- Stemming: Reducing words to linguistic stems.
- parametric:
-
Naive Bayes
-
K Nearest Neighbor: fails in text data
-
SVM
-
Boosting: combine weak-learners to create a strong-learner
-
Agglomerative Clustering
-
Divisive Clustering
-
K Means Clustering
-
Euclidean
for things that can be represented in euclidean space or to measure similarity in space
-
Cityblock
for binary
-
Jaccard
for completely random f
- Use wordnet to bootstrap a larger, more descriptive feature set for text data.