Permalink
Browse files

Improved stop-word removal, more aggressive now

  • Loading branch information...
1 parent 5af0fe5 commit c32e62f81c76b3faad0d46329957ecf060b4b787 @turian committed Apr 12, 2010
Showing with 593 additions and 6 deletions.
  1. +4 −0 README
  2. +579 −0 english.stop
  3. +10 −6 textpreprocess.py
View
@@ -9,3 +9,7 @@ REQUIREMENTS:
http://github.com/turian/common
and sub-requirements thereof.
* NLTK, for word tokenization
+
+The English stoplist is from:
+ http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
+However, I added words at the top (above "a").
Oops, something went wrong.

0 comments on commit c32e62f

Please sign in to comment.