# Converting words to Features with NLTK
compiling feature lists of words from positive reviews and words from the negative reviews to hopefully see trends in specific types of words in positive or negative reviews

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

In [2]:
word_features[:10]

['hostility',
 'kamil',
 'gangster',
 'jubilantly',
 'bailed',
 'sewing',
 'aura',
 'curse',
 'conelly',
 'clavell']

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:m

In [4]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Next, we can print one feature set like:

In [6]:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))



Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:



In [7]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [10]:
featuresets[:10]

[({'hostility': False,
   'manhole': False,
   'gangster': False,
   'jubilantly': False,
   'bailed': False,
   'sewing': False,
   'aura': False,
   'bart': False,
   'passages': False,
   'pantoliano': False,
   'spielberg': False,
   'damning': False,
   'hardass': False,
   'olden': False,
   'burne': False,
   'succeeds': False,
   'drills': False,
   'journeyed': False,
   'moreau': False,
   'mentality': False,
   'drank': False,
   'dariush': False,
   'dudeness': False,
   'makeover': False,
   'cotton': False,
   'fletch': False,
   'smallest': False,
   'editors': False,
   'disappeared': False,
   'meditation': False,
   'abandon': False,
   'meanders': False,
   'nan': False,
   'ignorance': False,
   'escorted': False,
   'ceaseless': False,
   'verge': False,
   'elijah': False,
   'try': False,
   'moaning': False,
   'flimsy': False,
   'venice': False,
   'grumbles': False,
   'lightyear': False,
   'macalaester': False,
   'yevgeny': False,
   'lynched': False,
   '

Awesome, now that we have our features and labels

In [12]:
type(featuresets)

list

In [19]:
len(featuresets)

2000