# Natural Language Processing (NLP)

## Introduction

*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker and [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky*

### What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language
- Also referred to as machine learning with text.

### What are some of the higher level task areas?

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [A friend's application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

### What are some of the lower level components?

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

### Why is NLP hard?

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"
- **Texts with the same words and phrases can having different meanings **: 
State farm commercial where two different people say "Is this my car? What? This is ridiculous! This can't be happening! Shut up! Ahhhh!!!"


NLP requires an understanding of the **language** and the **world**.

## Part 1: Reading in the Yelp Reviews

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline




In [2]:
# read yelp.csv into a DataFrame
url = 'yelp.csv'
yelp = pd.read_csv(url, encoding='unicode-escape')

In [3]:
yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [9]:
# Create a new DataFrame called yelp_best_worst that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[yelp.stars.isin([1,5])]

In [10]:
#Look at data
yelp_best_worst.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


In [22]:
# define X and y
X = yelp_best_worst.text
y= yelp_best_worst.stars

# split the new DataFrame into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.25, random_state=11)

In [23]:
# Null accuracy
y.value_counts(normalize=True)

5    0.816691
1    0.183309
Name: stars, dtype: float64

## Part 2: Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [14]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [18]:
# Turn text into tabular data using CountVectorizer

vect = CountVectorizer()

# Method 01
dtm = vect.fit_transform(simple_train)

# Method 02
#dtm = vect.fit(simple_train)
#dtm = vect.transform(simple_train)

dtm_df = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
dtm_df

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [19]:
# transforming a new sentence, what do you notice?
new_sentence = ['please call yourself a cab burrito']

dtm_new = vect.transform(new_sentence)

pd.DataFrame(dtm_new.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,1,1,0,1,0,0


In [26]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()

dtm_train = vect.fit_transform(X_train)

dtm_test = vect.transform(X_test)

train_df = pd.DataFrame(dtm_train.toarray(), columns=vect.get_feature_names())
test_df = pd.DataFrame(dtm_test.toarray(), columns=vect.get_feature_names())

In [29]:
train_df.head()

Unnamed: 0,00,000,00a,00am,00pm,01,03,04,05,06,...,zucca,zucchini,zumba,zupas,zuzu,zuzus,zwiebel,zzed,éclairs,ém
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# rows are documents, columns are terms (phrases) (aka "tokens" or "features")

# Why do they have the same number of features

In [30]:
# first 50 features
vect.get_feature_names()[5008:5108]

[u'emergency',
 u'emissions',
 u'emotion',
 u'emotional',
 u'empanadas',
 u'emphasize',
 u'empire',
 u'employ',
 u'employed',
 u'employee',
 u'employees',
 u'employers',
 u'employess',
 u'empress',
 u'empties',
 u'emptor',
 u'empty',
 u'emptying',
 u'emulate',
 u'emulsion',
 u'en',
 u'enable',
 u'encanto',
 u'encased',
 u'encebollada',
 u'encebollado',
 u'enchalada',
 u'enchanted',
 u'enchanting',
 u'enchilada',
 u'enchiladas',
 u'enclosed',
 u'encompass',
 u'encounter',
 u'encountered',
 u'encounters',
 u'encourage',
 u'encouraged',
 u'encouragement',
 u'encouraging',
 u'encrusted',
 u'end',
 u'endearing',
 u'endeavor',
 u'endeavors',
 u'ended',
 u'ending',
 u'endless',
 u'endorse',
 u'endovenera',
 u'ends',
 u'endure',
 u'endures',
 u'enemies',
 u'enemy',
 u'energetic',
 u'energized',
 u'energy',
 u'enforce',
 u'engage',
 u'engaged',
 u'engagement',
 u'engaging',
 u'engine',
 u'engineering',
 u'england',
 u'english',
 u'engrossed',
 u'enhance',
 u'enigma',
 u'enigmatic',
 u'enjoy',
 

In [31]:
# last 50 features

vect.get_feature_names()[-50:]

[u'yuyuyummy',
 u'yyyeeaahhhh',
 u'yyyyy',
 u'z11',
 u'za',
 u'zabba',
 u'zach',
 u'zam',
 u'zanella',
 u'zankou',
 u'zappos',
 u'zatsiki',
 u'zen',
 u'zero',
 u'zest',
 u'zesty',
 u'zexperience',
 u'zha',
 u'zhou',
 u'zihuatenejo',
 u'zilch',
 u'zillion',
 u'zin',
 u'zinburger',
 u'zinc',
 u'zing',
 u'zip',
 u'ziploc',
 u'zipper',
 u'zippers',
 u'zipps',
 u'ziti',
 u'zombi',
 u'zombies',
 u'zone',
 u'zoners',
 u'zones',
 u'zoning',
 u'zoo',
 u'zoyo',
 u'zucca',
 u'zucchini',
 u'zumba',
 u'zupas',
 u'zuzu',
 u'zuzus',
 u'zwiebel',
 u'zzed',
 u'\xe9clairs',
 u'\xe9m']

In [32]:
# show vectorizer options
vect

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

In [33]:
#Create a count vectorizer that doesn't lowercase the words
vect = CountVectorizer(lowercase=False)

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [34]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1,2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 168212)

In [35]:
# last 50 features
print vect.get_feature_names()[-50:]

[u'zones', u'zones dolls', u'zones so', u'zoning', u'zoning issues', u'zoo', u'zoo but', u'zoo if', u'zoo is', u'zoo not', u'zoo the', u'zoo tour', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fires', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini we', u'zucchini with', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupas', u'zupas cater', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zuzu was', u'zuzus', u'zuzus room', u'zwiebel', u'zwiebel kr\xe4uter', u'zzed', u'zzed in', u'\xe9clairs', u'\xe9clairs napoleons', u'\xe9m', u'\xe9m all']


## **Predicting the star rating with Naive Bayes**

### Naive Bayes

Bayes Theorem covers the probabilistic relationship between multiple variables, and specifically allows us to define one conditional in terms of the underlying probabilities and the inverse condition. Specifically, it can be defined as:

$$P(y|x) = P(y)P(x|y)/P(x)$$

This means the probability of y given x condition equals the probability of y times the probability of x given y condition divided by the probability of x.

This theorem can be extended to when x is a vector (containing the multiple x variables used as inputs for the model) to:

$$P(y|x_1,...,x_n) = P(y)P(x_1,...,x_n|y)/P(x_1,...,x_n)$$

Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as **ham or spam.**

$$P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

By assuming that the features (the words) are **conditionally independent**, we can simplify the likelihood function:

$$P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

We can calculate all of the values in the numerator by examining a corpus of **spam email**:

$$P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}$$

We would repeat this process with a corpus of **ham email**:

$$P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}$$

All we care about is whether spam or ham has the **higher probability**, and so we predict that the email is **spam**.

#### Key takeaways

- The **"naive" assumption** of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
- The **normalization constant** (the denominator) can be ignored since it's the same for all classes.
- The **prior probability** is much less relevant once you have a lot of features.

### <b>Pros</b>: 
#### - Very fast. Adept at handling tens of thousands of features which is why it's used for text classification
#### - Works well with a small number of observations
#### - Isn't negatively affected by "noise"

### <b>Cons</b>:
#### - Useless for probabilities. Most of the time assigns probabilites that are close to zero or one
#### - It is literally "naive". Meaning it assumes features are independent.

In [66]:
#Import Naive Bayes algorithm
from sklearn.naive_bayes import MultinomialNB

In [67]:
#test model on the whole data then score it
vect = CountVectorizer()
Xdtm = vect.fit_transform(X)
nb = MultinomialNB()
nb.fit(Xdtm,y)
nb.score(Xdtm,y)


0.97185511502692123

In [68]:
# make a countvectorizer for a train test split
vect = CountVectorizer()
# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use multinomial naive bayes with document feature matrix, NOT the text column
nb = MultinomialNB()
nb.fit(X_train_dtm,y_train )
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy
print metrics.accuracy_score(y_test,y_pred_class)

0.920743639922


In [69]:
# calculate null accuracy, which is the accuracy of our null model (just guessing the most common thing)
y_test_binary = np.where(y_test==5, 1, 0)
max(y_test_binary.mean(), 1 - y_test_binary.mean())

0.80626223091976512

In [70]:
#Test 01
# Predict on new text. But first, we need to transform it
new_text = ["I had a decent time at this restaurant. The food was delicious but the service was poor. I recommend the salad but do not eat the french fries."]
new_text_transform=vect.transform(new_text)

In [71]:
#Call prediction
nb.predict(new_text_transform)

array([5], dtype=int64)

In [59]:
#Test 02
# Predict on new text. But first, we need to transform it
new_text = ["bad food"]
new_text_transform=vect.transform(new_text)

In [60]:
#Call prediction
nb.predict(new_text_transform)

array([5], dtype=int64)

In [61]:
nb.predict_proba(new_text_transform)

array([[ 0.39704467,  0.60295533]])

In [72]:
# Function takes in vectorized data, trains model, and then scores it.
def tokenize_test(vect):
    nb = MultinomialNB()
    X_dtm = vect.fit_transform(X)
    print 'Features: ', X_dtm.shape[1]
    print 'Accuracy: ', cross_val_score(nb, X_dtm, y, cv=5, scoring='accuracy').mean()


In [73]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(2,3))
tokenize_test(vect)

Features:  562844
Accuracy:  0.786827457459


## Part 3: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options
vect

- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If a word is equally like to show up in a rap lyric as medical paper then its most likely a stop word.
- Corpus-specific stopwords, that words that aren't regular stopwords but become stopwords depending on the context.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english', ngram_range=(1, 2))
tokenize_test(vect)

In [None]:
# set of stop words
print vect.get_stop_words()[:30]

## Part 4: Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# remove English stop words and only keep 100 features, MUCH FASTER
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

In [None]:
# all 100 features
print vect.get_feature_names()

In [None]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=500)
tokenize_test(vect)

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 3 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=3)
tokenize_test(vect)

## Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

In [None]:
# print the first review
print yelp_best_worst.text[0]

In [None]:
# save it as a TextBlob object


In [None]:
# list the words


In [None]:
# list the sentences


In [None]:
# Parts-of-speech tagging. Identifies nouns, verbs, adverbs, etc...


POS Tags guide: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## Part 6: Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [None]:
# initialize stemmer
stemmer = 

Compare and contrast the words with their stems.

In [None]:
review.words[:100]

In [None]:
# stem each word
print [stemmer.stem(word) for word in review.words[:100]]

**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
#Stem "octopi"
word = Word('octopi')
stemmer.stem(word)

In [None]:
lem = WordNetLemmatizer()

In [None]:
#Try it with words that look very different when pluralized like indices and octopi
lem.lemmatize("octopi")

Compare and contrast the originals words with their "lemons"

In [None]:
print [word for word in review.words[100:200]]

In [None]:
# assume every word is a noun
print [word for word in review.words]

In [None]:
# assume every word is a verb
print [word.lemmatize(pos='v') for word in review.words]

In [None]:
# define a function that accepts text and returns a list of lemmas
def word_tokenize_stem(text):
    words = TextBlob(text).words
    return [stemmer.stem(word) for word in words]
def word_tokenize_lemma(text):
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]
def word_tokenize_lemma_verb(text):
    words = TextBlob(text).words
    return [word.lemmatize(pos="v") for word in words]

In [None]:
# use word_tokenize LEMMA as the feature extraction function (WARNING: SLOW!)
# this will lemmatize each word
vect = CountVectorizer(analyzer=word_tokenize_stem)
tokenize_test(vect)

In [None]:
# use word_tokenize STEM as the feature extraction function (WARNING: SLOW!)
# this will lemmatize each word
vect = CountVectorizer(analyzer=word_tokenize_lemma)
tokenize_test(vect)

## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents). Court, ball, shooting, passing will show up frequently in a basketball corpus, but essentially add no meaning.
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Remember DTM?
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term Frequency-Inverse Document Frequency (simple version)
tf/df

In [None]:
# TfidfVectorizer. Why does "please" have the highest score?
vect = 


**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

In [None]:
# create a document-term matrix using TF-IDF and remove stop words
vect = 
dtm = 
features =
dtm.shape

In [None]:
#Call tokenize_test function
vect = TfidfVectorizer(stop_words='english')
tokenize_test(vect)

## Part 8: Sentiment Analysis

In [None]:
print review

In [None]:
review.sentiment

In [None]:
#Apply polarity and sentiment over yelp reviews df
yelp["polarity"] = 
yelp["subjectivity"] = 

In [None]:
#Create new column of text length
yelp["review_length"] = 

In [None]:
pd.set_option('max_colwidth', 500)

In [None]:
#Look at text with high polarity
yelp[yelp.polarity == 1].text.head()

In [None]:
#Look at text with low polarity
yelp[yelp.polarity == -1].text.head()

In [None]:
#High ratings and low polarity
yelp[(yelp.stars == 5) & (yelp.polarity < -0.3)]["text"].head(2)

In [None]:
#Low ratings and high polarity
yelp[(yelp.stars == 1) & (yelp.polarity > 0.5)]["text"].head(2)

In [None]:
#Plot polarity
yelp.polarity.plot(kind="hist", bins=20);

In [None]:
#Plot subjectivity
yelp.subjectivity.plot(kind="hist", bins=20)

In [None]:
#Plot scatter plot of polarity vs subjectivity scores
plt.scatter(yelp.polarity, yelp.subjectivity)
plt.xlabel("Polarity Scores")
plt.ylabel("Subjectivity Scores")

In [None]:
#Plot boxplots of the polarity by yelp stars
yelp.boxplot(column='polarity', by='stars')

## Part 9: Calculating "spaminess" of a token

In [None]:
#Load in ham or spam text dataset
df = pd.read_table("sms.tsv",encoding="utf-8", names= ["label", "message"])
df.head()

In [None]:
#Look at null accuracy


In [None]:
#Define X and y
X = 
y = 
#Fit a vectorizer
vect =CountVectorizer()
Xdtm = vect.fit_transform(X)
#Train and score multinomial naive bayes model
nb = MultinomialNB()
nb.fit(Xdtm,y)
nb.score(Xdtm,y)

In [None]:
tokens = vect.get_feature_names()
len(tokens)

In [None]:
#Print first 50 features
print vect.get_feature_names()[:50]

In [None]:
#Print random slice of features
print vect.get_feature_names()[3200:3250]

In [None]:
#How many times does a word appear in each class
nb.feature_count_

In [None]:
nb.feature_count_.shape

In [None]:
ham_token_count = nb.feature_count_[0,:]
ham_token_count

In [None]:
spam_token_count = nb.feature_count_[1, :]
spam_token_count

In [None]:
# create a DataFrame of tokens with their separate ham and spam counts
df_tokens = pd.DataFrame({'token':tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
df_tokens.sample(10, random_state=3)

In [None]:
# add 1 to ham and spam counts to avoid dividing by 0
df_tokens['ham'] = df_tokens.ham + 1
df_tokens['spam'] = df_tokens.spam + 1
df_tokens.sample(10, random_state=3)

In [None]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

In [None]:
# convert the ham and spam counts into frequencies
df_tokens['ham'] = df_tokens.ham / nb.class_count_[0]
df_tokens['spam'] = df_tokens.spam / nb.class_count_[1]
df_tokens.sample(10, random_state=3)

In [None]:
# calculate the ratio of spam-to-ham for each token
df_tokens['spam_ratio'] = df_tokens.spam / df_tokens.ham
df_tokens.sample(10, random_state=3)

In [None]:
# examine the DataFrame sorted by spam_ratio
df_tokens.sort_values('spam_ratio', ascending=True).head(10)

In [None]:
#Try looking up scores of different words
word = "table"
df_tokens.loc[word, 'spam_ratio']

## Conclusion

- NLP is a gigantic field
- Understanding the basics broadens the types of data you can work with
- Simple techniques go a long way
- Use scikit-learn for NLP whenever possible

In [None]:
ls