# Natural Language Processing (NLP) Part 1

**Goals**

- Basics of NLP: tokenization, stopwords, POS tagging, stemming/lematization
- TextBlob library. How to process text with it and do sentiment analysis
- Text classification in sklearn: vectorizing text, modeling with naive bayes, and model optimization with grid search

## What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language
- Also referred to as machine learning with text.

### Examples

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [A friend's application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

### NLP Tools

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"

### NLP is hard! Here's why

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages, "y r u" vs "why are you"
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet", "clickbait", "fleek"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"
- **Texts with the same words and phrases can having different meanings **: 
State farm commercial where two different people say "Is this my car? What? This is ridiculous! This can't be happening! Shut up! Ahhhh!!!"


NLP requires an understanding of the **language** and the **world**.

## NLP with the NLTK library

At this point NLTK should be installed and its additional materials should be downloaded as well

In [None]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
from textblob import TextBlob

In [None]:
#Downloads the nltk data

# nltk.download()

### Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

Sentence tokenization

In [None]:
text = """Hello. How are you, dear Mr. Sir? Are you well?
          Here: drink this! It will make you feel better.
          I mean, it won't make you feel worse!"""


#Tokenize text using sent_tokenize function
sentences = sent_tokenize(text)

sentences



Based on the output, can you figure out the rules of tokenization?

Word tokenization

In [None]:
#Assign last sentence in sentences to sentence

sentence = sentences[5]


#Word tokenize using one of the sentences from sentences
#Assumes that input has already been tokenized into sentences

words = word_tokenize(sentence)

print(sentence)

print (words)

How did the word_tokenize function work? Let's try the wordpunct_tokenize function

In [None]:
#Pass sentence into wordpunct_tokenize function
wordpunct_tokenize(sentence)

Whats the difference?

Online demo of various tokenizers: http://text-processing.com/demo/tokenize/

### Part of speech tagging

<br>

"The process of assigning one of the parts of speech to the given word is called Parts Of Speech tagging. It is commonly referred to as POS tagging. Parts of speech include nouns, verbs, adverbs, adjectives, pronouns, conjunction and their sub-categories."

http://language.worldofcomputing.net/pos-tagging/parts-of-speech-tagging.html

In [None]:
#Text for POS tagging
text = """The process of assigning one of 
the parts of speech to the given word is called Parts Of Speech tagging"""

#Tokenize text
tokens = word_tokenize(text)

#Pass tokens into pos_tag function
pos_tag(tokens)


Output is tuple pairings of tokens with their POS tags

#### Some of POS tags: 
WP: wh-pronoun ("who", "what")  
VBZ: verb, 3rd person sing. present ("takes")  
VBG: verb, gerund/present participle ("taking")  
TO: to ("to go", "to him")   
DT: determiner ("the", "this")  
NN: noun, singular or mass ("door")  

All tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

### Stopwords

Common words that will likely appear in any text. Anything that can appears in a poem, rap lyric, or medical research paper is most likely a stopword. In most NLP contexts, we remove the stopwords because they don't tell you much about your text, they have no value.

In [None]:

#Intialize the list of stopwords 

sw = stopwords.words("english")

sw

In [None]:
#View list of punctuation characters

punctuation

In [None]:
#Add them to the sw list

sw += punctuation

Let's remove stopwords and punctuation from a corpus

In [None]:

corpus = """Sony Michel's touchdown in double-overtime gave 
Georgia a 54-48 Rose Bowl win over Oklahoma and 
made up for a late fumble that resulted in six points for the Sooners."""

In [None]:
#Tokenize text

tokens = wordpunct_tokenize(corpus)

tokens

In [None]:
#Clean up tokens by removing stopwords and punctuation characters

clean_tokens = [i for i in tokens if i not in sw]

clean_tokens

### Stemming and lemmatization

<br>

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [None]:
#Intialize stemmer object

stemmer = SnowballStemmer("english")

In [None]:
#Derive stems from random words

stemmer.stem("running")

In [None]:
stemmer.stem("absolutely")

In [None]:
stemmer.stem("forgave")

In [None]:
#Derive the stems of every token in clean tokens

stems = [stemmer.stem(token) for token in clean_tokens]

stems

What do you notice about the results of the stemming process?

**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

Compare and contrast the stems and lemmatization of certain words

In [None]:
#Stem of octopi (plural of octopus)

stemmer.stem("octopi")

In [None]:
#Intialize lemmatization object

lem = WordNetLemmatizer()

#Lemmatize octopi

lem.lemmatize("octopi")

What's the difference? Try it again with indices

In [None]:
#Stem

stemmer.stem("indices")

In [None]:
#Lemon
lem.lemmatize("indices")

Derive the lemons of clean_tokens

In [None]:
#Lemmative the clean tokens and set pos = v

lemons = [lem.lemmatize(token, pos= "v") for token in clean_tokens]

lemons

### N-Grams

Collections of adjacent words, number of words in each collection is determined by N. 

Bigrams = Two-word phrases

Trigrams = Three-word phrases

http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html

In [None]:

#Set N to 2 for bigrams
N = 2

#Make bigrams from clean_tokens
bigrams = ngrams(clean_tokens, N)

bigrams


In [None]:

#Set N to 3 for trigrams
N = 3

#Make bigrams from clean_tokens
trigrams = ngrams(clean_tokens, N)

trigrams


## TextBlob

<br>

Python library for processing simple NLP tasks.


You may need to download the corpora in textblob. Type into command line:

python -m textblob.download_corpora

In [None]:
#Text for using TextBlob

corpus = """
Mr. Persson, 35, sits in front of four computer screens,
one displaying the loader he steers as it lifts freshly blasted rock containing silver,
zinc and lead. If he were down in the mine shaft operating the loader manually,
he would be inhaling dust and exhaust fumes. 
Instead, he reclines in an office chair while using a joystick to control the machine.
"""

#Pass in text into textblob


Explore the capabilities of textblob

In [None]:
#tokenized words



In [None]:
#Sentences


In [None]:
#Word counts



In [None]:
#Pos tags



In [None]:
#Noun phrases



In [None]:
#Singularize words


In [None]:
#Pluralize words


In [None]:
#Lemmatization


In [None]:
#Lemmatization with verbs


In [None]:
#bigrams


### Sentiment Analysis

TextBlob uses an algorithm to rate text on subjectivity and polarity. Subjectivity measures how opinonated a text is on a scale from 0.0-1.0 and polarity measures how happy or mad or a text is on a scale from -1.0-1.0. 

In [None]:
#Text for sentiment analysis
raw_text = "I love learning about data science, it is very fun."

#Pass in raw_text into textblob
blob = 

#Derive scores



In [None]:
#Polarity score


In [None]:
#Subjectivity score


More examples

In [None]:
TextBlob("it's so awesome").sentiment


In [None]:
TextBlob("I love this course.").sentiment

In [None]:
TextBlob("Oh my god I love this course.").sentiment

In [None]:
TextBlob("it's so awesome.").sentiment

In [None]:
TextBlob("I hate cupcakes.").sentiment

In [None]:
TextBlob("i have no opinions about the matter").sentiment

Let's analyze the sentiment of yelp reviews

In [None]:
#Load in yelp review data

path = "../../data/NLP_data/yelp.csv"

yelp = pd.read_csv(path, encoding='unicode-escape')

yelp.head()

In [None]:
#Read first review

review = 


In [None]:
#Textblob review and get its sentiments scores

blob = 


What do you think of the scores? Are they too high or low?

In [None]:
#Calculate polarity and subjectivity scores for entire corpus
# by applying polarity and sentiment over yelp reviews df

yelp["polarity"] = 
yelp["subjectivity"] = 

What are the most negative and positives reviews

In [None]:
#Adjust settings
pd.set_option('max_colwidth', 500)

In [None]:
#Most negative



In [None]:
#Most positive




Are there reviews with 5 stars but low polarity scores?

In [None]:
#One star reviews with high polarity scores



Plot the scores

In [None]:
#Histogram of polarity scores



In [None]:
#Histogram of subjectivity scores


In [None]:
#Plot scatter plot of polarity vs subjectivity scores

plt.xlabel("Polarity Scores")
plt.ylabel("Subjectivity Scores")

In [None]:
#Plot boxplots of the polarity by yelp stars
;

What are you thoughts on the plots? Do they make sense to you?

## Text Classification

We're going to train a machine learning algorithm to classify yelp reviews as either five or one stars. But first we need to transform or "vectorize" our raw text before make any classifications.

### Count Vectorizer: How to turn text into numbers

In [None]:
# Create a new DataFrame called yelp_best_worst that only contains the 5-star and 1-star reviews
yelp_best_worst = 

In [None]:
# define X and y
X = 
y = 

#Null accuracy


# split data into training and testing sets
X_train, X_test, y_train, y_test = 

We can't pass in raw text into an algorithm, first we have to vectorize it, which means converting a collection of text documents to a matrix of token counts.

<br>

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [None]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term Frequency



In [None]:
# transforming a new sentence
new_sentence = ['please call yourself a taxi']



What do you notice? How come the two dataframes have the same features?

Use CountVectorizer to create document-term matrices from X_train and X_test

In [None]:
#Intialize vectorizer object
vect = 

#Fit and transform with training data
X_train_dtm = 

#Transform the testing data
X_test_dtm = 

In [None]:
#Vectorized data shapes




In [None]:
# first 50 features


In [None]:
# Random selection of 50 features


Let's put it in a dataframe

Ways to configure vectorizer

In [None]:
# show vectorizer options
vect

- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

In [None]:
#Create a count vectorizer that doesn't lowercase the words
vect = 
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape # has more features

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams
vect = 
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
# last 50 features


- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. This allows you use to use your own custom stopwords list. Great for corpus-specific stopwords, that words that aren't regular stopwords but become stopwords depending on the context.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
#Set vectorizer with stop_words to english
vect = 
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
#Show the stopwords used



- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
#Set vectorizer with max_features to 2000
vect = 
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
#Set vectorizer with min_df to 5
vect = 
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
#Set vectorizer with min_df to 0.1
vect = 
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
#What are the words that show up in at least 10 percent of documents



### Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents). Court, ball, shooting, passing will show up frequently in a basketball corpus, but essentially add no meaning. Corpus-specific stopwords.

[Source: Ultra Violet Analytics](http://www.ultravioletanalytics.com/2016/11/18/tf-idf-basics-with-pandas-scikit-learn/)

"Tf-idf is a very common technique for determining roughly what each document in a set of documents is “about”. It cleverly accomplishes this by looking at two simple metrics: tf (term frequency) and idf (inverse document frequency). Term frequency is the proportion of occurrences of a specific term to total number of terms in a document. Inverse document frequency is the inverse of the proportion of documents that contain that word/phrase. The general idea is that if a specific phrase appears a lot of times in a given document, but it doesn’t appear in many other documents, then we have a good idea that the phrase is important in distinguishing that document from all the others."

In [None]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term Frequency with CountVectorizer

vect = 


Binary = True assigns a 1 if a word is present irregardless of count, and 0 for absent words.

In [None]:
#Intialize vectorizer with binary = true
vect = 

#Fit and transform the text and sum up the counts
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
#Put results into dataframe
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

This is how many documents each word appears in.

TFIDF (simple version)

In [None]:
# Divide tf by df


Let's check out the sklearn version

In [None]:
#Intialize vectorizer
vect = 

#Fit and transform using tfidf and input results into dataframe
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

### Naive Bayes

Bayes Theorem covers the probabilistic relationship between multiple variables, and specifically allows us to define one conditional in terms of the underlying probabilities and the inverse condition. Specifically, it can be defined as:

$$P(y|x) = P(y)P(x|y)/P(x)$$

This means the probability of y given x condition equals the probability of y times the probability of x given y condition divided by the probability of x.

This theorem can be extended to when x is a vector (containing the multiple x variables used as inputs for the model) to:

$$P(y|x_1,...,x_n) = P(y)P(x_1,...,x_n|y)/P(x_1,...,x_n)$$

Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as **ham or spam.**

$$P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

By assuming that the features (the words) are **conditionally independent**, we can simplify the likelihood function:

$$P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

We can calculate all of the values in the numerator by examining a corpus of **spam email**:

$$P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}$$

We would repeat this process with a corpus of **ham email**:

$$P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}$$

All we care about is whether spam or ham has the **higher probability**, and so we predict that the email is **spam**.

**Key takeaways**

- The **"naive" assumption** of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
- The **normalization constant** (the denominator) can be ignored since it's the same for all classes.
- The **prior probability** is much less relevant once you have a lot of features.

<b>Pros</b>: 
- Very fast. Adept at handling tens of thousands of features which is why it's used for text classification
- Works well with a small number of observations
- Isn't negatively affected by "noise"

<b>Cons</b>:
- Useless for probabilities. Most of the time assigns probabilites that are close to zero or one
- It is literally "naive". Meaning it assumes features are independent.

Let's make our first model

In [None]:
#Vectorize the whole corpus. Remove stop words.

#Intialize vectorizer
vect = 

#fit and transform data
X_dtm = 

In [None]:
#Intialize model
nb = 


#Fit and score model




Not bad, but let's try it on a train-test split

In [None]:
#Null accuracy of testing set




In [None]:
#Intialize vectorizer 
vect = 

#Fit and transform on the training data
X_train_dtm = 
#Transform the testing data witht the vectorizer
X_test_dtm = 

#Intialize model
nb = 
#Fit it on training data


#Score it on training and testing data



How do you assess this model? 

<br>

Let's try it on some new text

In [None]:
# Predict on new text
new_text = ["I had a decent time at this restaurant. \
The food was delicious but the service was very poor. \
I recommend the salad but do not eat the french fries."]
new_text_transform = 

#Predict class


#Class probabilities


Let's do this again with the tfidf vectorizer

In [None]:
#Intialize vectorizer 
vect = 

#Fit and transform on the training data
X_train_dtm = 

#Transform the testing data witht the vectorizer
X_test_dtm = 

#Intialize model
nb = 
#Fit it on training data


#Score it on training and testing data



Thoughts on the results? Did you expect the scores to be lower than the Countvectorizer ones?

Let's cross validate with pipelines

In [None]:
#Create pipeline with tfidf vectorizer with max_features = 1000 and lowercase = true

pipe = 


#Cross validate with the pipeline and use the full raw text


Grid search time. We could spend a whole bunch of time testing various combinations of parameters, so instead of doing that, let's use grid search

In [None]:
#Make pipeline for countvectorizer and naive bayes model
pipe_cv = make_pipeline(CountVectorizer(), MultinomialNB())

#Intialize parameters for count vectorizer
param_grid_cv = {}
param_grid_cv["countvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_cv["countvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_cv["countvectorizer__lowercase"] = [True, False]

In [None]:
#Make pipeline for tfidfvectorizer and naive bayes model
pipe_tf = make_pipeline(TfidfVectorizer(), MultinomialNB())


#Intialize parameters for tfidf vectorizer
param_grid_tf = {}
param_grid_tf["tfidfvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_tf["tfidfvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_tf["tfidfvectorizer__lowercase"] = [True, False]

In [None]:
#Let's import time to see how long it takes

from time import time

In [None]:
#Grid search for the count vectorizer

grid_cv = GridSearchCV(pipe_cv, param_grid_cv, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_cv.fit(X, y)
#Print time elapsed
print time() - t

In [None]:
#Best parameters
grid_cv.best_params_

In [None]:
#Best score
grid_cv.best_score_

In [None]:
#Grid search for the tfidf vectorizer

grid_tf = GridSearchCV(pipe_tf, param_grid_tf, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_tf.fit(X, y)
#Print time elapsed
print (time() - t)

In [None]:
#Best parameters
grid_tf.best_params_

In [None]:
#Best score
grid_tf.best_score_

Randomized Search option

In [None]:
#Intialize randomized grid search
randsearch_cv = RandomizedSearchCV(pipe_cv, n_iter = 5,
                        param_distributions = param_grid_cv, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_cv.fit(X, y)

#Print time difference

print (time() - t)

## Calculating the "spaminess" of a token

This is a really helpful technique to find the words most associated with either class.

In [None]:
#Load in ham or spam text dataset
df = pd.read_table("../../data/NLP_data/sms.tsv",encoding="utf-8", names= ["label", "message"])
df.head()

In [None]:
#Look at null accuracy


In [None]:
#Assign X and y
X = 
y = 

#Intialize vectorizer with default settings
vect = 
#Fit and transform X
Xdtm = 
#Intialize, fit, and score model on training data
nb = 


In [None]:
#Assign list of features to tokens variable
tokens = 
len(tokens)

In [None]:
#Print random slice of features


In [None]:
#How many times does a word appear in each class


In [None]:
#Shape


In [None]:
#Returns out counts of each word in documents marked "ham"
ham_token_count = 


In [None]:
#Returns out counts of each word in documents marked "spam"
spam_token_count = 


In [None]:
# create a DataFrame of tokens with their separate ham and spam counts
df_tokens = pd.DataFrame({'token':tokens, 
                          'ham':ham_token_count, 
                          'spam':spam_token_count}).set_index('token')

#Randomly data 
df_tokens.sample(10, random_state=12)

In [None]:
# add 1 to ham and spam counts to avoid dividing by 0
df_tokens['ham'] = 
df_tokens['spam'] = 


In [None]:
# Naive Bayes counts the number of observations in each class


In [None]:
# convert the ham and spam counts into frequencies
df_tokens['ham'] = 
df_tokens['spam'] = 


In [None]:
# calculate the ratio of spam-to-ham for each token
df_tokens['spam_ratio'] = 


In [None]:
# examine the DataFrame sorted by spam_ratio


Voila, the top ten "spammiest" words in the dataset.

## Resources

Tokenization:
- http://text-processing.com/demo/tokenize/
- https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/


POS tagging:
- https://nlp.stanford.edu/software/tagger.shtml
- http://language.worldofcomputing.net/pos-tagging/parts-of-speech-tagging.html
- https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

NLTK:
- https://likegeeks.com/nlp-tutorial-using-python-nltk/
- http://billchambers.me/tutorials/2015/01/14/python-nlp-cheatsheet-nltk-scikit-learn.html

TextBlob:
- http://textblob.readthedocs.io/en/dev/quickstart.html
- http://rwet.decontextualize.com/book/textblob/
- http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html

Stemming and Lemmatization:
- http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
- https://stackoverflow.com/questions/1787110/what-is-the-true-difference-between-lemmatization-vs-stemming

Vectorizating Text:
- https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
- http://planspace.org/20150524-tfidf_is_about_what_matters/
- http://www.tfidf.com/
- http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-part-i/

Text classification:
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
- https://www.dataquest.io/blog/natural-language-processing-with-python/


## Lab time
- There are three other datasets pitchfork album reviews, fake/real news, and political lean.
- Pick one of those three datasets and try to build a model that differentiate between good/bad review, real/fake news, or liberal/conservative leaning. Make sure to examine the false positives and the false negatives texts. Use the "spamminess" technique on the corpus as well. 
- Use both count and tfidf vectorizers. Use textblob to determine sentiment and polarity.