# Natural Language Processing (NLP) Part 2

## Time to pick up where we left off

**Goals:**

- Finish text classification lesson by using stemming and lemmatization in our vectorizers
- Build a simple text summarizer
- How to find similar documents with cosine similarity and clustering

In [None]:
#Imports
from time import time
import pandas as pd
pd.set_option("max.colwidth", 500)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA, TruncatedSVD, NMF
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
from textblob import TextBlob

## Text Classification continued

To wrap our text classification section, we're going to learn how to incorporate stemming and lemmatization in our vectorizers. 

In [None]:
#Load in yelp review data

path = "../../data/NLP_data/yelp.csv"

yelp = pd.read_csv(path, encoding='unicode-escape')

yelp.head()

In [None]:
# Create a new DataFrame called yelp_best_worst that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [None]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

#Null accuracy
print y.value_counts(normalize=True)

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
#Look at the analyzer section of the CountVectorizer doc strings
CountVectorizer()

The analyzer argument allows us to upload our function to transform/tokenize the words in our corpura

In [None]:
# define a function that accepts text and returns a list of stems
def word_tokenize_stem(text):
    #Transform and tokenize words using TextBlob
    
    #Intialize stemmer
    
    #Return a list of the stems
    


# define a function that accepts text and returns a list of lemons (noun version)
def word_tokenize_lemma(text):
    #Transform and tokenize words using TextBlob
    
    #Return a list of lemons
    

# define a function that accepts text and returns a list of lemons (verb version)
def word_tokenize_lemma_verb(text):
    
    #Return a list of lemons    
    

Let's try our three new functions with both count and tfidf vectorizers. 
<br>
- First let's create a function that takes in an initialized but unfit vectorizer as an argument.
- Fit and transforms training data using the vectorizer
- Transforms the testing data
- Fits naive bayes model on training data.
- Evaluate it on the training and testing data.
- Prints the number of features and scores

In [None]:
def text_model_evaluator(vect):
    
    
    
    
    print ("Features: ", )
    print ("Training Score: ", )
    print ("Testing Score: ", )

In [None]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_stem

vect = 

#Pass vectorizer into function


In [None]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = 

#Pass vectorizer into function


In [None]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma_verb

vect = 

#Pass vectorizer into function


How do you interpret these results? Let's try it again with tfidf

In [None]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_stem

vect = 

#Pass vectorizer into function


In [None]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = 

#Pass vectorizer into function


In [None]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = 

#Pass vectorizer into function


How do the tfidf vectorizers compare to counts?

Grid search time. Let's grid search objects that incorporate all of the analyzer functions for count and tfidf vectorizers. In addition we'll do the same for randomized search.

Countvectorizer gridsearch

In [None]:
#Make pipeline for countvectorizer and naive bayes model
pipe_cv = make_pipeline(CountVectorizer(), MultinomialNB())

#Intialize parameters for count vectorizer
param_grid_cv = {}
param_grid_cv["countvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_cv["countvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_cv["countvectorizer__lowercase"] = [True, False]
param_grid_cv["countvectorizer__binary"] = [True, False]
param_grid_cv["countvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [None]:
#Grid search object

grid_cv = GridSearchCV(pipe_cv, param_grid_cv, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_cv.fit(X, y)
#Print time elapsed
print (time() - t)

In [None]:
#Best parameters
print (grid_cv.best_params_)
#Best score
print (grid_cv.best_score_)

Tfidfvectorizer gridsearch

In [None]:
#Make pipeline for tfidfvectorizer and naive bayes model
pipe_tf = make_pipeline(TfidfVectorizer(), MultinomialNB())


#Intialize parameters for tfidf vectorizer
param_grid_tf = {}
param_grid_tf["tfidfvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_tf["tfidfvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_tf["tfidfvectorizer__lowercase"] = [True, False]
param_grid_tf["tfidfvectorizer__binary"] = [True, False]
param_grid_tf["tfidfvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [None]:
#Grid search object

grid_tf = GridSearchCV(pipe_tf, param_grid_tf, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_tf.fit(X, y)
#Print time elapsed
print (time() - t)

Countvectorizer randomized search

In [None]:
#Randomized grid search with n_iter = 10
randsearch_cv = RandomizedSearchCV(pipe_cv, n_iter = 10,
                        param_distributions = param_grid_cv, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_cv.fit(X, y)

#Print time difference

print (time() - t)

In [None]:
#Best params
print (randsearch_cv.best_params_)
#Best score
print (randsearch_cv.best_score_)

Tfidfvectorizer randomized search

In [None]:
#Randomized grid search with n_iter = 10
randsearch_tf = RandomizedSearchCV(pipe_tf, n_iter = 10,
                        param_distributions = param_grid_tf, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_tf.fit(X, y)

#Print time difference

print (time() - t)

In [None]:
#Best params
print (randsearch_tf.best_params_)
#Best score
print (randsearch_tf.best_score_)

This wraps up text classification. Now onto the rest of the lesson.

## Summarizing text

We're going to build a very simple summarizer that uses tfidf scores on a corpura of data science and artificial intelligence articles

In [None]:
#Load in data

path = "../../data/NLP_data/ds_articles.csv"

#We're only be using the text and title columns
articles = pd.read_csv(path, usecols=["text", "title"], encoding="utf-8")

#Drop nulls
articles.dropna(inplace=True)

#Reset index
articles.reset_index(inplace=True, drop=True)

articles.head()

In [None]:
#Info


In [None]:
#Intialize tfidf with stop_words = english, max_features = 1000, and stem analzyer 

tfidf = 


#Fit and transform the text using the tfidf vectorizer
text = 
dtm = 

#Assign tokens to features
features = 

print (len())

In [None]:
#Create a dataframe of features and their idf scores
idfscores = 
idfscores["tokens"] = 
idfscores["scores"] = 



In [None]:
#Top ten most imporant words


In [None]:
#Top ten least imporant words


Let's our summarizer function that will randomly select an article to summarize. By summarize, I mean show the top five words with the highest tfidf values

In [None]:
def summarize():
    #Randomly choose index value
    index = np.random.choice(articles.index, 1)[0]
    article = text.iloc[index]
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(article).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[index, features.index(word)]
            
   # print words with the top 5 TF-IDF scores
    print ('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print (word)
        
    #Print title of article
    print ("\n", articles.title[index])
    
    #Print the text of article
#     print article

In [None]:
#Give it a go


## Text Similarity with Cosine Similarity and Clustering

### Cosine Similarity

![ew](https://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cosine.png?w=697)
<br><br>
" Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in (0,1). One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors."
<br>
Source: [Dataaspirant](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

In [None]:
#Diy cosine similarity function

def square_rooted(x):

    return round(np.sqrt(sum([a*a for a in x])),3)
 
def cosine_similarity_function(x,y):

    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return round(numerator/float(denominator),3)
 
vec1 = [3, 45, 7, 2]
vec2 = [2, 54, 13, 15]
cosine_similarity_function(vec1, vec2)

Derive matrix of similarities between all the data science articles documents.

In [None]:
#Calculate cosine distance for each pair of documents
dist = 

In [None]:
#make it a dataframe
dist_df = 

#Shape


Let's compare some articles!

In [None]:
#Index position of article
index = 239

In [None]:
#Assign titles column to titles variable

titles = 



#Print title
print ()

#print article

print ("\n, ************************************************ \n", )

We need to take the index value and use it grab the column of the scores between every article and the one at index 935

In [None]:
#Pass
dist_column = 

In [None]:
#Get the index values of the 5 

closest_index = 

In [None]:
#Pass index values into titles and print them



In [None]:
#Pass index values into titles and but don't print


### Clustering

It is standard practice to cluster with tfidf data instead of the count vectorized data

In [None]:
#Intialize clustering algorithm with 4 clusters and fit it on dtm

km4 = 
#Fit algorithm


In [None]:
#Check out silhouette score


In [None]:
#Assign labels to articles dataframe 

articles["cluster"] = 

Print 5 randomly selected headlines from each cluster

In [None]:
#Cluster 0


In [None]:
#Cluster 1



In [None]:
#Cluster 2



In [None]:
#Cluster 3



What do you think the clusters are? Is it easy decipher? Ignore the silhouette score, does it pass the eye test?

Let's examine the top words of each cluster

In [None]:
print("Top terms per cluster:")
order_centroids = km4.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()
for i in range(4):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print ("\n")

Let's try this exercise again but this time we'll cluster the cosine distances.

In [None]:
#Intialize clustering algorithm with 4 clusters
km4 = 

#fit it on dist array



In [None]:
#Check out silhouette score
silhouette_score(dist, km4.labels_)

Print 5 randomly selected headlines from each cluster

In [None]:
#Assign new labels to data frame



In [None]:
#Cluster 0
for i in articles[articles.cluster_dist == 0].sample(n=5).title.tolist():
    print (i)

In [None]:
#Cluster 1
for i in articles[articles.cluster_dist == 1].sample(n=5).title.tolist():
    print (i)

In [None]:
#Cluster 2
for i in articles[articles.cluster_dist == 2].sample(n=5).title.tolist():
    print (i)

In [None]:
#Cluster 3
for i in articles[articles.cluster_dist == 3].sample(n=5).title.tolist():
    print (i)

Are the results better?

# Resources


My fake news classifer article: https://opendatascience.com/blog/how-to-build-a-fake-news-classification-model/
<br>
My data science topic modeling article: https://opendatascience.com/blog/how-to-analyze-articles-about-data-science-using-data-science/
<br><br>
**Regular Expressions**
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.oreilly.com/ideas/an-introduction-to-regular-expressions


**NLP Tutorials**

- https://github.com/bonzanini/nlp-tutorial
- https://github.com/totalgood/pycon-2016-nlp-tutorial

**Text similarity:**
- https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/
- http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
- http://billchambers.me/tutorials/2014/12/22/cosine-similarity-explained-in-python.html
- Explains why text similarity uses cosine similarity -> https://www.quora.com/What-are-the-mechanics-of-cosine-similarity-in-natural-language-processing

**Text classification:**
- Another fake news tutorial - > https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
- http://nlpforhackers.io/text-classification/
- http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html
- https://github.com/javedsha/text-classification
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html


**Text clustering:**

- Great tutorial -> http://brandonrose.org/clustering
- http://nlpforhackers.io/recipe-text-clustering/
- https://pythonprogramminglanguage.com/kmeans-text-clustering/
- http://mccormickml.com/2015/08/05/document-clustering-example-in-scikit-learn/


**Word Embeddings/Word2Vec**

- https://chatbotsmagazine.com/introduction-to-word-embeddings-55734fd7068a
- https://www.springboard.com/blog/introduction-word-embeddings/
- http://ruder.io/word-embeddings-1/
- https://www.slideshare.net/BhaskarMitra3/a-simple-introduction-to-word-embeddings
- https://github.com/fastai/word-embeddings-workshop


**Topic Modeling**

- http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
- https://blog.bigml.com/2016/11/16/introduction-to-topic-models/
- http://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/nmf_topics.ipynb?create=1
- https://www.youtube.com/watch?v=ZgyA1Q2ywbM
- https://www.youtube.com/watch?v=SjRss8Uk6mQ
- https://github.com/derekgreene/topic-model-tutorial

# Lab time

Pick a text dataset to spend the rest of class working. There are three other datasets in the NLP_data that you can work with: pitchfork album reviews, fake/real news, deadspin, and political lean. Make sure to unzip political lean or fake news. You can also continue to work with the datasets we've already used (data science, yelp, spam.)

<br>

For the rest of class apply supervised or unsupervised learning techniques to the dataset of your choice. 

- Build a model that can differentiate between good/bad review, real/fake news, or liberal/conservative leaning or a model that 

- Predict how many page views a deadspin can get based on its headlines and tags.

- Ignore the labels and attempt cluster the articles.

- Have fun with the summarizer!!

<br>

Be prepared to share your results at the end of class.
