# Natural Language Processing (NLP) Part 2

## Time to pick up where we left off

**Goals:**

- Finish text classification lesson by using stemming and lemmatization in our vectorizers
- Build a simple text summarizer
- How to find similar documents with cosine similarity and clustering

In [2]:
#Imports
from time import time
import pandas as pd
pd.set_option("max.colwidth", 500)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA, TruncatedSVD, NMF
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize
from nltk.tokenize import TreebankWordTokenizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
from textblob import TextBlob

## Text Classification continued

To wrap our text classification section, we're going to learn how to incorporate stemming and lemmatization in our vectorizers. 

In [3]:
#Load in yelp review data

path = "../../data/NLP_data/yelp.csv"

yelp = pd.read_csv(path, encoding='unicode-escape')

yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n\r\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ing...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\r\n\r\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We we...",review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also dig their candy selection :),review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It's very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.\r\n\r\nThe Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.\r\n\r\nThe fenced in area is huge to let the dogs run, play, and sniff!",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,"General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... ""Mistakes are inevitable, it's how we recover from them that is important""!!!\r\n\r\nThanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)",review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [4]:
# Create a new DataFrame called yelp_best_worst that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [5]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

#Null accuracy
print y.value_counts(normalize=True)

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

SyntaxError: invalid syntax (<ipython-input-5-e94571642a03>, line 6)

In [6]:
#Look at the analyzer section of the CountVectorizer doc strings
CountVectorizer()

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

The analyzer argument allows us to upload our function to transform/tokenize the words in our corpura

In [7]:
# define a function that accepts text and returns a list of stems
def word_tokenize_stem(text):
    #Transform and tokenize words using TextBlob
    words = TextBlob(text).words
    #Intialize stemmer
    stemmer = SnowballStemmer("english")
    #Return a list of the stems
    return [stemmer.stem(word) for word in words]


# define a function that accepts text and returns a list of lemons (noun version)
def word_tokenize_lemma(text):
    #Transform and tokenize words using TextBlob
    words = TextBlob(text).words
    #Return a list of lemons
    return [word.lemmatize() for word in words]

# define a function that accepts text and returns a list of lemons (verb version)
def word_tokenize_lemma_verb(text):
    words = TextBlob(text).words
    #Return a list of lemons    
    return [word.lemmatize(pos="v") for word in words]

Let's try our three new functions with both count and tfidf vectorizers. 
<br>
- First let's create a function that takes in an initialized but unfit vectorizer as an argument.
- Fit and transforms training data using the vectorizer
- Transforms the testing data
- Fits naive bayes model on training data.
- Evaluate it on the training and testing data.
- Prints the number of features and scores

In [9]:
def text_model_evaluator(vect):
    X_train_dtm = vect.fit_transform(X_train)
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    print ("Features: ", X_train_dtm.shape[1])
    print ("Training Score: ", nb.score(X_train_dtm, y_train))
    print ("Testing Score: ", nb.score(X_test_dtm, y_test))

In [10]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_stem

vect = CountVectorizer(stop_words="english", analyzer=word_tokenize_stem)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  13273
Training Score:  0.970626631854
Testing Score:  0.924657534247


In [11]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = CountVectorizer(stop_words="english", analyzer=word_tokenize_lemma)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  20599
Training Score:  0.974216710183
Testing Score:  0.904109589041


In [12]:
#Intialize Count Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma_verb

vect = CountVectorizer(stop_words="english", analyzer=word_tokenize_lemma_verb)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  19431
Training Score:  0.974216710183
Testing Score:  0.906066536204


How do you interpret these results? Let's try it again with tfidf

In [13]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_stem

vect = TfidfVectorizer(stop_words="english", analyzer=word_tokenize_stem)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  13273
Training Score:  0.816906005222
Testing Score:  0.819960861057


In [14]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = TfidfVectorizer(stop_words="english", analyzer=word_tokenize_lemma)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  20599
Training Score:  0.817232375979
Testing Score:  0.819960861057


In [15]:
#Intialize Tfidf Vectorizer with stop_words set to english and analyzer to word_tokenize_lemma

vect = TfidfVectorizer(stop_words="english", analyzer=word_tokenize_lemma)

#Pass vectorizer into function
text_model_evaluator(vect)

Features:  20599
Training Score:  0.817232375979
Testing Score:  0.819960861057


How do the tfidf vectorizers compare to counts?

Grid search time. Let's grid search objects that incorporate all of the analyzer functions for count and tfidf vectorizers. In addition we'll do the same for randomized search.

Countvectorizer gridsearch

In [20]:
#Make pipeline for countvectorizer and naive bayes model
pipe_cv = make_pipeline(CountVectorizer(), MultinomialNB())

#Intialize parameters for count vectorizer
param_grid_cv = {}
param_grid_cv["countvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_cv["countvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_cv["countvectorizer__lowercase"] = [True, False]
param_grid_cv["countvectorizer__binary"] = [True, False]
param_grid_cv["countvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [None]:
#Grid search object

grid_cv = GridSearchCV(pipe_cv, param_grid_cv, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_cv.fit(X, y)
#Print time elapsed
print (time() - t)

In [None]:
#Best parameters
print (grid_cv.best_params_)
#Best score
print (grid_cv.best_score_)

Tfidfvectorizer gridsearch

In [21]:
#Make pipeline for tfidfvectorizer and naive bayes model
pipe_tf = make_pipeline(TfidfVectorizer(), MultinomialNB())


#Intialize parameters for tfidf vectorizer
param_grid_tf = {}
param_grid_tf["tfidfvectorizer__max_features"] = [1000, 2500 ,5000, 7500,10000]
param_grid_tf["tfidfvectorizer__ngram_range"] = [(1,1), (1,2), (2,2)]
param_grid_tf["tfidfvectorizer__lowercase"] = [True, False]
param_grid_tf["tfidfvectorizer__binary"] = [True, False]
param_grid_tf["tfidfvectorizer__analyzer"] = ["word", word_tokenize_stem,
                                              word_tokenize_lemma, word_tokenize_lemma_verb]

In [None]:
#Grid search object

grid_tf = GridSearchCV(pipe_tf, param_grid_tf, cv = 5, scoring = "accuracy")

#intialize time stamp
t = time()
#fit grid search object
grid_tf.fit(X, y)
#Print time elapsed
print (time() - t)

Countvectorizer randomized search

In [23]:
#Randomized grid search with n_iter = 10
randsearch_cv = RandomizedSearchCV(pipe_cv, n_iter = 10,
                        param_distributions = param_grid_cv, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_cv.fit(X, y)

#Print time difference

print (time() - t)

282.130045176


In [24]:
#Best params
print (randsearch_cv.best_params_)
#Best score
print (randsearch_cv.best_score_)

{'countvectorizer__lowercase': True, 'countvectorizer__analyzer': 'word', 'countvectorizer__ngram_range': (1, 1), 'countvectorizer__binary': False, 'countvectorizer__max_features': 7500}
0.931962799804


Tfidfvectorizer randomized search

In [None]:
#Randomized grid search with n_iter = 10
randsearch_tf = RandomizedSearchCV(pipe_tf, n_iter = 10,
                        param_distributions = param_grid_tf, cv = 5, scoring = "accuracy")

#Time the code 

t = time()

#Fit grid on data
randsearch_tf.fit(X, y)

#Print time difference

print (time() - t)

In [None]:
#Best params
print (randsearch_tf.best_params_)
#Best score
print (randsearch_tf.best_score_)

This wraps up text classification. Now onto the rest of the lesson.

## Summarizing text

We're going to build a very simple summarizer that uses tfidf scores on a corpura of data science and artificial intelligence articles

In [10]:
#Load in data

path = "../../data/NLP_data/ds_articles.csv"

#We're only be using the text and title columns
articles = pd.read_csv(path, usecols=["text", "title"], encoding="utf-8")

#Drop nulls
articles.dropna(inplace=True)

#Reset index
articles.reset_index(inplace=True, drop=True)

articles.head()

Unnamed: 0,text,title
0,One of the greatest difficulties that companies wishing to become more analytical have encountered over the last several years is finding good analysts and data scientists. A considerable amount of printer’s ink has been spilled into articles over this issue. Many of them mention consultants’ or analyst firms’ projections about how many quantitative analysts or data scientists will be needed in our society and conclude that it will be incredibly difficult to find them.\n\nI always thought th...,What Data Scientist Shortage? Get Serious and Get Talent
1,"Within soccer’s nascent analytics movement, one metric dominates most discussions. It’s called Expected Goals or xG. Models for calculating xG differ, but the underlying concept is the same. In a nutshell, xG takes a shot’s characteristics – distance from goal, angle from goal, root cause, etc. – and assigns a probability that said shot will result in a goal. Accounting for these probabilities reveals which team creates better scoring opportunities. Given a season of data, xG analysis is a p...","xG, Soccer Analytics of Bundesliga in R"
2,"The company’s adjacent market opportunities are growing at a CAGR of 18% for the next five years.\n\nQualcomm (NASDAQ: QCOM) announced a few days ago that its subsidiary Qualcomm Technologies will offer OEMs its first machine learning SDK for running their own neural network models on devices powered by Snapdragon 820 SoCs. The devices include smartphones, cars and drones among many others. Gary Brotman, director of product management, Qualcomm Technologies, said:\n\nWith the introduction of...",Qualcomm: Taking Artificial Intelligence To A New Level
3,"How Web, Tech Companies Use GPUs to Put Deep Learning at Your Fingertips\n\nGPUs have helped researchers spark a deep-learning revolution that’s given computers super-human capabilities.\n\nThey’ve already enabled breakthrough results on the industry-standard ImageNet benchmark. They’re powering Facebook’s “Big Sur” deep learning computing platform. They’re also accelerating major advances in deep learning across a broad range of fields.\n\nGPUs have become the go-to technology for training ...",How Companies Use GPUs to Put Deep Learning at Your Fingertips
4,"White House technology policy adviser Kristen Honey urged government and industry IT leaders to support the open data movement and showcase their work at two upcoming data innovation events.\n\nSpeaking Wednesday to a standing-room-only audience at the annual Data Innovation Summit in Washington, Honey highlighted a number of the administration’s open data initiatives, dating back to 2009, that are leading to innovative advances in medicine, agriculture, energy, transportation and education....",White House official urges IT leaders to join open data efforts


In [200]:
#Info
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1418 entries, 0 to 1417
Data columns (total 2 columns):
text     1418 non-null object
title    1418 non-null object
dtypes: object(2)
memory usage: 22.2+ KB


In [58]:
#Intialize tfidf with stop_words = english, max_features = 1000, and stem analzyer 

tfidf = TfidfVectorizer(stop_words=sw,max_df = 0.3,min_df=.05,
                        analyzer=word_tokenize_lemma,)

#Fit and transform the text using the tfidf vectorizer
text = articles.text
dtm = tfidf.fit_transform(text)

#Assign tokens to features
features = tfidf.get_feature_names()

print (len(features))

1383


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [59]:
#Create a dataframe of features and their idf scores
idfscores = pd.DataFrame()
idfscores["tokens"] = features
idfscores["scores"] = tfidf.idf_



In [60]:
#Top ten most imporant words
idfscores.sort_values(by="scores", ascending=False).head(10)

Unnamed: 0,tokens,scores
584,fail,3.981042
322,challenging,3.981042
1108,road,3.981042
303,camera,3.981042
209,analyzed,3.981042
1220,stock,3.981042
548,established,3.981042
23,60,3.981042
905,numerous,3.981042
292,bringing,3.981042


In [61]:
#Top ten least imporant words
idfscores.sort_values(by="scores", ascending=True).head(10)

Unnamed: 0,tokens,scores
778,large,2.207974
164,across,2.207974
70,Google,2.210335
248,available,2.215075
988,possible,2.215075
432,day,2.215075
568,experience,2.222226
131,They,2.222226
1267,think,2.222226
1308,type,2.224621


Let's our summarizer function that will randomly select an article to summarize. By summarize, I mean show the top five words with the highest tfidf values

In [62]:
def summarize():
    #Randomly choose index value
    index = np.random.choice(articles.index, 1)[0]
    article = text.iloc[index]
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(article).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[index, features.index(word)]
            
   # print words with the top 5 TF-IDF scores
    print ('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print (word)
        
    #Print title of article
    print ("\n", articles.title[index])
    
    #Print the text of article
#     print article

In [64]:
#Give it a go
summarize()

TOP SCORING WORDS:
emerging
risk
cluster
claim
approach

 Key tools of Big Data for Transformation: Review & Case Study


## Text Similarity with Cosine Similarity and Clustering

### Cosine Similarity

![ew](https://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/cosine.png?w=697)
<br><br>
" Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in (0,1). One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors."
<br>
Source: [Dataaspirant](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

In [65]:
#Diy cosine similarity function

def square_rooted(x):

    return round(np.sqrt(sum([a*a for a in x])),3)
 
def cosine_similarity_function(x,y):

    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return round(numerator/float(denominator),3)
 
vec1 = [3, 45, 7, 2]
vec2 = [2, 54, 13, 15]
cosine_similarity_function(vec1, vec2)

0.972

Derive matrix of similarities between all the data science articles documents.

In [84]:
#Calculate cosine distance for each pair of documents
dist = cosine_similarity(dtm.toarray())

In [85]:
#make it a dataframe
dist_df = pd.DataFrame(dist)

#Shape
dist_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417
0,1.0,0.113929,0.070045,0.079961,0.116752,0.15406,0.112959,0.060669,0.084572,0.322649,...,0.164872,0.142885,0.09552,0.194064,0.176607,0.08631,0.153051,0.025908,0.193765,0.138436
1,0.113929,1.0,0.076533,0.047124,0.058152,0.0777,0.068518,0.036861,0.038336,0.083151,...,0.108322,0.09662,0.079264,0.088891,0.094266,0.090885,0.05013,0.021027,0.055269,0.041649
2,0.070045,0.076533,1.0,0.102324,0.078817,0.097378,0.21731,0.049943,0.149809,0.131007,...,0.198051,0.024482,0.051428,0.094567,0.093622,0.074362,0.065025,0.146032,0.201584,0.073423
3,0.079961,0.047124,0.102324,1.0,0.038511,0.149229,0.035401,0.049433,0.140992,0.079774,...,0.045,0.073937,0.150466,0.128223,0.100471,0.083198,0.169101,0.054842,0.0899,0.037017
4,0.116752,0.058152,0.078817,0.038511,1.0,0.048237,0.087655,0.110262,0.034083,0.122892,...,0.043802,0.057226,0.03645,0.062074,0.142089,0.114308,0.02185,0.025359,0.07938,0.045385


Let's compare some articles!

In [68]:
#Index position of article
index = 239

In [69]:
#Assign titles column to titles variable

titles = articles.title



#Print title
print (titles[index])

#print article

print ("\n, ************************************************ \n", text[index])

10 Popular TV Shows on Data Science and Artificial Intelligence

, ************************************************ 
 Introduction

The development of full artificial intelligence could spell the end of human race. – Stephen Hawking

The world is now rapidly moving towards achieving this finest technology breakthrough ever. It is expected that AI would enrich humans with more power and opportunities. Another group of people (including Stephen Hawking and Elon Musk) believe that this might lead to human destruction (if not handled carefully).

I think, it’s too early for us to envisage such uncertain future. Good news is, companies like Google, Microsoft, Baidu have already started creating products based on AI. It won’t be long enough to experience the influence of AI in our daily lives.

Accidentally, my exploration of AI started with movie ‘Her’. The influence was so powerful that I ended up creating an infographic on 10 Movies on Data Science and Machine Learning. May be a ~ 2 hours

We need to take the index value and use it grab the column of the scores between every article and the one at index 935

In [70]:
#Pass
dist_column = dist_df[index]

In [71]:
#Get the index values of the 5 

closest_index = dist_column.nlargest(6).index[1:].tolist()

In [72]:
#Pass index values into titles and print them

for i in titles.iloc[closest_index].tolist():
    print (i)

10 Must Watch Movies on Data Science and Machine Learning
10 Must Watch Movies on Data Science and Machine Learning
Why algorithms will be at the core of our AI-powered future, and why you should care
Artificial Intelligence in business to deliver positive experiences
Netflix Changed Its Rating System for AI Purposes


In [222]:
#Pass index values into titles and but don't print
text.iloc[closest_index]

1116    Introduction\n\nSome members of our team (including me) live by just 2 passions in life – Data Science & Movies! For us, slicing and dicing movies over Monday morning coffee is part of warming up ritual.\n\nSo, we decided to do a poll among ourselves on the best movies related to data science and machine learning. We also thought that we would release the outcome of the results in form of an infographic.\n\nNeedless to say there were heated debates and a few disappointed faces in our office!...
1157    Introduction\n\nSome members of our team (including me) live by just 2 passions in life – Data Science & Movies! For us, slicing and dicing movies over Monday morning coffee is part of warming up ritual.\n\nSo, we decided to do a poll among ourselves on the best movies related to data science and machine learning. We also thought that we would release the outcome of the results in form of an infographic.\n\nNeedless to say there were heated debates and a few disappointed faces in

### Clustering

It is standard practice to cluster with tfidf data instead of the count vectorized data

In [73]:
#Intialize clustering algorithm with 4 clusters and fit it on dtm

km4 = KMeans(n_clusters=4)
#Fit algorithm
km4.fit(dtm)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [74]:
#Check out silhouette score
silhouette_score(dtm, km4.labels_)

0.015203943655105008

In [75]:
#Assign labels to articles dataframe 

articles["cluster"] = km4.labels_

Print 5 randomly selected headlines from each cluster

In [76]:
#Cluster 0
for i in articles[articles.cluster == 0].sample(n=5).title.tolist():
    print (i)

Which machine learning algorithm should I use?
Research Blog: Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System
Bayesian Statistics Explained in Simple English For Beginners
Eight (not 10) things an R user will find frustrating when trying to learn Python
May the Force of R be With You, Always!


In [78]:
#Cluster 1
for i in articles[articles.cluster == 1].sample(n=5).title.tolist():
    print (i)

IBM Watson IoT and Its Integration with Blockchain
A methodology for solving problems with DataScience for Internet of Things
Detailed Overview – symbIoTe
IoT’s killer app is home security
Samsung Will Invest $1.2 Billion Into US For 'Internet Of Things'


In [79]:
#Cluster 2
for i in articles[articles.cluster == 2].sample(n=5).title.tolist():
    print (i)

Germany most likely to win Euro 2016
Google's TensorFlow opens its AI doors to iOS developers
Jerry Kaplan: "Making Machine Learning Great Again" | Talks at Google
Rise of the robots: What advances mean for workers
10 Real-World Examples Of Machine Learning And AI You Can Use Today


In [80]:
#Cluster 3
for i in articles[articles.cluster == 3].sample(n=5).title.tolist():
    print (i)

Silicon Valley's Women In Data Science Creating New Opportunities For All Women In Startups And STEM
How AI and machine learning can help solve IT's data management problem
How mixed reality and machine learning are driving innovation in farming
Real-time drives AI uptake
How will AI shape the workforce of the future?


What do you think the clusters are? Is it easy decipher? Ignore the silhouette score, does it pass the eye test?

Let's examine the top words of each cluster

In [81]:
print("Top terms per cluster:")
order_centroids = km4.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names()
for i in range(4):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print ("\n")

Top terms per cluster:
Cluster 0:
 R
 Python
 Spark
 function
 variable
 code
 2
 x
 class
 Learning


Cluster 1:
 IoT
 device
 Internet
 Things
 sensor
 connected
 security
 's
 platform
 home


Cluster 2:
 's
 Google
 —
 said
 deep
 neural
 image
 he
 –
 researcher


Cluster 3:
 analytics
 –
 scientist
 organization
 job
 platform
 insight
 marketing
 Big
 skill




Let's try this exercise again but this time we'll cluster the cosine distances.

In [86]:
#Intialize clustering algorithm with 4 clusters
km4 = KMeans(n_clusters=4)

#fit it on dist array

km4.fit(dist)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [87]:
#Check out silhouette score
silhouette_score(dist, km4.labels_)

0.09926344148978795

Print 5 randomly selected headlines from each cluster

In [88]:
#Assign new labels to data frame

articles["cluster_dist"] = km4.labels_

In [89]:
#Cluster 0
for i in articles[articles.cluster_dist == 0].sample(n=5).title.tolist():
    print (i)

Four Artificial Intelligence Challenges Facing The Industrial IoT
10 Real World Applications of Internet of Things (IoT) – Explained in Videos
Machine learning is leading the way to a smarter internet of things
Implementation of Machine Learning in IoT - Tech-Talk by Aashish Kalra
Does IoT Need Machine Learning To Succeed?


In [90]:
#Cluster 1
for i in articles[articles.cluster_dist == 1].sample(n=5).title.tolist():
    print (i)

How Some of the Top Manufacturing Companies are Using AI
The Tech Predictions for 2017 Series
How AI will disrupt the classroom
Turbocharge productivity and efficiency with AI
The world told 6 guys no one would buy their software: Four years later they have a $940 million company


In [91]:
#Cluster 2
for i in articles[articles.cluster_dist == 2].sample(n=5).title.tolist():
    print (i)

Chatbots and The Future of Marketing
Intel Uses Chips Acquired in Deal With Startup to Drive Deep Learning
Governments are the Tip of the Open Data Iceberg – Datazar Blog
What Is The State Of Artificial Intelligence In China?
The big data explosion sets us profound challenges - how can we keep up?


In [92]:
#Cluster 3
for i in articles[articles.cluster_dist == 3].sample(n=5).title.tolist():
    print (i)

Time Series Prediction With Deep Learning in Keras
What Are The Differences Between AI, Machine Learning, NLP, And Deep Learning?
The Semantic Representation of Pure Mathematics—Wolfram Blog
The 10 Algorithms Machine Learning Engineers Need to Know
How do Chatbots work? A Guide to the Chatbot Architecture


Are the results better?

# Resources


My fake news classifer article: https://opendatascience.com/blog/how-to-build-a-fake-news-classification-model/
<br>
My data science topic modeling article: https://opendatascience.com/blog/how-to-analyze-articles-about-data-science-using-data-science/
<br><br>
**Regular Expressions**
- https://www.dataquest.io/blog/regular-expressions-data-scientists/
- https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
- https://www.oreilly.com/ideas/an-introduction-to-regular-expressions


**NLP Tutorials**

- https://github.com/bonzanini/nlp-tutorial
- https://github.com/totalgood/pycon-2016-nlp-tutorial

**Text similarity:**
- https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/
- http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
- http://billchambers.me/tutorials/2014/12/22/cosine-similarity-explained-in-python.html
- Explains why text similarity uses cosine similarity -> https://www.quora.com/What-are-the-mechanics-of-cosine-similarity-in-natural-language-processing

**Text classification:**
- Another fake news tutorial - > https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
- http://nlpforhackers.io/text-classification/
- http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html
- https://github.com/javedsha/text-classification
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html


**Text clustering:**

- Great tutorial -> http://brandonrose.org/clustering
- http://nlpforhackers.io/recipe-text-clustering/
- https://pythonprogramminglanguage.com/kmeans-text-clustering/
- http://mccormickml.com/2015/08/05/document-clustering-example-in-scikit-learn/


**Word Embeddings/Word2Vec**

- https://chatbotsmagazine.com/introduction-to-word-embeddings-55734fd7068a
- https://www.springboard.com/blog/introduction-word-embeddings/
- http://ruder.io/word-embeddings-1/
- https://www.slideshare.net/BhaskarMitra3/a-simple-introduction-to-word-embeddings
- https://github.com/fastai/word-embeddings-workshop


**Topic Modeling**

- http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
- https://blog.bigml.com/2016/11/16/introduction-to-topic-models/
- http://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/nmf_topics.ipynb?create=1
- https://www.youtube.com/watch?v=ZgyA1Q2ywbM
- https://www.youtube.com/watch?v=SjRss8Uk6mQ
- https://github.com/derekgreene/topic-model-tutorial

# Lab time

Pick a text dataset to spend the rest of class working. There are three other datasets in the NLP_data that you can work with: pitchfork album reviews, fake/real news, deadspin, and political lean. Make sure to unzip political lean or fake news. You can also continue to work with the datasets we've already used (data science, yelp, spam.)

<br>

For the rest of class apply supervised or unsupervised learning techniques to the dataset of your choice. 

- Build a model that can differentiate between good/bad review, real/fake news, or liberal/conservative leaning or a model that 

- Predict how many page views a deadspin can get based on its headlines and tags.

- Ignore the labels and attempt cluster the articles.

- Have fun with the summarizer!!

<br>

Be prepared to share your results at the end of class.
