__Lab 06 - Text analysis__

 __Table of contents__

1. [Representing text as numerical data](#Representing-text-as-numerical-data)
    1. [Example 1 - learn a small vocabulary](#learn-a-small-vocabulary-Example1)
1. [Case study - text message analysis](#Case-study-text-message-analysis)
    1. [Classify with multinomial naive bayes](#Classify-with-multinomial-naive-bayes)
    1. [Classify with logistic regression](#Classify-with-logistic-regression)
1. [Parameter tuning w/ CountVectorizer](#Parameter-tuning-w/-CountVectorizer)

In [2]:
import numpy as pd
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(style = 'whitegrid', font_scale = 1.3)

ModuleNotFoundError: No module named 'numpy'

<a id = "Representing-text-as-numerical-data"></a>

# Representing text as numerical data

Text data can be represented as numerical data by tokenization

<a id = "learn-a-small-vocabulary-Example1"></a>

## Example - learn a small vocabulary

Text data can be represented as numerical data by 'tokenized'
- Tokenize the vocabulary learned from a small set of training data
- Transform a test string based on the training vocabulary

In [None]:
# Load data

simpleTrain = ['call you tonight','Call me a cab','please call me... PLEASE!']
vect = CountVectorizer()

# learn the 'vocabulary' of the training data
vect.fit(simpleTrain)


In [None]:
#### Inspect

vect.get_feature_names()


In [None]:
# Represent each sample in DataFrame

simpleTrainDtm = vect.transform(simpleTrain)
simpleTrainDtm.toarray()

pd.DataFrame(simpleTrainDtm.toarray(), columns = vect.get_feature_names())


In [None]:
# Tokenize test data string

simpleTest = ["please don't call me"]

simpleTestDtm = vect.transform(simpleTest)
simpleTestDtm.toarray()

pd.DataFrame(simpleTestDtm.toarray(), columns = vect.get_feature_names())


> Notice that the word "don't" was ignored because this word was not in the learned vocabulary

<a id = "Case-study-text-message-analysis"></a>

# Case study - text message analysis - SPAM or not?

Build classifier to determine was an SMS text message is SPAM or not

In [None]:
# Load data

url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
sms = pd.read_table(url, header = None, names = ['label', 'message'])
sms['labelNum'] = sms.label.map({'ham' : 0, 'spam' : 1})


In [None]:
# Inspect

sms.shape


In [44]:
sms.head()


Unnamed: 0,label,message,labelNum
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [45]:
sms.label.value_counts()


ham     4825
spam     747
Name: label, dtype: int64

In [46]:
X = sms['message']
y = sms['labelNum']
print(X.shape)
print(y.shape)


(5572,)
(5572,)


In [None]:
# Train/test split

xTrain, xTest, yTrain, yTest = train_test_split(X, y, random_state = 1)
print(xTrain.shape)
print(xTest.shape)
print(yTrain.shape)
print(yTest.shape)


In [None]:
# Learn the vocabulary - Vectorize the SMS dataset

vect = CountVectorizer()
vect.fit(xTrain)
xTrainDtm = vect.transform(xTrain)
pd.DataFrame(xTrainDtm.toarray(), columns = vect.get_feature_names())[:7]


In [None]:
# Transform test set based on learned vocabulary

xTestDtm = vect.transform(xTest)
pd.DataFrame(xTestDtm.toarray(), columns = vect.get_feature_names())[:7]


<a id = "Classify-with-multinomial-naive-bayes"></a>

## Classify with multinomial naive bayes

In [50]:
nb = MultinomialNB()
nb.fit(xTrainDtm, yTrain)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [1]:
# Test set predictions

yPredClass = nb.predict(xTestDtm)


NameError: name 'nb' is not defined

In [None]:
# Evaluate predictions

metrics.accuracy_score(yTest, yPredClass)


In [53]:
metrics.confusion_matrix(yTest, yPredClass)


array([[1203,    5],
       [  11,  174]])

In [54]:
# print message for false positives (ham incorrectly labeled spam)

xTest[yTest < yPredClass]


574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [55]:
# print message for false negatives (spam incorrectly labeled ham)

xTest[yTest > yPredClass]


3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [56]:
xTest[2247]


"Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!"

In [57]:
# calc predicted probabilities for xTestDtm

yPredProb = nb.predict_proba(xTestDtm)[:,1]
yPredProb


array([2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,
       1.09026171e-06, 1.00000000e+00, 3.98279868e-09])

In [58]:
# Calculate area under the curve score

metrics.roc_auc_score(yTest, yPredProb)


0.9866431000536962

> Remarks - 

__Evaluate internal probabilities__

In [72]:
xTrainTokens = vect.get_feature_names()
len(xTrainTokens)


7456

In [73]:
# examine first fifty tokens

print(xTrainTokens[0:50])


['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']


In [74]:
# examine last fifty tokens

print(xTrainTokens[-50:])


['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']


In [75]:
# rows = classes, columns = tokens

nb.feature_count_


array([[ 0.,  0.,  0., ...,  1.,  1.,  1.],
       [ 5., 23.,  2., ...,  0.,  0.,  0.]])

In [76]:
# number of times each token appears in each type of message

hamTokenCount = nb.feature_count_[0,:]
spamTokenCount = nb.feature_count_[1,:]

tokens = pd.DataFrame({'token' : xTrainTokens, 'ham' : hamTokenCount, 'spam' : spamTokenCount}).set_index('token')
tokens[:7]

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.0,5.0
0,0.0,23.0
8704050406,0.0,2.0
121,0.0,1.0
1223585236,0.0,1.0
1223585334,0.0,2.0
125698789,1.0,0.0


In [77]:
tokens.sample(10, random_state = 9)


Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
gautham,1.0,0.0
home,116.0,2.0
webeburnin,0.0,1.0
report,6.0,0.0
30,10.0,2.0
accordin,1.0,0.0
toot,2.0,0.0
village,0.0,2.0
mokka,2.0,0.0
roommates,2.0,0.0


In [78]:
# add 1 to each token count to avoid div by 0

tokens['ham'] = tokens['ham'] + 1
tokens['spam'] = tokens['spam'] + 1
tokens.sample(10, random_state = 9)


Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
gautham,2.0,1.0
home,117.0,3.0
webeburnin,1.0,2.0
report,7.0,1.0
30,11.0,3.0
accordin,2.0,1.0
toot,3.0,1.0
village,1.0,3.0
mokka,3.0,1.0
roommates,3.0,1.0


In [79]:
# covert ham and spam counts into frequencies
# divide the number of times a word appears by the total number of observations in that class
# these probabilities are used to calculate conditional probability for class designation

tokens['ham'] = tokens['ham'] / nb.class_count_[0] 
tokens['spam'] = tokens['spam'] / nb.class_count_[1] 
tokens.sample(10, random_state = 9)


Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
gautham,0.000553,0.001779
home,0.032347,0.005338
webeburnin,0.000276,0.003559
report,0.001935,0.001779
30,0.003041,0.005338
accordin,0.000553,0.001779
toot,0.000829,0.001779
village,0.000276,0.005338
mokka,0.000829,0.001779
roommates,0.000829,0.001779


In [83]:
# add spam-to-ham ratio

tokens['spam_ratio'] = tokens['spam'] / tokens['ham']
tokens.sample(10, random_state = 9)


Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gautham,0.000553,0.001779,3.217972
home,0.032347,0.005338,0.165024
webeburnin,0.000276,0.003559,12.871886
report,0.001935,0.001779,0.91942
30,0.003041,0.005338,1.755257
accordin,0.000553,0.001779,3.217972
toot,0.000829,0.001779,2.145314
village,0.000276,0.005338,19.307829
mokka,0.000829,0.001779,2.145314
roommates,0.000829,0.001779,2.145314


In [84]:
# sort by spam_ratio descending to see the 'spammiest' words

tokens.sort_values(['spam_ratio'], ascending = [False])[:10]


Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,0.000276,0.158363,572.798932
prize,0.000276,0.135231,489.131673
150p,0.000276,0.087189,315.36121
tone,0.000276,0.085409,308.925267
guaranteed,0.000276,0.076512,276.745552
18,0.000276,0.069395,251.001779
cs,0.000276,0.065836,238.129893
www,0.000553,0.129893,234.911922
1000,0.000276,0.05694,205.950178
awarded,0.000276,0.053381,193.078292


In [85]:
# sort by spam_ratio ascending to see the least 'spammiest' words

tokens.sort_values(['spam_ratio'], ascending = [True])[:10]


Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gt,0.064971,0.001779,0.027387
lt,0.064142,0.001779,0.027741
he,0.047,0.001779,0.037858
she,0.035665,0.001779,0.049891
lor,0.0329,0.001779,0.054084
da,0.0329,0.001779,0.054084
later,0.030688,0.001779,0.057981
come,0.048936,0.003559,0.072723
too,0.021841,0.001779,0.081468
already,0.01963,0.001779,0.090647


<a id = "Classify-with-logistic-regression"></a>

## Classify with logistic regression

In [35]:
logReg = LogisticRegression()


In [36]:
logReg.fit(xTrainDtm, yTrain)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
# Test set predictions

yPredClass = logReg.predict(xTestDtm)


In [None]:
# Evaluate predictions

metrics.accuracy_score(yTest, yPredClass)


In [40]:
yPredProb = logReg.predict_proba(xTestDtm)[:,1]
metrics.roc_auc_score(yTest, yPredProb)


0.9936817612314301

>Remarks - 

<a id = "Parameter-tuning-w/-CountVectorizer"></a>

# Paramter tuning with CountVectorizer 


In [88]:
# show default params
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [89]:
# remove English stop words

vect = CountVectorizer(stop_words = 'English')


In [90]:
# expand scope of tokenization. a range of (1,1) makes tokens of single words
# a range of (1,2) expands the scope of tokeization so that each pair of words also becomes
# a token. this allows for context of word usage to enter the model, but makes the 
# document-word matrix larger

vect = CountVectorizer(ngram_range = (1,2))


In [91]:
# ignore terms that appear in X% or more of the documents

vect = CountVectorizer(max_df = 0.5)


In [93]:
# only keep items that appear in X or more documents

vect = CountVectorizer(min_df = 0.5)
