# TOC

__Lab 06 - Text analysis__
1. [Import](#Import)
1. [Representing text as numerical data](#Representing-text-as-numerical-data)
    1. [Example 1 - learn a small vocabulary](#learn-a-small-vocabulary-Example1)
1. [Case study - text message analysis](#Case-study-text-message-analysis)
    1. [Classify with multinomial naive bayes](#Classify-with-multinomial-naive-bayes)
    1. [Classify with logistic regression](#Classify-with-logistic-regression)
1. [Parameter tuning w/ CountVectorizer](#Parameter-tuning-w/-CountVectorizer)

# Import

<a id = "Import"></a>

In [2]:
import numpy as pd
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(style = 'whitegrid', font_scale = 1.3)


# Representing text as numerical data

Text data can be represented as numerical data by tokenization

<a id = "Representing-text-as-numerical-data"></a>

## Example - learn a small vocabulary

Text data can be represented as numerical data by 'tokenized'
- Tokenize the vocabulary learned from a small set of training data
- Transform a test string based on the training vocabulary

<a id = "learn-a-small-vocabulary-Example1"></a>

In [3]:
# load data
simpleTrain = ['call you tonight','Call me a cab','please call me... PLEASE!']
vect = CountVectorizer()

# learn the 'vocabulary' of the training data
vect.fit(simpleTrain)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [4]:
# Inspect
vect.get_feature_names()


['cab', 'call', 'me', 'please', 'tonight', 'you']

In [5]:
# represent each sample in DataFrame
simpleTrainDtm = vect.transform(simpleTrain)
simpleTrainDtm.toarray()

pd.DataFrame(simpleTrainDtm.toarray(), columns = vect.get_feature_names())


Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [6]:
# tokenize test data string
simpleTest = ["please don't call me"]

simpleTestDtm = vect.transform(simpleTest)
simpleTestDtm.toarray()

pd.DataFrame(simpleTestDtm.toarray(), columns = vect.get_feature_names())


Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


> Notice that the word "don't" was ignored because this word was not in the learned vocabulary

# Case study - text message analysis - SPAM or not?

Build classifier to determine was an SMS text message is SPAM or not

<a id = "Case-study-text-message-analysis"></a>

In [7]:
# load data
url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
sms = pd.read_table(url, header = None, names = ['label', 'message'])
sms['labelNum'] = sms.label.map({'ham' : 0, 'spam' : 1})


URLError: <urlopen error [Errno 11004] getaddrinfo failed>

In [None]:
# inspect
sms.shape


In [None]:
# review messages and labels
X = sms['message']
y = sms['labelNum']
print(X.shape)
print(y.shape)


In [None]:
# train/test split
xTrain, xTest, yTrain, yTest = train_test_split(X, y, random_state = 1)
print(xTrain.shape)
print(xTest.shape)
print(yTrain.shape)
print(yTest.shape)


In [None]:
# learn the vocabulary - Vectorize the SMS dataset
vect = CountVectorizer()
vect.fit(xTrain)
xTrainDtm = vect.transform(xTrain)
pd.DataFrame(xTrainDtm.toarray(), columns = vect.get_feature_names())[:7]


In [None]:
# transform test set based on learned vocabulary
xTestDtm = vect.transform(xTest)
pd.DataFrame(xTestDtm.toarray(), columns = vect.get_feature_names())[:7]


## Classify with multinomial naive bayes

<a id = "Classify-with-multinomial-naive-bayes"></a>

In [None]:
# create Naive Bayes model
nb = MultinomialNB()
nb.fit(xTrainDtm, yTrain)


In [None]:
# test set predictions
yPredClass = nb.predict(xTestDtm)


In [None]:
# evaluate predictions
metrics.accuracy_score(yTest, yPredClass)


In [None]:
# create confusion matrix
metrics.confusion_matrix(yTest, yPredClass)


In [None]:
# print message for false positives (ham incorrectly labeled spam)
xTest[yTest < yPredClass]


In [None]:
# print message for false negatives (spam incorrectly labeled ham)
xTest[yTest > yPredClass]


In [None]:
# review specific example
xTest[2247]


In [None]:
# calculated predicted probabilities for xTestDtm
yPredProb = nb.predict_proba(xTestDtm)[:,1]
yPredProb


In [None]:
# calculate area under the curve score
metrics.roc_auc_score(yTest, yPredProb)


__Evaluate internal probabilities__

In [None]:
# gather feature names
xTrainTokens = vect.get_feature_names()
len(xTrainTokens)


In [None]:
# examine first fifty tokens
print(xTrainTokens[0:50])


In [None]:
# examine last fifty tokens
print(xTrainTokens[-50:])


In [None]:
# rows = classes, columns = tokens
nb.feature_count_


In [None]:
# number of times each token appears in each type of message
hamTokenCount = nb.feature_count_[0,:]
spamTokenCount = nb.feature_count_[1,:]

tokens = pd.DataFrame({'token' : xTrainTokens, 'ham' : hamTokenCount, 'spam' : spamTokenCount}).set_index('token')
tokens[:7]


In [None]:
# sample from tokens
tokens.sample(10, random_state = 9)


In [None]:
# add 1 to each token count to avoid div by 0
tokens['ham'] = tokens['ham'] + 1
tokens['spam'] = tokens['spam'] + 1
tokens.sample(10, random_state = 9)


In [None]:
# covert ham and spam counts into frequencies
# divide the number of times a word appears by the total number of observations in that class
# these probabilities are used to calculate conditional probability for class designation
tokens['ham'] = tokens['ham'] / nb.class_count_[0] 
tokens['spam'] = tokens['spam'] / nb.class_count_[1] 
tokens.sample(10, random_state = 9)


In [None]:
# add spam-to-ham ratio
tokens['spam_ratio'] = tokens['spam'] / tokens['ham']
tokens.sample(10, random_state = 9)


In [None]:
# sort by spam_ratio descending to see the 'spammiest' words
tokens.sort_values(['spam_ratio'], ascending = [False])[:10]


In [None]:
# sort by spam_ratio ascending to see the least 'spammiest' words
tokens.sort_values(['spam_ratio'], ascending = [True])[:10]


## Classify with logistic regression

<a id = "Classify-with-logistic-regression"></a>

In [None]:
# creaet and fit logistic regression model
logReg = LogisticRegression()
logReg.fit(xTrainDtm, yTrain)


In [None]:
# test set predictions
yPredClass = logReg.predict(xTestDtm)


In [None]:
# evaluate predictions
metrics.accuracy_score(yTest, yPredClass)


In [None]:
# review predicted probabilities
yPredProb = logReg.predict_proba(xTestDtm)[:,1]
metrics.roc_auc_score(yTest, yPredProb)


# Paramter tuning with CountVectorizer 


<a id = "Parameter-tuning-w/-CountVectorizer"></a>

In [None]:
# show default params
vect

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words = 'English')


In [None]:
# expand scope of tokenization. a range of (1,1) makes tokens of single words
# a range of (1,2) expands the scope of tokeization so that each pair of words also becomes
# a token. this allows for context of word usage to enter the model, but makes the 
# document-word matrix larger
vect = CountVectorizer(ngram_range = (1,2))


In [None]:
# ignore terms that appear in X% or more of the documents
vect = CountVectorizer(max_df = 0.5)


In [None]:
# only keep items that appear in X or more documents
vect = CountVectorizer(min_df = 0.5)
