In [7]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [12]:

iris = load_iris()

In [13]:
# store the feature matrix X, and response vector y
X = iris.data
y = iris.target
print(iris.feature_names)
print(iris.target_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']


In [4]:
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [7]:
# examine hte first 5 rows of X with feature names
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

**QUESTION** do we need to scale the values especially if we are using KNN so the distance is done on scaled values?

[Rescaling Data for Machine Learning in Python with Scikit-Learn](https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/)

[How To Prepare Your Data For Machine Learning in Python with Scikit-Learn](https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/)


In [9]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place)
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In order to make a **prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [10]:
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])

array([1])

## Part 2: Representing text as numerical data

In [1]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

token count is typically words, but it does not have to be single words or even characters.

In [2]:
# import and instantiate CountVectorizer ( with the default parameters )
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [5]:
# learn the 'vocabulary' of the training data ( occurs in-place )
# vocabular creates the tokens and count of tokens described above.
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
# examine the fitted vocabulary
# examine all of the tokens that CountVectorizer extracted from the sample training data
# notice no punctuation
# alphabetic order
# ignores case
vect.get_feature_names()

['cab', 'call', 'me', 'please', 'tonight', 'you']

In [11]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

pryan:
Notice above 3,6.

3 - documents or strings as inputs.  Each is a row or input of data
6 - features dervied from the fit method of the CountVectorizer

In [13]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [15]:
# Use pandas to display the document term matrix with the features and the counts of all of the features.
df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
df.head()

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

pryan:  ** Bag of Words ** means that the order of words is lost in CountVectorizer.  It is possible that order is important, but it is important to note that CountVectorizer ** DOES NOT ** maintain order.  To some extent, you can use n-gram to preserve the order for small number of co-located words.

In [17]:
# print the sparse matrix
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [18]:
# example text for model testing
simple_test = ["please don't call me"]

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.

pryan: this means when we have the 'raw' new observation, we have to transform it using the vectorizer, so that it creates a new observation dtm.  the vectorizer will create a dtm, that matches what was 'fit' to earlier

In [19]:
#transform testing data into a document-term matrix ( using existing vocabular )
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]])

In [20]:
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


pryan:  Notice that the word, "don't", is not in the transformed dtm because "don't" was NOT a token seen in the training data.  Therefore it gets dropped.  This is OK. 

If the training corpus did not have a token, and the model was able to predict accurately, then the additional testing token will not likely help the model.  The model does not use the extra training token at the moment, so including it in the model - the model has no training to determine an outcome.


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Part 3: Reading the SMS data

In [9]:
# use read_table because it is a tab seperated file.  
sms = pd.read_table('../data/sms.tsv', header=None, names=['label', 'message'])

In [5]:
# examine the shape
sms.shape

(5572, 2)

In [7]:
# examine the first 10 rows
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [10]:
# examine the class distribution
# panda series ( columns ) has a value_counts method.
# about 80% is ham and 20% is spam
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [11]:
# convert label to a numerical variable.
# scikit learn models use numeric values.
sms['label_num'] = sms.label.map({'ham': 0, 'spam': 1})

In [10]:
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [12]:
# required way to define X and y for use with CountVectorizer
# get the set of total data, so that it can later be split with train_test_split
# CountVectorizer takes 1-dimensional series and turns it into 2-dimensional data frames
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

(5572,)
(5572,)


In [13]:
# split X and y into training and test data sets
# split the data BEFORE Vectorization so that we can create a testing set that is not polluted with any 
# training observations.  Also, the testing set may have words that the training set has not seen - just like
# the real world and this better models the real world.
# By default, splits at 75% training, 25% testing data
# train_test_split also preserves the relative frequency of the outcomes (ham/spam)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)

In [18]:
print(X_train.shape)
print(X_test.shape)

(4179,)
(1393,)


## Part 4: Vectorizing the SMS data

In [14]:
# instantiate CountVectorizer
vect = CountVectorizer()

In [15]:
# Learn training data vocabulary, then use it to create a document-term matrix.
vect.fit(X_train)

# remember that transform returns a document-term matrix
# use the vocabulary that was created in a fit, to create the document-term matrix
X_train_dtm = vect.transform(X_train)

In [16]:
# Alternatively you can call the fit_transform
X_train_dtm = vect.fit_transform(X_train)

In [17]:
# examine the document-term matrix
# 4179x7456
# 4179 - rows from the training set.  X_train.shape = 4179
# 7456 - number of document terms, or also known as unique tokens.
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [30]:
# visualize the document-term matrix so that we can see the columns ( term ) and the counts
pd.DataFrame(X_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# transform testing data ( using the fitted vocabulary ) into a document-term matrix
# 1393x7456
# 1393 - rows from the testing set.  X_test.shape = 1393
# 7456 - number of document terms, or also known as unique tokens.
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

In [19]:
# visualize the document-term matrix so that we can see the columns ( term ) and the counts
pd.DataFrame(X_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Part 5: Building a Naive Bayes model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [20]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [21]:
# train the model using X_train_dtm
# train aka fit, the model to the training data.  For text it is the document-term matrix that you
# train the model on.
%time nb.fit(X_train_dtm, y_train)

CPU times: user 6.85 ms, sys: 2.68 ms, total: 9.53 ms
Wall time: 9.68 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
# make classification predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [23]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.98851399856424982

In [24]:
# print the confusion matrix
# remember we are trying to create a spam model, and therefore '1' is spam.
# 
#      +------+------+
#      | ham  | spam |
#      +------+------+
# ham  | 1203 |    5 |
# spam |   11 |  174 |
#      +------+------+
# row[ham]column[ham] -   1203 True Negative, correctly said ham (negative) was ham 
# row[ham]column[spam] -     5 False Positive, falsely said spam (positive) when it was really ham
# row[spam]column[spam] -   11 False Negative, falsely said ham (negative) when it was really spam
# row[spam]column[spam] -  174 True Positive, correctly said spam (positive) was spam
metrics.confusion_matrix(y_test, y_pred_class)

array([[1203,    5],
       [  11,  174]])

In [45]:
# print message text for the false postives ( meaning they were incorrectly classified as spam)
X_test[y_test < y_pred_class]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [46]:
# print message  text for the false negatives ( meaning they were incorrectly classified as ham)
# meaning where y_test = 1 (spam), and we predicted 0 (ham)
X_test[y_test > y_pred_class]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [48]:
# what do you notice about the false negatives?
# meaning we said ham, but was really spam
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [50]:
# calculate predicted probabilities for X_test_dtm ( poorly calibrated)
# for each of the observations, what is the models prediction that the observation is ham or spam
# [:, 1] - means all of the rows and the first column
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

In [51]:
# calculate AUC
# area under the curve
metrics.roc_auc_score(y_test, y_pred_prob)

0.98664310005369615

In [52]:
# creates a 2-dimensional array
# [0] - ham prediction
# [1] - spam prediction
# across the row, [0] + [1] = 100%
nb.predict_proba(X_test_dtm)

array([[  9.97122551e-01,   2.87744864e-03],
       [  9.99981651e-01,   1.83488846e-05],
       [  9.97926987e-01,   2.07301295e-03],
       ..., 
       [  9.99998910e-01,   1.09026171e-06],
       [  1.86697467e-10,   1.00000000e+00],
       [  9.99999996e-01,   3.98279868e-09]])

In [27]:
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1208
          1       0.97      0.94      0.96       185

avg / total       0.99      0.99      0.99      1393



## Part 6: Comparing Naive Bayes with logistic regression

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Logistic regression is much slower than Naive Bayes.  

In [3]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [4]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

NameError: name 'X_train_dtm' is not defined

In [56]:
# make a classification prediction for the X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [60]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
# for logistic regression, the predicted probability can be interpreted as the likelyhood of occurence.
# with NaiveBayes, you should not really consider the probabilities as likelyhood.
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,
        0.99725053,  0.00157706])

In [58]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.9877961234745154

In [59]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.99368176123143015

In [28]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          0       0.99      1.00      0.99      1208
          1       0.97      0.94      0.96       185

avg / total       0.99      0.99      0.99      1393



Naive Bayes is faster, but Logistic Regression gives slightly better predicted probabilities in this case

## Part 7: Calculating the "spamminess" of each token

Video 2 @ 20:20

In [61]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

7456

In [62]:
# examine the first 50 tokens
X_train_tokens[0:50]

['00',
 '000',
 '008704050406',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02',
 '0207',
 '02072069400',
 '02073162414',
 '02085076972',
 '021',
 '03',
 '04',
 '0430',
 '05',
 '050703',
 '0578',
 '06',
 '07',
 '07008009200',
 '07090201529',
 '07090298926',
 '07123456789',
 '07732584351',
 '07734396839',
 '07742676969',
 '0776xxxxxxx',
 '07781482378',
 '07786200117',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '07880867867',
 '0789xxxxxxx',
 '07946746291',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '08',
 '0800',
 '08000407165',
 '08000776320',
 '08000839402',
 '08000930705']

In [64]:
# examine the last 50 tokens
X_train_tokens[-50:]

['yer',
 'yes',
 'yest',
 'yesterday',
 'yet',
 'yetunde',
 'yijue',
 'ym',
 'ymca',
 'yo',
 'yoga',
 'yogasana',
 'yor',
 'yorge',
 'you',
 'youdoing',
 'youi',
 'youphone',
 'your',
 'youre',
 'yourjob',
 'yours',
 'yourself',
 'youwanna',
 'yowifes',
 'yoyyooo',
 'yr',
 'yrs',
 'ything',
 'yummmm',
 'yummy',
 'yun',
 'yunny',
 'yuo',
 'yuou',
 'yup',
 'zac',
 'zaher',
 'zealand',
 'zebra',
 'zed',
 'zeros',
 'zhong',
 'zindgi',
 'zoe',
 'zoom',
 'zouk',
 'zyada',
 'èn',
 '〨ud']

In [65]:
# Naive Bayes counts the number of times each token appears in each class
# recall, row[0] is ham, row[1] is spam
nb.feature_count_

array([[  0.,   0.,   0., ...,   1.,   1.,   1.],
       [  5.,  23.,   2., ...,   0.,   0.,   0.]])

What the feature count matrix says is:
For the zero-th element: The term '00', was found Zero times in ham messages and 5 times in spam messages.

For the first element: The term '000', was found Zero times in ham messages and 23 times in spam messages.

In [67]:
# rows represent classes, columns represent tokens
# row[0] - ham
# row[1] - spam
# 7456 columns or tokens
nb.feature_count_.shape

(2, 7456)

In [68]:
# number of times each token appears across all HAM messages
# row[0] = ham
# this says pull out row zero, and all of the columns
ham_token_count = nb.feature_count_[0, :]
ham_token_count

array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

In [70]:
# number of times each token appears across all SPAM messages
# row[1] = spam
# this says pull out row one, and all of the columns
spam_token_count = nb.feature_count_[1, :]
spam_token_count

array([  5.,  23.,   2., ...,   0.,   0.,   0.])

In [71]:
# create a DataFrame of tokens with their separate ham and spam counts
# keys - column names
# create a DataFrame where each row will show the token, the number of times it appeared in ham, 
#                        and the number of times it appeared in spam.
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')

In [72]:
# examine 5 random DataFrame rows
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64.0,2.0
nasty,1.0,1.0
villa,0.0,1.0
beloved,1.0,0.0
textoperator,0.0,2.0


In [75]:
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In the table above, would 'nasty' be a better predictor of ham or spam?

The answer is spam - because we have so fewer spam messages than ham.

 ham: 1 / 4825
spam: 1 / 747

Before we can use this to calculate the **"spamminess" of each token**, we need to avoid **dividing by zero** and account for the **class imbalance**.

In [73]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,65.0,3.0
nasty,2.0,2.0
villa,1.0,2.0
beloved,2.0,1.0
textoperator,1.0,3.0


In [76]:
# Naive Bayes counts the number of observations in each class.
# we want to normalize the ham/spam counts so we get frequencies instead of raw counts.
# naive bayes maintains this from a fitted model.
# scikit learn convention, the trailing underscore means that the property is only there after a 'fit' is called.
nb.class_count_

array([ 3617.,   562.])

3617 + 562 = 4179, which is the number of rows in the X_train_dtm

In [77]:
# normalize the ham and spam using the count of classifications of ham or spam.
# updated the dataframe to change the ham column from a count column to a normalized frequency 
# buy using the calculation:  ham value/ total ham
# this is a better measure of the classification because it is adjusted for the class imbalance
tokens['ham_norm'] = tokens.ham / nb.class_count_[0]
tokens['spam_norm'] = tokens.spam / nb.class_count_[1]

In [78]:
# display a random sample
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,ham_norm,spam_norm
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
very,65.0,3.0,0.017971,0.005338
nasty,2.0,2.0,0.000553,0.003559
villa,1.0,2.0,0.000276,0.003559
beloved,2.0,1.0,0.000553,0.001779
textoperator,1.0,3.0,0.000276,0.005338


In [81]:
# for each token, we can now calculate the ratio of spam-to-ham.
tokens['spam_ratio'] = tokens.spam_norm / tokens.ham_norm

In [82]:
# display a random sample
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,ham_norm,spam_norm,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
very,65.0,3.0,0.017971,0.005338,0.297044
nasty,2.0,2.0,0.000553,0.003559,6.435943
villa,1.0,2.0,0.000276,0.003559,12.871886
beloved,2.0,1.0,0.000553,0.001779,3.217972
textoperator,1.0,3.0,0.000276,0.005338,19.307829


In [83]:
# examine the DataFrame sorted by spam_ratio
tokens.sort_values('spam_ratio', ascending=False)

Unnamed: 0_level_0,ham,spam,ham_norm,spam_norm,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
claim,1.0,89.0,0.000276,0.158363,572.798932
prize,1.0,76.0,0.000276,0.135231,489.131673
150p,1.0,49.0,0.000276,0.087189,315.361210
tone,1.0,48.0,0.000276,0.085409,308.925267
guaranteed,1.0,43.0,0.000276,0.076512,276.745552
18,1.0,39.0,0.000276,0.069395,251.001779
cs,1.0,37.0,0.000276,0.065836,238.129893
www,2.0,73.0,0.000553,0.129893,234.911922
1000,1.0,32.0,0.000276,0.056940,205.950178
awarded,1.0,30.0,0.000276,0.053381,193.078292


In [84]:
# look up the spam_ratio for a given token
tokens.loc['dating', 'spam_ratio']

83.667259786476862

## Part 8: Creating a DataFrame from individual text files

Video 2 @ 40:30

This section is about how to read seperate files into panads to prepare for the modeling we just did above.  

In [29]:
# use glob to create a list of ham files
# glob only gets the files names - it does NOT actually read the files.
import glob
ham_filenames = glob.glob('../data/ham_files/*.txt')
ham_filenames

['../data/ham_files/email1.txt',
 '../data/ham_files/email3.txt',
 '../data/ham_files/email5.txt']

In [30]:
# read the contents of the ham files into a list ( each list element is one email)
ham_text = []
for filename in ham_filenames:
    with open(filename) as f:
        ham_text.append(f.read())

ham_text

['This is a ham email.\nIt has 2 lines.\n',
 'This is another ham email.\n',
 'This is yet another ham email.\n']

In [31]:
# repeat the process for the spam files
spam_filenames = glob.glob('../data/spam_files/*.txt')
spam_text = []
for filename in spam_filenames:
    with open(filename) as f:
        spam_text.append(f.read())
spam_text

['This is a spam email.\n', 'This is another spam email.\n']

In [32]:
# combine the ham and spam lists
all_text = ham_text + spam_text
all_text

['This is a ham email.\nIt has 2 lines.\n',
 'This is another ham email.\n',
 'This is yet another ham email.\n',
 'This is a spam email.\n',
 'This is another spam email.\n']

In [33]:
# create a list of labels (ham=0, spam=1)
all_labels = [0]*len(ham_text) + [1]*len(spam_text)
all_labels

[0, 0, 0, 1, 1]

In [35]:
# convert the lists into a DataFrame
pd.DataFrame({'label': all_labels, 'message': all_text})

Unnamed: 0,label,message
0,0,This is a ham email.\nIt has 2 lines.\n
1,0,This is another ham email.\n
2,0,This is yet another ham email.\n
3,1,This is a spam email.\n
4,1,This is another spam email.\n
