In [79]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Working with Text Data in Scikit-learn

In [80]:
"""
Agenda:
1. Model Building in Scikit-learn - object shapes
2. Representing text as numerical data
3. Reading SMS data
4. Vectorizing the SMS data
5. Building a Naive Bayes Model
6. Comparing Naive Bayes with Logistic Regression
7. Calculating the 'spaminess' of each token
8. Creating a dataframe from individual text files
"""

#iris is classification dataset
from sklearn.datasets import load_iris
iris = load_iris()

Part 1: Model Building in Scikit-Learn 

In [81]:
"""
Features are also known as predictors, inputs, or attributes.
The response is also known as target, label, output

store the feature matrix (X) and response vector (y)
"""
X = iris.data
y = iris.target

In [82]:
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [83]:
#examine 5 rows of X, including feature names
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [84]:
#import the class
from sklearn.neighbors import KNeighborsClassifier

#instantiate the model with default parameters
knn = KNeighborsClassifier()

#fit the model with the data
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [85]:
#predict response for new observation
knn.predict([[3,5,4,2]])

array([1])

## Part 2: Representing Text as Numerical Data

In [86]:
#example test for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me...Please!']

In [87]:
simple_train

['call you tonight', 'Call me a cab', 'please call me...Please!']

In [88]:
#import and instantiate Countvectorizer(with default parameters)
#Countvectorizer is used for extracintg deafures from text.
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [89]:
#learn the vocabulary of the training data.  Learns the vocabulary, what are words being used
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [90]:
"""
examine the fitted vocabulary it learned
returns python list of strings
fitted vocabulary
put in alphabetical order
Lowercase=True - puts all lowercase
no capitals(ignores case, no duplicates, no punctuation)
deletes anything that doesn't have at least 2 characters, not removing stop words(didn't remove 'a' b/c stop word, it's less
than 2 character)
-no duplicates. Didn't put same word in twice
"""
vect.get_feature_names()

['cab', 'call', 'me', 'please', 'tonight', 'you']

In [91]:
#transform training data into document-term matrix
simple_train_dtm = vect.transform(simple_train)

In [92]:
"""
3 documents, 6 vocabulary words(terms, features, tokens)
"""
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [93]:
# documens matrices have to array method
#convert sparse matrix to dense matrix
# 3 x 6 matrix as a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

In [94]:
"""
examine the vocabulary and document term matrix together
It took text which was non-numeric and variable length and it's now representing it
     as a feature matrix with fixed number of columns
It's just a count of the number of times a token appears in that message

feature matrix. Create this b/c you need an X.
"""
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [95]:
"""
features and samples defined as follows:
-each individual token occurence frequency(normalized or not) is treated as a feature
- The vector of all token frequencies for a given document is considered a multivariate sample.
    A corpus of documents can be represented as a matrix, with 1 row per document, 1 column per token/word/feature.
    We have a corpus of 3 documents.
vectorization-process of turning a collection of text documents into numerical feature vectors. This strategy(a/k/a tokenization,
counting, normailization) is called bag of words. Documents are described by word occurences while ignoring the realtive 
position information of words in the document. Ordering is lost.

Count of the number of times a token appears in that message
sparse matrix doesn't store locations of zeros, stores locations of nonzeros.
    ie, (2,3)  2    location, go down to row 2, over 0,1,2,3 has a 2
    Doesn't tell you where zeros are, only non-zeros.
print sparse matrix
"""
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


### Testing 

In [96]:
"""
point of building a model is to make predictions later.
I'm going to have new text data that comes in, what do I do with it?

-example text for model testing.
"""
simple_test = ["please don't call me"]

In [97]:
#transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]])

In [98]:
"""
examine the vocabulary and document-term matrix together
-what has happened that isn't optimal?
what happened to 'don't'.  'don't' wasn't a feature previously.  Wasn't a token in training data so it get dropped.
It's ok it dropped.
Since it wasn't in original corpus it won't assist in modeling text.

in spam case, 'don't' may be highly relevant to predicting ham or spam, but if that word has never been seen in our 
    training data, the model doesn't have any reason to believe to it's related to spam or non-spam.Training model never 
    learned whether 'don't' is a hammy or spammy word. It's dropped b/c we wouldn't learn anything from it.

vect.fit(train) - learns the vocabulary of hte training data    
vect.transform(train) - uses the fitted vocabulary to build a document term matrix from the training data    
vect.transform.(test) -  uses the fitted vocabulary to build a document term matrix from the testing data (and
    ignores tokens it hasn't seen before)

-we didn't run fit on the testing data. We ran fit and transform on training data. If we had run a fit on the testing data we would have 
    learned a new vocabulary of 4 words, it would not match the vocalubalry of 6 words that was in the training data.

"""
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


## Part 3: Reading the SMS data

In [99]:
path = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label','message'])

In [100]:
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [101]:
sms.shape

(5572, 2)

In [102]:
#examine the class distribution
# ham is word for non-spam
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [103]:
#want to convert label in your response to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})


In [104]:
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [105]:
#usual way t odefine X and y (from iris data) for use with a model
X= iris.data
y = iris.target
print(X.shape)
print(y.shape)

(150, 4)
(150,)


In [106]:
#required way to define X and y for use with convectorizor
#both are 1D. Starting with X as 1D b/c it will get converted by countvectorizer into a 2d.
X = sms.message  #message series as X data
y = sms.label_num  # label_num series as y
print(X.shape)
print(y.shape)

(5572,)
(5572,)


In [155]:
"""
split X and y into training and testing sets before vectorization
the point of train test split is you believe the test set you're making is a simulation of the future.
Reality is that future docs will have words you've never seen before. Models are handicapped by fact taht they can only
    learn from the past.
If you vectorize before splitting you model, your testing set will not see any words that it didn't learn already.
Making it too easy for model if you vectorize first, then train/test split.

by default split is 75/25
"""
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4179,)
(4179,)
(1393,)
(1393,)


## Part 4: Vectorizing SMS Data

In [108]:
#instantiate the vectorizer
vect = CountVectorizer()

In [109]:
#learn trainig data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [110]:
#examine document term matrix
#4179 sms messages, training documents and 7456 individual words(unique tokens. duplicate words only get counted once)
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [111]:
#transform testing data (using fitted vocabulary)into document-term matix
#7456 will be the same b/c transfom is using the fitted vocabulary and it learned a vocabulary of 7456 previously.
#7456 is hte number of features, tokens taht were learned in the fit step.
#knew it was goin to be 1393 b/c taht is hte number of test documents(X_test.shape)
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

## Part 5: Building and Evaluating Multinomial Naive Bayes Model

In [112]:
"""
import and substantiate Multinomial Naive Bayes Model
There a different types of naive bayes we're using multinomial
Multinomial Naive Bayes is suitable for Classification with discrete features(meaning integer features),such as word count.
Won't accept negative numbers as features.
Multinomial is very common to use for text problems
"""
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [113]:
# train the model using X_train_dtm(timing it with 'magic command')
%time nb.fit(X_train_dtm, y_train)

CPU times: user 5.53 ms, sys: 2.39 ms, total: 7.92 ms
Wall time: 7.08 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [114]:
#make class predicitons for X_test_dtm, not X_test
#Step 4, make a prediction
y_pred_class = nb.predict(X_test_dtm)

In [115]:
y_pred_class

array([0, 0, 0, ..., 0, 1, 0])

In [116]:
"""
Accuracy is percent correct.
calculate accuracy of class predictions
we're using roughly 4000 ham/spam messages and short text to predict whether another 1300 messages are ham or spam 

pass acual values first, predicted values 2nd.

null accuracy- our null accuracy is hte the accuracy that could be accomplished by always predicting the majority class
   That would 75 or 80% b/c of imbalcned classes.  We have 3/4 ham and the rest is spam.
   If we always predcited ham we would get it iright 75% of the time.
Predicting ham or spam.  Based on words in a message.
"""
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9885139985642498

In [117]:
"""
print the confusion matrix
Looking for predictions for the testing set.
#comparing actuals v. predicted
TN = 1203, this was a correct prediction
TP = 174, this was a correct prediction
FP = 5, false positives. Means message was incorrectly classified as spam. Predicted spam and it was ham.
FN = 11, false negatives
"""
metrics.confusion_matrix(y_test, y_pred_class )

array([[1203,    5],
       [  11,  174]])

In [118]:
"""
print message text for the false positives (meaning they were incorrectly classified as spam)
classifier incorrectly predicted spam
for any case where the prediction is 1, and the acutal is 0.It returns a True and that Series of True and False is passed
    to X_test to select out rows.
These are the 5 messages that were hand marked as ham, meaning person that built is said these are ham messages and the 
    classifier incorrectly predicted spam.
"""
X_test[y_test < y_pred_class]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [119]:
#print message text fo false negatives(meaning they were incorrectly classified as ham)
#false negative is a spam message that was incorrectly classified as ham
X_test[y_test > y_pred_class]


3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [156]:
y_test

1078    0
4028    0
958     0
4642    0
4674    0
5461    0
4210    0
4216    0
1603    0
1504    0
1783    0
3465    0
5534    0
4267    0
2498    0
4259    0
147     1
141     0
4517    1
3053    0
5392    0
2346    0
1242    0
3224    0
4872    0
3044    0
1660    0
3214    0
501     0
1827    0
       ..
2285    0
4829    0
2155    0
3555    0
4582    0
1010    0
2007    0
221     0
1961    1
1141    0
2270    0
2589    0
3779    0
1714    1
1463    1
2694    0
1925    0
3597    0
1988    0
932     0
2870    0
5458    0
2890    0
3658    0
4285    0
3207    0
4655    0
1140    0
1793    1
1710    0
Name: label_num, Length: 1393, dtype: int64

In [120]:
"""
what do you notice about the false negatives
false negatives are a lot longer
Naive Bayes is getting lost in hammy words.
"""
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [121]:
"""
predicted probabilties-what is the predicted probability of class membership for each of the observations. 
    What is the model's prediction of the liklihood that it is spam or ham.
calculate predicted probabilites for X_test_dtm (poorly calibrated)
for each of hte obervations what is hte models prediction of likelyhood that it is spam
predicted probability of class 1 which is spam [:,1]. 
[:,1] I want all rows and I want column 1. Column 1 is the predicted probability of class 1.  Class 1 is 'spam'.
"""
y_pred_prob = nb.predict_proba(X_test_dtm)[:,1]

In [122]:
"""
for the first message in test set, it thinks the likelyhood that it is spam is .0028.
last message, it thinks the liklihood that it is spam is almost zero.
1.00, it thinks it's 100% spam.
These were the predicted probabilities that were output.
Naive Bayes has poorly calibrated predicted probabilities. It gives very extreme numbers. Not terribly usefule as predcited 
    probabilties. If you want t ointerpret as liklihoods, they're not terribly useful.
    
You may want your model to output predicted probabilitites that can be interpreted as probabilties of class 1, so
   I want my model to tell me if it has 60% chance of belonging to class 1.  Maybe I only have time to call 10 prospects per day.
   NB prections should not be interpreted as actual probabilties. 
"""

y_pred_prob

array([2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,
       1.09026171e-06, 1.00000000e+00, 3.98279868e-09])

In [158]:
"""
need t oslice to get predcited probability of class 1.
all rows, column 1
"""
nb.predict_proba(X_test_dtm)[:,1]

array([2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,
       1.09026171e-06, 1.00000000e+00, 3.98279868e-09])

In [124]:
"""
Despite the fact that predicted probabilties were so wacky, the AUC is still very high b/c the accuracy was so high already.
    No way AUC could be much lower.
AUC figure is still great.
Use of predicted probabilties:
    1)Some evaluation metrics require predicted probabilties.
      AUC requires predicted probabilties.Logloss requires predcited probabilties
    2) maybe you actually don't care about class predictions
      ie, credit card fraud, is this transaction fraudulent.Might not care about accuracy-did I get it right or wrong.
      Did predcited probabiltiy break 50% or not. Might say,all I care if something has more than 10% liklihood of being 
      fraud, then I will flag and it and disallow payment until customer authorizes it.
      with fraud want predcited probabilties finely tuned.Care whether it predcits 2% or 12% fraud.
    
-calculate AUC(Area under the Curve)
"""
metrics.roc_auc_score(y_test, y_pred_prob)

0.9866431000536962

## Part 6: Comparing Naive Bayes with Logistic Regression

Compare Naive Bayes with Logitic Regression 

In [125]:
#import and instantiate logistric regression model
# this is a classificaiton model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [126]:
"""
wall time is bigger. Before wall time was Wall time: 20.8 ms
Naive Bayes is really fast. Logsitic Regression is slower than Naive Bayes.
"""
#train the model using X_test_dtm
%time logreg.fit(X_train_dtm,y_train)

CPU times: user 61.1 ms, sys: 3.61 ms, total: 64.7 ms
Wall time: 35.2 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [127]:
#make a class prediction for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [128]:
"""
calculate predicted probabilities for X_test_dtm( well calibrated). They are well calibrated probabilties.
logistic regression probabilities are much more likely to be interpretable as actual liklihoods
logistic regression probabilities are better calibrated
-Logistic Regression is good if you care about predcited probabilities.
   But Logistic Regression is slower than Naive Bayes.
-We don't take Naives Bayes probabilties very seriously. Logistic Regression predcited probabilties are better calibrated.
"""
y_pred_prob = logreg.predict_proba(X_test_dtm)[:,1]

In [129]:
y_pred_prob

array([0.01269556, 0.00347183, 0.00616517, ..., 0.03354907, 0.99725053,
       0.00157706])

In [130]:
#calculate accuracy
metrics.accuracy_score(y_test,y_pred_class )

0.9877961234745154

In [131]:
"""
Both accuracy and roc_auc_score did really well in this case.
calculate AUC
Minor differences
Point-Logistic Regression gives better predicted probabilties, Naives Bayes is faster. 2 ways to think about usage of the 
    2 models.
"""
metrics.roc_auc_score(y_test, y_pred_prob)

0.9936817612314301

## Part 7: Calculate the spaminess of each token

In [132]:
"""
-get insight from your models.
#Can do this with Naive Bayes in particular
-Why did certain messages get flagged as spam v ham? 
-Why question. Were the individual words thought of by the model as like spammy words?
-store the vocabulary of X_train as an object.
-length is 7456, the vocabulary size. It learned 4179 x 7456, was the size of matrix b/c 7456 is the vocabulary size.
"""
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

7456

In [133]:
#examine the fist 50 tokens
print(X_train_tokens[:50])

['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']


In [134]:
#examine the last 50 tokens
print(X_train_tokens[-50:])

['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']


In [135]:
#naive bayes, when you run a fit,counts the number of times each token appears in a class
# ham is class 0 and spam is class 1

#can use to help us tell which words are hammy and which are spammy

"""
Above, it kept the vocabulary in alphabetical order.

ham is class 0
spam to class 1
2 rows represent 2 classes, class 0 are ham and class 1 is spam. The 7456 columns represent the features/tokens.
0 corresponds to numerically smallest class
-regarding above token '00', was found 0 times in ham messages and 5 times in spam messages.
-'〨ud', appeared 1 time in ham message and 0 times spam.

Can use this data to decide which words are hammy and which words are spammy. A words that appears a lot in ham but not in 
    spam, it's a good predcitor of a ham message. 
How do we know which row is ham and which row is spam?  Kevin assigned ham to class 0 and spam to class 1.

Just like confusion matrix, it's sorted with the lowest class numerically as hte closest to the upper left corner.
"""
nb.feature_count_

array([[ 0.,  0.,  0., ...,  1.,  1.,  1.],
       [ 5., 23.,  2., ...,  0.,  0.,  0.]])

In [136]:
#rows represent classes and columns represent tokens
#2 rows represent 2 classes: class 0 is ham and class 1 is spam
#7456 represent the features, represnts tokens
#first token 00 was found 0 times in ham messages, but 5 times in spam messages
# the last token appeared 1 time inf ham message and o times in spam message.
#numpy array
#a word that appears a lot in ham but not in spam it's a good predictor of a ham message.
# 2 x 7456 numpy array
nb.feature_count_.shape

(2, 7456)

In [137]:
#first row
##number of times each token appears accross all spam messages
ham_token_count = nb.feature_count_[0,:]
ham_token_count


array([0., 0., 0., ..., 1., 1., 1.])

In [138]:
#second row
#number of times each token appears accross all spam messages
spam_token_count = nb.feature_count_[1,:]
spam_token_count

array([ 5., 23.,  2., ...,  0.,  0.,  0.])

In [139]:
#create a dataframe of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')

In [140]:
tokens

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,0.0,5.0
000,0.0,23.0
008704050406,0.0,2.0
0121,0.0,1.0
01223585236,0.0,1.0
01223585334,0.0,2.0
0125698789,1.0,0.0
02,0.0,4.0
0207,0.0,3.0
02072069400,0.0,1.0


In [141]:
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64.0,2.0
nasty,1.0,1.0
villa,0.0,1.0
beloved,1.0,0.0
textoperator,0.0,2.0


In [142]:
"""
want to calculate for every word a spaminess score.
Going to use the ratios of spam to ham inorder to do that now.
2 things to do:
1. account for class imbalance that we were just talking about.  We need spam to have a higher weight
Word appearing in spam has to have a higher weight b/c of class imbalance.
2. make sure there are no /0 errors
If I'm doing a ratio nof 1 column to another, having a 0 is going to be a problem.  I'm going to add 1 to both columns
"""
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,65.0,3.0
nasty,2.0,2.0
villa,1.0,2.0
beloved,2.0,1.0
textoperator,1.0,3.0


In [143]:
"""
I want to normalize numbers
I want to divide by the number of spam messages and ham.  So I can get frequencies rather than counts

naive bayes counts hte number of observations in each class
3617 ham and 562 spam
"""
nb.class_count_

array([3617.,  562.])

In [144]:
"""
It's a better measure of raw counts b/c it's adjusted for class imbalance.
ie, 'nasty' is less prevalent in spam messages than ham messages, 0.000553 vs. 0.003559
Don't take figures too seriously w/o accounting for fact that you added 1 before I did division.
convert ham and spam counts to frequencies
"""
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,0.017971,0.005338
nasty,0.000553,0.003559
villa,0.000276,0.003559
beloved,0.000553,0.001779
textoperator,0.000276,0.005338


In [145]:
"""
Our most spammy word is 'text operator', followed by 'villa', followed by 'nasty',followed by 'beloved', followed by 'very'.
Out of 5 words, 'very' is most 'hammy' word. It's the most predcitive of ham.

Naive Bayes learns this ratio. NB learns spam ratio column and uses it to make predicitons.
Based on this training dat, if I got a new text message that said 'very nasty villa', what Naive Bayes does is it 
    learned these ratios, would take the log of those numbers and add them up. Based upon either adding or mutliplying them,
    that would be how it decides whether a message is ham or spam.
    
-calculate for every word the ratio of spam to ham for each token
"""
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
very,0.017971,0.005338,0.297044
nasty,0.000553,0.003559,6.435943
villa,0.000276,0.003559,12.871886
beloved,0.000553,0.001779,3.217972
textoperator,0.000276,0.005338,19.307829


In [146]:
"""
examine the datframe sorted by spam_ratio
note: use sort() instead of sort_values for pandas
words very spammy in this dataset
Our hammy words in this data are much more conversational:'lol','ask','already','she','he'
For your problem you may be interested in digging, what words really contributing to my model pre
predicting ham or spam. Great for model diagnostics.Explore what your model
has learned.
-Might want to know what words really contributed to my model predciting ham or spam. Great for model diagnostics.
    Great way to explore what model has learned.
"""
tokens.sort_values('spam_ratio', ascending=False)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,0.000276,0.158363,572.798932
prize,0.000276,0.135231,489.131673
150p,0.000276,0.087189,315.361210
tone,0.000276,0.085409,308.925267
guaranteed,0.000276,0.076512,276.745552
18,0.000276,0.069395,251.001779
cs,0.000276,0.065836,238.129893
www,0.000553,0.129893,234.911922
1000,0.000276,0.056940,205.950178
awarded,0.000276,0.053381,193.078292


In [147]:
"""
look up the spam ratio for a given token. Easiest way is to use loc method.

row/column. 'dating' has a high spam ratio, but not as high as others.
Don't interpret this to mean that words above 1 mean hammy and below 1 mean spammy, can't do this b/c of 1 he added.
Don't take numbers too seriously.
Look at the general scores and the ranking is what you're looking at.
Don't take words too seriously, but are helpful to know which words indicate a certain class for your document.

Question, where was the threshold set for ham or spam. Threshold was set by default at 50%. You can adjust threshold manually.
Question, cna you use Naive Baye for credit card fraud. Yes. It is a classification model. NB is not only for text based data
    NB doesn't know what the data means, it doesn't know it came from text.
    
Kevin points out, Logistic Regression is not universally more accurate than Naive Bayes. With smaller datasets, NB tends to ahve 
    a better accuracy than logistic regression. With larger datasets, Logistic Regression tends to do better than NB
    becaise it has a lower asymtotic error.
"""
tokens.loc['dating','spam_ratio']

83.66725978647686

Part 8: Creating a Text from Individual Text files

In [148]:
"""
-ie, my text data is stored in a bunch of separate documents.Not stored in a pre-built dataframe. This is the problem 
    Kevin wants to solve.
    Have a bunch of ham and spam emails.

You've 5 emails and that's training data.You've dragged and dropped them into
    different folders.
You've got 3 ham emails and 2 spam emails. ham folder, spam folder.
Goal is to take those 5 files and treat each file as its own document and then build a dataframe that
looks just like what we've been using from the SMS dataset.
Each document is its own file and you want that to turn that into a dataframe.
glob comes up with a list file names. So I have something to iterate through.

job of glob is just to get file names, not to read files.

*.txt  -means get all text documents in folder
Builds a list files to work with in python.
-use glob to create a list of hsm filenames 
"""
import glob
ham_filenames = glob.glob('data_txt/ham/*.txt') 
ham_filenames

['data_txt/ham/email1.txt',
 'data_txt/ham/email3.txt',
 'data_txt/ham/email5.txt']

In [149]:
"""
Build an empty list.
Each of the file names
Open that file, after opening it I want to append the complete text of that file to the ham text list.

We've not got a python list of 3 strings. Each list element, meaning each string represnts 1 documents, meaning email.
    \n is a new line character.This entire documents, regardless whether it has 2 lines or 2 thousand lines. It gets stored
    as 1 string.That way the documents are separated out as list elements.
-read the contents of hte ham files into a list (each list element is 1 email)

"""
ham_text = []
for filename in ham_filenames:
    with open(filename) as f:
        ham_text.append(f.read())
ham_text        



['This is a ham email.\nIt has 2 lines.',
 'This is another ham email.',
 'This is yet another ham email.']

In [150]:
"""
repeat process for spam files.
A list of 2 emails. There were only 2 spam emails.
"""
import glob
spam_filenames = glob.glob('data_txt//spam/*.txt')
spam_text = []
for filename in spam_filenames:
    with open(filename) as f:
        spam_text.append(f.read())
spam_text 

['This is a spam email.', 'This is another spam email.']

In [151]:
"""
Got a list of my 5 documents as a string
-combine the ham ans spam lists
"""
all_text = ham_text + spam_text


In [152]:
all_text

['This is a ham email.\nIt has 2 lines.',
 'This is another ham email.',
 'This is yet another ham email.',
 'This is a spam email.',
 'This is another spam email.']

In [153]:
"""
need a to create a list of labels where ham=0 and spam=1
if you have a list wit h1 element in it,like a ), and you multiply it by a number, it makes a list of that length with
that number repeated.
I wanted to create a list where in the exact same order I have 0 0 0 1 1 to represent ham or spam

"""
all_labels = [0] * len(ham_text) + [1] * len(spam_text)
all_labels

[0, 0, 0, 1, 1]

In [154]:
"""
final step, convert these 2 lists to a dataframe, so I'm passing a dictionary with column headers label and message 
I'm passing in the data. Passing it to pd dataframe function
Now have a dataframe that looks just like what we built in reading in sms.tsv
Countvectorizer knows what '\n' is and it's going to ignore that.
"""
pd.DataFrame({'label': all_labels, 'message':all_text})

Unnamed: 0,label,message
0,0,This is a ham email.\nIt has 2 lines.
1,0,This is another ham email.
2,0,This is yet another ham email.
3,1,This is a spam email.
4,1,This is another spam email.
