## Ham vs Spam

![Ham-vs-Spam](https://raw.githubusercontent.com/sik-flow/nlp_text_classification/master/Pics/span-vs-ham.png)

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

We are going to build a classifier to determine whether a message is ham or spam.  Data set is from the UCI Machine Learning Repository and can be found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

First lets read in the data and see what we have.

In [3]:
# Read in the Data

data_source = 'https://raw.githubusercontent.com/sik-flow/nlp_text_classification/master/Data/SMSSpamCollection'
df = pd.read_csv(data_source, delimiter= '\t', header=None)
df.columns = ['ham_spam', 'text']

In [None]:
df.head()

<img align="left" width="500" height="500" src="https://raw.githubusercontent.com/sik-flow/nlp_text_classification/master/Pics/dfhead1.png">

Lets first take a look at the most common words in ham messages and the most common messages in ham messages to see if there are any trends.

In [5]:
print ('10 most common words in ham messages:')
pd.Series(' '.join(df[df['ham_spam'] == 'ham'].text).lower().split()).value_counts()[:10]

10 most common words in ham messages:


i      2181
you    1669
to     1552
the    1125
a      1058
u       881
and     846
in      790
my      745
is      717
dtype: int64

In [6]:
print ('10 most common words in spam messages:')
pd.Series(' '.join(df[df['ham_spam'] == 'spam'].text).lower().split()).value_counts()[:10]

10 most common words in spam messages:


to      685
a       375
call    342
your    263
you     252
the     204
for     202
or      188
free    180
2       169
dtype: int64

As you can see above, for ham messages the most common words are "I", "you", "to", and "the".  These words are known as [stop words](https://en.wikipedia.org/wiki/Stop_words).  Stop words are common words that are filtered out when doing Natural Language Processing. 

For spam messages we see that we have some stop words "to", "a", "you", "the", and it also has some words that will have some importance suchas as "call" and "free".  

To start, we are going to leave the stop words in and see the results.  That way we can compare the effect of removing stop words. 

I'm now going to take the 10 most common words from ham and spam messages and make those columns with a binary format if that specific message has that word.

In [7]:
# Most common words in ham messages 
ham_words = pd.Series(' '.join(df[df['ham_spam'] == 'ham'].text).lower().split()).value_counts()[:10] \
    .index.values.tolist()
    
# Most common words in spam messages 
spam_words = pd.Series(' '.join(df[df['ham_spam'] == 'spam'].text).lower().split()).value_counts()[:10] \
    .index.values.tolist()
    
# Most common words of both messages
common_words = ham_words + spam_words

# Remove duplicate words
common_words = list(set(common_words))

In [8]:
# Loop to make columns and check if each word in common_words is in the message 

for word in common_words:
    df[word] = df['text'].str.contains(" " + word + " ")

In [None]:
df.head()

<img align="left" width="900" height="900" src="https://raw.githubusercontent.com/sik-flow/nlp_text_classification/master/Pics/dfhead2.png">

We now have our have data, so we are now going to use [Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html).  We use Bernoulli Naive Bayes because we are trying to predict a binary result (ham or spam).

In [10]:
from sklearn.naive_bayes import BernoulliNB

# Grab Data
X = df[common_words]
y = df['ham_spam']

clf = BernoulliNB()
clf.fit(X, y)
clf.score(X, y)

0.89501076812634606

Ended up with an accuracy of almost 90%, we're done right?

![Not so fast, my friend](https://raw.githubusercontent.com/sik-flow/nlp_text_classification/master/Pics/corso.png)

We need to see if you ham_spam column is unbalanced, if 99% of the messages are ham getting almost 90% is not very impressive!

In [39]:
df['ham_spam'].value_counts()

ham     4825
spam     747
Name: ham_spam, dtype: int64

In [40]:
len(df[df['ham_spam'] == 'ham']) / len(df)

0.8659368269921034

The ham_spam column is unbalanced.  If we would have predicted every message to be a ham message, we would have ended up with an accuracy of almost 87%.  So, getting 89% accuracy isn't so great anymore.

Lets look at the confusion matrix to see how we did.

In [41]:
from sklearn.metrics import confusion_matrix
y_pred = clf.predict(X)
confusion_matrix(y, y_pred)

array([[4699,  126],
       [ 459,  288]])

We correctly predicted 4,699 messages that were ham as ham, we incorrectly predicted 126 messages as spam that were ham, we incorrectly predicted 459 messages that were spam as ham, and correctly predicted 288 messages as spam that were spam.

### Stop Words

Lets remove stop words and see if that helps out. 

In [42]:
df_stop = df[['ham_spam', 'text']]

In [43]:
from sklearn.feature_extraction import stop_words
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'seem', 'who', 'hereupon', 'con', 'is', 'towards', 'un', 'being', 'whenever', 'well', 'everywhere', 'myself', 'third', 'fill', 'above', 'be', 'beyond', 'etc', 'few', 'its', 'part', 'all', 'anything', 'almost', 'herein', 'made', 'however', 'such', 'hence', 'fifty', 'nine', 'serious', 'them', 'except', 'own', 'between', 'empty', 'seeming', 'what', 'among', 'became', 'give', 'hasnt', 'everyone', 'hereafter', 'same', 'cannot', 'so', 'some', 'this', 'next', 'our', 'though', 'seemed', 'ours', 'it', 're', 'system', 'her', 'again', 'might', 'must', 'my', 'thru', 'about', 'whole', 'his', 'thereafter', 'there', 'can', 'may', 'through', 'for', 'someone', 'whither', 'cant', 'amount', 'how', 'last', 'mine', 'off', 'these', 'herself', 'against', 'formerly', 'via', 'whereas', 'thus', 'that', 'whom', 'enough', 'from', 'perhaps', 'below', 'could', 'take', 'then', 'we', 'amoungst', 'sometimes', 'eleven', 'she', 'nor', 'sincere', 'to', 'whatever', 'or', 'yours', 'due', 'mostly', 'in', 'see', '

Above is the list of stop words in sklearn.  There is not a "definitive" list of stop words, you will see that NLTK has a slightly different list.  

Lets remove stop words now.

In [44]:
# Make everything lowercase
df['text'] = df['text'].str.lower()

# Remove punctuation
df["text"] = df['text'].str.replace('[^\w\s]','')

# Remove stop words
stop = stop_words.ENGLISH_STOP_WORDS
df_stop['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word for word in x.split()if word not in (stop)]))

Now that stop words are removed lets see what the most common words are for ham and spam messages.

In [45]:
print ('20 most common words in ham messages:')
pd.Series(' '.join(df_stop[df_stop['ham_spam'] == 'ham']['text_without_stopwords']).lower().split()).value_counts()[:20]

20 most common words in ham messages:


u        985
im       460
2        309
just     290
dont     276
ltgt     276
ok       273
ur       246
ill      242
know     232
got      232
like     231
come     227
good     224
love     190
time     189
day      188
4        174
ü        169
going    167
dtype: int64

In [46]:
print ('20 most common words in spam messages:')
pd.Series(' '.join(df_stop[df_stop['ham_spam'] == 'spam']['text_without_stopwords']).lower().split()).value_counts()[:20]

20 most common words in spam messages:


free      216
2         173
txt       150
u         147
ur        144
mobile    123
text      120
4         119
stop      115
claim     113
reply     101
prize      92
just       78
won        73
new        69
send       68
nokia      65
urgent     63
cash       62
win        60
dtype: int64

Looking at the most common words for ham messages we can see trends - "u", "im", "i'll".  Spam messages also have trends such as "free", "text", "txt", "mobile", "claim".  

Lets make a new data set with these words and see if we can improve our model.

In [47]:
# Most common words in ham messages 
ham_ns_words = pd.Series(' '.join(df_stop[df_stop['ham_spam'] == 'ham']['text_without_stopwords']).lower().split()).value_counts()[:20].index.values.tolist() 
    
# Most common words in spam messages 
spam_ns_words = pd.Series(' '.join(df_stop[df_stop['ham_spam'] == 'spam']['text_without_stopwords']).lower().split()).value_counts()[:20].index.values.tolist() 
    
# Most common words of both messages
common_ns_words = ham_ns_words + spam_ns_words

# Remove duplicate words
common_ns_words = list(set(common_ns_words))

In [48]:
# Loop to make columns and check if each word in common_words is in the message 

for word in common_ns_words:
    df_stop[word] = df_stop['text_without_stopwords'].str.contains(" " + word + " ")

In [49]:
X = df_stop[common_ns_words]
y = df_stop['ham_spam']

clf = BernoulliNB()
clf.fit(X, y)
clf.score(X, y)

0.94669777458722182

In [50]:
y_pred = clf.predict(X)
confusion_matrix(y, y_pred)

array([[4771,   54],
       [ 243,  504]])

We ended up with an accuracy of almost 95% and our confusion matrix looks better as well.  Originally, we had 126 messages that were classified as spam that were ham, now we only have 54 messages that were classified as spam that were ham.  

### TF-IDF

Next we'll see if we can improve our score even more using TF-IDF.  TF-IDF stands for Term Frequency-Inverse Document Frequency. Term frequency shows how often a word appears in a document.  While inverse document frequency downscales words that appear a lot across documents.  In short words with a high TF-IDF score are words that appear frequently in the document and provide the most information about that specific document.  

Lets convert our data into a TF-IDF matrix

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer("english")

tf_idf = tfidf.fit_transform(df_stop['text_without_stopwords'])

In [52]:
clf.fit(tf_idf, y)
clf.score(tf_idf, y)

0.98546302943287867

In [53]:
y_pred = clf.predict(tf_idf)
confusion_matrix(y, y_pred)

array([[4823,    2],
       [  79,  668]])

We ended up with an accuracy of almost 99% and our confusion matrix looks better as well.  Also, we ended with only 2 messages that we classified as spam that was actually ham.  This is very important since misclassifying a ham message means that the user more than likely will not see the message and they need to see that message.  

### Summary

We learned the following tools/techniques:

1. What stop words are
2. Basic Text Classification using most common words
3. Basic Text Classification using TF-IDF

Use these NLP skills for good. Try classifying IMDB reviews using data [here](http://ai.stanford.edu/~amaas/data/sentiment/).

![Use NLP for good](https://raw.githubusercontent.com/sik-flow/nlp_text_classification/master/Pics/drevil1.jpg)