 # Spam or Ham? 
The goal of the project is to obtain a model that successfully will label a SMS to be spam or not. The method of obtaining that model will be the following:

  * Download dataset and set up environment
  * Explore data to gain domain knowledge for feature engineering 
  * Establish benchmark model
  * Preprocess data and create features
  * Implement at least three different supervised learning models and evaluate performance
  * Choose best performing model and perform Grid Search 
  * Evaluate model versus benchmark model
  * Evaluate results


### Metrics
A common metric for evaluation performance in binary classifications is 
$accuracy  = \frac{correct\: predictions}{total\: predictions}$. The metric does however fall short when the dataset is heavily skewed like it is in this case. Just predicting that all SMS text messages would achieve a accuracy of 75\% on the dataset and in a real life scenario when the overwhelming majority of text messages are not spam it would probably achieve an accuracy close to 100\%. The model would however be completely useless, despite the high accuracy. 


To solve this problem we introduce the Precision and Recall measurement. The
$precision = \frac{true\: positives}{true\: positives\ + \ false \: positives}$ metric explains how many of messages classified as spam that actually were spam. The
$recall = \frac{true\: positives}{true\: positives\ + \ false\: negatives}$ metric explains how many of the total spam messages the algorithm was able to correctly identify.

What metric out of the two to use is determined on a case to case basis. When it comes to spam, most people are probably okay with receiving a spam message from time to time but are probably not okay with missing an important text message because it was labeled as spam. Therefore, we will evaluate our model with en emphasis on precision. Luckily though, there are another metric called
$F_\beta = (1+ \beta^2) * \frac{ \cdot precision\cdot recall}{\beta^2*precision+ recall}$ that takes into account both the precision and the recall. Which of the two metrics to weigh higher than the other is determined by $\beta$, a $\beta$ lower than $1$ put emphasis on recall and higher than $1$ weighs precision higher. In this project $\beta = 1.5$ will be used.
## 1. Import Dataset


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input dat
#a files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

data = pd.read_csv("../input/spam.csv",encoding='latin-1')


# Any results you write to the current directory are saved as output.

Lets inspect the first 5 objects of the data

In [None]:
data.head()

As we can see above there are 3 columns created in the csv import process that contains no information, let's remove those and rename the columns appropriately 

In [None]:
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"sms"})
data.head()

Lets change the label to be easier for a the computer to process, let 0 resemble ham and 1 spam  

In [None]:
data['y'] = data.label.map({'ham': 0, 'spam': 1})

In [None]:
print(data.shape)
print(data.label.value_counts())

## Data Exploration 

The dataset provided consists of 5572 total text messages. Every data point consists of two features, the label and the text message itself. The messages is labeled as either spam or ham.  There are a total of 4825 text messages labeled ham and 747 labeled as spam.


#### Text Length 
Let's start by looking if there is any difference in the length of the two types of text messages. Since spam a lot of times contains long messages about different deals and things that you can win, intuitively I feel like the spams SMS should be longer on average 


In [None]:
# Add a column for the lenght of the SMS
data['length'] = data.sms.str.len()
data.head()

In [None]:
spam = data[data['label'] == 'spam']
ham = data[data['label'] == 'ham']
print("Data for the spam:")
print(spam.length.describe())

print("\nData for the ham:")
print(ham.length.describe())



As suspected the spam messages seems to be longer on average, in fact the average length of a ham SMS is almost half of that of a Spam SMS. 
The standard deviation of the the ham SMS length is a lot larger though and the max of the ham SMS is over 4 times that of spam. 
Since we are seeing such clear differences in the SMS length we should definately try to use the length as a feature to try to improve the model later on. 

#### Word usage 

Let's look at word usage. One nice way of doing this is constructing a word cloud which is a nice graphical representation of the most used words in a large corpus of words

We are also gonna see if there seems to any indicator if a SMS is spam or ham depending on the spelling. When it comes to email a lot of the spam will contain misspellings and wierd grammar and can therefore be a good way of determining if an email is spam or not. My first thought was that it might be the same for SMS but then I realized that when I text it's usually with friends and I use a lot of slang and definately do not care that much about my spelling. However I think it would be an interesting thing to look a bit more into. 

In [None]:
# Import neccesary libraries 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
from wordcloud import WordCloud 
import matplotlib.pyplot as plt


#nltk.download('stopwords')
#stemmer = SnowballStemmer('english')
word_set = set(nltk.corpus.words.words()) # a set containing all english words to determine  
stop_words = set(stopwords.words("english"))

In [None]:
# strings to store long strings of word for creating 
ham_wordlist = ''
total_ham_words = 0
total_misspelled_ham = 0

spam_wordlist = ''
total_spam_words = 0
total_misspelled_spam = 0

# tokenize and remove non alphanumerical 
tknzr = RegexpTokenizer(r'\w+')
for text in ham.sms:
    tokens = tknzr.tokenize(text.lower())
    # Remove all word
    for word in tokens:
        total_ham_words += 1 # increment total words for every word
        if word not in stop_words: # only save words that are not in stop words
            ham_wordlist = ham_wordlist + ' ' + word  
        if word not in word_set:
            total_misspelled_ham += 1 # count the total of misspelled words 

    
for text in spam.sms:
    tokens = tknzr.tokenize(text.lower())
    # Remove all word
    for word in tokens:
        total_spam_words += 1 # increment total words for every word
        if word not in stop_words: # only save words that are not in stop words
            spam_wordlist = spam_wordlist + ' ' + word 
        if word not in word_set:
            total_misspelled_spam += 1 # count the total of misspelled words 
    

In [None]:
spam_wordcloud = WordCloud(background_color="lightgrey", width=600, height=400).generate(spam_wordlist)
ham_wordcloud = WordCloud(background_color="lightgrey", width=600, height=400).generate(ham_wordlist)

In [None]:
# Ham wordcloud
plt.figure( figsize=(10,8))
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Spam wordcloud

plt.figure( figsize=(10,8))
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

As seen in the wordclouds above the most used words differ a lot from the two classes. While the most used words for records labeled ham does not seem to follow any certain pattern, the SMS labeled spam certainly does. The most used word is free, which is used in for example a lot of gambling advertisements. 

In [None]:
print('HAM \n total words: {} \n total misspells: {} \n %: {}'.format(total_ham_words,
                                                                     total_misspelled_ham, 
                                                                     total_misspelled_ham*100 / total_ham_words))

print('SPAM \n total words: {} \n total misspells: {} \n %: {}'.format(total_spam_words,
                                                                     total_misspelled_spam, 
                                                                     total_misspelled_spam*100 / total_spam_words))

In [None]:
# Function to calculate the number of misspells in each message
def calculate_misspells(x):
    #print(x)
    tokens = tknzr.tokenize(x.lower())
    #print(tokens)
    corr_spelled = [word for word in tokens if word in word_set]
    if len(tokens) == 0:
        return 0
    return len(corr_spelled)/len(tokens)

When looking at all words written in text messages that were labeled as Ham the percentage of misspelled words were $16.5\%$. For the text messages labeled Spam the same number was $35.5 \%$ 


In [None]:
data['misspells'] = data.sms.apply(calculate_misspells)

In [None]:
spam = data[data['label'] == 'spam']
ham = data[data['label'] == 'ham']
print("Data for the spam:")
print(spam.misspells.describe())

print("\nData for the ham:")
print(ham.misspells.describe())

When looking at the misspellings on a record to record basis the statistics  where $1.0$ notes a text message that contain no errors and $0.0$ notes a message where all words were misspelled. Even though the result is not as clear as in the case of text length, the possibility of using misspellings as a feature should be investigated. 

In [None]:
data_ready = data.drop([ "label"], axis=1)

In [None]:
data_ready.head()

In [None]:
# Create a function to remove all stopwords from the 
def remove_stopwords(x):
    #print(x)
    tokens = tknzr.tokenize(x.lower())
    #print(tokens)
    stop_removed = [word for word in tokens if word not in stop_words]
    
    return " ".join(stop_removed)

In [None]:
data_ready.sms = data.sms.apply(remove_stopwords)

In [None]:
data_x = data_ready.drop(['y'], axis=1)

In [None]:
data_x.head()

### Tf-idf
Tf-idf is a numerical statistic that reflect how important a word is to a document in a corpus. It will be used to extract features from the text messages into a feature vector. The idea is to treat each document as a bag of word while retaining the information about the occurrences of each word. 

Tf-idf consits of two statistics, tf and idf. Tf is the term frequency, and is basically just the raw count of a term in a document $$ tf_{i,j} = \frac{n_{i,j}}{\sum_k n_{k,j}} $$ 
Idf stands for inverse document frequency and measures how much information a word provides. Basically a word that is used a lot will not contain as much information as a word that is used less. Idf is calculated as $idf_i = \mbox{log} \frac{|D|}{|{d : t_i \in d}|} $. The Tf-idf is thereafter calculated as $\mbox{tf-idf}_{t,d} = (1 +\log \mbox{tf}_{t,d}) \cdot \log \frac{N}{\mbox{df}_t}$


### Naive Bayes
Naive Bayes is a simple supervised learning method based on applying Bayes' theorem. Basically the classifier works by assigning a record the class that has the highest probability of being true given the features, finding the $C_k$ with the highest probability $p(C_k | x_i,...,x_n)$ where $C_k$ represents all possible classes and $x_i$ a feature. Since the above formula is infisible to calculate if the number of features are large enough. Bayes theorem is therefore used to rewrite the problem to be possible to solve, $p(C_k | x_i,...,x_n) = \frac{p(C_k)p(\mathbf{x}|C_k)}{p(\mathbf{x})}$. 

To be able to be used as a classifier the above formula is rewritten and then a decision rule is added. Most usually the decision rle used is to pick the hypothesis that is most probable, know as MAP. This gives us the following formula ${\displaystyle {\hat {y}}={\underset {k\in \{1,\dots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\displaystyle \prod _{i=1}^{n}p(x_{i}\mid C_{k}).}$ for the Naive Bayes Classifier
\\ \\
Naive Bayes have been shown too work really well in binary classification cases and a lot of the early spam detectors were implemented using naive bayes.

### Train test split
The data will be split into a training set and a test set to evaluate the performance of the model. If this was not done the we would have no idea of knowing if the model is actually working or just overfitting to the data. 

### Benchmark
Since the data is heavily skeewed a simple benchmark will be to just classify everything as ham (all predictions are 0). The naive predictor achives an Accuracy of $0.87$ and $F_{1.5}$ score of $0.48$

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data_x,data["y"], test_size = 0.2, random_state = 1)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer()

tfidf1 = transvector.fit_transform(X_train.sms)

In [None]:
X_train_df = tfidf1.todense()

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
# The misspells are already scaled between 0-1 but I had to include it due to some wierd error 
# when I tried to only minmax the length feature. 
scaler = MinMaxScaler() 
X_train[['length', 'misspells']] = scaler.fit_transform(X_train[['length', 'misspells']])

In [None]:
X_train.head()

In [None]:
# Convert the Pandas dataframe so that it is a Numpy matrix to concatinate with the tfidf features
X_train[['length', 'misspells']].as_matrix()

In [None]:
X_train_final = np.concatenate((X_train_df , X_train[['length', 'misspells']].as_matrix()), axis=1)

In [None]:
# Transform test set
tfidf_test = transvector.transform(X_test.sms)
X_test_df = tfidf_test.todense()
X_test[['length', 'misspells']] = scaler.transform(X_test[['length', 'misspells']])
X_test_final = np.concatenate((X_test_df , X_test[['length', 'misspells']].as_matrix()), axis=1)

In [None]:
# Try using both naive bayes models 
prediction = dict()
from sklearn.naive_bayes import GaussianNB, MultinomialNB
gnb = GaussianNB()
clf = MultinomialNB()
gnb.fit(X_train_final,y_train)
clf.fit(X_train_final,y_train)

In [None]:
prediction["gaussian"] = gnb.predict(X_test_final)
prediction["multinom"] = clf.predict(X_test_final)

In [None]:
# Compare models 
print("F-score Gaussian, F-score Multinom, Accuracy Gaussian, Accuracy Multinom")
from sklearn.metrics import fbeta_score, accuracy_score
print(fbeta_score( y_test, prediction["gaussian"], average='macro', beta=1.5))
print(fbeta_score( y_test, prediction["multinom"], average='macro', beta=1.5))
print(accuracy_score( y_test, prediction["gaussian"]))
print(accuracy_score( y_test, prediction["multinom"]))

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer


In [None]:
# Perform Grid Search on multinomial
param_grid = {'alpha': [0, 0.5, 1, 2,5,10], 'fit_prior': [True, False]}
multinom = MultinomialNB()
scorer = make_scorer(fbeta_score, beta=1.5)
clf = GridSearchCV(multinom, param_grid, scoring=scorer)
clf.fit(X_train_final,y_train)

In [None]:
best_clf = clf.best_estimator_

In [None]:
best_predictions = best_clf.predict(X_test_final)

In [None]:
print("Best model F-beta and Accuracy:")
print(fbeta_score( y_test, best_predictions, average='macro', beta=1.5))
print(accuracy_score( y_test, best_predictions))

In [None]:
#Benchmark model: 
print("Benchmark model metrics on complete set and test set:")
#whole dataset
print(fbeta_score( data.y, np.zeros_like(data.y), average='macro', beta=1.5))
print(accuracy_score( data.y, np.zeros_like(data.y)))
#test set
print(fbeta_score( y_test, np.zeros_like(y_test), average='macro', beta=1.5))
print(accuracy_score( y_test, np.zeros_like(y_test)))

In [None]:
# Missclassified as spam
X_test[y_test < best_predictions ]

In [None]:
# Missclassified as ham
X_test[y_test > best_predictions]

In [None]:
from sklearn.model_selection import *

In [None]:
# Testing the robustness of the model


kf = KFold(n_splits=5)
kf.get_n_splits(data_x)
print("Scores of the different folds")
for train_index, test_index in kf.split(data_x):
    X_train, X_test = data_x.iloc[train_index], data_x.iloc[test_index]
    y_train, y_test = data.y.iloc[train_index], data.y.iloc[test_index]
    
    
    tfidf_train = transvector.fit_transform(X_train.sms)
    X_train_df = tfidf_train.todense()
    X_train[['length', 'misspells']] = scaler.transform(X_train[['length', 'misspells']])
    X_train_final = np.concatenate((X_train_df , X_train[['length', 'misspells']].as_matrix()), axis=1)

    tfidf_test = transvector.transform(X_test.sms)
    X_test_df = tfidf_test.todense()
    X_test[['length', 'misspells']] = scaler.transform(X_test[['length', 'misspells']])
    X_test_final = np.concatenate((X_test_df , X_test[['length', 'misspells']].as_matrix()), axis=1)
    
    clf = MultinomialNB(alpha=0.5, fit_prior=False)
    clf.fit(X_train_final,y_train)
    
    predictions = clf.predict(X_test_final)
    
    print("fbeta:")
    print(fbeta_score( y_test, predictions, average='macro', beta=1.5))
    print("accuracy:")
    print(accuracy_score( y_test, predictions))
