### In this notebook, we look at how to use text analysis techniques and the Naive Bayes algorithm to detect spam messages

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### 1- Import and preprocess data

The dataset comes from the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection 

In [6]:
df = pd.read_table('smsspamcollection/SMSSpamCollection', sep='\t', header=None)
df.columns = ['label', 'sms_message']
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
df['label'] = df.label.map({'spam': 1, 'ham':0})
df.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [39]:
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1, test_size=0.2)
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the training set: 4457
Number of rows in the test set: 1115


### 2 - Define Bag of Words ###
Here we use sklearn [CountVectorizer method](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to create Bag of Words. 

In [40]:
count_vector = CountVectorizer()

# Fit the training data and then return the matrix (+ vocabulary for the train set)
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix
testing_data = count_vector.transform(X_test)

### 3 - Apply Naive Bayes model

Apply multinomial Naive Bayes algorithm, to account for discrete features (word counts)

In [41]:
naive_bayes = MultinomialNB(alpha=1.)
naive_bayes.fit(training_data.toarray(), y_train)

MultinomialNB()

In [42]:
predictions = naive_bayes.predict(testing_data.toarray())

##### To evaluate performance, given the strong class imbalance, we use the **F1 score** (weighted average of precision and recall)

In [43]:
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print(' ')
print('F1 score: ', format(f1_score(y_test, predictions)))

Accuracy score:  0.9901345291479821
Precision score:  0.9788732394366197
Recall score:  0.9455782312925171
 
F1 score:  0.9619377162629758


Disclaimer: Notebook is based on exercises of the Natural Language Processing Udacity Nanodegree.