# Spam Classification 

## 1. Introduction
The following Notebook demonstrates a Naive Bayes Classifier to identify messages marked as "spam" or "no spam". Our data is downloaded from [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).
Our outcome variable "y" indicates whether a message is considered to be "spam" or "not". Each observation in the dataset consists of  outcome variable y and is associated with one message. The objective here is to build a model that takes in a message and classifies this message as "spam" or "no spam" that can be applied to new messages with unknown outcome variable y. This is done by finding a model that maps each message to a probability being spam first, then deciding on a probability cutoff as of which a message is likely to be "spam".

#### Multinomial Naive Bayes
This algorithm is based on the theorem by Bayes:
$$
\begin{align}
P(A|B) &= \frac{P(B|A) \cdot P(A)}{P(B)}
\end{align}
$$
and estimates the conditional probability of a particular word given a class as the relative frequence of this word in documents belonging to a specific class. Number of occurences are taken into account.

The multinomail Naive Bayes classifier is well-suited to be used in situations of text classification and makes the assumption that features are independent. Even though this is hardly true, it generally performs good in this context. 

In [1]:
# Import libraries
import pandas as pd

# import display function
from IPython.display import display

# Import data (seperator as tab, colnames
df = pd.read_table("data/data_SMSSpam", sep="\t", names=["label", "sms_message"])
display(df.head())

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 2. The Data
- Two columns. 
 - Column [,1]: "ham" (no spam) or "spam". 
 - Column [,2]: text of SMS message

### Data Preprocessing
- Convert labels to binary variables: 
 - ham := 0 for no spam and spam := 1.
 - Use map method to convert

In [2]:
# Define mapping
di = {"ham": 0, "spam":1}

# Apply mapping
df["label"] = df.label.map(di)

# Show results
display(df.head())

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Convert text to numbers: Bag of Words in scikit-learn
We have to convert the text messages to numbers (vectors of numbers) that can be used by our algorithms. A simple way to do this is to use the bag-of-words representation, yielding a vectorized dictionary consisting of the occurences of words. It is called "bag" because any structureal order of the relations of words to each other is dismissed. In sklearn, this model is denoted as `CountVectorizer()`.  

- Objective: Convert set of text to a frequency distribution matrix (vectorization)
- Use count vector: [count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) 

In the following, we will first go through a simple example of how this is done and how it looks like before applying this method to our spam dataset.

### Example I: CountVectorizer

In [3]:
# Import Count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Simple example data
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',]

# Instantiate object of CountVectorizer
vectorizer = CountVectorizer()

# Fit to our data
X = vectorizer.fit_transform(corpus)

# extract feature names
print(vectorizer.get_feature_names())
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


The matrix shows that the word "and" is only present in the third observation once (row three, column one). "Document" occurs once in text 1 as well as text 4 and occurs twice in the second text. 

### Example II: CountVectorizer

In [4]:
# Try it first on the already known documents-list
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# fit data and return matrix
count_vector = count_vector.fit(documents)

# get feature names
names = count_vector.get_feature_names()
count_vector = count_vector.transform(documents)

# Convert to matrix 
doc_array = count_vector.toarray()

print(names)
print(doc_array)

# Frequency matrix
frequency_matrix = pd.DataFrame(columns=names, data=doc_array)
frequency_matrix

['are', 'call', 'from', 'hello', 'home', 'how', 'me', 'money', 'now', 'tomorrow', 'win', 'you']
[[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


### Back to Spam Classification: 
Note that in the case of large datasets there will be certain fill words (e.g. "is", "the", and alike which should be dealt with by using "stop_words" or the [tfidf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) method. 

In [5]:
# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393




### 3. Modeling

Applying Bag of Words processing to our dataset
 
- i) fit our training data into CountVectorizer() 
- ii) Transform testing data to return the matrix

### Bayes Theorem implementation
- Use Bayes Algorithm to make predictions and classify a message as spam or not spam
- Bayes Algorithm is composed of a prior (probabilities that we are aware of or that is given to us) and the posterior (probabilities we are looking to compute using the priors)
- Algorithm considers features that it is using to make the predictions to be independent of each other, which may not always be the case. 
- More precisely, if we have more than one feature we will assume that this additional feature is independent of the first feature which may or may not be true

In a next step we will extend Bayes to consider cases where we have more than one feature. 

Formula for the Naive Bayes theorem:
$$
\begin{equation}
P(y| x_1, ..., x_n) = \frac{P(y)P(x_1, ...,x_n|y)}{P(x_1,...,x_n)}
\end{equation}
$$

### Naive Bayes implementation using scikit-learn
- Use `naive_bayes` method from `sklearn` to make predictions
- Use multinomial Naive Bayes implementation (suitable for classification with discrete features, word counts for text clasification). 
- Input: Integer of word counts
- Alternatively: Gaussian Naive Bayes is better suited for continous data as it assumes that the input data has a Gaussian (normal) distribution.

In [6]:
# Import Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit trainig data
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

# Initialize object
naive_bayes = MultinomialNB()

# Fit object to trining data
naive_bayes.fit(training_data, y_train)

# Predictions
predictions = naive_bayes.predict(testing_data)
print(predictions[0:100])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0]


### 4: Model Evaluation: 
- **Accuracy** measures how often the classifier makes the correct prediction. It's the ratio of the number of correct predictions to the total number of predictions (number of test data points). 
- **Precision** is the proportion of messages we classified as spam and actually were spam. It is a ratio of: 

$$
\begin{equation}
\frac{True \; Positives}{True \; Positives + False \; Positives}
\end{equation}
$$

- **Sensitivity** tells us what proportion of messages that actually were spam were classified by us as spam. Ratio of true positives (words classified as spam, and which are actually spam) to all the words that were actually spams, hence: 

$$
\begin{equation}
\frac{True \; Positives}{True \; Positives + False \; Negatives}
\end{equation}
$$

- In a skewed classification problem, accuracy by itself is not a very good metric (e.g. only 2 spam vs 98 non-spam).
- For such cases, precision and recall come in handy. They can be combined to get the F1 score, which is the weighted average of the precision and recall scores. 
- F1 score can range from 0 to 1, with 1 being the best possible score.

In the following we'll be calculating all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.

In [7]:
# Import metrics from sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate Scores
print('Accuracy score: ', format(accuracy_score(y_test, predictions).round(4)))
print('Precision score: ', format(precision_score(y_test, predictions).round(4)))
print('Recall score: ', format(recall_score(y_test, predictions).round(4)))
print('F1 score: ', format(f1_score(y_test, predictions).round(4)))

Accuracy score:  0.9885
Precision score:  0.9721
Recall score:  0.9405
F1 score:  0.956


In [8]:
print("End of Notebook")

End of Notebook
