# Spam/ham predictor using scikit-learn (sklearn)
Supervised classification algorithm

## Basic flow:
**Load dataset**   
**Feature extraction**   
**Classifier**   
**Metrics**

In [1]:
import pandas

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC

from textblob import TextBlob

### 1) load dataset

- Load messages dataset using pandas library, which is a popular library for data manipulation and analysis.
- Split dataset into training and testing set (sklearn train_test_split function do it in a pretty simple way by specifying test size)

In [2]:
def load_dataset():
    messages  = pandas.read_csv("spam.csv", encoding='latin-1')
    messages = messages.rename(columns={"v1":"label", "v2":"content"})
   
    contents_train, contents_test, labels_train, labels_test = train_test_split(messages['content'],
                                                                                messages['label'],
                                                                                test_size=0.2)
        
    print "Total messages:", messages.shape[0]
    print "Training set contains", contents_train.shape[0], "messages"
    print "Testing set contains", contents_test.shape[0], "messages"
    
    return contents_train, contents_test, labels_train, labels_test

contents_train, contents_test, labels_train, labels_test = load_dataset()

Total messages: 5572
Training set contains 4457 messages
Testing set contains 1115 messages


In [3]:
#lemmatisation
def split_into_lemmas(content):
    words = TextBlob(content.lower()).words
    return [word.lemma for word in words]

### 2) feature extraction 

- Tokenization, lemmatization and stop words removal
- Convert messages into a vector that machine learning models can understand: CountVectorizer convert a collection of text documents to a matrix of token counts
- fit_transform method does two things: it learns the vocabulary of the messages and extracts word count

In [4]:
def feature_extraction(contents_train, contents_test):
    count_vector = CountVectorizer(analyzer=split_into_lemmas, stop_words='english')
    
    #Learn the vocabulary dictionary and return term-document matrix.
    train_messages = count_vector.fit_transform(contents_train)
    test_messages = count_vector.transform(contents_test)
    
    return train_messages, test_messages

train_messages, test_messages = feature_extraction(contents_train, contents_test)

To improve accuracy, we can use word frequencies instead of word count occurances. Instead of how many times a word appears in the message, we will compute the "percentage" of the message that is made by the word.
Most popular method is called TF-IDF. 

Term Frequency: How often a given word appears in a message.
Inverse Document Frequency: This downscales words that appear a lot across documents.


In [5]:
tfidf_train_messages = TfidfTransformer().fit_transform(train_messages)
tfidf_test_messages = TfidfTransformer().fit_transform(test_messages)

In [6]:
def metrics(classifier, predictions, labels_test):

    metric_matrix = confusion_matrix(labels_test, predictions)
    print classifier + (' classifier accuracy: '),format(accuracy_score(labels_test, predictions))
    #False positive
    print format(metric_matrix[0][1]),('ham messages were wrongly classified as spam while'),format(metric_matrix[0][0]),('were classified correclty')
    #False negative
    print format(metric_matrix[1][0]), ('spam messages wrongly classified as ham while'), format(metric_matrix[1][1]),('were classified correclty')
    #print classification_report(labels_test, predictions)

### 3) naive bayes classifier 

Based on Bayes probability theorem assuming independence between every pair of features, the probability of a message being spam(ham) given it contains a word:    
    
\begin{align}
P(S|W)=\frac{P(W|S)P(S)}{P(W)} = \frac{P(W|S)P(S)}{P(W|S)P(S)+P(W|H)P(H)}
\end{align}


In [7]:
def naive_bayes_classifier(train_messages, labels_train, test_messages):
    naive_bayes = MultinomialNB() 
    #train the classifier based on the training messages
    naive_bayes.fit(train_messages, labels_train) 
    
    #predic label of test messages using the trained model
    nb_predictions = naive_bayes.predict(test_messages) 
    return nb_predictions

nb_predictions = naive_bayes_classifier(train_messages, labels_train, test_messages)
nb_tfidf_predictions = naive_bayes_classifier(tfidf_train_messages, labels_train, tfidf_test_messages)

### 4) NB metrics report

In [8]:
#Naive Bayes classifier
metrics('Naive Bayes', nb_predictions, labels_test)
print '\n------------------------------------------------\n'

#Naive Bayes classifier with TFIDF
metrics('Naive Bayes with TFIDF',nb_tfidf_predictions, labels_test)

print '\n------------------------------------------------\n'

Naive Bayes classifier accuracy:  0.983856502242
8 ham messages were wrongly classified as spam while 968 were classified correclty
10 spam messages wrongly classified as ham while 129 were classified correclty

------------------------------------------------

Naive Bayes with TFIDF classifier accuracy:  0.956053811659
0 ham messages were wrongly classified as spam while 976 were classified correclty
49 spam messages wrongly classified as ham while 90 were classified correclty

------------------------------------------------



### 3.1) SVM classifier

Treat each data item is a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Classification is performed by finding the hyper-plane that differentiate two classes.


In [9]:
def linear_svc_classifier(train_messages, labels_train, test_messages):
    linear_svc = LinearSVC()
    #train the classifier based on the training messages
    linear_svc.fit(train_messages, labels_train)
    
    #predic label of test messages using the trained model
    linear_svc_predictions = linear_svc.predict(test_messages)
    return linear_svc_predictions

linear_svc_predictions = linear_svc_classifier(train_messages, labels_train, test_messages)
linear_svc_tfidf_predicitions = linear_svc_classifier(tfidf_train_messages, labels_train, tfidf_test_messages)

### 4.1) SVM metrics report

In [10]:
#Linear SVC classifier
metrics('Linear SVC', linear_svc_predictions, labels_test)
print '\n------------------------------------------------\n'

#Linear SVC classifier with TFIDF
metrics('Linear SVC with TDIDF', linear_svc_tfidf_predicitions, labels_test)
print '\n------------------------------------------------\n'

Linear SVC classifier accuracy:  0.985650224215
2 ham messages were wrongly classified as spam while 974 were classified correclty
14 spam messages wrongly classified as ham while 125 were classified correclty

------------------------------------------------

Linear SVC with TDIDF classifier accuracy:  0.987443946188
1 ham messages were wrongly classified as spam while 975 were classified correclty
13 spam messages wrongly classified as ham while 126 were classified correclty

------------------------------------------------



### Improvements 
** Use sklearn pipeline ** 
** Add ngram_range parameter **
** k-fold cross-validation **