# Sentiment Analysis to Detect Spam Emails

Data used: spam.csv

This project focuses on building a supervised machine learning model that determines the probability any given email is spam mail or not.  The project relies on techniques found in text mining practices and the use of the pandas, numpy and sklearn python libraries.  The goal of the project is to build a generally reliable model as well as explore some of the key features and commonalities between spam mail samples to better understand the breadth (or lack thereof) of complexity and characteristics.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Step 1: Load and Visualize Dataset

In [2]:
spam_data = pd.read_csv('spam.csv')
spam_data['target'] = np.where(spam_data['target']=='spam',1,0) # converts spam text to binary
print(spam_data.shape) # 5572 samples
spam_data.head(10)

(5572, 2)


Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [3]:
spam_data['target'].mean()*100 # 13.4% of the cases are spam

13.406317300789663

In [4]:
def length_of_emails():
    temp = spam_data.copy()
    temp['length'] = temp['text'].str.len()
    average_length = temp.groupby('target')['length'].agg('mean').values
    return average_length[0], average_length[1]
not_s,s = length_of_emails()
print('Avg. Length of Non-Spam Emails: ' + str(round(not_s)))
print('Avg. Length of Spam Emails: ' + str(round(s)))

Avg. Length of Non-Spam Emails: 71.0
Avg. Length of Spam Emails: 139.0


In [5]:
def avg_digits():
    import re
    spam_data['digits_count'] = spam_data['text'].apply(lambda row: len(re.findall(r'(\d)', row)))
    average_digits = spam_data.groupby('target')['digits_count'].agg('mean').values
    return average_digits[0], average_digits[1]
not_s_d, s_d = avg_digits()
print('Avg. Number of digits in Non-Spam Emails: ' + str(round(not_s_d,2)))
print('Avg. Number of digits in Spam Emails: ' + str(int(s_d)))

Avg. Number of digits in Non-Spam Emails: 0.3
Avg. Number of digits in Spam Emails: 15


In [6]:
def non_words():
    import re
    t = spam_data.copy()
    t['not_words'] = t['text'].apply(lambda row: len(re.findall(r'\W', row)))
    avg_nowords = t.groupby('target')['not_words'].agg('mean').values
    return avg_nowords[0], avg_nowords[1]
not_s_w, s_w = non_words()
print('Avg. Number of non-word characters in Non-Spam Emails: ' + str(int(not_s_w)))
print('Avg. Number of non-word characters in Spam Emails: ' + str(int(s_w)))

Avg. Number of non-word characters in Non-Spam Emails: 17
Avg. Number of non-word characters in Spam Emails: 29


Data is noted to be skewed with less than 14% of the cases to be spam (target = 1).  Surprisingly, at first glance, spam emails are longer than non-spam emails.  Moreover, it appears the number of digits in an email might be good indicator of spam mail along with larger proportions on non-word characters.

### Step 2: Split Data

In [7]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

### Step 3: Build out Models

#### Model Alpha: Naive Bayes Classifier

This model utilizes and bag of words approach as to compare the vocabulary of each case to the general use of the entire vocabulary for each target bucket.  It then takes the Bayes Algorithm to conclude the probabilites of the words in each sample being part of either bucket. 


In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

def NB_model():
    vect = CountVectorizer().fit(X_train)
    X_train_vectorized = vect.transform(X_train)
    model_nb = MultinomialNB(alpha=0.1).fit(X_train_vectorized, y_train)
    predictions = model_nb.predict(vect.transform(X_test))
    score = roc_auc_score(y_test, predictions)
    cm = confusion_matrix(y_test, predictions)
    return score, cm
NB_auc, NB_cm = NB_model()
print('AUC Score: ' + str(NB_auc))
print(NB_cm)

AUC Score: 0.9720812182741116
[[1196    0]
 [  11  186]]


In our first model, it appears that the Naive Bayes Classifier does a great job determining if the email is not spam, having a false positive score of zero.  Our false negative score is also relatively low with eleven cases misinterpreted.  Overall a good baseline to compare to.

#### Model Beta: TF-IDF Vectorized and Tuned Naive Bayes Classifier

This model uses the TF-IDF method to assign value to the relevancy of each word in an email.  TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

This model applies the TF-IDF method to organizing the data and employs a smooting alpha of 0.1 while ignoring terms that have a frequency less than three.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_view(): # returns the ten least and ten most relevant indicators of spam 
    vect = TfidfVectorizer().fit(X_train)
    feature_names = np.array(vect.get_feature_names()).reshape(-1, 1)
    X_train_vectorized = vect.transform(X_train)
    tfidf_values = X_train_vectorized.max(0).toarray()[0].reshape(-1, 1)
    df = pd.DataFrame(data=np.hstack((feature_names, tfidf_values)), columns=['features', 'tfidf'])
    smalldf = df.sort_values(by=['tfidf', 'features']).set_index('features')[:10]
    largestdf = df.sort_values(by=['tfidf', 'features'], ascending=[False, True]).set_index('features')[:10]
    return smalldf, largestdf
s, l = tfidf_view()
print('Ten smallest tfidf:\n')
print(s)
print('\nTen largest tfidf:\n')
print(l)

Ten smallest tfidf:

                           tfidf
features                        
aaniye        0.0744745235430759
athletic      0.0744745235430759
chef          0.0744745235430759
companion     0.0744745235430759
courageous    0.0744745235430759
dependable    0.0744745235430759
determined    0.0744745235430759
exterminator  0.0744745235430759
healer        0.0744745235430759
listener      0.0744745235430759

Ten largest tfidf:

          tfidf
features       
146tf150p   1.0
645         1.0
anything    1.0
anytime     1.0
beerage     1.0
done        1.0
er          1.0
havent      1.0
home        1.0
lei         1.0


In [10]:
def tfidf_NB_model():
    vect = TfidfVectorizer(min_df=3).fit(X_train) # ignores frequencies lees than 3
    X_train_vectorized = vect.transform(X_train)
    X_test_vectorized = vect.transform(X_test)
    clf = MultinomialNB(alpha=0.1).fit(X_train_vectorized, y_train) # alpha smooths out the make sure there is never a probability of zero
    y_score = clf.predict_proba(X_test_vectorized)[:, 1]
    pred_labels = []
    for s in y_score:
        if s > 0.5:
            pred_labels.append(1)
        else:
            pred_labels.append(0)
    y_pred = np.array(pred_labels)
    score = roc_auc_score(y_test, y_score)
    cm = confusion_matrix(y_test, y_pred)
    return score, cm
NB2_auc, NB2_cm = tfidf_NB_model()
print('AUC Score: ' + str(NB2_auc))
print(NB2_cm)


AUC Score: 0.9954968337775665
[[1196    0]
 [  23  174]]


This model appears to do overall well in predicting spam, however, its AUC score is less than our baseline model and has a slightly larger false negative count. Further tuning might be required.

#### Model Gamma: SVM model with TF-IDF Applied

This model fits and transforms the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5, using this document-term matrix and an additional feature, the length of document (number of characters). The data is fitted to a Support Vector Classification model with regularization C=10000. 
This function returns the AUC score and confusion matrix.

In [11]:
from sklearn.svm import SVC

def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

def SVC_model():
    spam_data['length_of_doc'] = spam_data['text'].str.len()
    X_train, X_test, y_train, y_test = train_test_split(spam_data.drop('target', axis=1), spam_data['target'], random_state=0)
    vect = TfidfVectorizer(min_df=5).fit(X_train['text'])
    X_train_vectorized = vect.transform(X_train['text'])
    X_train_vectorized = add_feature(X_train_vectorized, X_train['length_of_doc'])
    clf = SVC(C=10000).fit(X_train_vectorized, y_train)
    X_test_vectorized = vect.transform(X_test['text'])
    X_test_vectorized = add_feature(X_test_vectorized, X_test['length_of_doc'])
    y_score = clf.decision_function(X_test_vectorized)
    score = roc_auc_score(y_test, y_score)
    p = clf.predict(X_test_vectorized)
    cm = confusion_matrix(y_test, p)
    return score, cm
SVC_auc, SVC_cm = SVC_model()
print('AUC Score: ' + str(SVC_auc))
print(SVC_cm)


AUC Score: 0.9951106055718724
[[1193    3]
 [  16  181]]


The SVM model results in similar results as the last model, posting a similar AUC score and confusion matrix. The number of false negatives droped by seven while false positives increased by three (most likely due to the high level of regularization that is being applied with a C of 10000.  

#### Model Delta: Tuned Logistic Regression

Fits and transforms the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams).

Adds the following additional features:
1) the length of document (number of characters)
2) number of digits per document

Fits a Logistic Regression model with regularization C=100. Then computes the area under the curve (AUC) score using the transformed test data.

This function returns the AUC score as a float and confusion matrix.

In [12]:
from sklearn.linear_model import LogisticRegression

def LR_model():
    import re
    temp = spam_data.copy()
    temp['length_of_doc'] = temp['text'].str.len() # overall length of email
    temp['digits_count'] = temp['text'].apply(lambda row: len(re.findall(r'(\d)', row))) # finds digits per email
    X_train, X_test, y_train, y_test = train_test_split(temp.drop('target', axis=1), temp['target'], random_state=0)
    # resplit train/test to include new features
    vect = TfidfVectorizer(min_df=5, ngram_range=(1, 3)).fit(X_train['text'])
    X_train_vectorized = vect.transform(X_train['text'])
    X_test_vectorized = vect.transform(X_test['text'])
    X_train_vectorized = add_feature(X_train_vectorized, X_train['length_of_doc'])
    X_train_vectorized = add_feature(X_train_vectorized, X_train['digits_count'])
    X_test_vectorized = add_feature(X_test_vectorized, X_test['length_of_doc'])
    X_test_vectorized = add_feature(X_test_vectorized, X_test['digits_count'])
    
    clf = LogisticRegression(C=100).fit(X_train_vectorized, y_train)
    y_score = clf.predict(X_test_vectorized)
    cm = confusion_matrix(y_test, y_score)
    score = roc_auc_score(y_test, y_score)
    return score, cm

LR_auc, LR_cm = LR_model()
print('AUC Score: ' + str(LR_auc))
print(LR_cm)

AUC Score: 0.9678709064054463
[[1192    4]
 [  12  185]]


Model performs very well but at a less effective degree as the SVM and and Naive Bayes Models.  Second smallest false negative count and highest false positive count.  

#### Model Epsilon: Logistic Regression With Count Vectorizer

Fits and transforms the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5. Passes in analyzer='char_wb' which creates character n-grams only from text inside word boundaries. This makes the model more robust to spelling mistakes.

Uses this document-term matrix and the following additional features:

- The length of document (number of characters)
- Number of digits per document
- Number of non-word characters (anything other than a letter, digit or underscore.)

Fits a Logistic Regression model with regularization C=100. Then computes the area under the curve (AUC) score using the transformed test data and a confusion matrix.

In [13]:
def LR2_model():
    import re
    temp = spam_data.copy()
    temp['length_of_doc'] = temp['text'].str.len()
    temp['digit_count'] = spam_data['text'].apply(lambda row: len(re.findall(r'\d', row)))
    temp['non_word_char_count'] = temp['text'].apply(lambda row: len(re.findall(r'\W', row)))
    X_train, X_test, y_train, y_test = train_test_split(temp.drop('target', axis=1), temp['target'], random_state=0)
    
    vect = CountVectorizer(min_df=5, ngram_range=(2, 5), analyzer='char_wb').fit(X_train['text'])
    X_train_vectorized = vect.transform(X_train['text'])
    X_test_vectorized = vect.transform(X_test['text'])
    X_train_vectorized = add_feature(X_train_vectorized, X_train['length_of_doc'])
    X_train_vectorized = add_feature(X_train_vectorized, X_train['digit_count'])
    X_train_vectorized = add_feature(X_train_vectorized, X_train['non_word_char_count'])
    X_test_vectorized = add_feature(X_test_vectorized, X_test['length_of_doc'])
    X_test_vectorized = add_feature(X_test_vectorized, X_test['digit_count'])
    X_test_vectorized = add_feature(X_test_vectorized, X_test['non_word_char_count'])
    clf = LogisticRegression(C=100).fit(X_train_vectorized, y_train)
    y_score = clf.predict(X_test_vectorized)
    score = roc_auc_score(y_test, y_score)
    cm = confusion_matrix(y_test, y_score)
    return score, cm

LR2_auc, LR2_cm = LR2_model()
print('AUC Score: ' + str(LR2_auc))
print(LR2_cm)

AUC Score: 0.9788593110707434
[[1194    2]
 [   8  189]]


By controlling for spelling errors and increasing the n-gram range, the Logistic Regression improved from the previous run, earning a higher AUC socre and small false positive and Negative Rates

### Step 4: Summarize Models and Discern Best Candidates

In [14]:
NB = [NB_auc, NB_cm]
NB2 = [NB2_auc, NB2_cm]
SVC = [SVC_auc, SVC_cm]
LR = [LR_auc, LR_cm]
LR2 = [LR2_auc, LR2_cm]
d = zip(NB, NB2, SVC, LR, LR2)
df = pd.DataFrame(d, index = ['AUC Score', 'Confusion Matrix'])
c = ['NB', 'NB2', 'SVC', 'LR', 'LR2']
df.columns = c
df = df.sort_values(by= 'AUC Score', axis=1)
df

Unnamed: 0,LR,NB,LR2,SVC,NB2
AUC Score,0.967871,0.972081,0.978859,0.995111,0.995497
Confusion Matrix,"[[1192, 4], [12, 185]]","[[1196, 0], [11, 186]]","[[1194, 2], [8, 189]]","[[1193, 3], [16, 181]]","[[1196, 0], [23, 174]]"


###  Conclusions

In analyzing the outputs of each of these models, it can be concluded that either the Support Vector or Tuned Naive Bayes Models would be most suitable for prediciting spam emails given their AUC scores are nearly identical and the highest of the five models.  Between the two, there is a tradeoff the lies in their precision and recall scores that are up to the user to discern their preference. The tuned Naive Bayes is a stong candidate as it has zero false positve counts, however, it has an overall lower accuracy than the SVC model (19 mistakes vs 23). In practice, these differences are minute but important if we believe that having false positives can be especially harmful to the implimentation of the model in an uncontrolled setting which.  In all, either are good candidates and an argument can be made for all five candidates, given their AUC scores are within 0.03 of each other's AUC scores.

General conclusions also support ignoring word frequencies less than 3 to be a good benchmark as well as adding in a regularizing penalty to smooth out inconsistencies between things like spelling or outliers.  Length of Doucument also appears to be a potential indicator but is not used in the Naive Bayes models, suggesting that is might play a minor role in a more finely tuned model.  