<h2> 0) Pre-Processing Code </h2>

In [68]:
def Text_Length(Data):
    
    Word_lengths=[]

    for i in Data:
        
        word=len(nltk.word_tokenize(i))
        Word_lengths.insert(len(Word_lengths),word)
        
    return(Word_lengths)

In [69]:
def Wordify(Data):
    
    Words=""
    
    for i in Data:
    
         token=nltk.word_tokenize(i)
    
         for j in token:
        
             if(j.lower not in stop_words):
                
                     Words+=j.lower()
                    
    return(Words)


In [96]:
def Featurize(Data):
    
    Feature=[]

    for i in Data:
    
       if(i=="spam"):
        
           Feature.insert(len(Feature),1)
        
       else:
        
           Feature.insert(len(Feature),0)
            
    return(Feature)

<h1>1) Data & Library Imports</h1>

In [71]:

#Necessary Imports

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #for developing visualizations
import nltk #for general text processing
import matplotlib.pyplot as plt #for developing visualizations 


In [72]:

#Random Sample of the data

Data=pd.read_csv("../input/spam.csv",encoding = "ISO-8859-1")
Data.sample(n=10).head(n=5)


<h1>2) Exploratory Analysis</h1>

In [73]:

#Distribution of Class of Dataset

print("Length of Dataset: ",len(Data))
print("Spam: ",len(Data[Data['v1']=="spam"]))
print("Ham: ",len(Data[Data['v1']=="ham"]))


In [74]:

Data[Data['v1']=="ham"][0:5]['v2']


In [75]:

Data[Data['v1']=="spam"][0:5]['v2']


In [76]:
from wordcloud import WordCloud #to generate word clouds 
from nltk.corpus import stopwords

In [101]:
stop_words=set(stopwords.words("english"))
Spam=Data[Data['v1']=="spam"]['v2']
Ham=Data[Data['v1']=="ham"]['v2']

In [102]:
spam_wordcloud=WordCloud(width=600,height=400).generate(Wordify(Spam))
ham_wordcloud=WordCloud(width=600,height=400).generate(Wordify(Ham))

<h2>Spam Word Cloud</h2>

In [103]:
plt.figure( figsize=(10,8), facecolor='k')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

<h2>Ham Word Cloud</h2>

In [104]:
plt.figure( figsize=(10,8), facecolor='k')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

<h2>Word Lengths</h2>

In [105]:
Spam_Lengths=pd.Series(Text_Length(Spam),name="Spam Lengths")
Ham_Lengths=pd.Series(Text_Length(Ham),name="Ham Lengths")
fig,ax=plt.subplots(1,2,figsize=(15,6))
sns.distplot(Spam_Lengths,ax=ax[0])
sns.distplot(Ham_Lengths,ax=ax[1])
fig.show()

In [83]:
Ham_Lengths.describe()

In [84]:
Spam_Lengths.describe()

<h1>3) Data Pre-Processing</h1>

In [85]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


The sentances in the form of text needs to be featurized to learnable features for machine learning models. 

<h2>Bigram Features</h2>

In [86]:

BiGram=CountVectorizer(ngram_range=(1, 2))
X_BOW=BiGram.fit_transform(Data['v2'])


<h2>Tf-Idf Features</h2>

In [87]:

tfidf=TfidfTransformer()
X_T=tfidf.fit_transform(X_BOW)


Lets split the data into train and test set for accurate evaluation of performance and generalization of algorithms. We will split the train and test as 60:40. Although cross validation would be more efficient its slower to run on Kaggle causing my session to be timed out. 

In [97]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_T,Featurize(Data['v1']), test_size=0.4, random_state=0)

<h1>4) Machine Learning </h1>

In [98]:

#Necessary imports for machine learning algorithms

from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn import linear_model

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


<h3>Gaussian Naive Bayes</h3>

In [99]:
gNB=GaussianNB()
gNB.fit(X_train.toarray(),y_train)
accuracy_score(gNB.predict(X_test.toarray()),y_test)

<h3>Multinomial Naive Bayes</h3>

In [100]:
mNB=MultinomialNB()
mNB.fit(X_train.toarray(),y_train)
accuracy_score(mNB.predict(X_test.toarray()),y_test)

<h3>RBF  SVM</h3>

In [None]:
rbfsvm=SVC(kernel="rbf")
rbfsvm.fit(X_train.toarray(),y_train)
accuracy_score(rbfsvm.predict(X_test.toarray()),y_test)

<h3>Linear SVM</h3>

In [64]:
linearsvm=SVC(kernel="linear")
linearsvm.fit(X_train.toarray(),y_train)
linearsvm_results=linearsvm.predict(X_test.toarray())
accuracy_score(linearsvm_results,y_test)

<h3>Logistic Regression</h3>

In [65]:
Logit=linear_model.LogisticRegression()
Logit.fit(X_train.toarray(),y_train)
accuracy_score(Logit.predict(X_test.toarray()),y_test)

<h1>5) Model Evaluation</h1>

Among the above algorithms Linear SVM gave the best accuracy. Lets evaluate the confusion matrix and other evaluation metrics of the algorithm 

In [66]:
sns.heatmap(confusion_matrix(linearsvm_results,y_test),annot=True)

In [67]:
print(classification_report(y_test,linearsvm_results))