Hi all, this is my attempt at building a spam filter from the `sms-spam-collection-dataset`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
plt.style.use('ggplot')
%matplotlib inline

First we read the csv file.

In [None]:
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv',encoding='latin-1')
df

In [None]:
df.isna().sum()

Let's examine the columns to see what is going on.

In [None]:
not_null = df[df['Unnamed: 2'].notnull()].head()
not_null

They seem to be multiple lines of text, lets add them to the same column.

In [None]:
for index, rows in df.iterrows():
    for r in rows[2:]:
        if type(r) is str:
            df.loc[index,'v2'] = df.loc[index,'v2'] + ' ' + r

  

Let's check the data.

In [None]:
not_null

In [None]:
df.loc[281,'v2']

In [None]:
df[df['Unnamed: 2'] != 0].head()

The columns have combined, now we can do our analysis and not miss out on any data!

In [None]:
cols = df.columns
cols

In [None]:
df = df.drop(cols[2:],axis=1)

Let's rename the columns for readability.

In [None]:
df = df.rename(columns={'v1': 'class','v2': 'text'})

And we can encode our predictor variable.

In [None]:
df = df.replace('ham',0) 
df = df.replace('spam',1)

In [None]:
df['class'].astype('int')

About 13.4% of the text messages are spam.

In [None]:
percentage_spam = df['class'].mean()
percentage_spam

Let's look at the length of text to see whether it can predict for spam.

In [None]:
df['text_length'] = df['text'].apply(len)
df

In [None]:
df[df['class']==0]['text_length']

It looks like the mean character length for spam is higher, we should visualize this.

In [None]:
df.groupby('class').describe()

In [None]:
plt.figure(figsize=(10,8))
sns.histplot(data=df, x='text_length',hue='class',stat="density", common_norm=False)
plt.xlim(-1,250)
plt.title('Distribution of spam(1) vs ham(0)')
plt.ylabel('Normalized Frequency')
plt.xlabel('Length of Text')

Let's start building our NLP model.

In [None]:
X = df.drop('class',axis=1)
y = df['class']

In [None]:
#count_vectorizer class requires 1d X values
X_train, X_test, y_train, y_test = train_test_split(X['text'],y,test_size=0.2,random_state=42)

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')

In [None]:
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)

In [None]:
multi_nb = MultinomialNB()

In [None]:
multi_nb.fit(count_train,y_train)

In [None]:
y_pred = multi_nb.predict(count_test)

In [None]:
metrics.accuracy_score(y_test,y_pred)

This is a pretty good result as it is more then just predicting everything as spam (87%).

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred),annot=True)

Let's try to optimize the model's hyperparamters even further.

In [None]:
alpha_list = np.logspace(0,200,20)
alpha_list

In [None]:
nb_params = {'alpha': alpha_list}

In [None]:
multi_nb.get_params()

In [None]:
nb_grid = GridSearchCV(multi_nb,nb_params,n_jobs=-1)

In [None]:
nb_grid.fit(count_train,y_train)

In [None]:
nb_grid.best_params_

Looks like the model was already using `alpha = 1.0`, this best parameter.

Thanks for reading this beginner's notebook, if you have any suggestions on how to improve my model please let me know!