<h1>Description</h1>

Context

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

Content

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

-> A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link]. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link]. -> A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link]. -> Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link]. This corpus has been used in the following academic researches:

Acknowledgements

The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

Inspiration

Can you use this dataset to build a prediction model that will accurately classify which texts are spam?


<h1>Import data</h1>

In [None]:
import numpy as np
import pandas as pd
import plotly as py
import matplotlib.pyplot as plt
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import colorlover as cl
import operator
%matplotlib inline
import string
import itertools
import re
import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.decomposition import PCA
import xgboost as xgb
import seaborn as sns

import nltk
from nltk.corpus import stopwords

from collections import Counter, OrderedDict

import os
print(os.listdir("../input"))

from IPython.display import HTML

RANDOM_STATE = 43

<h1>Exploration Data Analysis</h1>
<h2>Data review</h2>

In [None]:
df = pd.read_csv("../input/spam.csv",encoding='latin-1')
df.head()

Let's rename appropriate columns and drop unnessesary columns:

In [None]:
df = df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df.rename(columns= {"v1": "label", "v2": "text"})
df.label = df.label.astype('category') 
df.text = df.text.astype('str')
df.head()

In [None]:
df.describe()

In [None]:
df.label.value_counts()


<h2>2. Visualization</h2>

Let's visualize counts of objects, According to result we have that ham messages are more than span in 5 times, that's we need to do resampling or use stratified kfold

In [None]:
_, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))
df.label.value_counts(sort=True).plot(kind='pie', ax=ax2, autopct='%1.0f%%')
df.label.value_counts(sort=True).plot(kind='bar', color=['blue', 'red'], ax=ax1)
ax1.set_title('Objects counts')
ax1.set_ylabel('Count')
ax1.set_xlabel('Label')
ax2.set_title('Objects counts')

We have missed phone numbers which was provived by sms message

In [None]:
def get_number_checker():
    checker_func = np.vectorize(lambda x: re.search("[0-9]{10}", x) != None or re.search("[0-9]{3}-[0-9]{3}-[0-9]{3}", x) != None)
    return df[checker_func(df.text)]

get_number_checker().label.value_counts(sort=True).plot(kind="bar")
checker_func = np.vectorize(lambda x: re.search("[0-9]{10}", x) != None or re.search("[0-9]{3}-[0-9]{3}-[0-9]{3}", x) != None)
df = df.assign(has_phone_number=checker_func(df.text))

<h1>3. Text Analitycs</h1>

Find most commin 100 word in text. Also we will clean punctuation and stop words using nltk library. Result divide on spam dataset and ham dataset. Next, we will visualize this data.

In [None]:
MAX_COMMON_WORDS = 100

def clean_from_stop_words_and_punctuation(x):
    return [word.lower() for word in x.split() if word.lower() not in stopwords.words('english') and  word.lower() not in string.punctuation]

def sort_dict_by_value(t):
    return sorted(t, key=lambda x: x[1],reverse=True)

def get_word_arr(label):
    clean_and_join = lambda x: " ".join(clean_from_stop_words_and_punctuation(x))
    cleaned_arr = df[df.label==label].text.apply(clean_and_join)
    splitted_strings = cleaned_arr.apply(lambda word: word.split(" ")).values
    return list(itertools.chain.from_iterable(splitted_strings))

def get_counter_dict(label):
    return sort_dict_by_value( Counter(get_word_arr(label)).most_common(MAX_COMMON_WORDS))

counter_ham = get_counter_dict('ham')
counter_spam = get_counter_dict('spam')

In [None]:
spam_counter_df = pd.DataFrame.from_dict(counter_spam)
spam_counter_df.T

In [None]:
ham_counter_df = pd.DataFrame.from_dict(counter_ham)
ham_counter_df.T

In [None]:
ham_plot = go.Bar(
    x = ham_counter_df.iloc[:, 0],
    y = ham_counter_df.iloc[:, 1],
    name = "Commom spam words"
)

iplot([ham_plot])

In [None]:
iplot([go.Bar(
    x = spam_counter_df.iloc[:, 0],
    y = spam_counter_df.iloc[:, 1],
    marker = dict(
        color=cl.scales['3']['div']['RdYlBu'][0]
    )
)])

Let's dicover one more feature - length. Try find correlation between length and count of spam

In [None]:
len_df = df.assign(len=df.text.apply(lambda x: len(x)))
len_df.head()

We saw length of the message doesnt depend on spam or ham

In [None]:
_, (ax1) = plt.subplots(nrows=1, ncols=1, figsize=(25, 5))
ax = sns.countplot(data=len_df.sort_values(['len'], ascending=False).sample(200), x='len', hue='label', ax=ax1,dodge=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()


<h1>Make predictions</h1>

Let's try Null model 

In [None]:
dtc = GridSearchCV(DecisionTreeClassifier(), param_grid = { 'criterion': ['gini', 'entropy'] }, cv=5)
dtc.fit(np.zeros((df.shape[0],1)), df.label).best_score_

In [None]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df.text)
X.shape

Make some transormation to boolean type:

In [None]:
df.label = df.label.map({ 'spam': 1, 'ham': 0 })
y = df.label

Try visualize our text:

In [None]:
pca = PCA(n_components=2).fit(X.toarray())
data2D = pca.transform(X.toarray())
fig, ax = plt.subplots(figsize=(20, 15))
ax.set_title('Vectorize plot')
ax.set_ylabel('PC2')
ax.set_xlabel('PC1')
ax.legend(['HAM', 'SPAM'])
sns.scatterplot(data2D[:,0], data2D[:,1], hue=df.label, ax=ax)

Let's declare variables clasificators. First clasificator which we are trying to use is Bayesian classifier. Our task is to build binary classifier so we will use BernoulliNB first:


In [None]:
bnb = GridSearchCV(BernoulliNB(),{ 'alpha':range(100),}, cv=StratifiedKFold(n_splits=5), refit=True)
cross_val_score(bnb, X, y, cv=5)

Bernoulli confusion matrix:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
fit = bnb.fit(X_train,y_train)
print(classification_report(y_test, fit.predict(X_test)))

In [None]:
mnb = GridSearchCV(MultinomialNB(),{ 'alpha':range(100),}, cv=StratifiedKFold(n_splits=5), refit=True)
fit = mnb.fit(X_train,y_train)
print(classification_report(y_test, fit.predict(X_test)))

In [None]:
gnb = GaussianNB()
fit = gnb.fit(X_train.toarray(), y_train)
print(classification_report(y_test, fit.predict(X_test.toarray())))

Naive Bayes algorithms give quite good results. 