> <h1>SMS Spam Detection</h1>

**Define Problem **

First part of dealing with a machine learning and data science problem is defining the problem.
Here, our problem is easy to understand. We have two kind of SMS : <br> 1. Spam<br> 2. Ham<br>

We have a dataset which contains 5574 English SMS that each SMS labeled that is spam or ham. So we have a supervised classification problem.

**Loading Data**

As we use kaggle dataset, we don't need gathering data. We use pandas library to read csv file and loading it in a pandas dataframe.

The dataset file name is <mark style="background-color: LightYellow">spam.csv</mark> and it exists in <mark style="background-color: LightYellow">input</mark> directory.

In [None]:
import os
print(os.listdir("../input"))

<mark style="background-color: LightYellow">load_data</mark> function, get path of directory(../input), filename(spam.csv)  and data file coded as it's parameters and load csv file with pandas and finally return a pandas dataframe.

We call load_data with <mark style="background-color: LightYellow">latin1</mark> coded and load it in <mark style="background-color: LightYellow">spam</mark> variable.

In [None]:
import pandas as pd
import os

def load_data(path, filename, codec='utf-8'):
  csv_path = os.path.join(path, filename)
  print(csv_path)
  return pd.read_csv(csv_path, encoding=codec)

spam = load_data('../input', 'spam.csv', codec='latin1')

pandas <mark style="background-color: LightYellow">head</mark> method, returns 5 first row of dataframe.

We can see that dataframe has 5 column :
1. **v1:** dataset label that categorized to <mark style="background-color: LightYellow">spam</mark> and <mark style="background-color: LightYellow">ham</mark> label
2. **v2:** first line of SMS that can not be empty
3. **Unnamed: 2** second line of SMS
4. **Unnamed: 3** third line of SMS
5. **Unnamed: 4** fourth line of SMS

In [None]:
spam.head()

In this step, we <mark style="background-color: LightYellow">rename</mark> dataframe column for simplifying dealing with it.

Then in spam <mark style="background-color: LightYellow">describe</mark> method, you can see that column names are renamed. We also get some information from this table, see that a few number of messages are more than one line and also most of the messages are unique.

In [None]:
spam.columns = ['label', 'line1', 'line2', 'line3', 'line4']
spam.describe()

**Visualize Data**

We use <mark style="background-color: LightYellow">seaborn</mark> and <mark style="background-color: LightYellow">matplotlib</mark> libraries for visualiztion and plotting.

In this two plot we that over 86% of 5574 SMSs are ham and 13.4% of them are spam.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f, axs = plt.subplots(1, 2, figsize=(12, 6))
sns.countplot(spam['label'], ax=axs[0])
axs[1].pie(spam.groupby(spam['label'])['line1'].count(), labels=['ham', 'spam'], autopct='%1.1f%%', startangle=90, pctdistance=0.85)
plt.show()

We also need to know how many SMSs which are more than two lines are spam. we can see 90% of such SMSs are ham

In [None]:
spam_with_more_line = spam[spam['line2'].notnull()]
f, axs = plt.subplots(1, 2, figsize=(12, 6))
sns.countplot(spam_with_more_line['label'], ax=axs[0])
axs[1].pie(spam_with_more_line.groupby(spam_with_more_line['label'])['line1'].count(), labels=['ham', 'spam'], autopct='%1.1f%%',
           startangle=90, pctdistance=0.85)
plt.show()

**Data Cleaning**

Now we should <mark style="background-color: LightYellow">concat</mark> sencod, third and fourth line, with the first line. For this purpose we should fill their <mark style="background-color: LightYellow">NaN</mark> values with empty string. Then we can add lines with each other and save it new column called <mark style="background-color: LightYellow">text</mark>.

Now we can drop line1, line2, line3 and line4 from spam.

In [None]:
spam['line2'].fillna('', inplace=True)
spam['line3'].fillna('', inplace=True)
spam['line4'].fillna('', inplace=True)

spam['text'] = spam['line1'] + ' ' + spam['line2'] + ' ' + spam['line3'] + ' ' + spam['line4']

spam.drop(['line1', 'line2', 'line3', 'line4'], axis=1, inplace=True)

spam.head()

**Adding new feature**

From the histogram below, we can see that spam SMSs have more characters than ham SMSs. So we add new feature called len to our data and fill it with length of each text message.

In [None]:
spam_with_len = spam.copy()
spam_with_len['len'] = spam['text'].str.len()

spam_with_len.hist(column='len', by='label', bins=25, figsize=(15, 6), color = "skyblue")
plt.show()

**Create Train and Test Set**

We load data once again to prevent changes. (we change data in pipeline in next steps)<br>

We use 20% of data as test set and remaining data as train set.

In [None]:
spam = load_data('../input', 'spam.csv', 'latin1')
txts = spam.drop(['v1'], axis=1)
labels = spam['v1']
x_train, x_test, y_train, y_test = txts[:4457], txts[4457:], labels[:4457], labels[4457:]

We should categorize labels, map spam to 1 and ham to 0. We can also use sklearn LabelEncoder.

In [None]:
label_map_func = lambda x: 1 if x == 'spam' else 0

y_test = list(map(label_map_func, y_test))
y_train = list(map(label_map_func, y_train))

x_test data indexes starts from 4457, we should reset it's indexing.

In [None]:
x_test = x_test.reset_index().drop(['index'], axis=1)

**Prepare Data for Machine Learning Algorithm**

For this goal we use <mark style="background-color: LightYellow">nltk library</mark>.
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

We write our own custom pipline friendly transformation and use them step by step in pipeline.

In [None]:
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.tokenize.casual import TweetTokenizer
from nltk.stem import PorterStemmer 

1. **Concat All Lines**<br>
Concat all lines and remove additional column as we said in previous parts.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class ConcatLines(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        X['Unnamed: 2'].fillna('', inplace=True)
        X['Unnamed: 3'].fillna('', inplace=True)
        X['Unnamed: 4'].fillna('', inplace=True)

        X['text'] = X['v2'] + ' ' + X['Unnamed: 2'] + ' ' + X['Unnamed: 3'] + ' ' + X['Unnamed: 4']
        X.drop(['v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
        return X
spam = ConcatLines().transform(spam)
spam.head()

2. **Add Length Feature**

In this transformation, we add <mark style="background-color: LightYellow">length</mark> of each SMS in column called len

In [None]:
class AddLength(BaseEstimator, TransformerMixin):
    def __init__(self, textAttr='text', lenAttr='len'):
        self.lenAttr = lenAttr
        self.textAttr = textAttr
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        X[self.lenAttr] = X[self.textAttr].str.len()
        return X
spam = AddLength().transform(spam)
spam.head()

3. **LowerCase All Words**<br>
All words in setence should be <mark style="background-color: LightYellow">lowercase</mark>, because we have to check if two words are equal or not, So both of them should be lowercase or uppercase.

In [None]:
class ToLowerCase(BaseEstimator, TransformerMixin):
    def __init__(self, textAttr='text', lenAttr='len'):
        self.lenAttr = lenAttr
        self.textAttr = textAttr
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        X[self.textAttr] = X[self.textAttr].str.lower()
        return X
    
spam = ToLowerCase().transform(spam)
spam.head()

4. **Tokenize String**:

Tokenizing string using nltk tweet tokenizer. It tokenize text to list of words and it can find out emojies and not delete them.

In [None]:
class Tokenize(BaseEstimator, TransformerMixin):
    def __init__(self, textAttr='text', lenAttr='len'):
        self.lenAttr = lenAttr
        self.textAttr = textAttr
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        x_len = X[self.lenAttr]
        x_text = X[self.textAttr]
        x_text = [TweetTokenizer().tokenize(str(x)) for x in x_text]
        X['text'] = x_text
        X['len'] = x_len
        return X
    
spam = Tokenize().transform(spam)
spam.head()

5. **Removing Stop word**:

In case of omitting stopwords, we should select which words are stopwords. Fortunately nltk give us a huge list of stop words. We also choose one char english alphabet and punctuaions as stopwords. We could simply remove words equal to stopwords.

We also remove empty words and words finished with dot. 

**Stemming Words**

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.<br>
[Refrence link](https://www.geeksforgeeks.org/python-stemming-words-with-nltk/)
![](https://www.wolfram.com/language/11/text-and-language-processing/assets.en/generate-and-verify-stemmed-words/O_51.png)

In [None]:
import string

class RemoveStopWordsAndStem:
    def __init__(self, textAttr='text', lenAttr='len'):
        self.lenAttr = lenAttr
        self.textAttr = textAttr
        self.ps = ps = PorterStemmer()
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        alphabet = list(string.ascii_lowercase)
        stop_words = list(stopwords.words('english'))
        puncs = list(string.punctuation)
        stop_words = stop_words + puncs + alphabet
        x_text = X[self.textAttr]
        text = []
        for i in range(len(X)):
            filtered_sentence = []
            for w in x_text[i]: 
                if w not in stop_words:
                    w = w.rstrip(".")
                    if w is not "":
                        filtered_sentence.append(self.ps.stem(w))
            text.append(filtered_sentence)
        X[self.textAttr] = text
        return X
spam = RemoveStopWordsAndStem().transform(spam)
spam.head()

6. **Substitute Emoji, Website and Number**:


* In this step we want to substitute emojies with "emoji" string. we know that almost all emojies start with : ; = > and since we have deleted punctuation marks from words, so words with 2 or more characters starting with char above are labeled as emoji.

* We know that websites are start with https:// or www. or ends with .com or something like that. So we substitute website urls with website label.

* This dataset has a lot of numbers, we should label number and they can find with [isnumeric()](https://www.programiz.com/python-programming/methods/string/isnumeric) function and they labeled as digitnumber.

In [None]:
class Substitute:
    def __init__(self, textAttr='text', lenAttr='len'):
        self.lenAttr = lenAttr
        self.textAttr = textAttr
        self.emoji_list = emoji_list = [':', ';', '>', '=']
        self.website_list = ['.com', '.org', '.co.uk', '.net', 'http', 'www.']
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        x_text = X[self.textAttr]
        text = []
        for i in range(len(X)):
            text.append(self.substitute(x_text[i]))
        X[self.textAttr] = text
        return X
    
    def substitute(self, words):
        for i in range(len(words)):
            if self.is_emoji(words[i]):
                words[i] = 'emoji'
            elif words[i].isnumeric():
                words[i] = 'digitnumber'
            else :
                for site in self.website_list:
                    if site in words[i]:
                        words[i] = 'website'
        return words
    def is_emoji(self, word):
        return word[0] in self.emoji_list and len(word) > 1

spam = Substitute().transform(spam)
spam.head()

7. **Creating Sparse Matrix**

In the last part of pipeline we should create the existance matrix that told us that a row contains which words. We first find all words after data cleaning from train data, then if fill column with repeating number of column's word.

Then we should Transform it to sparse matrix that only saves non zero column. This step done for better performance during machine learning algorithms.

In [None]:
from scipy import sparse
import numpy as np

class ToSparseMatrix:
    def __init__(self, train_set, textAttr='text', lenAttr='len'):
        self.lenAttr = lenAttr
        self.textAttr = textAttr
        self.train_words = tokenize_pipeline.transform(train_set.copy())
    def fit(self, X, y=None):
        pass
    def transform(self, X):
        if self.train_words is None:
            self.train_words = X
        self.final_words = np.array([x for t in self.train_words[self.textAttr] for x in t])
        self.final_words = np.unique(self.final_words)
        matrix = np.array([[0 for x in range(len(self.final_words) + 1)] for y in range(len(X))])
        x_texts = list(X[self.textAttr])
        x_len = list(X[self.lenAttr])
        for i in range(len(x_texts)):
            for token in x_texts[i]:
                cond = np.where(self.final_words == token)
                if(len(cond[0]) > 0):
                    matrix[i][cond[0][0]] += 1
            matrix[i][-1] = x_len[i]
        return sparse.csr_matrix(matrix)

**Creating Pipeline**

We should create a pipeline from previous steps because we need all of them together for both training set and test set in case of creating sparse matrix. Also we need them for for finding final words from training set.


In [None]:
from sklearn.pipeline import Pipeline

tokenize_pipeline = Pipeline([
    ('concat lines', ConcatLines()),
    ('add length', AddLength()),
    ('lower case words', ToLowerCase()),
    ('tokenize words', Tokenize()),
    ('remove stopwords', RemoveStopWordsAndStem()),
    ('substitute emoji, web, number', Substitute()),
])

sparse_pipeline = Pipeline([
    ('sparse pipeline', ToSparseMatrix(x_train.copy(deep=True)))
])

x_train_prepared = tokenize_pipeline.transform(x_train.copy(deep=True))
x_train_prepared = sparse_pipeline.transform(x_train_prepared)
x_train_prepared

**Try to find Best Model**

We should import all classification models that guess to work for this problem, then create a model variable to work with.

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB, ComplementNB, BernoulliNB
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
models = [
    ('svc', SVC(kernel='rbf')),
    ('neighbors', KNeighborsClassifier(3)),
    ('random_forest', RandomForestClassifier()),
    ('sgd', SGDClassifier()), 
    ('mutlinomial_nb', MultinomialNB()),
    ('complement_nb', ComplementNB()),
    ('bernoli_nb', BernoulliNB()),
]

Now it's time to compute score for each model. For this goal, we use cross_val_score function from sklearn and show scores for each model

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score, roc_auc_score
scores = pd.DataFrame([], columns=['model', 'accuracy', 'auc', 'precision', 'recall', 'f1'])
for model in models:
    pred = cross_val_predict(model[1], x_train_prepared, y_train, cv=3, n_jobs=-1)
    scores = scores.append({
        'model':model[0],
        'accuracy' : accuracy_score(y_train, pred),
        'auc' : roc_auc_score(y_train, pred),
        'precision': precision_score(y_train, pred),
        'recall': recall_score(y_train, pred),
        'f1': f1_score(y_train, pred),
    }, ignore_index=True)
scores

**Score Analysis **

We now that in this classification problem, accuracy is important but recall and precesion is more important than accuracy. As we can see almost all models have high accuracy but some of them have very low f1-score which calculate from presicion and recall. so we must to select the best model but which one of them is the best?

**Why The Naive Bayes Model Works So Well**

The Naive Bayes model works on the assumption that the features of the dataset are independent of each other — hence called Naive.<br>
This works well for bag-of-words models a.k.a text documents since:
* words in a text document are independent of each other.
* the location of one word doesn’t depend on another word.

Thus satisfying the independence assumption of the Naive Bayes model. Hence, it is most commonly used for text classification, sentiment analysis, spam filtering & recommendation systems.

[Reference](https://towardsdatascience.com/sms-text-classification-a51defc2361c)

**Find Best Hyperparameter**

We choose multinomial model and use GridSearchCV to find the best hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV

estimator = BernoulliNB()
grid_params = {
    'alpha': [0, 0.08, 0.09, 0.10, 0.11, 0.15],
    'binarize': [0, 0.1, 0.3, 0.5],
    'fit_prior': [True, False],
    'class_prior': [None, [0.4, 0.6]],
}
grid_search = GridSearchCV(estimator, grid_params, scoring='recall')
grid_search.fit(x_train_prepared, y_train)
grid_search.best_score_

best model from grid search

In [None]:
final_model = grid_search.best_estimator_
final_model

Find scores on the test set

In [None]:
x_test_prepared = tokenize_pipeline.transform(x_test.copy(deep=True))
x_test_prepared = sparse_pipeline.transform(x_test_prepared)
pred = cross_val_predict(final_model, x_test_prepared, y_test, cv=3)

print("Precision: ", precision_score(y_test, pred))
print("Recall: ", recall_score(y_test, pred))
print("f1_score: ", f1_score(y_test, pred))
print("Accuracy: ", accuracy_score(y_test, pred))

We can see that by tuning model's hyperparameter, we can reach to 98% accuracy and f1_score 93%.