![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQUXi3mkDlIZMmaGJzZVQnEEC535eNtp3WbO5HzZMxhCcUwucLo)

# **SMS: Spam or Ham (Beginner)**

For my first kernel on Natural Language Processing (NLP), I chose the SMS Spam Collection Dataset.  
It contains  the text of 5572 SMS messages and a label, classifying the message as "spam" or "ham".

In this kernel I explore some common techniques of NLP like:

* **Removing Punctuation and Stopwords**
* **Tokenizer, Bag of words**  
* **Term frequency inverse document frequency (TFIDF)**

Based on these preprocessing, I train 6 different models that classify **unknown** messages as spam or ham. 

* **Naive Bayes Classifier**
* **SVM Classifier**  
* **KNN Classifier**
* **SGD Classifier**
* **Gradient Boosting Classifier**
* **XGBoost Classifier**

For easier handling of the preprocessing steps (for train and test data) and the optimization of different  
models for the same conditions, the classification is done with **Pipelines** including GridSearchCV.  
Finally, for the model evaluation different **metrics** are examined:  
accuracy, precision, recall, fscore, roc_auc

**The Notebook follows this outline:**  

**Part 0: Imports, define functions**  
[import libraries](#Imports)  
define [functions](#Functions) that are used often  

[**Part 1: Exploratory Data Analysis**](#Part-1:-EDA)  
**1.1 Get an overview of the dataset**  
head, describe and value counts  
[Distribution of the target variable](#Distribution-of-the-target-variable)  
[Add numerical label for spam](#Add-numerical-label-for-spam)  
**1.2 length of message**  
[Add feature: length of message](#Add-feature:-length-of-message)  
**1.3 WordClouds**  
[WordCloud: Ham messages](#WordCloud:-Ham-messages)  
[WordCloud: Spam messages](#WordCloud:-Spam-messages)  

[**Part 2: Preprocessing**](#Part-2:-Preprocessing)  
[**2.1 Remove punctuation and stopwords**](#2.1-Remove-Punctuation-and-Stopwords)   
[**2.2 Top 30 words in ham and spam messages**](#2.2-Top-30-words-in-ham-and-spam-messages)  
**2.3 Bag of words with CountVectorizer**  
[The Bag of Words representation](#The-Bag-of-Words-representation)  
[Examples for spam and ham messages](#Examples-for-spam-and-ham-messages)  
[Applying bow_transformer on all messages](#Applying-bow_transformer-on-all-messages)  
**2.4 Term frequency inverse document frequency (TFIDF)**  
[From occurrences to frequencies](#From-occurrences-to-frequencies)  
[TfidfTransformer from sklearn](#TfidfTransformer-from-sklearn)  


[**Part 3: Classifiers**](#Part-3:-Classifiers)  
[**3.1 First test for Classification**](#3.1-First-test-for-Classification) with Naive Bayes Classifier  
[**3.2 train test split**](#3.2-train-test-split)  
**3.3 Pipelines for Classification of unknown messages**  
[Multinomial Naive Bayes](#3.3.1-MultinomialNB)  (simple: Preprocessing and Classification)  
[KNN Classifier](#3.3.2-KNN)  (GridSearchCV for model parameter)   
[Support Vector Classifier](#3.3.3-SVC)  (GridSearchCV for Preprocessing)  
[SGD Classifier](#3.3.4-SGD)  (GridSearchCV for Preprocessing and model parameter)  
[GradientBoostingClassifier](#3.3.5-GradientBoostingClassifier)    (GridSearchCV for Preprocessing and model parameter)  
[XGBoost Classifier](#3.3.6-XGBoost-Classifier)    (GridSearchCV for Preprocessing and model parameter)  
**3.4 Comparison of results**  
[confusion_matrix](#confusion_matrix) +++ [accuracy_score](#accuracy_score)       
[precision_score](#precision_score) +++ [recall_score](#recall_score)  
[f1_score](#f1_score) +++  [classification_report](#classification_report)    
[roc_auc_score](#roc_auc_score)  
**3.5 Optimize classifiers with scoring by precision**  
3.5.1 [GridSearchCV pipelines version 2](#3.5.1-GridSearchCV-pipelines-version-2)  
3.5.2 [Confusion matrices for scoring by precision](#3.5.2-Confusion-matrices-for-scoring-by-precision)  
**3.6 Optimize classifiers with scoring by recall**  
3.6.1 [GridSearchCV pipelines version 3](#3.6.1-GridSearchCV-pipelines-version-3)  
3.6.2 [Confusion matrices for scoring by recall](#3.6.2-Confusion-matrices-for-scoring-by-recall)  
**3.7 Optimize classifiers with scoring by roc_auc**  
3.7.1 [GridSearchCV pipelines version 4](#3.7.1-GridSearchCV-pipelines-version-4)    
3.7.2 [Confusion matrices for scoring by roc auc](#3.7.2-Confusion-matrices-for-scoring-by-roc-auc)  

[**Part 4: NLTK**](#Part-4:-NLTK)


**TODO : **  
 

**include feature text length in model**   

**NLTK**

**References:**  
Parts of the EDA and preprocessing are based on the Capstone Project in Jose Portilla's Udemy course.  
I can recommend this course for beginners in Python ML.  


# **Part 0: Imports, define functions** 

## Imports

In [1]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

import wordcloud

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

import os
print(os.listdir("../input"))

FileNotFoundError: [WinError 3] 지정된 경로를 찾을 수 없습니다: '../input'

## Functions

print Classification Report and Accuracy

In [None]:
def print_validation_report(y_true, y_pred):
    print("Classification Report")
    print(classification_report(y_true, y_pred))
    acc_sc = accuracy_score(y_true, y_pred)
    print("Accuracy : "+ str(acc_sc))
    
    return acc_sc

plot_confusion_matrix

In [None]:
def plot_confusion_matrix(y_true, y_pred):
    mtx = confusion_matrix(y_true, y_pred)
    #fig, ax = plt.subplots(figsize=(4,4))
    sns.heatmap(mtx, annot=True, fmt='d', linewidths=.5,  
                cmap="Blues", cbar=False, ax=ax)
    #  square=True,
    plt.ylabel('true label')
    plt.xlabel('predicted label')

# **Part 1: EDA**

In [None]:
data = pd.read_csv("../input/spam.csv",encoding='latin-1')
data.head()

Columns 2,3,4 contain no important data and can be deleted.  
Also, we rename column v1 as "label" and v2 as "text"

In [None]:
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})

In [None]:
data.describe()

In [None]:
data.groupby("label").describe()

### Distribution of the target variable

The dataset contains 4825 ham and 747 spam messages.  
For both classes, some messages appear more than once (common phrases, etc.).

In [None]:
data.label.value_counts()

In [None]:
data.label.value_counts().plot.bar();

### Add numerical label for spam   
Target must be numerical for ML classification models

In [None]:
data['spam'] = data['label'].map( {'spam': 1, 'ham': 0} ).astype(int)
data.head(15)

### 1.2 length of message

### Add feature: length of message

In [None]:
data['length'] = data['text'].apply(len)

In [None]:
data.head(10)

In [None]:
data.hist(column='length',by='label',bins=60,figsize=(12,4));
plt.xlim(-40,950);

Looks like spam messages are generally longer than ham messages:  
Bulk of ham has length below 100, for spam it is above 100.  
We will check if this feature is useful for the classification task in Part 3.  

In [None]:
data_ham  = data[data['spam'] == 0].copy()
data_spam = data[data['spam'] == 1].copy()

### 1.3 WordClouds

In [None]:
def show_wordcloud(data_spam_or_ham, title):
    text = ' '.join(data_spam_or_ham['text'].astype(str).tolist())
    stopwords = set(wordcloud.STOPWORDS)
    
    fig_wordcloud = wordcloud.WordCloud(stopwords=stopwords,background_color='lightgrey',
                    colormap='viridis', width=800, height=600).generate(text)
    
    plt.figure(figsize=(10,7), frameon=True)
    plt.imshow(fig_wordcloud)  
    plt.axis('off')
    plt.title(title, fontsize=20 )
    plt.show()

### WordCloud: Ham messages

In [None]:
show_wordcloud(data_ham, "Ham messages")

### WordCloud: Spam messages

In [None]:
show_wordcloud(data_spam, "Spam messages")

# **Part 2: Preprocessing**

**Basic preprocessing for common NLP tasks includes converting text to lowercase and removing punctuation and stopwords.**  
**Further steps, especially for text classification tasks, are:**  
* Tokenization
* Vectorization and 
* TF-IDF weighting  

**Lets apply these approaches on the SMS messages.**

## 2.1 Remove Punctuation and Stopwords

### Punctuation
**We use the punctuation list from the string library:**

In [None]:
import string
string.punctuation

### Stopwords  
from sklearn documentation:  https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words  
Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text,   
and which may be removed to avoid them being construed as signal for prediction.  
Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality.  

Due to the known issues in the ’english’ stop word list of sklearn, we use the stopwords from NLTK:

**NLTK**

In [None]:
from nltk.corpus import stopwords
stopwords.words("english")[100:110]

**With the above lists for punctuation characters and stop words, we define a function to remove these from the text**  
**This function also converts all text to lowercase**

In [None]:
def remove_punctuation_and_stopwords(sms):
    
    sms_no_punctuation = [ch for ch in sms if ch not in string.punctuation]
    sms_no_punctuation = "".join(sms_no_punctuation).split()
    
    sms_no_punctuation_no_stopwords = \
        [word.lower() for word in sms_no_punctuation if word.lower() not in stopwords.words("english")]
        
    return sms_no_punctuation_no_stopwords

In [None]:
data['text'].apply(remove_punctuation_and_stopwords).head()

### 2.2 Top 30 words in ham and spam messages

### 2.2.1 Collections: Counter

In [None]:
from collections import Counter

In [None]:
data_ham.loc[:, 'text'] = data_ham['text'].apply(remove_punctuation_and_stopwords)
words_data_ham = data_ham['text'].tolist()
data_spam.loc[:, 'text'] = data_spam['text'].apply(remove_punctuation_and_stopwords)
words_data_spam = data_spam['text'].tolist()

In [None]:
list_ham_words = []
for sublist in words_data_ham:
    for item in sublist:
        list_ham_words.append(item)

In [None]:
list_spam_words = []
for sublist in words_data_spam:
    for item in sublist:
        list_spam_words.append(item)

In [None]:
c_ham  = Counter(list_ham_words)
c_spam = Counter(list_spam_words)
df_hamwords_top30  = pd.DataFrame(c_ham.most_common(30),  columns=['word', 'count'])
df_spamwords_top30 = pd.DataFrame(c_spam.most_common(30), columns=['word', 'count'])

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='word', y='count', 
            data=df_hamwords_top30, ax=ax)
plt.title("Top 30 Ham words")
plt.xticks(rotation='vertical');

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='word', y='count', 
            data=df_spamwords_top30, ax=ax)
plt.title("Top 30 Spam words")
plt.xticks(rotation='vertical');

### 2.2.2 NLTK: FreqDist

In [None]:
fdist_ham  = nltk.FreqDist(list_ham_words)
fdist_spam = nltk.FreqDist(list_spam_words)

In [None]:
df_hamwords_top30_nltk  = pd.DataFrame(fdist_ham.most_common(30),  columns=['word', 'count'])
df_spamwords_top30_nltk = pd.DataFrame(fdist_spam.most_common(30), columns=['word', 'count'])

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='word', y='count', 
            data=df_hamwords_top30_nltk, ax=ax)
plt.title("Top 30 Ham words")
plt.xticks(rotation='vertical');

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x='word', y='count', 
            data=df_spamwords_top30_nltk, ax=ax)
plt.title("Top 30 Spam words")
plt.xticks(rotation='vertical');

## 2.2 Bag of words with CountVectorizer

### The Bag of Words representation

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction  

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.  
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

**Tokenization**  
tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.  
**Vectorization**  
counting the occurrences of tokens in each document.  
**TF-IDF**  
normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.  


**Bag of Words**  
In this scheme, features and samples are defined as follows:
each individual token occurrence frequency (normalized or not) is treated as a feature.  
the vector of all the token frequencies for a given document is considered a multivariate sample.  
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.  
We call vectorization the general process of turning a collection of text documents into numerical feature vectors.   
This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or “Bag of n-grams” representation.  
Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

For futher details and example implementations see:  
https://en.wikipedia.org/wiki/Bag-of-words_model  
https://en.wikipedia.org/wiki/Document-term_matrix  

An Introduction to Bag-of-Words in NLP  
https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428

In this kernel we apply the CountVectorizer from sklearn as BOW model.  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html    
As tokenizer we use the remove_punctuation_and_stopwords function defined above

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer(analyzer = remove_punctuation_and_stopwords).fit(data['text'])

In [None]:
print(len(bow_transformer.vocabulary_))

In all sms messages bow_transformer counted 9431 different words.

### Examples for spam and ham messages

Lets look at some vectorization examples for spam and ham messages

In [None]:
sample_spam = data['text'][8]
bow_sample_spam = bow_transformer.transform([sample_spam])
print(sample_spam)
print(bow_sample_spam)

In [None]:
rows, cols = bow_sample_spam.nonzero()
for col in cols: 
    print(bow_transformer.get_feature_names()[col])

In [None]:
print(np.shape(bow_sample_spam))

In [None]:
sample_ham = data['text'][4]
bow_sample_ham = bow_transformer.transform([sample_ham])
print(sample_ham)
print(bow_sample_ham)

In [None]:
rows, cols = bow_sample_ham.nonzero()
for col in cols: 
    print(bow_transformer.get_feature_names()[col])

### Applying bow_transformer on all messages

In [None]:
bow_data = bow_transformer.transform(data['text'])

In [None]:
bow_data.shape

In [None]:
bow_data.nnz

**Sparsity: percentage of none zero entries**  
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction  
**Sparsity**  
As most documents will typically use a very small subset of the words used in the corpus,  
the resulting matrix will have many feature values that are zeros (typically more than 99% of them).  
For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary  
with a size in the order of 100,000 unique words in total while each document will use 100 to   
1000 unique words individually.  
In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector,  
implementations will typically use a sparse representation such as available in the scipy.sparse package.


In [None]:
bow_data

In [None]:
bow_data.shape[0]

In [None]:
bow_data.shape[1]

In [None]:
bow_data.nnz

number of none zero entries divided by matrix size  

In [None]:
print( bow_data.nnz / (bow_data.shape[0] * bow_data.shape[1]) *100 )

Around 10% of the matrix are non zeros (=ones)

## 2.3 Term frequency inverse document frequency - TFIDF

### From occurrences to frequencies  
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#from-occurrences-to-frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values  
than shorter documents, even though they might talk about the same topics.  
To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document  
by the total number of words in the document: these new features are called **tf for Term Frequencies**.  
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are  
therefore less informative than those that occur only in a smaller portion of the corpus.  
This downscaling is called **tf–idf for “Term Frequency times Inverse Document Frequency”**.

For futher details and example implementations see:  
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments

### TfidfTransformer from sklearn
Both tf and tf–idf can be computed as follows using TfidfTransformer:   
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(bow_data)

In [None]:
tfidf_sample_ham = tfidf_transformer.transform(bow_sample_ham)
print(tfidf_sample_ham)

In [None]:
tfidf_sample_spam = tfidf_transformer.transform(bow_sample_spam)
print(tfidf_sample_spam)

In [None]:
data_tfidf = tfidf_transformer.transform(bow_data)

In [None]:
data_tfidf

In [None]:
np.shape(data_tfidf)

## train test split

for TFIDF matrix only

In [None]:
from sklearn.model_selection import train_test_split

data_tfidf_train, data_tfidf_test, label_train, label_test = \
    train_test_split(data_tfidf, data["spam"], test_size=0.3, random_state=5)

In [None]:
data_tfidf_train

In [None]:
data_tfidf_test

for TFIDF matrix and feature "length"

In [None]:
from scipy.sparse import  hstack
X2 = hstack((data_tfidf ,np.array(data['length'])[:,None])).A

In [None]:
X2_train, X2_test, y2_train, y2_test = \
    train_test_split(X2, data["spam"], test_size=0.3, random_state=5)

# Part 3: Classifiers

## 3.1 First test for Classification  

### Naive Bayes Classifier

sparse matrix to matrix

In [None]:
data_tfidf_train = data_tfidf_train.A
data_tfidf_test = data_tfidf_test.A

MNB Model using only TFIDF matrix

In [None]:
spam_detect_model = MultinomialNB().fit(data_tfidf_train, label_train)
pred_test_MNB = spam_detect_model.predict(data_tfidf_test)
acc_MNB = accuracy_score(label_test, pred_test_MNB)
print(acc_MNB)

Our first classifier seems to work well, it has an accuracy of 96.5 % for the test set.  

In [None]:
scaler = MinMaxScaler()
data_tfidf_train_sc = scaler.fit_transform(data_tfidf_train)
data_tfidf_test_sc  = scaler.transform(data_tfidf_test)

MNB Model using only TFIDF matrix, scaled

In [None]:
spam_detect_model_minmax = MultinomialNB().fit(data_tfidf_train_sc, label_train)
pred_test_MNB = spam_detect_model_minmax.predict(data_tfidf_test_sc)
acc_MNB = accuracy_score(label_test, pred_test_MNB)
print(acc_MNB)

Applying the min max scaler on the TFIDF matrix improves the performance of the MNB classifier:  
It now has an accuracy of 98.2 % for the test set. 

MNB model with TFIDF matrix and feature "length", unscaled

In [None]:
spam_detect_model_2 = MultinomialNB().fit(X2_train, y2_train)
pred_test_MNB_2 = spam_detect_model_2.predict(X2_test)
acc_MNB_2 = accuracy_score(y2_test, pred_test_MNB_2)
print(acc_MNB_2)

Fitting MNB with the unscaled features TFIDF + length of message decreases performance.  
Lets now check the fit with the scaled features.

In [None]:
X2_tfidf_train = X2_train[:,0:9431]
X2_tfidf_test  = X2_test[:,0:9431]
X2_length_train = X2_train[:,9431]
X2_length_test  = X2_test[:,9431]

In [None]:
scaler = MinMaxScaler()
X2_tfidf_train = scaler.fit_transform(X2_tfidf_train)
X2_tfidf_test  = scaler.transform(X2_tfidf_test)

In [None]:
scaler = MinMaxScaler()
X2_length_train = scaler.fit_transform(X2_length_train.reshape(-1, 1))
X2_length_test  = scaler.transform(X2_length_test.reshape(-1, 1))

In [None]:
X2_train = np.hstack((X2_tfidf_train, X2_length_train))
X2_test  = np.hstack((X2_tfidf_test,  X2_length_test))

MNB model with TFIDF matrix and feature "length", scaled

In [None]:
spam_detect_model_3 = MultinomialNB().fit(X2_train, y2_train)
pred_test_MNB_3 = spam_detect_model_3.predict(X2_test)
acc_MNB_3 = accuracy_score(y2_test, pred_test_MNB_3)
print(acc_MNB_3)

We studied the same classifier, Multinomial Naive Bayes, with different set of features and found that the results vary regarding the accuracy of the predictions.  
In the following we study a different classifier, again with different set of features.
Also we study what this accuracy actually means and also if this metric is the optimal one we should apply for this task.

KNN

In [None]:
parameters_KNN = {'n_neighbors': (10,15,17), }

grid_KNN = GridSearchCV( KNeighborsClassifier(), parameters_KNN, cv=5,
                        n_jobs=-1, verbose=1)

grid_KNN.fit(data_tfidf_train, label_train)

In [None]:
print(grid_KNN.best_params_)
print(grid_KNN.best_score_)

In [None]:
parameters_KNN = {'n_neighbors': (6,8,10), }
grid_KNN = GridSearchCV( KNeighborsClassifier(), parameters_KNN, cv=5,
                        n_jobs=-1, verbose=1)
grid_KNN.fit(data_tfidf_train_sc, label_train)

In [None]:
print(grid_KNN.best_params_)
print(grid_KNN.best_score_)

## 3.2 train test split

for text data

In [None]:
from sklearn.model_selection import train_test_split

sms_train, sms_test, label_train, label_test = \
    train_test_split(data["text"], data["spam"], test_size=0.3, random_state=5)

In [None]:
sms_train.head()

## 3.3 Classification Pipelines

After splitting the data into a train and test set we now use a pipeline to apply the   
**CountVectorizer** and the **TfidfTransformer** on both sets.  
We also add a classifier to the pipeline, so we can combine all necessary steps in one object:  
* Preprecocessing  
* Crossvalidation (GridsearchCV)
* Fitting  
* Predicting
* Evaluating (test score)

### 3.3.1 MultinomialNB

**simple Pipeline. no optimization**

In [None]:
pipe_MNB = Pipeline([ ('bow'  , CountVectorizer(analyzer = remove_punctuation_and_stopwords) ),
                   ('tfidf'   , TfidfTransformer()),
                   ('clf_MNB' , MultinomialNB()),
                    ])

**test score**

In [None]:
pipe_MNB.fit(X=sms_train, y=label_train)
pred_test_MNB = pipe_MNB.predict(sms_test)
acc_MNB = accuracy_score(label_test, pred_test_MNB)
print(acc_MNB)
print(pipe_MNB.score(sms_test, label_test))

The two steps  
**CountVectorizer** and **TfidfTransformer**  
can also be performed in one step with  
**TfidfVectorizer**  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html  
Convert a collection of raw documents to a matrix of TF-IDF features  
Equivalent to CountVectorizer followed by TfidfTransformer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
pipe_MNB_tfidfvec = Pipeline([ ('tfidf_vec' , TfidfVectorizer(analyzer = remove_punctuation_and_stopwords)),
                               ('clf_MNB'   , MultinomialNB()),
                            ])

In [None]:
pipe_MNB_tfidfvec.fit(X=sms_train, y=label_train)
pred_test_MNB_tfidfvec = pipe_MNB_tfidfvec.predict(sms_test)
acc_MNB_tfidfvec = accuracy_score(label_test, pred_test_MNB_tfidfvec)
print(acc_MNB_tfidfvec)
print(pipe_MNB_tfidfvec.score(sms_test, label_test))

**Yes, results are identical**

### 3.3.2 KNN  
Pipeline with GridSearchCV  
optimize best model parameter: n_neighbors

In [None]:
pipe_KNN = Pipeline([ ('bow'  , CountVectorizer(analyzer = remove_punctuation_and_stopwords) ),
                   ('tfidf'   , TfidfTransformer()),
                   ('clf_KNN' , KNeighborsClassifier() )
                    ])

parameters_KNN = {'clf_KNN__n_neighbors': (8,15,20), }

grid_KNN = GridSearchCV(pipe_KNN, parameters_KNN, cv=5,
                        n_jobs=-1, verbose=1)

grid_KNN.fit(X=sms_train, y=label_train)

**best_params_**

In [None]:
grid_KNN.best_params_

**cross validation score: best_score_**

In [None]:
grid_KNN.best_score_

**test score**

In [None]:
pred_test_grid_KNN = grid_KNN.predict(sms_test)
acc_KNN = accuracy_score(label_test, pred_test_grid_KNN)
print(acc_KNN)
print(grid_KNN.score(sms_test, label_test))

### 3.3.3 SVC  
Pipeline with GridSearchCV  
search best preprocessing: apply TfidfTransformer (yes/no)

In [None]:
pipe_SVC = Pipeline([ ('bow'  , CountVectorizer(analyzer = remove_punctuation_and_stopwords) ),
                   ('tfidf'   , TfidfTransformer()),
                   ('clf_SVC' , SVC(gamma='auto', C=1000)),
                    ])


parameters_SVC = dict(tfidf=[None, TfidfTransformer()],
                      clf_SVC__C=[500, 1000,1500]
                      )
#parameters = {'tfidf__use_idf': (True, False),    }

grid_SVC = GridSearchCV(pipe_SVC, parameters_SVC, 
                        cv=5, n_jobs=-1, verbose=1)

grid_SVC.fit(X=sms_train, y=label_train)

**best_params_**

In [None]:
grid_SVC.best_params_

**cross validation score: best_score_**

In [None]:
grid_SVC.best_score_

**test score**

In [None]:
pred_test_grid_SVC = grid_SVC.predict(sms_test)
acc_SVC = accuracy_score(label_test, pred_test_grid_SVC)
print(acc_SVC)
print(grid_SVC.score(sms_test, label_test))

### 3.3.4 SGD  
Pipeline with GridSearch  
search best preprocessing: use_idf (yes/no)  
and best model parameters (alpha, penalty)

In [None]:
pipe_SGD = Pipeline([ ('bow'  , CountVectorizer(analyzer = remove_punctuation_and_stopwords) ),
                   ('tfidf'   , TfidfTransformer()),
                   ('clf_SGD' , SGDClassifier(random_state=5)),
                    ])

parameters_SGD = {
    #'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    #'clf_SGD__max_iter': (5,10),
    'clf_SGD__alpha': (1e-05, 1e-04),
}

grid_SGD = GridSearchCV(pipe_SGD, parameters_SGD, cv=5,
                               n_jobs=-1, verbose=1)

grid_SGD.fit(X=sms_train, y=label_train)

**best_params_**

In [None]:
grid_SGD.best_params_

**cross validation score: best_score_**

In [None]:
grid_SGD.best_score_

**test score**

In [None]:
pred_test_grid_SGD = grid_SGD.predict(sms_test)
acc_SGD = accuracy_score(label_test, pred_test_grid_SGD)
print(acc_SGD)
print(grid_SGD.score(sms_test, label_test))

### 3.3.5 GradientBoostingClassifier

In [None]:
pipe_GBC = Pipeline([ ('bow'  , CountVectorizer(analyzer = remove_punctuation_and_stopwords) ),
                      ('tfidf'   , TfidfTransformer() ),
                      ('clf_GBC' , GradientBoostingClassifier(random_state=5) ),
                    ])

parameters_GBC = { 'tfidf__use_idf': (True, False), 
                   'clf_GBC__learning_rate': (0.1, 0.2),
                   #'clf_GBC__min_samples_split': (3,5), 
                 }

grid_GBC = GridSearchCV(pipe_GBC, parameters_GBC, 
                        cv=5, n_jobs=-1, verbose=1)

grid_GBC.fit(X=sms_train, y=label_train)

In [None]:
grid_GBC.best_params_

In [None]:
grid_GBC.best_score_

In [None]:
pred_test_grid_GBC = grid_GBC.predict(sms_test)
acc_GBC = accuracy_score(label_test, pred_test_grid_GBC)
print(acc_GBC)
print(grid_GBC.score(sms_test, label_test))

### 3.3.6 XGBoost Classifier

In [None]:
import xgboost as xgb

# Set params['eval_metric'] = ...

In [None]:
pipe_XGB = Pipeline([ ('bow'  , CountVectorizer(analyzer = remove_punctuation_and_stopwords) ),
                      ('tfidf'   , TfidfTransformer() ),
                      ('clf_XGB' , xgb.XGBClassifier(random_state=5) ),
                    ])

parameters_XGB = { 'tfidf__use_idf': (True, False), 
                   'clf_XGB__eta': (0.01, 0.02),
                   'clf_XGB__max_depth': (5,6), 
                 }

grid_XGB = GridSearchCV(pipe_XGB, parameters_XGB, 
                        cv=5, n_jobs=-1, verbose=1)

grid_XGB.fit(X=sms_train, y=label_train)

In [None]:
grid_XGB.best_params_

In [None]:
grid_XGB.best_score_

In [None]:
pred_test_grid_XGB = grid_XGB.predict(sms_test)
acc_XGB = accuracy_score(label_test, pred_test_grid_XGB)
print(acc_XGB)
print(grid_XGB.score(sms_test, label_test))

## 3.4 Comparison of results

https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019  

https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score

For a binary classification task, there are 4 possible results:


TN: True negatives  (ham mails labeled as ham)  
FP: False positives (ham mails labeled as spam)  
FN: False negatives (spam mails labeled as ham)  
TP: True positives  (spam mails labeled as spam)  

### confusion_matrix

The confusion matrix gives an overview of the classification results:  
The diagonal elements represent the number of points for which the predicted label is equal to the true label,  
while off-diagonal elements are those that are mislabeled by the classifier.  
The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.  
The rows of a confusion matrix correspond to the true (actual) classes and the columns correspond to the predicted classes.  
So, all together the confusion matrix for a binary classifier consists of 4 values:

TN FP  
FN TP


**Using seaborn heat map for nice plot of confusion matrix**

In [None]:
def plot_confusion_matrix(y_true, y_pred):
    mtx = confusion_matrix(y_true, y_pred)
    #fig, ax = plt.subplots(figsize=(4,4))
    sns.heatmap(mtx, annot=True, fmt='d', linewidths=.5,  
                cmap="Blues", square=True, cbar=False)
    #  
    plt.ylabel('true label')
    plt.xlabel('predicted label')

**Dictionary of predictions**

In [None]:
list_clf = ["MNB", "KNN", "SVC", "SGD", "GBC", "XGB"]

list_pred = [pred_test_MNB, pred_test_grid_KNN, 
             pred_test_grid_SVC, pred_test_grid_SGD,
             pred_test_grid_GBC, pred_test_grid_XGB]

dict_pred = dict(zip(list_clf, list_pred))

In [None]:
def plot_all_confusion_matrices(y_true, dict_all_pred, str_title):
    
    list_classifiers = list(dict_all_pred.keys())
    plt.figure(figsize=(10,7.5))
    plt.suptitle(str_title, fontsize=20, fontweight='bold')
    n=231

    for clf in list_classifiers : 
        plt.subplot(n)
        plot_confusion_matrix(y_true, dict_all_pred[clf])
        plt.title(clf, fontweight='bold')
        n+=1

    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    

In [None]:
plot_all_confusion_matrices(label_test, dict_pred, "Pipelines v1, scoring=accuracy")

### accuracy_score

classification accuracy = correct predictions / total predictions = (TP + TN) / (TP + TN + FP + FN)



In [None]:
dict_acc = {}
for clf in list_clf :
    dict_acc[clf] = accuracy_score(label_test, dict_pred[clf])

In [None]:
for clf in list_clf :
    print(clf, " " , dict_acc[clf])

Comparing the accuracy_score with the confusion matrices, one finds that  
accuracy score may not be the best parameter to choose the best classifier.  
SGD, a model with high accuracy_score, incorrectly classifies 6 ham mails as spam,  
which is usually not wanted for a spam classifier (important mails might get lost).  
MNB has less accuracy than SGD, but it classifies all ham mails correctly.  
SVC also classifies all ham mails correctly but compared to MNB it classifies   
much more spam mails correctly.  
Apart from accuracy there are further scoring methods to evaluate a classifier.  
Lets look at the other classifier scores in more detail:  
precision, recall, fscore, support, roc_auc

### precision_score

The precision is the ratio TP / (TP + FP) where TP is the number of true positives and TP the number of false positives.  
The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.  
Precision = 1 for FP = 0 and precision goes up when FP goes down.

In [None]:
for clf in list_clf :
    print(clf, " ", precision_score(label_test, dict_pred[clf]))

By definition the precision is calculated for the negative class (label = 0, ham mails).  
This is also the default when calling precision score without any further parameters.  
But we can also examine the precision for the individual labels (ham,spam = 0,1)

In [None]:
for clf in list_clf :
    print(clf, " ", precision_score(label_test, dict_pred[clf], average=None, labels=[0,1]))

Precision for classifying ham mails is 1.0 for the MNB and SVC classifier.  
SGD has the best precision for classifying ham mails.

### recall_score

The recall is the ratio TP / (TP + FN) where TP is the number of true positives and FN the number of false negatives.  
The recall is intuitively the ability of the classifier to find all the positive samples.

In [None]:
for clf in list_clf :
    print(clf, " ", recall_score(label_test, dict_pred[clf]))

Recall is defined regarding the positive class (label=1, spam mails).  
Again, if we call the recall score method with the labels parameter, we get  
the recall for ham and spam messages:

In [None]:
for clf in list_clf :
    print(clf, " ", recall_score(label_test, dict_pred[clf], average=None, labels=[0,1] ))

### f1_score

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall,  
where an F-beta score reaches its best value at 1 and worst score at 0.  
The F-beta score weights recall more than precision by a factor of beta.  
beta == 1.0 means recall and precision are equally important.

In [None]:
for clf in list_clf :
    print(clf, " ", f1_score(label_test, dict_pred[clf]))

In [None]:
for clf in list_clf :
    print(clf, " ", f1_score(label_test, dict_pred[clf], average=None, labels=[0,1] ))

### classification_report

In [None]:
print(classification_report(label_test, pred_test_MNB))

The support is the number of occurrences of each class in y_true.

### precision_recall_fscore_support

In [None]:
for clf in list_clf :
    print(clf, " ", precision_recall_fscore_support(label_test, dict_pred[clf], average=None, labels=[0,1] ))

https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

### roc_auc_score

In [None]:
for clf in list_clf :
    print(clf, " ", roc_auc_score(label_test, dict_pred[clf] ))

In [None]:
import sklearn.metrics
sklearn.metrics.SCORERS.keys()

### 3.5 Optimize classifiers with scoring by precision

We perform GridSearchCV again, using the same parameter grids and pipelines like before.  
For all classifier pipelines, we only change the scoring method from "accuracy" to "precision".

In [None]:
scoring = 'precision'

### 3.5.1 GridSearchCV pipelines version 2

MNB

The precision for MNB was already 1.0 so it can not be improved.

KNN

In [None]:
grid_KNN_2 = GridSearchCV(pipe_KNN, parameters_KNN, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_KNN_2.fit(X=sms_train, y=label_train)
pred_test_grid_KNN_2 = grid_KNN_2.predict(sms_test)

In [None]:
grid_KNN_2.best_params_

SVC

In [None]:
grid_SVC_2 = GridSearchCV(pipe_SVC, parameters_SVC, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_SVC_2.fit(X=sms_train, y=label_train)
pred_test_grid_SVC_2 = grid_SVC_2.predict(sms_test)

SGD

In [None]:
grid_SGD_2 = GridSearchCV(pipe_SGD, parameters_SGD, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_SGD_2.fit(X=sms_train, y=label_train)
pred_test_grid_SGD_2 = grid_SGD_2.predict(sms_test)

GBC

In [None]:
grid_GBC_2 = GridSearchCV(pipe_GBC, parameters_GBC, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_GBC_2.fit(X=sms_train, y=label_train)
pred_test_grid_GBC_2 = grid_GBC_2.predict(sms_test)

XGB

In [None]:
grid_XGB_2 = GridSearchCV(pipe_XGB, parameters_XGB, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_XGB_2.fit(X=sms_train, y=label_train)
pred_test_grid_XGB_2 = grid_XGB_2.predict(sms_test)

### 3.5.2 Confusion matrices for scoring by precision

In [None]:
list_clf = ["MNB", "KNN_2", "SVC_2", "SGD_2", "GBC_2", "XGB_2"]

list_pred = [pred_test_MNB, pred_test_grid_KNN_2, 
             pred_test_grid_SVC_2, pred_test_grid_SGD_2,
             pred_test_grid_GBC_2, pred_test_grid_XGB_2]

dict_pred_2 = dict(zip(list_clf, list_pred))

In [None]:
plot_all_confusion_matrices(label_test, dict_pred_2, "Pipelines v2, scoring=precision")

### 3.6 Optimize classifiers with scoring by recall

For spam detection optimizing tbe classifiers by precision seems most reasonable.  
But for other tasks it may be advantageous to have a classifier with maximum recall.  
For example, in Credit Card Fraud detections, you want to find all fraud samples.  
For all classifier pipelines, we perform GridSearchCV again, using the same parameter grids  
and only changing the scoring method to "recall".

In [None]:
scoring = 'recall'

### 3.6.1 GridSearchCV pipelines version 3

MNB

TODO:  
paramgrid for MNB

KNN

In [None]:
grid_KNN_3 = GridSearchCV(pipe_KNN, parameters_KNN, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_KNN_3.fit(X=sms_train, y=label_train)
pred_test_grid_KNN_3 = grid_KNN_3.predict(sms_test)

In [None]:
grid_KNN_3.best_params_

SVC

In [None]:
grid_SVC_3 = GridSearchCV(pipe_SVC, parameters_SVC, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_SVC_3.fit(X=sms_train, y=label_train)
pred_test_grid_SVC_3 = grid_SVC_3.predict(sms_test)

SGD

In [None]:
grid_SGD_3 = GridSearchCV(pipe_SGD, parameters_SGD, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_SGD_3.fit(X=sms_train, y=label_train)
pred_test_grid_SGD_3 = grid_SGD_3.predict(sms_test)

GBC

In [None]:
grid_GBC_3 = GridSearchCV(pipe_GBC, parameters_GBC, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_GBC_3.fit(X=sms_train, y=label_train)
pred_test_grid_GBC_3 = grid_GBC_3.predict(sms_test)

XGB

In [None]:
grid_XGB_3 = GridSearchCV(pipe_XGB, parameters_XGB, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_XGB_3.fit(X=sms_train, y=label_train)
pred_test_grid_XGB_3 = grid_XGB_3.predict(sms_test)

### 3.6.2 Confusion matrices for scoring by recall

In [None]:
list_clf = ["MNB", "KNN_3", "SVC_3", "SGD_3", "GBC_3", "XGB_3"]

list_pred = [pred_test_MNB, pred_test_grid_KNN_3, 
             pred_test_grid_SVC_3, pred_test_grid_SGD_3,
             pred_test_grid_GBC_3, pred_test_grid_XGB_3]

dict_pred_3 = dict(zip(list_clf, list_pred))

In [None]:
plot_all_confusion_matrices(label_test, dict_pred_3, "Pipelines v3, scoring=recall")

### 3.7 Optimize classifiers with scoring by roc_auc_score

In [None]:
scoring = 'roc_auc'

### 3.7.1 GridSearchCV pipelines version 4

MNB

KNN

In [None]:
grid_KNN_4 = GridSearchCV(pipe_KNN, parameters_KNN, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_KNN_4.fit(X=sms_train, y=label_train)
pred_test_grid_KNN_4 = grid_KNN_4.predict(sms_test)

In [None]:
grid_KNN_4.best_params_

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thr = roc_curve(label_test, grid_KNN_4.predict_proba(sms_test)[:,1])
plt.figure(figsize=(5, 5))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Plot')
auc_knn4 = auc(fpr, tpr) * 100
plt.legend(["AUC {0:.3f}".format(auc_knn4)]);

SVC

In [None]:
grid_SVC_4 = GridSearchCV(pipe_SVC, parameters_SVC, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_SVC_4.fit(X=sms_train, y=label_train)
pred_test_grid_SVC_4 = grid_SVC_4.predict(sms_test)

SGD

In [None]:
grid_SGD_4 = GridSearchCV(pipe_SGD, parameters_SGD, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_SGD_4.fit(X=sms_train, y=label_train)
pred_test_grid_SGD_4 = grid_SGD_4.predict(sms_test)

GBC

In [None]:
grid_GBC_4 = GridSearchCV(pipe_GBC, parameters_GBC, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_GBC_4.fit(X=sms_train, y=label_train)
pred_test_grid_GBC_4 = grid_GBC_4.predict(sms_test)

In [None]:
from sklearn.metrics import roc_curve, auc
fpr, tpr, thr = roc_curve(label_test, grid_GBC_4.predict_proba(sms_test)[:,1])
plt.figure(figsize=(5, 5))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Plot')
auc_gbc4 = auc(fpr, tpr) * 100
plt.legend(["AUC {0:.3f}".format(auc_gbc4)]);

XGB

In [None]:
grid_XGB_4 = GridSearchCV(pipe_XGB, parameters_XGB, cv=5,
                          scoring=scoring, n_jobs=-1, verbose=1)

grid_XGB_4.fit(X=sms_train, y=label_train)
pred_test_grid_XGB_4 = grid_XGB_4.predict(sms_test)

### 3.7.2 Confusion matrices for scoring by roc auc

In [None]:
list_clf = ["MNB", "KNN_4", "SVC_4", "SGD_4", "GBC_4", "XGB_4"]

list_pred = [pred_test_MNB, pred_test_grid_KNN_4, 
             pred_test_grid_SVC_4, pred_test_grid_SGD_4,
             pred_test_grid_GBC_4, pred_test_grid_XGB_4]

dict_pred_4 = dict(zip(list_clf, list_pred))

In [None]:
plot_all_confusion_matrices(label_test, dict_pred_4, "Pipelines v4, scoring=roc auc")

# Part 4: NLTK

In [None]:
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

In [None]:
print(data['text'][7])

In [None]:
print(sent_tokenize(data['text'][7]))

In [None]:
print(word_tokenize(data['text'][7]))

In [None]:
stopWords = set(stopwords.words('english'))
words = word_tokenize(data['text'][7])
wordsFiltered = []

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

print(wordsFiltered)