# SMS Spam Detection with Feature Selection

In this jupyter notebook I tried to apply the idea of feature selection on a dataset where the features are words instead of numeric/categorical values. I have chosen the spam SMS dataset for its compact size. First I will implement a K-Nearest Neighbor classifier using sklearn without any feature selection. Then, I will perform the feature selection and will see if there is any improvement in the prediction result.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Import and overview of the dataset

In [None]:
df = pd.read_csv('../input/spam.csv', encoding='latin-1')

In [None]:
df.head()

In [None]:
df.shape

## Some data clean up

In [None]:
# Remove garbage columns
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [None]:
# Remove any empty rows
df.dropna(inplace=True)

In [None]:
# Set column names to something meaningful
df.columns = ['type', 'sms']

In [None]:
# Convert target values to numeric
df.loc[df['type'] == 'ham', 'type'] = 0
df.loc[df['type'] == 'spam', 'type'] = 1

In [None]:
df.head()

In [None]:
# Remove any html tags from text (just if any) and lowercase the words
from bs4 import BeautifulSoup
df['sms'] = df['sms'].apply(lambda x: BeautifulSoup(x.lower(), 'html.parser').get_text())

In [None]:
# Separating the features and target values
X = df['sms']
y = df['type'].astype('int')

In [None]:
X.shape, y.shape

## Preparing the data for training

In [None]:
from sklearn.model_selection import train_test_split

# We are using 70% of data for training, rest for the testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=111, test_size=0.3)

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

## Vectorizing the features

The training module does not understand plain texts. It only understands list of features which has to be numeric in type. So, we have to convert the plain text into list of numeric features. This conversion method is known as vectorizing the features. We are going to use sklearn's TfidfVectorizer based on TF-IDF algorithm. For models like spam detection, TfidfVectorizer is quite standard to use. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)

In [None]:
# Lets have a look at the vectorized features
vectorizer.get_feature_names()

In [None]:
# So, we have total 8669 features (or words if its easier to think)
X_train_vectorized.shape

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create a K-Nearest Neighbors classifier model with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_vectorized, y_train)

## Test the accuracy of trained model

In [None]:
# CAUTION: We used the same vectorizer we used for the training data.
# For test data, we are only doing transform, not fit_transform, as we 
# want to use the same vectorizer which was fitted on train data.
X_test_vectorized = vectorizer.transform(X_test)

In [None]:
y_pred = knn.predict(X_test_vectorized)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

So, using all the features available in the dataset, we have achieved accuracy of >92%. Now, we are going to use feature selection, where we will use only top 100 mostly used features and retrain the model. Lets see if we can achieve any improvement in the prediction result.

## Process dataset using feature selection

For our current problem at hand, we can use two different types of feature selection: SelectKBest (selects K best features) or SelectPercentile (select a percentage of original features). Here we are going to use SelectKBest, where K = 100 (we will use 100 best features) with chi2 as the scoring function.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

skb = SelectKBest(score_func=chi2, k=100)
features_fit = skb.fit(X_train_vectorized, y_train)

In [None]:
features_fit.scores_

In [None]:
X_train_selected = features_fit.transform(X_train_vectorized)

In [None]:
X_train_selected.shape

## Train new model with feature selected data

In [None]:
knn_selected = KNeighborsClassifier(n_neighbors=3)
knn_selected.fit(X_train_selected, y_train)

In [None]:
X_test_selected = features_fit.transform(X_test_vectorized)
y_pred_selected = knn_selected.predict(X_test_selected)

In [None]:
accuracy_score(y_test, y_pred_selected)

Oohoo! Without doing any changes to the data or algorithm, using as they are, only using feature selection and using best 100 features, we have achieved a accuracy score 95.6% which is more than 3.6% more than the previous one we trained above. 

We can try different other training algorithms as well as play with model parameters and number of features to use to see if we can achieve accuracy score as much as 99%. But that is for some other day :)

## Saving the trained model on disk

After we successfully train a model, the thing we want to do is save the model on disk, so that it can be used later. We are going to save our trained model using pickle.

In [None]:
import pickle

pickle.dump(knn_selected, open('knn_selected.model', 'wb'))

# The thing we always miss is that we also have to save the 
# vectorizer and the feature selection model for future use.
# We have trained our model on those modules and any future 
# data has to be processed by them before we can perform 
# any prediction. Otherwise, we will ended up having a 
# shape mismatch error when we try to predict future data
# on loaded model using processed by fresh vectorizer and 
# feature selection model

# Save the vectorizer
pickle.dump(vectorizer, open('knn.vect', 'wb'))

# Save the feature selection model
pickle.dump(features_fit, open('feature_selector.feat', 'wb'))

In [None]:
os.listdir(os.getcwd())

## Load the trained model from disk

In [None]:
knn_selected_saved = pickle.load(open('knn_selected.model', 'rb'))
vectorizer_saved = pickle.load(open('knn.vect', 'rb'))
features_fit_saved = pickle.load(open('feature_selector.feat', 'rb'))

Lets test our loaded model using it to predict on our test data. If we get the same accuracy score of 95.6%, then our model reload is successful.

In [None]:
saved_pred = knn_selected_saved.predict(X_test_selected)
accuracy_score(y_test, saved_pred)

Perfect!! :)

## Test loaded model on complete unseen data

In [None]:
validation_data = pd.DataFrame.from_dict({
        'sms': ['Baa, baa, black sheep, have you any wool? Yes sir, yes sir, three bags full! One for the master, And one for the dame, One for the little boy Who lives down the lane']
    })

validation_data.head()

In [None]:
# Similar data clean up we did earlier on our train/test dataset
validation_data['sms'] = validation_data['sms'].apply(lambda x: BeautifulSoup(x.lower(), 'html.parser').get_text())

In [None]:
# Note we are processing our data using our loaded vectorizer and 
# feature selection model and only doing transform, not fit_transform
validation_features = vectorizer_saved.transform(validation_data['sms'])
validation_features = features_fit_saved.transform(validation_features)

In [None]:
# Lets see if our feature selection worked
validation_features.shape

In [None]:
# Predict on new unseen data
knn_selected.predict(validation_features)

In [None]:
knn_selected.predict_proba(validation_features)

So, we can see that our new unseen data was detected as not being spam (I am glad that my favorite childhood rhyme was not detected as spam :)) (equals 0 means not spam remember from above?). Also predict_proba() tells us that it was detected as not spam with 100% confidence.