**<h1><div align="center">SMS Spam Classifier </div></h1>**


---

**<h2>Problem Statement: </h2>** 

> To classify a new text message provided by a user is **spam** or **not spam**.

&nbsp;

**Data Link**: [UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection/) [5572 Documents]

***Keywords:*** *make_pipeline, CountVectorizer, TfidfTransformer, LogisticRegression, MultinomialNB, stop_words*

---

**<h2>Project Planning :</h2>** 


### **1. Gathering Data**
- **Imports -** 
  - Contains all the imports necessary for reading data, visualizations and model buiding and evaluating.
- Extracting zip files from downloaded UCI dataset.


### **2. Exploring Data**
- Understand the nature of the data .info() .describe()
- Understand the Distribution of target labels.
- Obtaining insights on the new column 'length', spam and ham messages, and their relation via graphs


### **3. Model Building**
- Data is split in to *train* and *test* messages .
- Text Messages are first converted in to Bag-of-words representation using *CountVectorizer* and *TfidfTransformer*, then to a Model.

- **Pipeline -** 
  - Deploying a Pileline, constructed using below three steps.

  1. **Count Vectorizer -**
    - Remove common English words using *stop_words*.
    - **Tokenization**: Splits each document into the words that appear in it on whitespace and punctuation.
    - **Vocabulary building**: Collect a vocabulary of all words that appear in any of the documents, and numbers them.

  2. **Tf-idf Transformer -** 
    - Takes in the sparse matrix output produced by Count Vectorizer and transforms it by giving high weight to any term that appears often in a particular document, but not in many documents in the corpus.

  3. **Model -**
    - *Logistic Regression* and *Multinomial Naive Bayes* Classifiers to classify messages (tf-idf sparse matrix) in to Spam or Ham 


- **GridSearchCV -**
  - Deploying a grid search using Pipeline, Parameters to adjust, with Cross-Validation of 5 folds.
  - Fitting the created grid with train data and obtaining the Best cross-validation score and Best Parameters

- Exploring created Vocabulary, Stop Words, TFIDF Vocabulary

### **4. Predictions and Evaluation**
  - Predictions on test text data.
  - Evaluation of model with classification report and confusion matrix.

### **5. New SMS Classifier Prediction for User** 
  - Building a complete model with best parameters and best estimator obtained from GridSearchCV on whole dataset (X and y) without splitting.
  - Developing a function '*classify*', to classify any user provided message in to spam or not spam.
  - Previlage for user to input any text message and Check model predictions on SMS.

---

Solution by     : **Aditya Karanth**.

GitHub Profile  : https://github.com/Aditya-Karanth

Kaggle Profile  : https://www.kaggle.com/adityakaranth

LinkedIn Profile: https://www.linkedin.com/in/u-aditya-karanth-2206/

# Imports


In [None]:
import numpy as np
import pandas as pd
import string

# Model
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# Visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
sns.set_context('notebook')
matplotlib.rcParams['figure.figsize'] = (12,8) 

In [None]:
# # Extracting files from 'smsspamcollection' zip file

# from zipfile import ZipFile

# with ZipFile("smsspamcollection.zip", 'r') as z:
#     # printing all the contents of the zip file
#     z.printdir()
#     # extracting all the files
#     print('Extracting all the files now...')
#     z.extractall()
#     print('Done!')

# Exploring Data

In [None]:
corpus = pd.read_csv('../input/uci-sms-spam-collection-data-set/SMSSpamCollection', sep='\t', names=['label','message'])
corpus.head()

In [None]:
corpus.info()

In [None]:
corpus.describe()

In [None]:
corpus.groupby('label').describe()

In [None]:
# Checking a random message at row 13
print('Length of Message: {}\n Message: {}'.format(len(corpus['message'][13]), corpus['message'][13]))

In [None]:
# Distribution of labels in data
sns.countplot(corpus['label'])
display(corpus['label'].value_counts())

# Ham messages are 4825 and Spam are 747, i.e Imbalanced dataset

In [None]:
# Adding 'length' column to corpus
corpus['length'] = corpus['message'].apply(len)
corpus

In [None]:
corpus['length'].describe()
# Messages with minimum length is 2 and maximum length of 910

In [None]:
# Minimum length message

print('min_mess: \n\n')
corpus[corpus['length']==2]['message'].iloc[:]

In [None]:
# Maximum length message

print('max_mess: \n') 
corpus[corpus['length']==910]['message'].iloc[0]

In [None]:
# Average lengths of ham and spam

corpus.groupby('label').mean()['length']

In [None]:
# Distribution of spam and ham messages length

corpus.hist(column='length', by='label', bins=50)

# Model Building

In [None]:
# Splitting data in to train and test messages

X = corpus['message']
y = corpus['label']
text_train, text_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22) 

Two classifiers are considered for classifying high-dimensional, sparse data:
- Logistic Regression
- Multinomial Naive Bayes (selected)

## Logistic Regression

In [None]:
# PIPELINE [tf–idf actually makes use of the statistical properties of the training data]
pipe_lr = make_pipeline(CountVectorizer(stop_words='english'), #   (Tokenization, Vocabulary building)
                        TfidfTransformer(), # Transforms sparse matrix output produced by CountVectorizer(Uses L2 normalization)
                        LogisticRegression(max_iter=1000)) # Model to classify spam/ham

# parameters for grid search
param_grid_lr = {'countvectorizer__ngram_range' : [(1,1), (1,2)], # Combination of words to consider
                 'countvectorizer__min_df' : [1,2,3], # Minimum appearence in documents
                 'logisticregression__C' : [0.1,1,10,100,1000]} # Regularization Parameter


# GRID SEARCH (using pipeline and param_grid along with cross-validation)
grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, n_jobs=-1)

# Fit train data to grid
grid_lr.fit(text_train, y_train)

print('Best cross-validation score : {:.2f}\n'.format(grid_lr.best_score_))
print('Best Parameters: ', grid_lr.best_params_)

## Multinomial Naive-Bayes

In [None]:
# PIPELINE [tf–idf actually makes use of the statistical properties of the training data]
pipe_nb = make_pipeline(CountVectorizer(stop_words='english'), # Bag-of-words (Tokenization, Vocabulary building)
                     TfidfTransformer(), # Transforms sparse matrix output produced by CountVectorizer(Uses L2 normalization)
                     MultinomialNB()) # Model to classify spam/ham

# parameters for grid search
param_grid_nb = {'countvectorizer__ngram_range' : [(1,1), (1,2)], # Combination of words to consider
              'countvectorizer__min_df' : [1,2,3], # Minimum appearence in documents
              'multinomialnb__alpha' : [0.001,0.01,0.1,1,10]} # Regularization Parameter

# GRID SEARCH (using pipeline and param_grid along with cross-validation)
grid_nb = GridSearchCV(pipe_nb, param_grid_nb, cv=5, n_jobs=-1)

# Fit train data to grid
grid_nb.fit(text_train, y_train)

print('Best cross-validation score : {:.2f}\n'.format(grid_nb.best_score_))
print('Best Parameters: ', grid_nb.best_params_)


Both models performs almost simillarly on this data, Choosing **Multinomial Naive Bayes**

In [None]:
# Cross validation mean scores
grid_nb.cv_results_['mean_test_score']

In [None]:
# Best Estimator
grid_nb.best_estimator_

In [None]:
# Count Vectorizer
vect = grid_nb.best_estimator_.named_steps["countvectorizer"]

# Vocabulary
print('len of vocabulary : ',len(vect.vocabulary_))
print('Every 500th vocabulary:\n', vect.get_feature_names()[::1000])

# Stop Words
print('\nlen of stop words : ',len(vect.get_stop_words()))
print('\nstop words : ',vect.get_stop_words())

In [None]:
# Tf-idf Transformer
tfidf = grid_nb.best_estimator_.named_steps["tfidftransformer"]

print('len of vocabulary : ',len(tfidf.idf_))

print('\nIDF of word "phone" is', tfidf.idf_[vect.vocabulary_['phone']])
print('IDF of word "cat" is', tfidf.idf_[vect.vocabulary_['cat']])

# Predictions and Evaluation

In [None]:
y_pred_nb = grid_nb.predict(text_test)

print('Test Score: {:.3f}'.format(grid_nb.score(text_test, y_test)))

print('\nConfusion Matrix: \n', confusion_matrix(y_test, y_pred_nb))
print('\nClassification Report: \n', classification_report(y_test, y_pred_nb))

# **New SMS Classifier Prediction for User**

Re-Modelling

In [None]:
# MultinomialNB model using the best parameters and complete data (messages)
pipe = make_pipeline(CountVectorizer(stop_words='english', ngram_range=(1,2)),
                     TfidfTransformer(),
                     MultinomialNB(alpha=0.1))
model = pipe.fit(X,y)
print('Final Score - ',model.score(X, y))

Function to classify any user provided message

In [None]:
def classify(x):
  pred = model.predict([x])
  return "This is a Spam message" if pred[0]=='spam' else "This is not a Spam message"

--> Type any new text in `message = '___' ` and Run

In [None]:
# This is a new message out of dataset
message = 'Our records show that you overpaid for (a product or service). Kindly supply your bank routing and account number to receive your refund.'

classify(message)

In [None]:
message = 'Hey, hope you are doing well and sound'

classify(message)

In [None]:
# This is a new message out of dataset
message = 'IMPORTANT - You could be entitled up to £3,160 in compensation from mis-sold PPI on a credit card or loan. Please reply PPI for info or STOP to opt out.'

classify(message)