# Dataset : SMS spam collection Dataset

Problem Statement: Using this dataset to build a prediction model that will accurately classify which texts are spam .
  

Dataset Description:

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

Introduction:
    This problem is about classification of messages into spam or ham.Classification algorithm is used to build the model to predict the outcome. Performance of each model is measured using following Metrics:
    Accuracy,Precision (measuring exactness), Recall (measuring completeness) and the F1 Score (compromise between Precision and Recall),these metrics formulas (TP = # True Positives, TN = # True Negatives, FP = # False Positives, FN = # False Negatives):
  1.Accuracy = (TP + TN) / (TP + TN + FP + FN)
  2.Precision = TP / (TP + FP)
  3. Recall = TP / (TP + FN)
  4.F1 Score = 2 * Precision * Recall / (Precision + Recall)
Steps involved in building the model are
1.Importing the dataset
2.Text preprocessing and building bag of word model
3.Building Machine Learning Model and measuring the performance
     1.Naive Bayes
     2.Random Forest classifier
     3.KNN
     4.SVM and tuning hyper parameter using GridesearchCSV

In [1]:
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#importing the dataset
dataset=pd.read_csv("spam.csv",encoding='latin-1')
dataset.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
#removing unwanted column
dataset.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],axis=1,inplace=True)


In [4]:
#renaming the column and encoding the label
dataset.rename(columns={'v1': 'label','v2':'text'},inplace=True)
dataset['label']=dataset.label.map({'ham':0,'spam':1})
dataset.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
#frequency table for univariate categorical variable
dataset.label.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [5]:
#to check if there is NaN value
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label    5572 non-null int64
text     5572 non-null object
dtypes: int64(1), object(1)
memory usage: 87.1+ KB


# Text Preprocessing

In [6]:
#Data cleaning
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
corpus=[]
for i in range (0,len(dataset)):
    text=re.sub('[^A-Za-z]',' ',dataset['text'][i])
    text=text.lower()
    text=text.split()
    ps=PorterStemmer()
    text=[ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
    text=' '.join(text)
    corpus.append(text)



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Home\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
#creating bag of words using sparse matrix
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=6000)
X=cv.fit_transform(corpus).toarray()
y=dataset.iloc[:,0].values
X.shape

(5572, 6000)

In [8]:
#splitting the dataset into train set and test set
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=0)



# Machine Learning Model
Naive Bayes and Random forest classifier works well for classifying text data

In [9]:
#Fitting Naive Bayes to Training set
from sklearn.naive_bayes import GaussianNB
classifier_NB=GaussianNB()
classifier_NB.fit(X_train,y_train)

GaussianNB(priors=None)

In [10]:
#Prediciting the test set result
y_predict=classifier_NB.predict(X_test)

In [11]:
#evaluating performace using confusion matrixs
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
cm=confusion_matrix(y_test,y_predict)
accuracy=accuracy_score(y_test,y_predict)
print("Confusion_matrix",cm)
print("accuracy:",accuracy)


Confusion_matrix [[822 127]
 [ 24 142]]
accuracy: 0.864573991031


In [12]:
#evaluating performace using K-Fold cross_validation
from sklearn.cross_validation import cross_val_score
cv=cross_val_score(estimator=classifier_NB,X=X_train,y=y_train,cv=10)
accuracy=cv.mean()
print(accuracy)

0.8738972085


In [13]:
print(classification_report(y_test,y_predict,target_names=["Ham","Spam"]))

             precision    recall  f1-score   support

        Ham       0.97      0.87      0.92       949
       Spam       0.53      0.86      0.65       166

avg / total       0.91      0.86      0.88      1115



# Ensemble Method

In [15]:
#Fitting Random forest classifier to training set
from sklearn.ensemble import RandomForestClassifier
classifier_RF=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
classifier_RF.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [16]:
#predicting the test set results
y_predict=classifier_RF.predict(X_test)
accuracy_score(y_test,y_predict)

0.96322869955156953

In [17]:
#measuring accuracy using kfold cross validation
cv=cross_val_score(estimator=classifier_RF,X=X_train,y=y_train,cv=10)
accuracy=cv.mean()
print(accuracy)

0.969039612769


In [18]:
print(classification_report(y_test,y_predict,target_names=["Ham","Spam"]))

             precision    recall  f1-score   support

        Ham       0.96      1.00      0.98       949
       Spam       0.99      0.76      0.86       166

avg / total       0.96      0.96      0.96      1115



# K Nearest Neighbor algorithm

In [19]:
#fitting KNN classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier_KNN=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2,n_jobs=-1)
classifier_KNN.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform')

In [20]:
#predicitng the test result using knn classifier
y_predict=classifier_KNN.predict(X_test)
accuracy_score(y_test,y_predict)

0.90852017937219731

In [21]:
#measuring accuracy using kfold cross validation
cv=cross_val_score(estimator=classifier_KNN,X=X_train,y=y_train,cv=10,n_jobs=-1)
accuracy=cv.mean()
print(accuracy)

0.918327119692


In [22]:
print(classification_report(y_test,y_predict,target_names=["Ham","Spam"]))

             precision    recall  f1-score   support

        Ham       0.90      1.00      0.95       949
       Spam       1.00      0.39      0.56       166

avg / total       0.92      0.91      0.89      1115



# SVM classifier

In [23]:
#fitting SVM classifier to training set
from sklearn.svm import SVC
classifier_SVC=SVC(kernel='rbf',random_state=0)
classifier_SVC.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

In [24]:
#predicitng the test result using knn classifier
y_predict=classifier_SVC.predict(X_test)
accuracy_score(y_test,y_predict)

0.85112107623318389

In [25]:
#measuring accuracy using kfold cross validation
cv=cross_val_score(estimator=classifier_SVC,X=X_train,y=y_train,cv=10,n_jobs=-1)
accuracy=cv.mean()
print(accuracy)

0.869643641869


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict, target_names = ["Ham", "Spam"]))

             precision    recall  f1-score   support

        Ham       0.85      1.00      0.92       949
       Spam       0.00      0.00      0.00       166

avg / total       0.72      0.85      0.78      1115



  'precision', 'predicted', average, warn_for)


# Tuning hyper parameters using GridSearchCV


In [None]:

from sklearn.model_selection import GridSearchCV
parameters=[{'kernel':['linear'],'C':[1,10,100,1000]},{'kernel':['rbf'],'C':[1,10,100,1000],'gamma':[0.5,0.1,0.01,0.001]}]
grid_search=GridSearchCV(classifier_SVC,parameters,scoring='accuracy',cv=10,n_jobs=-1)
grid_search.fit(X_train,y_train)
best_score=grid_search.best_score_
best_param=grid_search.best_params_


Conclusion: For this dataset, Random Forest classifier has better accuracy than other model.