Step 1: Importing all the necessary libraries 

In [2]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import word_tokenize,pos_tag
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import svm

Step 2: Loading the data into a dataframe

In [11]:
df = pd.read_csv('OneDrive\Desktop\Projects\IITG\spam.csv',encoding='latin-1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Here,the column names aren't really relevant to what the dataset is trying to convey to us. Thus, for my convenience, I'll rename the column names to something relevant.

In [13]:
df=df[['v1','v2']]
df=df.rename(columns={'v1':'label','v2':'text'})
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Step 3: Normalising the data by dropping the case sensitivity. (Case Normalization)
(Please refer write up to see why this method of Normalisation was chosen.)

In [14]:
def normalize(msg):
    msg=msg.lower()
    return msg

In [17]:
df['text']=df['text'].apply(normalize)
df.head()

Unnamed: 0,label,text
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


Step 4: Before vectorizing the normalized words, let's split the data into a test set and train set.

In [39]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size = 0.1, random_state = 1)

Step 5: Vectorizing the data. Here, I'm using the TF-IDF Vectorizer. (Refer Write-up to see why.)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size = 0.1, random_state = 1)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

Step 6:Now that the words have been vectorized,I'm using an SVM (Again, refer write up to see the basis of selection) to build a classifier for the model.

In [60]:
model=svm.SVC(kernel='linear', C=1000, gamma=1) 
model.fit(X_train,y_train)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Step 7: Now, to test the accuracy of this model:

In [64]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size = 0.1, random_state = 1)
X_test=vectorizer.transform(X_test)
y_pred=model.predict(X_test)
print(confusion_matrix(y_test, y_pred))

[[489   1]
 [  2  66]]


The above confusion matrix tells us the number of:
True Positives:489,
False Positives:1,
False Negatives:2,
True Negatives:66

Accuracy=(Number of True Positives+Number of True Negatives)/Total 

In [66]:
Accuracy=(489+66)/(490+68)
print(Accuracy*100)

99.46236559139786


## Accuracy of Model: 99.462%

### Other Important Parameters pertaining to Predictions and Error Metrics:

Precision (P): Number of true Positives/(Number of True Positives+Number of false positives)


Recall (R): Number of True Positives/(Number of True Positives+Number of false Negatives)

F1 Score=2PR/P+R

In [67]:
P=489/(490)
print(P)

0.9979591836734694


In [68]:
R=489/(489+2)
print(R)

0.9959266802443992


In [69]:
F1=(2*P*R)/(P+R)
print(F1)

0.9969418960244649


## F SCORE OF MODEL: 0.996