#### Title : Prepare Bag of words to perform Spam Filtering in the given Text passage provided.

#### Description : The students are required to analyse the text passage provided, tokenize and Lementise the text. Now apply technique to prepare Bag of Words and apply spam filtering methods to provide spam filtering solution.

##### Objective: Familiarity with NLP operations using NLTK and application in semantic analysis.

##### Domain : Natural Language Processing.

Steps to be taken:

1) Perform cleaning and tokenization.

2) Perform lementization after doing POS (parts of speech identifications) to avoid errors.

3) Prepare Bag of Words and perform Spam filtering from database of emails.

### Importing Libraries

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, re
import nltk

#### Reading text file which contains the spam text

In [2]:
text = open(r'C:\Users\utkar\OneDrive\Desktop\Machine Learning\SpamCollection.txt').read()

Converting the text file into a csv file so that we can have columns and we can divide the text and the spam, non-spam label.

In [3]:
df = pd.read_csv(r'C:\Users\utkar\OneDrive\Desktop\Machine Learning\SpamCollection.txt',sep='\t',header=None,names=['label','text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Converting ham and spam values of label column to 0 and 1, here we assign 0 for ham(not-spam) and 1 for spam

In [4]:
df['label'] = df['label'].apply(lambda x:0 if x == "ham" else 1)

In [5]:
df.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
df.shape

(5572, 2)

Importing two python files which we created for simpyfying our task

nlp_tools contains function for lemmatization

contractions contains functions for expanding short forms into complete words ex. don't = do not

In [7]:
import nlp_tools
import contractions

Here we create a new column and apply contractions file function as well as nlp_tools file function on the given text

In [8]:
df['clean_text'] = df['text'].apply(contractions.expand_contraction)


In [9]:
df['clean_text'] = df['clean_text'].apply(nlp_tools.lemmatization_sentence)

##### Now we will create a list of the clean_text column and use CountVectorizer

In [10]:
spam_filter = df['clean_text'].tolist()

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
cv = CountVectorizer()

In [13]:
cv.fit(spam_filter)

CountVectorizer()

In [14]:
X = cv.transform(spam_filter).toarray()

In [15]:
len(cv.get_feature_names())

7753

In [16]:
y = df['label'].values

Using train_test_split library to split the data into train and test dataset

In [17]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y)

We will use Naive Bayes algorithm for classifying the text into spam and not-spam

In [18]:
from sklearn.naive_bayes import MultinomialNB


In [19]:
model = MultinomialNB()

In [20]:
model.fit(x_train,y_train)

MultinomialNB()

In [21]:
y_pred = model.predict(x_test)

In [22]:
from sklearn import metrics
cr = metrics.classification_report(y_test,y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1220
           1       0.90      0.92      0.91       173

    accuracy                           0.98      1393
   macro avg       0.94      0.96      0.95      1393
weighted avg       0.98      0.98      0.98      1393



### The overall accuracy we achieve using naive bayes algorithm is 98 percent which is pretty good.

In [23]:
testspam = "date wed NUMBER aug NUMBER NUMBER NUMBER NUMBER NUMBER from chris garrigues cwg dated NUMBER NUMBER"

In [25]:

clean_spam = contractions.expand_contraction(testspam)

lemma_spam = nlp_tools.lemmatization_sentence(clean_spam)
vector_spam = cv.transform([lemma_spam]).toarray()
spam = model.predict(vector_spam)
prob = model.predict_proba(vector_spam)

In [26]:
spam

array([1], dtype=int64)

In [27]:
prob

array([[0.00983319, 0.99016681]])

### The test text is a spam text as we can see and the model predicts the same with a really good probability of 99%. Our model's testing with a testing dataset as well as the test text is done with good and accurate results.