# **NLP using Naive Bayes - Spam Classification Problem**

**Import Required Libraries**

In [1]:
import pandas as pd
import numpy as np

*Data ingestion*

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/subhashdixit/NLP/main/Spam_Classification/SMSSpamCollection.csv', sep="\t", header = None, names = ['label', 'messages'])

In [3]:
df

Unnamed: 0,label,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


* Structured Data : Rows and columns
* Unstructured Data : Text data
* NLP With deep learning:
  * 1. RNN
  * 2. LSTM/GRV
  * 3. Encoder/Decoder
  * 4. Transformer (Self Algotihm)
  * 5. BART
  * 6. GPT
  * Example:
  * TEXT Classification
  * Summarization
  * Text Sentiments

* NLP with ML:
 * Naive Bayes
 * SVM

**Import nltk and re**

In [4]:
import nltk 
import re

**Download all stopwords  (irrelevant word to our problem)**

In [5]:
# Not important words, we hav eto remove it
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
from nltk.corpus import stopwords

**Import PorterStemmer to apply stemming**

In [7]:
from nltk.stem.porter import PorterStemmer

In [8]:
ps = PorterStemmer()

Stemming: Getting Root Word
* Like:  
  * Likes
  * Likely
  * Liked
* Love:
  * Loely
  * Loves
  * Loved
  * Loving

Lemmatization: Getting context Word
* Like:  
  * Likes
  * Likely
  * Liked
* Stemming might get wrong because it's fetch root word but Lemmatization takes contect into consideration

## **Data Preprocessing**

In [9]:
corpus = []
for i in range(0, len(df)):
  # Replace expect alphabets with empty string
  review = re.sub("[^a-zA-z]", " ",df['messages'][i]).lower().split()
  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  corpus.append(review)
# corpus

## **Feature Extraction**

In [10]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
# Set of unique wrods and then create vector
# max_features = Consider relevant columns only
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

In [11]:
 X.shape

(5572, 2500)

In [12]:
y = pd.get_dummies(df['label'], drop_first = True)

## **Train Test Split**

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.25, random_state = 1)

## **Model Creation**

In [15]:
# Training model using Naive bayes classifier
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [16]:
y_pred = model.predict(X_test)

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
accuracy_score(y_test, y_pred)

0.9877961234745154

In [19]:
accuracy_score(y_train, model.predict(X_train))

0.988992581957406

## **Sample Prediction**

In [20]:
def Predict_Spam(sample_review):
    word = []
    sample_review = re.sub('[^a-zA-Z]', ' ', sample_review)
    sample_review = sample_review.lower()
    sample_review = sample_review.split()
    """creating PorterStemmer object to take main stem of each word"""
    ps = PorterStemmer()
    """loop for stemming each word in string array at ith row"""   
    sample_review = [ps.stem(word) for word in sample_review if not word in set(stopwords.words('english'))]
    """rejoin all string array elements to create back into a string"""
    sample_review = ' '.join(sample_review) 
    """append each string to create array of clean text"""
    word.append(sample_review)
    temp = cv.transform(word).toarray()
    return model.predict(temp)[0]
    

In [21]:
if (Predict_Spam("Free entry in 2 a wkly comp to win FA Cup fin")):
  print("It is a Spam!")
else:
  print("It is not a Spam!")

It is a Spam!


## **Save Model**

In [22]:
import pickle
pickle.dump(model, open("model_random.pkl", "wb"))