<a href="https://colab.research.google.com/github/wasimkhan33/Spam-Classification-using-NLP/blob/main/spam_classifier_using_basic_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP

Natural Language Processing(NLP) is a subset of Artifical Intelligence(AI) that helps the computers to understand, interpret and utilize the human language. NLP allows the applications to communicate with people using human language.
Whenever our data contains large chunks of text, we use NLP techniques to first clean that data and then feed it to the model. We will understand each and every step in detail one by one.

**Steps in any basic NLP project**

1. Tokenization (Breaking down of large texts into smaller tokens i.e. paragraphs to sentences and sentences to words)
2. Text Data Cleaning (Removing punctuations, stopwords, converting to lower cases etc)
3. Stemming or Lemmatization (Removing the suffixes of similar words to their root word to get uniformity)
4. Converting the remaining words to vectors by pre-processing techniques like Bag of Words.
5. Feeding those vectors to the model

In this kernel, let's go through all these points by implementing a basic Spam classifier using NLP techniques.

**Importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Importing Dataset**

In [None]:
df=pd.read_csv('../input/sms-spam-collection-dataset/spam.csv',encoding=('ISO-8859-1'))

In [None]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


First let's drop the last three columns as they are of no use. v1 column is our target variable(predict whether spam or ham) and v2 column is independent variable. Also, we will rename v1 and v2 columns to label and message respectively.

In [None]:
df=df.drop('Unnamed: 2',axis=1)
df=df.drop('Unnamed: 3',axis=1)
df=df.drop('Unnamed: 4',axis=1)

In [None]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df=df.rename(columns={'v1':'label','v2':'message'})
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
import re              #importing necessary NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer=WordNetLemmatizer()              #object creation for lemmatzation on corpus of data

In [None]:
corpus=[]

Next step is to remove the punctuations, stopwords like 'a', 'the', 'is' etc as these words do not contribute to the model.
Also, convert into lower cases as similar words with different cases will be treated differently.
Then apply Lemmatization on those texts(each row) to remove the suffixes and reduce it to its dictionary root form.
Words like studies, studying will get converted to study. This is done to bring uniformity.

In [None]:
for i in range(0,len(df)):
    review=re.sub('[^a-zA-Z]','',df['message'][i])
    review=review.lower()
    review=review.split()
    review=[lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [None]:
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Now that we have cleaned the data, its time to convert those words into vectors using Bag of Words technique.

In this technique I have explained the Bag of words in detail. You can refer the logic of technique from there.
In short Bag of words just creates a set of vectors containing count of word occurences in the document and it creates such vectors for each row of message column.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

In [None]:
cv=CountVectorizer(max_features=1000)         #countvectorizer is the library which helps in creating BOW. cv is the object

In [None]:
X=cv.fit_transform(corpus).toarray()            #independent variable

In [None]:
y=pd.get_dummies(df['label'])               #creating one hot(dummy variables) vectors for target variable
y=y.iloc[:,1].values

In [None]:
from sklearn.model_selection import train_test_split                      #Do the train test split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

In [None]:
from sklearn.naive_bayes import MultinomialNB          #for this we will use Multinomial Bayes algorithm to determine the label

In [None]:
spam_detect_model=MultinomialNB().fit(X_train,y_train)

In [None]:
y_pred=spam_detect_model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
print(accuracy_score(y_test,y_pred))

0.8556053811659193


This was all about the basic flow of a NLP project through spam classifier project. 