**NLP Basic Applications**

SMS spam classification - Classify if an SMS is spam or ham based on the content in it

Sentiment Analysis of product reviews - Classify if a product review has a positive sentiment or a negative sentiment

Indian Railways FAQ chatbot - Automate the response to frequently asked questions on Indian Railways ticket booking

Topic modelling - Identify key topics in a document and group document that belong to similar topics together

In order to build the classifier, we have collected historic data which has different SMS messages marked as 'ham' (not-spam) or 'spam'. You can download this data

In [1]:
import numpy as np
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import os

In [4]:
dataset_path = '/content/drive/MyDrive/Datasets/spam.csv'

In [5]:
sms_data = pd.read_csv(dataset_path,encoding = 'latin-1')

In [6]:
# Review the loaded data
print(sms_data.head())

     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [7]:
cols = sms_data.columns[:2]
print(cols)

Index(['v1', 'v2'], dtype='object')


In [8]:
data = sms_data[cols]

In [9]:
print(data.shape)

(5572, 2)


In [10]:
data = data.rename(columns = { 'v1':'Value','v2':'Text'})

In [11]:
print(data.head())

  Value                                               Text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [12]:
print(data.Value.value_counts())

Value
ham     4825
spam     747
Name: count, dtype: int64


**Step 2: Feature Engineering**

feature engineering, this includes necessary data cleaning and normalizing operations.


In [13]:
from string import punctuation
import re
import nltk
from nltk import word_tokenize

In [14]:
punctuation = list(punctuation)  # punctuation means comma,fullstop,brackets

In [15]:
# Creating a new feature called Punctuations.
# This feature counts the number of punctuation characters in the sms message
data["Punctuations"] = data["Text"].apply(lambda x: len(re.findall(r"[^\w+&&^\s]",x)))


  data["Punctuations"] = data["Text"].apply(lambda x: len(re.findall(r"[^\w+&&^\s]",x)))


In [16]:
# Creating a new feature called Phonenumbers.
# This feature indicates if the sms text contains a phonenumber or not
data["Phonenumbers"] = data["Text"].apply(lambda x: len(re.findall(r"[0-9]{10}",x)))

In [17]:
# Creating a new feature called Links.
# This feature indicates if the sms text contains a URL or not
is_link = lambda x: 1 if re.search(r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+",x)!=None else 0
data["Links"] = data["Text"].apply(is_link)

In [18]:
import nltk
nltk.download('punkt_tab')
nltk.download('words')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [19]:
# Identifying and counting how many unusual words are there in the sms text
def find_unusual_words(text):
    text_vocab_set = set(w.lower() for w in text if w.isalpha())
    english_vocab_set = set(w.lower() for w in nltk.corpus.words.words())
    unusual_set = text_vocab_set - english_vocab_set
    return len(sorted(unusual_set))
data["unusualwords"] = data["Text"].apply(lambda x: find_unusual_words(word_tokenize(x)))

In [23]:
# View a few records of the data after creating these features
print(data[14:25])

   Value                                               Text  Punctuations  \
14   ham                I HAVE A DATE ON SUNDAY WITH WILL!!             2   
15  spam  XXXMobileMovieClub: To use your credit, click ...            11   
16   ham                         Oh k...i'm watching here:)             6   
17   ham  Eh u remember how 2 spell his name... Yes i di...             5   
18   ham  Fine if thatåÕs the way u feel. ThatåÕs the wa...             1   
19  spam  England v Macedonia - dont miss the goals/team...             7   
20   ham          Is that seriously how you spell his name?             1   
21   ham  IÛ÷m going to try for 2 months ha ha only joking             2   
22   ham  So Ì_ pay first lar... Then when is da stock c...             6   
23   ham  Aft i finish my lunch then i go str down lor. ...             3   
24   ham  Ffffffffff. Alright no way I can meet up with ...             2   

    Phonenumbers  Links  unusualwords  
14             0      0            

In [24]:
print(data)

     Value                                               Text  Punctuations  \
0      ham  Go until jurong point, crazy.. Available only ...             9   
1      ham                      Ok lar... Joking wif u oni...             6   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...             5   
3      ham  U dun say so early hor... U c already then say...             6   
4      ham  Nah I don't think he goes to usf, he lives aro...             2   
...    ...                                                ...           ...   
5567  spam  This is the 2nd time we have tried 2 contact u...             9   
5568   ham              Will Ì_ b going to esplanade fr home?             1   
5569   ham  Pity, * was in mood for that. So...any other s...             7   
5570   ham  The guy did some bitching but I acted like i'd...             1   
5571   ham                         Rofl. Its true to its name             1   

      Phonenumbers  Links  unusualwords  
0        

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf= TfidfVectorizer(stop_words="english",strip_accents='ascii',max_features=300)
tf_idf_matrix = tf_idf.fit_transform(data["Text"])

In [26]:
data_extra_features = pd.concat([data,pd.DataFrame(tf_idf_matrix.toarray(),columns=tf_idf.get_feature_names_out())],axis=1)

In [27]:
print(data_extra_features)

     Value                                               Text  Punctuations  \
0      ham  Go until jurong point, crazy.. Available only ...             9   
1      ham                      Ok lar... Joking wif u oni...             6   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...             5   
3      ham  U dun say so early hor... U c already then say...             6   
4      ham  Nah I don't think he goes to usf, he lives aro...             2   
...    ...                                                ...           ...   
5567  spam  This is the 2nd time we have tried 2 contact u...             9   
5568   ham              Will Ì_ b going to esplanade fr home?             1   
5569   ham  Pity, * was in mood for that. So...any other s...             7   
5570   ham  The guy did some bitching but I acted like i'd...             1   
5571   ham                         Rofl. Its true to its name             1   

      Phonenumbers  Links  unusualwords  000   10  

Step 3: Machine Learning


We converted our textual data into a type that can be used by machine learning algorithms, let us go ahead and apply machine learning. Here we aim to train the machine to recognize which messages are 'spam' and which are 'ham'

Recall that before we build a machine learning model, we split the data into train and test sets. The model is learnt on the train set and is validated against the test set.

In [28]:
from sklearn.model_selection import train_test_split

In [31]:
from sklearn.model_selection import train_test_split
X=data_extra_features
features = X.columns.drop(["Value","Text","Value_num"],errors='ignore')
target = ["Value"]
X_train,X_test,y_train,y_test = train_test_split(X[features],X[target])

In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [34]:
dt = DecisionTreeClassifier(min_samples_split=40)

In [35]:
dt.fit(X_train,y_train)

In [43]:
pred = dt.predict(X_test)
print("accuarcy score",accuracy_score(y_test,pred))

accuarcy score 0.9633883704235463


In [39]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [40]:
mnb = MultinomialNB()

In [41]:
mnb.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [42]:
pred_mnb = mnb.predict(X_test)
print(accuracy_score(y_test, pred_mnb))

0.9777458722182341


In [44]:
lr = LogisticRegression()

In [45]:
lr.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [46]:
pred_lr = lr.predict(X_test)

In [47]:
print("accuracy Score",accuracy_score(y_test,pred_lr))

accuracy Score 0.9748743718592965
