# NLP and Machine Learning

Steps:

    1) Collecting textual data
    2) cleaning the input data
    3) Normalizing the data
        -Removal of stop words
        -converting all text to a standard case
        -Stemming and Lemmitization
    4) Feature engineering
        - Vectorization
    5) Apply machine learning Techniques to build models
        - for spam classifier/ sentiment analyzer - naive bayes, Decision trees
        - auto-completes user input : hidden markov model
    6) Validating the model built
    7) Deploying the model and making prediction on new data

### SMS Spam classifier using Decision Trees and Naive Bayes

step1: Collecting data

In [7]:
import numpy as np
import pandas as pd

sms_data=pd.read_csv("spam.csv",encoding='latin-1')
print(sms_data.head())
cols= sms_data.columns[:2]
data=sms_data[cols]
print(data.shape)
data=data.rename(columns={'v1':'value','v2':'text'})
print(data.head())


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  
(5572, 2)
  value                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


Step2: Cleaning, Normalizing and feature engineering

In [63]:
from string import punctuation
import re
import nltk
from nltk import word_tokenize

punctuations=list(punctuation)

#creating new feature called punctuation this counts the number of punctutation characters in the text
data['punctuations']=data.text.apply(lambda x:len(re.findall(r'[^\w&&^\s]',x)))

#creating new feature called phone number this indicates this text contains phone number or not
data['phonenumbers']=data.text.apply(lambda x:len(re.findall(r'[0-9]{10}',x)))

#creating feature called links this indicates if the sms text contains a URL or not
data['links']=data.text.apply(lambda x: 1 if re.search(r"https?://(?:[\w.]|(?:%[\da-fA-F]{2}))+",x)!=None else 0)

#Creating new feature called uppercase it indicates how many words in the text are in uppercase
count_upper=lambda x: list(map(str.isupper,x.split())).count(True)
#upper_case=lambda y,n : n+1 if y.isupper() else n
data['uppercase']=data.text.apply(count_upper)

#Counting how many unusual words are there in text
def unusual_words(text):
    words=set(w.lower() for w in text if w.isalpha())
    english_vocab_set=set(w.lower() for w in nltk.corpus.words.words())
    unusual_text=words-english_vocab_set
    return len(unusual_text)
data['unusualwords']=data.text.apply(lambda x: unusual_words(word_tokenize(x)))

KeyboardInterrupt: 

In [64]:
# Convert text into TF-IDF matrix. Recall that the TF-IDF matrix is a numeric representation of the text
from sklearn.feature_extraction.text import TfidfVectorizer as tfidf
tf_idf=tfidf(stop_words='english',strip_accents='ascii',max_features=300)
tf_idf_matrix=tf_idf.fit_transform(data.text)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=300,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents='ascii',
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [69]:
data_features=pd.concat([pd.DataFrame(data=tf_idf_matrix.toarray(),columns=tf_idf.get_feature_names())],axis=1)
data_features

Unnamed: 0,000,10,150p,150ppm,16,18,1st,2nd,50,a1,...,world,www,xmas,xxx,ya,yeah,year,yes,yo,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.594379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.372953,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Step3: Machine Learning

Recall that before we build a machine learning model, we split the data into train and test sets. The model is learnt on the train set and is validated again the test set.

In [82]:
from sklearn.model_selection import train_test_split
x=data_features
y=data.value
x_train,x_test,y_train,y_test=train_test_split(x,y)

Since the task here is to distinguish one class of messages from another (spam from ham), we need to use classification algorithms.

we are using DecisionTreeClassifier to learn the model

In [89]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
decision_tree=DecisionTreeClassifier(min_samples_split=40)
decision_tree.fit(x_train,y_train)
prediction=decision_tree.predict(x_test)
accuracy_score(y_train,decision_tree.predict(x_train))
accuracy_score(y_test,prediction)

0.9454414931801867

Logistic Regression

In [95]:
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression(C=1)
LR.fit(x_train,y_train)
pred=LR.predict(x_test)
accuracy_score(y_train,LR.predict(x_train))
accuracy_score(y_test,pred)



0.9676956209619526

Naive Bayes

In [98]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()
nb.fit(x_train,y_train)
predict=nb.predict(x_test)
accuracy_score(y_test,predict)

0.9698492462311558