<a href="https://colab.research.google.com/github/ramapriyakp/Portfolio/blob/master/NLP/Spam_filtering_Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Description**

Our goal is to predict whether the new e-mail is spam or not-spam. Notspam is also called ham. We have a dataset that contains a bunch of emails and those emails already have a class label, which is either spam or non-spam. So,
our dataset is categorized into two classes--spam and non-spam.
<br> Now if we get a new email, then can we categorize that particular e-mail into the spam or not-spam class? The
answer is yes. So, to classify the new e-mail we use our dataset and ML algorithm and
provide the best suited class for the new mail. The algorithm that implements the
classification is called a ***classifier***. 

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
cd '/content/drive/My Drive/NLP'

/content/drive/My Drive/NLP


In [0]:
from __future__ import print_function
import pandas as pd
import numpy as np

In [0]:
# read file into pandas using a relative path
#path = 'data/sms.tsv'
path = 'sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

  


In [0]:
# examine the shape
sms.shape

(5572, 2)

In [0]:
# examine the first 10 rows
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [0]:
# examine the class distribution
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [0]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [0]:
# check that the conversion worked
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [0]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

(5572,)
(5572,)


In [0]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


In spam filtering, we will use the CountVectorizer API of scikit-learn to generate the features. First, we perform some basic text analysis that will help us understand our data. Here, we
have converted the text data to a vector format using scikit-learn API, Count
Vectorizer().

__Word Counts with CountVectorizer__<br>
CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

*  Create an instance of the CountVectorizer class.
*  Call the fit() function in order to learn a vocabulary from one or more documents.
*  Call the transform() function on one or more documents as needed to encode each as a vector.

In [0]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vect = CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [0]:
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

In [0]:
# examine the document-term matrix
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [0]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

# Decision Tree
but how does a decision tree recognize using
which features and which feature value should it split the data? So, decision tree uses the concept of entropy to decide where to split
the data. Let's understand entropy.

**Entropy** is the measure of the impurity in a tree branch.
 otherwise,

*  If Entropy E = 0, all data points in a tree branch belong to the same class, then entropy E = 0; otherwise, entropy E > 0 and E <= 1.
*   If Entropy E = 1, then it indicates that the tree branch is highly
impure or data points are evenly split between all the available classes.

How to know on which variable or
using which feature we need to perform a split? 
information gain. Decision tree uses information gain 
 for this. 
 

*Information Gain (IG) = Entropy (Parent Node) - [Weight Average] Entropy (Children)*

We are calculating the entropy of the parent node and
subtracting the weighted entropy of the children. This calculation is done for all the
available features, so the decision tree knows exactly where to split. 

**Advantages of decision tree**

The following are the advantages that decision tree provide:
 

*  simple and easy to develop
*  Decision tree can be interpreted by humans easily and it's a white box algorithm
It helps us determine the worst, best, and expected values for different scenarios

**Disadvantages of decision tree**


*   If you have a lot of features, then decision tree may have overfitting issues
*   You need to be careful about the parameters that you are passing while training




In [0]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')

In [0]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time clf.fit(X_train_dtm, y_train)

CPU times: user 115 ms, sys: 880 µs, total: 116 ms
Wall time: 126 ms


DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [0]:
# make class predictions for X_test_dtm
y_pred_class = clf.predict(X_test_dtm)

In [0]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9669777458722182

**Why you need Confusion matrix?**

The confusion matrix visualizes the accuracy of a classifier by comparing the actual and predicted classes.

![alt text](https://www.guru99.com/images/r_programming/032918_0938_DecisionTre2.png)


Here are pros/benefits of using a confusion matrix.

*  It shows how any classification model is confused when it makes predictions
*   Confusion matrix not only gives you insight into the errors being made by your classifier but also types of errors that are being made.

*   This breakdown helps you to overcomes the limitation of using classification accuracy alone.

In [0]:

# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1182,   26],
       [  20,  165]])

In [0]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

1827    Dude. What's up. How Teresa. Hope you have bee...
574                                Waiting for your call.
1973    Yes but can we meet in town cos will go to gep...
3242      Ok i've sent u da latest version of da project.
1791    Am not working but am up to eyes in philosophy...
2900    Aight, I should be there by 8 at the latest, p...
2497    HCL chennai requires FRESHERS for voice proces...
745       Men like shorter ladies. Gaze up into his eyes.
2340    Cheers for the message Zogtorius. Ive been st...
1832    Hello- thanx for taking that call. I got a job...
566     Ill call u 2mrw at ninish, with my address tha...
3544             I'm e person who's doing e sms survey...
987     I'm in office now . I will call you  &lt;#&gt;...
867     Same here, but I consider walls and bunkers an...
1537    How's it feel? Mr. Your not my real Valentine ...
705     True dear..i sat to pray evening and felt so.s...
988     Geeee ... I miss you already, you know ? Your ...
100     Please

In [0]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

3642    You can stop further club tones by replying "S...
1777                    Call FREEPHONE 0800 542 0578 now!
2680    New Tones This week include: 1)McFly-All Ab..,...
763     Urgent Ur £500 guaranteed award is still uncla...
4574    URGENT! This is the 2nd attempt to contact U!U...
881     Reminder: You have not downloaded the content ...
3132    LookAtMe!: Thanks for your purchase of a video...
2514    U have won a nokia 6230 plus a free digital ca...
4499    Latest Nokia Mobile or iPOD MP3 Player +£400 p...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
4768    Your unique user ID is 1172. For removal send ...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
761     Romantic Paris. 2 nights, 2 flights from £79 B...
3564    Auction round 4. The highest bid is now £54. N...
2863    Adult 18 Content Your video will be with you s...
2247    Hi ya 

In [0]:
# example false negative
X_test[761]

'Romantic Paris. 2 nights, 2 flights from £79 Book now 4 next year. Call 08704439680Ts&Cs apply.'

In [0]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = clf.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([0., 0., 0., ..., 0., 1., 0.])

In [0]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9351843565419724

*  ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
*  ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

