## Document classification

It can be useful to be able to classify new "test" documents using already classified "training" documents.  
A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  
Here is one example of such data:  

UCI Machine Learning Repository: Spambase Data Set

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

http://archive.ics.uci.edu/ml/datasets/Spambase


For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), 
then analyze these documents to predict how new documents should be classified.

Link to the recording:

https://www.youtube.com/watch?v=tkuucWzI_Ts

## About the Data

The data has been prepared after analysing 4601 emails for it's contents like work frequency , special characters presents  run length of capital letters.

The last column of 'spambase.data' indicates whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. 

** Feature set **
**column 1- 48** -continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.

**column 49-54**  continuous real [0,100] attributes of type char_freq_CHAR]
= percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail

**column 55** continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters

**column 56**  continuous integer [1,...] attribute of type capital_run_length_longest
= length of longest uninterrupted sequence of capital letters

**column 57**  continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail

**column 58** nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0).

In [1]:
#load the libraries

import nltk
import numpy as np
import pandas as pd
from matplotlib import pyplot
import matplotlib.pyplot as plt 
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB 
#from sklearn.svm import SVC 
from sklearn import datasets, svm, cross_validation, tree, preprocessing, metrics
from sklearn.metrics import classification_report ,confusion_matrix
#from sklearn.metrics import 
import warnings
warnings.filterwarnings('ignore')
import sklearn.ensemble as ske




In [2]:
spam_ham_data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data', sep = ",", header = None)

In [3]:
#add the column names

col_names = ["word_freq_make", 
 "word_freq_address", 
 "word_freq_all", 
 "word_freq_3d", 
 "word_freq_our", 
 "word_freq_over", 
 "word_freq_remove", 
 "word_freq_internet",
 "word_freq_order",        
 "word_freq_mail",         
 "word_freq_receive",      
 "word_freq_will",     
 "word_freq_people",
 "word_freq_report",       
 "word_freq_addresses",    
 "word_freq_free",       
 "word_freq_business",     
 "word_freq_email",        
 "word_freq_you",         
 "word_freq_credit",       
 "word_freq_your",         
 "word_freq_font",         
 "word_freq_000",          
 "word_freq_money",        
 "word_freq_hp",           
 "word_freq_hpl",          
 "word_freq_george",       
 "word_freq_650",          
 "word_freq_lab",          
 "word_freq_labs",         
 "word_freq_telnet",       
 "word_freq_857",          
 "word_freq_data",         
 "word_freq_415",          
 "word_freq_85",           
 "word_freq_technology",   
 "word_freq_1999",         
 "word_freq_parts",        
 "word_freq_pm",           
 "word_freq_direct",       
 "word_freq_cs",           
 "word_freq_meeting",      
 "word_freq_original",     
 "word_freq_project",      
 "word_freq_re",           
 "word_freq_edu",          
 "word_freq_table",        
 "word_freq_conference",   
 "char_freq_;",            
 "char_freq_(",            
 "char_freq_[",            
 "char_freq_!",            
 "char_freq_$",            
 "char_freq_#",            
 "capital_run_length_average",
 "capital_run_length_longest",
 "capital_run_length_total",
 "spam"]
spam_ham_data.columns = col_names

In [4]:
spam_ham_data.head(10)


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
5,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1
6,0.0,0.0,0.0,0.0,1.92,0.0,0.0,0.0,0.0,0.64,...,0.0,0.054,0.0,0.164,0.054,0.0,1.671,4,112,1
7,0.0,0.0,0.0,0.0,1.88,0.0,0.0,1.88,0.0,0.0,...,0.0,0.206,0.0,0.0,0.0,0.0,2.45,11,49,1
8,0.15,0.0,0.46,0.0,0.61,0.0,0.3,0.0,0.92,0.76,...,0.0,0.271,0.0,0.181,0.203,0.022,9.744,445,1257,1
9,0.06,0.12,0.77,0.0,0.19,0.32,0.38,0.0,0.06,0.0,...,0.04,0.03,0.0,0.244,0.081,0.0,1.729,43,749,1


In [5]:
spam_ham_data.shape

(4601, 58)

Since the email data has already been analyzed and features has been seperated , we are building the classification based on the given set of information.

## Class Distribution

Another important thing to make sure before feeding our data into the model is the class distribution of the data. In our case where the expected class are divided into two outcome 1 (spam)  and 0 (ham)
spam_ham_data



In [6]:
#check if there any missing value present in the spam column
print(spam_ham_data['spam'].isnull().sum())
spam_ham_data['spam'].value_counts()

0


0    2788
1    1813
Name: spam, dtype: int64

We could see that there is distribtion target variable is not imbalanced.

## Split the data 
We could seperate the feature variables and target and then We could split the data set into test and train data set

In [19]:
#seperate the target and feature data 
X = spam_ham_data.drop(['spam'], axis=1).values
y = spam_ham_data['spam'].values


In [8]:
#split the data into training and test data by 80% and 20%
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [9]:
print(X_train.shape)
print(X_test.shape)

(3220, 57)
(1381, 57)


## Model Generation

After making sure our data is good and ready we can continue to building our model. 
In this notebook we will try to build 3 different models with different algorithm. In this step we will create a baseline model for each algorithm using the default parameters set by sklearn and after building all 3 of our models we will compare them to see which works best for our case.

**DecisionTreeClassifier**

In [12]:
# initialize decision tree classifier with a depth 10 and fit the model
doc_clf = tree.DecisionTreeClassifier(max_depth=10)
doc_clf.fit (X_train, y_train)
doc_clf.score (X_test, y_test)

0.9207383279044516

From the model we notice that the score is almost 92% accurate. 

In [14]:
pred = doc_clf.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

[[543  28]
 [ 45 305]]
             precision    recall  f1-score   support

          0       0.92      0.95      0.94       571
          1       0.92      0.87      0.89       350

avg / total       0.92      0.92      0.92       921



**KNeighborsClassifier**

In [15]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

pred = knn.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))


[[492  79]
 [101 249]]
             precision    recall  f1-score   support

          0       0.83      0.86      0.85       571
          1       0.76      0.71      0.73       350

avg / total       0.80      0.80      0.80       921



**GaussianNB**

In [16]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

pred = gnb.predict(X_test)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))


[[411 160]
 [ 18 332]]
             precision    recall  f1-score   support

          0       0.96      0.72      0.82       571
          1       0.67      0.95      0.79       350

avg / total       0.85      0.81      0.81       921

Accuracy of GNB classifier on training set: 0.83
Accuracy of GNB classifier on test set: 0.81


This shows our decision tree classifier has about 92% accuracy. Let's try other classifiers

## Model comparison and Selection

**Accuracy:** the proportion of true results among the total number of cases examined.

**Precision:** used to calculate how much proportion of all data that was predicted positive was actually positive.

**Recall:** used to calculate how much proportion of actual positives is correctly classified.

**F1 score:** a number between 0 and 1 and is the harmonic mean of precision and recall.


From the classification report we could see that ** DecisionTreeClassifier ** provides most accurate result for the given data set ( precision -0.92 ,recall-0.95  ,f1-score-0.94     ) and hence we could select it.
