# Data 620 - Web Analytics HW 5.2-Document Classification

**Yina Qiao**


Assignment:

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  [UCI Machine Learning Repository: Spambase Data Set](http://archive.ics.uci.edu/ml/datasets/Spambase)

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

## Data

Source: http://archive.ics.uci.edu/ml/datasets/Spambase



## Data preparation and load



In [1]:
import tempfile
import urllib.request
import zipfile
import os
import pandas as pd
import nltk
import numpy as np


from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import ensemble
from sklearn import svm
import sklearn.metrics as sm

# Define the URL of the ZIP file to download
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.zip'

# Create a temporary file to store the downloaded ZIP file
temp = tempfile.NamedTemporaryFile()

# Download the ZIP file from the URL
urllib.request.urlretrieve(url, temp.name)

# Extract the contents of the ZIP file to a directory named 'l1'
with zipfile.ZipFile(temp.name, 'r') as zip_ref:
    zip_ref.extractall('./l1')

# Change the current working directory to 'l1'
os.chdir('l1')

# Read the 'spambase.data' file into a DataFrame
spambase = pd.read_csv("spambase.data", sep=",")

# Read the column names from the 'spambase.names' file, removing punctuation
cnames = pd.read_csv("spambase.names", comment="|", header=None)[0]
cnames = cnames.str.replace("[[:punct:]]", "").tolist()

# Adjust the column names, excluding the first entry, and add 'target' as the last column name
cnames = cnames[1:] + ["target"]

# Assign the adjusted column names to the DataFrame
spambase.columns = cnames


  cnames = cnames.str.replace("[[:punct:]]", "").tolist()
  pat = re.compile(pat, flags=flags)


## Data exploration   

In [2]:
# Looking at first few rows
spambase.head()

Unnamed: 0,word_freq_make: continuous.,word_freq_address: continuous.,word_freq_all: continuous.,word_freq_3d: continuous.,word_freq_our: continuous.,word_freq_over: continuous.,word_freq_remove: continuous.,word_freq_internet: continuous.,word_freq_order: continuous.,word_freq_mail: continuous.,...,char_freq_;: continuous.,char_freq_(: continuous.,char_freq_[: continuous.,char_freq_!: continuous.,char_freq_$: continuous.,char_freq_#: continuous.,capital_run_length_average: continuous.,capital_run_length_longest: continuous.,capital_run_length_total: continuous.,target
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [3]:
spambase.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   word_freq_make:         continuous.      4600 non-null   float64
 1   word_freq_address:      continuous.      4600 non-null   float64
 2   word_freq_all:          continuous.      4600 non-null   float64
 3   word_freq_3d:           continuous.      4600 non-null   float64
 4   word_freq_our:          continuous.      4600 non-null   float64
 5   word_freq_over:         continuous.      4600 non-null   float64
 6   word_freq_remove:       continuous.      4600 non-null   float64
 7   word_freq_internet:     continuous.      4600 non-null   float64
 8   word_freq_order:        continuous.      4600 non-null   float64
 9   word_freq_mail:         continuous.      4600 non-null   float64
 10  word_freq_receive:      continuous.      4600 no

In [4]:
spambase.describe()

Unnamed: 0,word_freq_make: continuous.,word_freq_address: continuous.,word_freq_all: continuous.,word_freq_3d: continuous.,word_freq_our: continuous.,word_freq_over: continuous.,word_freq_remove: continuous.,word_freq_internet: continuous.,word_freq_order: continuous.,word_freq_mail: continuous.,...,char_freq_;: continuous.,char_freq_(: continuous.,char_freq_[: continuous.,char_freq_!: continuous.,char_freq_$: continuous.,char_freq_#: continuous.,capital_run_length_average: continuous.,capital_run_length_longest: continuous.,capital_run_length_total: continuous.,target
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,...,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,0.104576,0.212922,0.280578,0.065439,0.312222,0.095922,0.114233,0.105317,0.090087,0.239465,...,0.038583,0.139061,0.01698,0.26896,0.075827,0.044248,5.191827,52.17087,283.290435,0.393913
std,0.305387,1.2907,0.50417,1.395303,0.672586,0.27385,0.39148,0.401112,0.278643,0.644816,...,0.243497,0.270377,0.109406,0.815726,0.245906,0.429388,31.732891,194.912453,606.413764,0.488669
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.2755,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.3825,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.31425,0.052,0.0,3.70525,43.0,265.25,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


## Spam / non-spam count     

The last column spamclass contain whether a email is spam or not.

1 = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

In [5]:
spam_count = len(spambase[spambase['target'] == 1])
notspam_count = len(spambase[spambase['target'] == 0])

print("Spam count: %d" % spam_count)
print("Non-spam count: %d" % notspam_count)


Spam count: 1812
Non-spam count: 2788


#### Check for nulls

In [6]:
spambase[spambase.isnull().any(axis = 1)]

Unnamed: 0,word_freq_make: continuous.,word_freq_address: continuous.,word_freq_all: continuous.,word_freq_3d: continuous.,word_freq_our: continuous.,word_freq_over: continuous.,word_freq_remove: continuous.,word_freq_internet: continuous.,word_freq_order: continuous.,word_freq_mail: continuous.,...,char_freq_;: continuous.,char_freq_(: continuous.,char_freq_[: continuous.,char_freq_!: continuous.,char_freq_$: continuous.,char_freq_#: continuous.,capital_run_length_average: continuous.,capital_run_length_longest: continuous.,capital_run_length_total: continuous.,target


## Data Preparation

Let's split the data into train and test, distributing 70% in train.

In [7]:
spambase_rows = len(spambase)
train_rows = int(spambase_rows * 0.7)
val_rows = int(spambase_rows * 0.15)
test_rows = spambase_rows - train_rows - val_rows

In [8]:
print("Training rows (70 prc of total): %d" %train_rows)
print("Validation rows (15 prc of total): %d" %val_rows)
print("Testing rows (15 prc of total): %d" %test_rows)
print("Total: %d" %(train_rows + val_rows + test_rows))

Training rows (70 prc of total): 3220
Validation rows (15 prc of total): 690
Testing rows (15 prc of total): 690
Total: 4600


In [9]:
train_set, test_set = train_test_split(spambase, test_size = test_rows, random_state = 8)
train_set, val_set = train_test_split(train_set, test_size = val_rows, random_state = 88)

In [10]:
print("Training set: %d" %len(train_set))
print("Validation set: %d" %len(val_set))
print("Testing set: %d" %len(test_set))
print("Total: %d" %(len(train_set) + len(val_set) + len(test_set)))

Training set: 3220
Validation set: 690
Testing set: 690
Total: 4600


## Confusion Matrix

We will compute Confusion Matrix using following function, which will give us True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN), from which the "Accuracy" of an algorithm can be determined.    

In [11]:
def func_confusion_matrix(y_true, y_pred):
    cm = sm.confusion_matrix(y_true, y_pred, labels = [1, 0])
    print("TP: %d" %cm[0,0])
    print("FP: %d" %cm[1,0])
    print("TN: %d" %cm[1,1])
    print("FN: %d" %cm[0,1])
    print(sm.classification_report(y_true, y_pred, labels = [1,0], target_names = ["Spam", "Not spam"]))
    return cm

## Decision Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
Source: https://scikit-learn.org/stable/modules/tree.html

In [12]:
tree = tree.DecisionTreeClassifier(criterion = "entropy", random_state = 88)

### Training Set

In [13]:
train_class = train_set['target']  # Replace 'spamclass' with the correct column name
train_vars = train_set.drop(labels='target', axis=1)  # Replace 'spamclass' with the correct column name

tree_fit = tree.fit(train_vars, train_class)

tree_train = tree_fit.predict(train_vars)
cm = func_confusion_matrix(train_class, tree_train)


TP: 1252
FP: 0
TN: 1967
FN: 1
              precision    recall  f1-score   support

        Spam       1.00      1.00      1.00      1253
    Not spam       1.00      1.00      1.00      1967

    accuracy                           1.00      3220
   macro avg       1.00      1.00      1.00      3220
weighted avg       1.00      1.00      1.00      3220



### Validation set

In [14]:
validate_class = val_set['target']  # Replace 'spamclass' with the correct column name
validate_vars = val_set.drop(labels='target', axis=1)  # Replace 'spamclass' with the correct column name

tree_validate = tree_fit.predict(validate_vars)
cm = func_confusion_matrix(validate_class, tree_validate)


TP: 244
FP: 31
TN: 388
FN: 27
              precision    recall  f1-score   support

        Spam       0.89      0.90      0.89       271
    Not spam       0.93      0.93      0.93       419

    accuracy                           0.92       690
   macro avg       0.91      0.91      0.91       690
weighted avg       0.92      0.92      0.92       690



### Test set

In [15]:
test_class = test_set['target']  # Replace 'spamclass' with the correct column name
test_vars = test_set.drop(labels='target', axis=1)  # Replace 'spamclass' with the correct column name

tree_test = tree_fit.predict(test_vars)
cm = func_confusion_matrix(test_class, tree_test)


TP: 262
FP: 29
TN: 373
FN: 26
              precision    recall  f1-score   support

        Spam       0.90      0.91      0.91       288
    Not spam       0.93      0.93      0.93       402

    accuracy                           0.92       690
   macro avg       0.92      0.92      0.92       690
weighted avg       0.92      0.92      0.92       690



**Precision**
Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?

**Recall**
Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?

### Variable importance

In [16]:
def func_arrange_feature_by_importance(fit, vars):
    df = {'variable': pd.Series(vars.columns.values), 'imp': pd.Series(fit.feature_importances_)}
    return pd.DataFrame(df, columns=['variable','imp']).sort_values(['imp'], ascending=0).head(10)

func_arrange_feature_by_importance(tree_fit, test_vars)

Unnamed: 0,variable,imp
52,char_freq_$: continuous.,0.268194
6,word_freq_remove: continuous.,0.134915
51,char_freq_!: continuous.,0.103756
24,word_freq_hp: continuous.,0.07896
54,capital_run_length_average: continuous.,0.062309
55,capital_run_length_longest: continuous.,0.039214
15,word_freq_free: continuous.,0.036289
26,word_freq_george: continuous.,0.027385
44,word_freq_re: continuous.,0.022259
56,capital_run_length_total: continuous.,0.0211


## Random Forest
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Source: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [17]:
forest = ensemble.RandomForestClassifier(criterion = "entropy", random_state = 88)

### Training set

In [18]:
forest_fit = forest.fit(train_vars, train_class)

forest_train = forest_fit.predict(train_vars)
cm = func_confusion_matrix(train_class, forest_train)

TP: 1253
FP: 1
TN: 1966
FN: 0
              precision    recall  f1-score   support

        Spam       1.00      1.00      1.00      1253
    Not spam       1.00      1.00      1.00      1967

    accuracy                           1.00      3220
   macro avg       1.00      1.00      1.00      3220
weighted avg       1.00      1.00      1.00      3220



### Test set

In [19]:
forest_test = forest_fit.predict(test_vars)
cm = func_confusion_matrix(test_class, forest_test)

TP: 274
FP: 21
TN: 381
FN: 14
              precision    recall  f1-score   support

        Spam       0.93      0.95      0.94       288
    Not spam       0.96      0.95      0.96       402

    accuracy                           0.95       690
   macro avg       0.95      0.95      0.95       690
weighted avg       0.95      0.95      0.95       690



In [20]:
func_arrange_feature_by_importance(forest_fit, test_vars)

Unnamed: 0,variable,imp
51,char_freq_!: continuous.,0.099876
52,char_freq_$: continuous.,0.098352
6,word_freq_remove: continuous.,0.079624
54,capital_run_length_average: continuous.,0.066567
15,word_freq_free: continuous.,0.063733
55,capital_run_length_longest: continuous.,0.056424
24,word_freq_hp: continuous.,0.05101
20,word_freq_your: continuous.,0.049192
56,capital_run_length_total: continuous.,0.040445
18,word_freq_you: continuous.,0.03485


## Support Vector Machines
SVM is an exciting algorithm and the concepts are relatively simple. The classifier separates data points using a hyperplane with the largest amount of margin. That's why an SVM classifier is also known as a discriminative classifier. SVM finds an optimal hyperplane which helps in classifying new data points.


In [21]:
svm = svm.SVC(random_state = 88)

### Training set

In [22]:
svm_fit = svm.fit(train_vars, train_class)

svm_train = svm_fit.predict(train_vars)
cm = func_confusion_matrix(train_class, svm_train)

TP: 540
FP: 210
TN: 1757
FN: 713
              precision    recall  f1-score   support

        Spam       0.72      0.43      0.54      1253
    Not spam       0.71      0.89      0.79      1967

    accuracy                           0.71      3220
   macro avg       0.72      0.66      0.67      3220
weighted avg       0.71      0.71      0.69      3220



### Test Set

In [23]:
svm_test = svm_fit.predict(test_vars)
cm = func_confusion_matrix(test_class, svm_test)

TP: 129
FP: 37
TN: 365
FN: 159
              precision    recall  f1-score   support

        Spam       0.78      0.45      0.57       288
    Not spam       0.70      0.91      0.79       402

    accuracy                           0.72       690
   macro avg       0.74      0.68      0.68       690
weighted avg       0.73      0.72      0.70       690



## Conclusion

So far, Random forest seems to be the best model.