# Spam Detector Revision
* By Wai Ping Jerry KWOK
* Created on 2023-11-25

### Introduction
You will build a NN to improve the spam classifier from the previous project.

### Objectives
Improve the spam classifier from the previous project.

### Requirements:
1. Python Libraries: Ensure you have *librosa*, *keras*, and *tensorflow* installed in your Python environment.
2. Note: You may use the code snippets and functions provided in class or use your own novel approach to accomplish the tasks outlined in the problems.

### **Spam Classifier**

#### **Overview**
In this part of the assignment, you will develop a neural network spam classifier using comments from YouTube videos. The classifier will distinguish between spam and non-spam comments. You will use Python, Keras for neural network modeling, and Pandas for data manipulation. This assignment will guide you through data preparation, model construction, and evaluation. You are free to answer the questions as you wish, you need not follow the instructions explicitly, but you must carry out the same tasks and provide the same outputs.

#### **Spam Classifier Tasks**

In [49]:
import warnings
warnings.filterwarnings('ignore')

In [50]:
# import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Activation, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer

#### **1. Data Loading and Preparation**
Task: Load and concatenate data from multiple CSV files using Pandas.
* Load comments from five CSV files named *Youtube01-Psy.csv*, *Youtube02-KatyPerry.csv*, *Youtube03-LMFAO.csv*, *Youtube04-Eminem.csv*, and *Youtube05-Shakira.csv*.

In [144]:
# read in the data
inp_psy = pd.read_csv('c:/Users/jerry/OneDrive - Red River College Polytech/Documents/COMP3703_intro_to_a_i/module_4_neural_nets/project/Youtube01-Psy.csv')
inp_kat = pd.read_csv('c:/Users/jerry/OneDrive - Red River College Polytech/Documents/COMP3703_intro_to_a_i/module_4_neural_nets/project/Youtube02-KatyPerry.csv')
inp_lmf = pd.read_csv('c:/Users/jerry/OneDrive - Red River College Polytech/Documents/COMP3703_intro_to_a_i/module_4_neural_nets/project/Youtube03-LMFAO.csv')
inp_emi = pd.read_csv('c:/Users/jerry/OneDrive - Red River College Polytech/Documents/COMP3703_intro_to_a_i/module_4_neural_nets/project/Youtube04-Eminem.csv')
inp_sha = pd.read_csv('c:/Users/jerry/OneDrive - Red River College Polytech/Documents/COMP3703_intro_to_a_i/module_4_neural_nets/project/Youtube05-Shakira.csv')

* Concatenate these files into a single DataFrame.

In [145]:
# concatenate all the dataframes
inp_all = pd.concat([inp_psy, inp_kat, inp_lmf, inp_emi, inp_sha])

# check the shape of the data
print(inp_all.shape)

(1956, 5)


In [53]:
# check the first few rows of the data
inp_all.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


* Shuffle the DataFrame rows randomly.

In [54]:
# shuffle the inp_all dataframe
inp_all_shu = inp_all.sample(frac=1, random_state=80)

#### **2. Setting Up Cross-Validation**
Task: Implement Stratified K-Fold Cross-Validation.
* Use the *StratifiedKFold* class from *sklearn.model_selection* to create cross-validation splits. Set the number of splits to 5.

In [55]:
# create cross validation splits
skfold = StratifiedKFold(n_splits=5)

* Ensure that the splits are stratified based on the *CLASS* column in your DataFrame.

In [56]:
# splits on CLASS
splits = skfold.split(inp_all_shu, inp_all_shu['CLASS'])

* Output the pair *train*, *test* indices from your splits in the previous task. *train* contains the indices of the dataset that should be used for training in this particular split. *test* contains the indices of the dataset that should be used for testing in this particular split. The loop should print "split" to indicate the start of a new split formation, then prints test indices. Sample output is given by:

In [57]:
# print the train, test splits
for train, test in splits:
    print('Split')
    print(test)

Split
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
 234 235 236 237 238 239 240 241 242 243 244 

#### **3. Tokenization and Text-to-Matrix Conversion**
Task: Convert text data into a numerical format suitable for a neural network.
* Create a function *prepare_data* that takes indices for training and testing data.

In [58]:
# create a function to take indices and return the train and test data
def prepare_data(dataset, train_index, test_index):
    # get the train data
    train_data = dataset['CONTENT'].iloc[train_index]

    # get the test data
    test_data = dataset['CONTENT'].iloc[test_index]

    return train_data, test_data

* Use the *Tokenizer* from *keras.preprocessing.text* to tokenize the text comments.
* Limit the number of words to 2000.
* Convert the text comments to a TF-IDF matrix format.

In [59]:
# setup the constant
MAX_WORDS = 2000

In [60]:
# create a function to tokenize the data
def tokenize(train_data, test_data):
    # create a tokenizer
    tokenizer = Tokenizer(num_words=MAX_WORDS)

    # fit the tokenizer on the training data
    # only learn the words in the training data
    tokenizer.fit_on_texts(train_data)

    # transform the training and testing data using tf-idf
    X_train = tokenizer.texts_to_matrix(train_data, mode='tfidf')
    X_test = tokenizer.texts_to_matrix(test_data, mode='tfidf')

    return X_train, X_test

#### **4. Data Preprocessing**
Task: Normalize the TF-IDF matrix.
* Divide the matrix by its maximum absolute value.
* Subtract the mean from the matrix.

In [61]:
# create a function to normalize the matrix
def normalize(X_train, X_test):
    # divide the data by the max value
    X_train_norm = X_train / np.amax(np.absolute(X_train))
    X_test_norm = X_test / np.amax(np.absolute(X_test))

    # subtract the mean
    X_train_norm = X_train_norm - np.mean(X_train_norm)
    X_test_norm = X_test_norm - np.mean(X_test_norm)

    return X_train_norm, X_test_norm

In [62]:
# create a function to prepare the labels with one-hot encoding
def prepare_labels(dataset, train_index, test_index):
    # get the labels
    train_labels = dataset['CLASS'].iloc[train_index]
    test_labels = dataset['CLASS'].iloc[test_index]

    # one-hot encode the labels
    train_labels = to_categorical(train_labels)
    test_labels = to_categorical(test_labels)

    return train_labels, test_labels

In [63]:
# create a function to obtain the train and test data and labels
def prepare_data_and_label (dataset, train_index, test_index):
    # get the train and test data
    train_data, test_data = prepare_data(dataset, train_index, test_index)

    # tokenize the train and test data
    X_train, X_test = tokenize(train_data, test_data)

    # normalize the train and test data
    X_train_norm, X_test_norm = normalize(X_train, X_test)

    # get the train and test labels
    train_labels, test_labels = prepare_labels(dataset, train_index, test_index)

    return X_train_norm, X_test_norm, train_labels, test_labels

#### **5. Building the Neural Network Model**
Task: Define and compile a neural network model.
* Use the *Sequential* model from Keras.
* Add a Dense layer with 512 units and '*relu*' activation, followed by a Dropout layer with 0.5 dropout rate.
* Add another Dense layer for classification and use '*softmax*' activation.

In [64]:
# construct a model
model = Sequential()

# add the input layer
model.add(Input(shape=(2000,)))

# add the hidden layer
model.add(Dense(units=512))
model.add(Activation('relu'))
model.add(Dropout(0.5))

# add the output layer
model.add(Dense(units=2))
model.add(Activation('softmax'))

* Compile the model with '*categorical_crossentropy*' loss and '*adamax*' optimizer.

In [65]:
# compile the model
model.compile(loss='categorical_crossentropy', optimizer='adamax', metrics=['accuracy'])

# show the model summary
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_27 (Dense)             (None, 512)               1024512   
_________________________________________________________________
activation_26 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_28 (Dense)             (None, 2)                 1026      
_________________________________________________________________
activation_27 (Activation)   (None, 2)                 0         
Total params: 1,025,538
Trainable params: 1,025,538
Non-trainable params: 0
_________________________________________________________________
None


#### **6. Model Training and Evaluation**
Task: Train and evaluate the model.
* Write a function *train_and_test* that takes training and testing indices.
* Train the model on the training data and evaluate it on the test data.
* Use the accuracy metric to evaluate the model's performance.

In [66]:
# create a function to take training and testing indices
def train_and_test(dataset, train_index, test_index, model):
    # get the train and test data and labels
    X_train, X_test, y_train, y_test = prepare_data_and_label(dataset, train_index, test_index)

    # train the model
    model.fit(X_train, y_train, epochs=10, batch_size=16)

    # evaluate the model on the test data
    print('-'*50)
    print('Evaluate on test data')
    results = model.evaluate(X_test, y_test, batch_size=16)
    print(f'test loss, test acc: {results}')
    print('-'*50)

    return results

#### **7: Cross-Validation Scores**
Task: Calculate and print the mean and standard deviation of the cross-validation scores.
* Run the *train_and_test* function for each split in the cross-validation.
* Collect the accuracy scores from each run and calculate their mean and standard deviation. Print the loss and accuracy output for each of the 10 epochs. Sample output for a fold should look as follows:

In [67]:
# create a list to store the accuracy
cv_scores = []

# create cross validation splits
splits = skfold.split(inp_all_shu, inp_all_shu['CLASS'])

# iterate through the splits
for train_index, test_index in splits:
    # train and test the model
    cv_score = train_and_test(inp_all_shu, train_index, test_index, model)

    # append the accuracy to the list
    cv_scores.append(cv_score[1])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------
Evaluate on test data
test loss, test acc: [0.12118500927273108, 0.96683675]
--------------------------------------------------
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------
Evaluate on test data
test loss, test acc: [0.25380686009326553, 0.9181586]
--------------------------------------------------
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--------------------------------------------------
Evaluate on test data
test loss, test acc: [0.17286814514861998, 0.9360614]
--------------------------------------------------
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
---------------------

* Print the mean and standard deviation of all of these scores.

In [68]:
# print the mean and standard deviation of the accuracy
print(f'Mean Accuracy: {np.mean(cv_scores)}')
print(f'Standard Deviation of Accuracy: {np.std(cv_scores)}')

Mean Accuracy: 0.9243136644363403
Standard Deviation of Accuracy: 0.028987307101488113
