## Spam Email Classifier with KNN using TF-IDF scores

1.   Assignment must be implemented in Python 3 only.
2.   You are allowed to use libraries for data preprocessing (numpy, pandas, nltk etc) and for evaluation metrics, data visualization (matplotlib etc.).
3.   You will be evaluated not just on the overall performance of the model and also on the experimentation with hyper parameters, data prepossessing techniques etc.
4.   The report file must be a well documented jupyter notebook, explaining the experiments you have performed, evaluation metrics and corresponding code. The code must run and be able to reproduce the accuracies, figures/graphs etc.
5.   For all the questions, you must create a train-validation data split and test the hyperparameter tuning on the validation set. Your jupyter notebook must reflect the same.
6.   Strict plagiarism checking will be done. An F will be awarded for plagiarism.

**Task: Given an email, classify it as spam or ham**

Given input text file ("emails.txt") containing 5572 email messages, with each row having its corresponding label (spam/ham) attached to it.

This task also requires basic pre-processing of text (like removing stopwords, stemming/lemmatizing, replacing email_address with 'email-tag', etc..).

You are required to find the tf-idf scores for the given data and use them to perform KNN using Cosine Similarity.

### Import necessary libraries

In [1]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
import re, math

stopwords_set = set(stopwords.words('english'))
stemmer = SnowballStemmer(language='english')

### Load dataset

In [2]:
with open('emails.txt', 'r') as file:
    crudeData = file.readlines()

### Preprocess data

In [3]:
verdict = [0]
score = dict()
modValue = dict()
stemmedWords = dict()
id = 0
for line in crudeData:
    id += 1
    line = line.casefold()
    splitt = re.split(r'[^A-Za-z0-9]+', line)
    if len(splitt) < 2:
        print(id, splitt)
        continue
    verdict.append(0)
    if splitt[0] == "spam":
        verdict[id] = 1
    else:
        verdict[id] = 0

    totalCount = len(splitt) - 1
    counts = dict()
    for i in splitt[1:-1]:
        if i == '':
            continue
        if i not in stopwords_set:
            if i not in stemmedWords:
                stemmedWords[i] = stemmer.stem(i)
            if stemmedWords[i] not in counts:
                counts[stemmedWords[i]] = 0
            counts[stemmedWords[i]] += 1
    
    score[id] = dict()
    modValue[id] = 0
    for word in counts:
        score[id][word] = math.log(1 + counts[word]) * math.log(totalCount/counts[word])
        modValue[id] += score[id][word]**2
    modValue[id] = (modValue[id])**(0.5)

def cosineSimilarity(a, b):
    global modValue, score
    similarity = 0
    if modValue[a] * modValue[b] == 0 or a == b:
        return 0
    for word in score[a]:
        if word in score[b]:
            similarity += score[b][word]*score[a][word]
    similarity /= (modValue[a]*modValue[b])
    return similarity
dist = [[] for i in range(0, id+1)]
for i in range(1, id):
    for j in range(1, id):
        dist[i].append([cosineSimilarity(i, j), j])
    dist[i].sort(reverse=True)
        


### Split data

In [4]:
import numpy as np

x = np.arange(id+1)
x_train, x_validation, y_train, y_validation = train_test_split(x[1:], verdict[1:], test_size = 0.2, random_state=13)

### Train your KNN model (reuse previously iplemented model built from scratch) and test on your data

In [6]:
from sklearn import metrics as skm

'''find knn,
iterate over a set of k and for each k,
try different criteria to assign a verdict on spam or not spam

based on the result from above, plot a graph to understand
which parameters performed the best'''
def KNNAlgorithm(k, dist, metrics):
    global x_train, x_validation, y_train, y_validation

    # go over training dataset and find the avg number of spams in k 
    # needed to specify spam
    spamThreshold = 0
    for x in range(len(x_train)):
        if y_train[x] == 1:
            temp = dist[x_train[x]][:k]
            for i in temp:
                spamThreshold += verdict[i[1]]
    spamThreshold /= len(x_train)

    y_out = []
    for x in range(len(x_validation)):
        temp = dist[x_validation[x]][:k]
        count = 0
        y_out.append(0)
        for i in temp:
            count += verdict[i[1]]
        if count >= spamThreshold:
            y_out[x] = 1
        else:
            y_out[x] = 0
    
    # print('Spam threshold =', spamThreshold, "- for k =", k, "- with following results") 
    # disp = skm.ConfusionMatrixDisplay(skm.confusion_matrix(y_validation, y_out))
    # disp.plot()
    ar = [
        skm.confusion_matrix(y_validation, y_out).ravel(), 
        skm.accuracy_score(y_validation, y_out), 
        skm.recall_score(y_validation, y_out),
        skm.f1_score(y_validation, y_out)
        ]
    metrics.append(ar)


Below, we implement KNN for cosine similarity distance measure for a varying k

In [8]:
k_list = [1, 3, 5, 7,11, 17, 23, 28]

'''
metrics is a dictionary with distance methods as keys 
and each key corresponds to an array of values for different evaluation metrics

each array of metrics is of form [confusion matrix, accuracy, recall, f1-score]
'''
metrics = dict()
def applyAlgorithm(distanceMethod):
    global metrics, k_list
    metrics[distanceMethod] = []
    for k in k_list:
        KNNAlgorithm(k, dist, metrics[distanceMethod])


***1. Experiment with different distance measures [Euclidean distance, Manhattan distance, Hamming Distance] and compare with the Cosine Similarity distance results.***

In [9]:
applyAlgorithm('cosine')
print(metrics)

{'cosine': [[array([975,   1,  20, 119]), 0.9811659192825112, 0.8561151079136691, 0.918918918918919], [array([973,   3,   9, 130]), 0.989237668161435, 0.935251798561151, 0.9558823529411764], [array([962,  14,   6, 133]), 0.9820627802690582, 0.9568345323741008, 0.93006993006993], [array([951,  25,   5, 134]), 0.9730941704035875, 0.9640287769784173, 0.8993288590604027], [array([945,  31,   7, 132]), 0.9659192825112107, 0.9496402877697842, 0.8741721854304636], [array([922,  54,   5, 134]), 0.947085201793722, 0.9640287769784173, 0.8195718654434251], [array([940,  36,   7, 132]), 0.9614349775784753, 0.9496402877697842, 0.8599348534201955], [array([921,  55,   4, 135]), 0.947085201793722, 0.9712230215827338, 0.8206686930091185]]}


***2. Explain which distance measure works best and why? Explore the distance measures and weigh their pro and cons in different application settings.***

***3. Report Confusion matrix along with accuracy, recall, precision and F1-score in the form of a table***

***4. Choose different K values (k=1,3,5,7,11,17,23,28) and experiment. Plot a graph showing R2 score vs k.***

### Train and test Sklearn's KNN classifier model on your data (use metric which gave best results on your experimentation with built-from-scratch model.)

***Compare both the models result.***

***What is the time complexity of training using KNN classifier?***

***What is the time complexity while testing? Is KNN a linear classifier or can it learn any boundary?***