# Assignment: Classification

Classification refers to categorizing the given data into classes. For example,
- Given an image of hand-written character, identifying the character (multi-class classification)
- Given an image, annotating it with all the objects present in the image (multi-label classification)
- Classifying an email as spam or non-spam (binary classification)
- Classifying a tumor as benign or malignant and so on

In this assignment, we will be building a classifier to classify emails as spam or non-spam. We will be using the Kaggle dataset [Spam or Not Spam Dataset](https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset?resource=download) for this task. 

**Note**: You cannot load any libraries other than the mentioned ones.




In [1]:
data_file = open("spam_or_not_spam.csv","r",encoding="utf8")
data_file.readline()
data_unfiltered = []
data_values = []
while True:
    line = data_file.readline()
    temp = line.split(",")
    if not line:
        break
    data_unfiltered.append(str(temp[0]))
    data_values.append(int(temp[1]))
data_file.close()


### Data pre-processing
The first step in every machine learning algorithm is to process the raw data in some meaningful representations. We will be using the [Bag-of-Words](https://towardsdatascience.com/a-simple-explanation-of-the-bag-of-words-model-b88fc4f4971) representation to process the text. It comprises of following steps:

- Process emails line-by-line to extract all the words.
- Replace extracted words by their stem (root) word. This is known as stemming and lematization.
- Remove stop words like and, or, is, am, and so on.
- Assign a unique index to each word. This forms the vocabulary.
- Represent each email as a binary vector of length equal to the size of the vocabulary such that the $i^{th}$ element of the vector is 1 iff the $i^th$ word is present in the email.

Here we provide you with the function signature along with the expected functionality. You are expected to complete them accordingly. 

In [2]:
import numpy as np
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')

# takes an email as an argument
# read email line-by-line and extract all the words
# return list of extracted words
def read_email(email):
    email_splitted = email.split(" ")
    words =[]
    for word in email_splitted:
        words.append(word)
    return words
  
# takes a list of words as an argument
# replace each word by their stem word
# return list of stem words

def stemming(words):
    ps = PorterStemmer()
    stem_words = [ps.stem(word) for word in words]
    return stem_words

# takes a list of stem-words as an argument
# remove stop words
# return list of stem words after removing stop words
def remove_stop_words(stem_words):
    stem_no_stop_words = [word for word in stem_words if not word in stopwords.words()]
    return stem_no_stop_words

# takes a list of stem-words as an argument
# add new words to the vocabulary and assign a unique index to them
# returns new vocabulary
def build_vocabulary(stem_words_list):  # Takes as input list of list of stem words to build a global vocabulary
    vocab = []
    for stem_words in stem_words_list:
        for word in stem_words:
            if word not in vocab:
                vocab.append(word)
    return vocab

# takes a list of stem-words and vocabulary as an argument
# returns bow representation
def get_bow(stem_words,vocab):
    email_bow =[]
    index = 0
    for word in vocab:
        email_bow.append(0)
        for stem_word in stem_words:
            if(stem_word == word):
                email_bow[index] +=1
        index +=1
    return email_bow

# read the entire dataset
# convert emails to bow and maintain their labels
# call function text_to_bow()

def read_data(data_unfiltered):
    
    data = []
    stem_words_list =[]
    index = 0;
    for email in data_unfiltered:
        print(str("The email ") + str(index) + " is being stemmed and stop words being removed")
        words = read_email(email)
        stem_words = stemming(words)
        stem_no_stop_words = remove_stop_words(stem_words)
        stem_words_list.append(stem_no_stop_words)
        index+=1
    
    print("Vocabulary is being generated")
    vocabulary = build_vocabulary(stem_words_list)
    index = 0
    
    data_file = open("data_cleaned.csv","w")
    
    for stem_word in stem_words_list:
        print(str("The email ") + str(index) + " is being pre-processed in a suitable representation")
        index += 1
        email_bow = get_bow(stem_word,vocabulary)
        data.append(email_bow)
        n = len(email_bow)
        for i in range(n-1):
            data_file.write(str(email_bow[i])+",")
        data_file.write(str(email_bow[i])+"\n")
        
    data_file.close()
    return data

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RISHABH\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Donot run this part as this will take alot of time to execute
data = read_data(data_unfiltered)

### Data Visualization
Let's understand the data distribution
- Visualize the frequency of word-occurence in all the emails(spam + non-spam)
- Visualize the freuency of word-occurence for spam and non-spam emails separately

In [3]:
data_file = open("data_cleaned.csv","r")
data = []
while True:
    line = data_file.readline()
    temp = line.split(",")
    if not line:
        break
    data.append(temp)
data_file.close()

In [None]:
import matplotlib.pyplot as plt

# visuallze data distribution
def data_vis():
  return

data_vis(data)

### Learn a Classifier
Split the dataset randomly in the ratio 80:20 as the training and test dataset. Use only training dataset to learn the classifier. No test data should be used during training. Test data will only be used during evaluation.

Now let us try to use ML algorithms to classify emails as spam or non-spam. You are supposed to implement [SVM](https://scikit-learn.org/stable/modules/svm.html) and [K-Nearest Neighbour](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) algorithm available in scikit-learn using the same training dataset for both.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

# split dataset
def split(data):
  return train_data, test_data

# learn a SVM model
# use the model to make prediction
# return the model predictions on train and test dataset
def svm_classifier():
  return predict_labels

# implement k-NN algorithm
# use the model to make prediction
# return the model predictions on train and test dataset
def knn_classifier():
  return predict_labels

train_data, test_data = split(data)
svm_train_predictions, svm_test_predictions = svm_classifier(train_data, test_data)
knn_train_predictions, knn_test_predictions = knn_classifier(train_data, test_data)

### Model Evaluation
Compare the SVM and k-NN model using metrics
- Accuracy
- [AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html)


In [None]:
from sklearn import metrics

# compute accuracy 
def compute_accuracy(true_labels, predicted_labels):
  return acc

# compute AUC score 
def compute_auc(true_labels, predicted_labels):
  return auc

# write code to print train and test accuracy and AUC score of SVM and k-NN classifier