<div>
<img src="https://drive.google.com/uc?export=view&id=1vK33e_EqaHgBHcbRV_m38hx6IkG0blK_" width="350"/>
</div> 

#**Artificial Intelligence - MSc**
##ET5003 - MACHINE LEARNING APPLICATIONS 

###Instructor: Enrique Naredo
###ET5003_NLP_SpamClasiffier


### Spam Classification

[Spamming](https://en.wikipedia.org/wiki/Spamming) is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose (especially the fraudulent purpose of phishing), or simply sending the same message over and over to the same user. 

Spam Classification: Deciding whether an email is spam or not.



## Imports

In [None]:
# standard libraries
import pandas as pd
import numpy as np

In [None]:
# Scikit-learn is an open source machine learning library 
# that supports supervised and unsupervised learning
# https://scikit-learn.org/stable/
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
# Regular expression operations
#https://docs.python.org/3/library/re.html
import re 

# Natural Language Toolkit
# https://www.nltk.org/install.html
import nltk

# Stemming maps different forms of the same word to a common “stem” 
# https://pypi.org/project/snowballstemmer/
from nltk.stem import SnowballStemmer

# https://www.nltk.org/book/ch02.html
from nltk.corpus import stopwords

## Step 1: Load dataset

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# path to your (local/cloud) drive 
path = '/content/drive/MyDrive/Colab Notebooks/Enrique/Data/spam/'

# load dataset
df = pd.read_csv(path+'spam.csv', encoding='latin-1')


In [None]:
# original dataset
df.head()

In [None]:
# remove useless features
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis='columns')

In [None]:
# v1 -> is the class label: ham, spam
# ham -> https://en.wiktionary.org/wiki/ham_e-mail
# spam -> https://en.wiktionary.org/wiki/spam#English
# v2 -> is the  email
df.head()

## Step 2: Pre-processing

In [None]:
# Removing stopwords and stemming
# a stem must be a word
# Example:  fishing, fished, and fisher: stem -> fish
# choose English as the target language
stemmer = SnowballStemmer('english', ignore_stopwords=False)

In [None]:
# Stop words are basically a set of commonly used words in any language
# https://en.wikipedia.org/wiki/Stop_word
# and are filtered out before processing of natural language data 
# Example list: https://github.com/igorbrigadir/stopwords/blob/master/en/terrier.txt
nltk.download('stopwords')
stop = set(stopwords.words('english'))

In [None]:
# example: remove anything that is not a letter
string_sample = '123This @45is 890-130 an_example !!'
new_string = re.sub('[^a-zA-Z]', ' ', string_sample) 
print(new_string)

In [None]:
# removing duplicated spaces
" ".join(new_string.split())

In [None]:
# remove anything that is not a letter in the emails
df['v2'] = [re.sub('[^a-zA-Z]', ' ', sms) for sms in df['v2']]

In [None]:
# list of words in the emails
email_words = [sms.split() for sms in df['v2']]
print(email_words)

In [None]:
# function to normalize words
def normalize(words):
  normalized_words = list()
  for word in words:
    # remove  the most common words
    if word.lower() not in stop: 
      # stemming
      new_word = stemmer.stem(word) 
      # lower case
      normalized_words.append(new_word.lower()) 
  return normalized_words

In [None]:
# normalize words in emails
email_words_norm = [normalize(word) for word in email_words]
print(email_words_norm)

In [None]:
# update dataframe
df['v2'] = [" ".join(word) for word in email_words_norm]
df.head()

In [None]:
# training and test datasets
data_register = df['v2']
class_label = df['v1']
factor = 0.2
lucky_number = 7
x_train, x_test, y_train, y_test = train_test_split(data_register, class_label, test_size=factor, random_state=lucky_number)

In [None]:
# train
print(x_train)

In [None]:
# test
print(x_test)

In [None]:
# class labels in training
y_train

In [None]:
# reshape -> gives a new shape to an array without changing its data
# https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.reshape.html#numpy.reshape
# -1 -> the unspecified value is inferred
y_train.values.reshape(-1,1)

**Build your custome function**

The real difference between stemming and lemmatization is that stemming reduces word-forms to (pseudo) stems which might be meaningful or meaningless, whereas lemmatization reduces the word-forms to linguistically valid meaning.

In [None]:
# you can build your own NLP function
# edit it according to your requirements
from nltk.tokenize import word_tokenize  
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
nltk.download('punkt')
nltk.download('wordnet')

def NLP_preprocess(some_text):
  """
  Normalization using NLTK and spaCy
  """
  # 1. Tokenization
  NLP_token = word_tokenize(some_text)

  # 2. Stemming
  PS = PorterStemmer()
  NLP_stem = []
  for word in NLP_token:
      NLP_stem.append(PS.stem(word))

  # 3. Lemmatization
  WL = WordNetLemmatizer()
  NLP_lemma = []
  for word in NLP_stem:
      NLP_lemma.append(WL.lemmatize(word))
  
  # 4. Stopword   
  FS = []  
  NLP_stop = set(stopwords.words("english"))
  for w in NLP_lemma:  
      if w not in NLP_stop:  
        FS.append(w)
  
  # 5. Punctuation  
  punctuations = "?:!.,;"
  for word in FS:
      if word in punctuations:
          FS.remove(word)

  # print comparison
  print(" ")
  print(some_text)
  print(FS)

In [None]:
# example
NLP_preprocess(string_sample)

## Step 3: Counts

**Create new features with NLP**

In [None]:
# Convert a collection of text documents to a matrix of token counts
CV = CountVectorizer()

In [None]:
## training
# transforming email into counts
# counting the number of times a word appears in each email
# 'x_train_count' is a sparse matrix, this avoids to store the zeroes
x_train_count = CV.fit_transform(x_train)

# returns: n_samples, n_features
print("total emails =", x_train_count.shape[0])
print("total words =", x_train_count.shape[1])

In [None]:
# show the counts in train
print(x_train_count)

In [None]:
# full matrix
x_train_count.toarray()

In [None]:
## test
# transforming email into counts
# counting the number of times a word appears in each email
# using the vocabulary fitted with '.fit'
# sparse matrix: only non-zeroes elements are stored
x_test_count = CV.transform(x_test)

# returns: n_samples, n_features
print("total emails =", x_test_count.shape[0])
print("total words =", x_test_count.shape[1])

In [None]:
# array mapping from feature integer indices to feature name
int2feature = CV.get_feature_names()
print(int2feature)

In [None]:
# warning:
# be aware that running several times next cell
# it will append 'Class' each time 
c = 0

In [None]:
# append 'Class' to the end of the list
if c==0:
  int2feature.append('Class')
  # print last 10 feature names
  print(int2feature[len(int2feature)-10:len(int2feature)-1])
  c = 1
else:
  print('already appended')



In [None]:
# new dataset
new_dataset = pd.DataFrame(data=np.hstack([x_train_count.toarray(),y_train.values.reshape(-1,1)]), columns=int2feature)


In [None]:
# first rows
new_dataset.head()

In [None]:
new_dataset.describe()

In [None]:
# write object to a comma-separated values (csv) file
# verify in your folder
new_dataset.to_csv(path+"spam_clean.csv",index=False)

### Data format

In [None]:
# select an email number
row_index = 0
print("non-sparse matrix =", x_train_count[row_index,:].todense())

# original words in the email
print('original words: ', x_train.values[row_index])

# decoded numerical input 
DNI = x_train_count[row_index,:].todense()
# inverse_transform: return terms per document with nonzero entries
print('decoded input: ', CV.inverse_transform(DNI))

# index of words
ind = np.where(DNI[0,:]>0)[1]
print('word indexes: ', ind)

# number of times those words appears in the email
print(x_train_count[row_index,ind].todense())

## Step 4: Learning

Training the classifier and making predictions on the test set

In [None]:
# create a model
MNB = MultinomialNB()

# fit to data
MNB.fit(x_train_count, y_train)

In [None]:
# testing the model

prediction_train = MNB.predict(x_train_count)
print('training prediction\t', prediction_train)

prediction_test = MNB.predict(x_test_count)
print('test prediction\t\t', prediction_test)

In [None]:
# set_printoptions: If True, always print floating point numbers 
# using fixed point notation, in which case numbers equal to zero 
# in the current precision will print as zero
np.set_printoptions(suppress=True)

# Ham and Spam probabilities in test
class_prob = MNB.predict_proba(x_test_count)
print(class_prob)

In [None]:
# show emails classified as 'spam'
threshold = 0.5
spam_ind = np.where(class_prob[:,1]>threshold)[0]
print(x_test.values[spam_ind])

## Step 5: Accuracy

In [None]:
# accuracy in training set
y_pred_train = prediction_train
print("Train Accuracy: "+str(accuracy_score(y_train, y_pred_train)))

In [None]:
# accuracy in test set (unseen data)
y_true = y_test
y_pred_test = prediction_test
print("Test Accuracy: "+str(accuracy_score(y_true, y_pred_test)))

In [None]:
# confusion matrix
conf_mat = confusion_matrix(y_true, y_pred_test)
print("Confusion Matrix\n", conf_mat)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

labels = ['Ham','Spam']

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(conf_mat)
plt.title('Confusion matrix of the classifier\n')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()