# Hands-on introduction to ML training
So far, we have worked with simple, tabular data. In this notebook, we will learn how to deal with text in a spam detection problem.

### Step 1: Load and explore data
The first step is figuring out the data source. In this case we will use a pre-existing dataset. We will:
1. Create a folder 'data'
2. Download the file from public github repo using python package "requests" and save the emails.csv file in the data folder.

In [19]:
%config IPCompleter.greedy=True #Helps with auto-complete

import numpy as np
import pandas as pd
import os

try:
    os.mkdir('data')
except OSError as error:
    print(error)

import requests, csv

url = 'https://raw.githubusercontent.com/techno-nerd/ML_101_Course/main/07%20Unstructured%20Data%20-%20Text/data/emails.csv'
r = requests.get(url)
with open('data/emails.csv', 'w') as f:
  writer = csv.writer(f)
  for line in r.iter_lines():
    writer.writerow(line.decode('utf-8').split(','))

[Errno 17] File exists: 'data'


In [20]:
df = pd.read_csv('data/emails.csv')

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


In [22]:
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,Subject: do not have money get software cds f...,1


In [23]:
#0 = Not spam, 1 = Spam
print(df['spam'].value_counts())

spam
0    4360
1    1368
Name: count, dtype: int64


### [Kaggle Dataset](https://www.kaggle.com/datasets/karthickveerakumar/spam-filter/code)

This dataset is about detecting spam emails, something almost all mail applications have.

### Step 2: Data preparation

Normally, for Natural Language Processing (NLP), the following steps are usually taken:
1. Removal of HTML content, like the "<br>" tags (not required for this dataset)
2. Removal of punctuations and special characters
3. Removal of stopwords ("is", "the", "a", etc.), which are not significant
4. Lemmatizing - Turning multiple words into a common root. For example, learnt, learning and learn to the root: Learn
5. Vectorisation - Encoding the cleaning text into numerical values

Then, we split the data into training (80%) and testing (20%)

### Cleaning: Regex

Regular expressions (Regex) is a term given for these kinds of tasks, of removing punctuations and special characters.

In [24]:
import re

for i in df.index:
    #Replace anything that is not alphabetical
    review = re.sub('[^a-zA-Z]', ' ', df['text'][i])
    review = review.lower()

    #Split the text into a list for iterating over words later
    review = review.split()
    df['text'][i] = review

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the c

In [25]:
print(df.head(2))

                                                text  spam
0  [subject, naturally, irresistible, your, corpo...     1
1  [subject, the, stock, trading, gunslinger, fan...     1


### Cleaning: Remove Stopwords

In [26]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

for i in df.index:
    review = [word for word in df['text'][i] if not word in set(stopwords.words('english'))]
    df['text'][i] = review

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arushgarg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/

In [27]:
print(df.head(2))

                                                text  spam
0  [subject, naturally, irresistible, corporate, ...     1
1  [subject, stock, trading, gunslinger, fanny, m...     1


### Cleaning: Lemmatizing Words

Lemmatizing is when we turn the words back to their original root (ex: ran, running and run to run).

In [28]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lem = WordNetLemmatizer()

for i in df.index:
    review = [lem.lemmatize(word) for word in df['text'][i]]
    review = ' '.join(review) #Turning the review back into a string
    df['text'][i] = review

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/arushgarg/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'][i] = review
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/inde

In [29]:
print(df.head())

                                                text  spam
0  subject naturally irresistible corporate ident...     1
1  subject stock trading gunslinger fanny merrill...     1
2  subject unbelievable new home made easy im wan...     1
3  subject color printing special request additio...     1
4  subject money get software cd software compati...     1


#### Cleaning: TFIDF Vectorizer

In the other notebook, we used a simple Count Vectorizer. <br>
This one is called Term Frequency Inverse Document Frequency. <br>
This changes the weightage of the word depending on how much discriminatory power it gives

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()

reviews_vec = tfidf_vec.fit_transform(df['text'])
reviews_vec = reviews_vec.toarray()
print(reviews_vec[:2])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Now that we have the data ready, we will split it into train and test sets.

In [31]:
import sklearn.model_selection as ms

train_features, test_features, train_labels, test_labels = ms.train_test_split(reviews_vec, df['spam'], test_size=0.2)
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(4582, 30935)
(1146, 30935)
(4582,)
(1146,)


### Step 3: Model Selection and Training

Since the feature space is very large, we cannot use most stand-alone models, like Decision Trees or Logistic Regression. Hence, we will use a a Support Vector Machine (SVM) and a Random Forest (ensemble of decision trees).

### Support Vector Machines (SVM)

SVMs are very powerful for this kind of problems, because they can handle the large feature space. They try to find an equation that separates the two classes.

In [32]:
from sklearn.svm import LinearSVC

svm = LinearSVC(C=0.5) #The 'C' value tells the model to maximise accuracy, not margins
svm = svm.fit(train_features, train_labels)



In [33]:
def ClassifierMetrics(labels, predictions):
    total = labels.size
    result = (labels == predictions)
    correct = result.sum()
    accuracy = (correct)/total

    #Precision (correct '1' prediction / total '1' prediction)
    precision = (result[predictions == 1.0].sum()) / (predictions == 1.0).sum()

    #Recall = (correct '1' predictions / total number of '1's)

    recall = (result[predictions == 1.0].sum()) / (labels == 1.0).sum()

    return [accuracy, precision, recall]

In [34]:
svm_pred = svm.predict(test_features)
svm_metrics = ClassifierMetrics(test_labels, svm_pred)

print("SVM TEST Metrics:")
print(f"Accuracy: {svm_metrics[0]}")
print(f"Precision: {svm_metrics[1]}")
print(f"Recall: {svm_metrics[2]}")

SVM TEST Metrics:
Accuracy: 0.9956369982547993
Precision: 0.9962962962962963
Recall: 0.9853479853479854


### Random Forest Classifiers

Random Forests can also perform well on these kind of problems because of the way they divide features amongst the different trees.

In [35]:
from sklearn.ensemble import RandomForestClassifier

r_forest = RandomForestClassifier(n_estimators=500, min_samples_leaf=3)
r_forest = r_forest.fit(train_features, train_labels)

In [36]:
r_forest_pred = r_forest.predict(test_features)
r_forest_metrics = ClassifierMetrics(test_labels, r_forest_pred)

print("SVM TEST Metrics:")
print(f"Accuracy: {r_forest_metrics[0]}")
print(f"Precision: {r_forest_metrics[1]}")
print(f"Recall: {r_forest_metrics[2]}")

SVM TEST Metrics:
Accuracy: 0.9755671902268761
Precision: 1.0
Recall: 0.8974358974358975
