## Assignment 3: Spam Detection with Random Forests

1.  Write a python function that returns
    1. the frequency of a given character in a string,
    2. the frequency of a given word (maximal consecutive sequence of letters, case insensitive) in string,
    3. a list of all maximal sequences of consecutive capital letters in a string,

    in a string.

2.  For each message in the dataset `sms_spam.csv`, compute:
    1. the frequency of the each of the words in the list `words` defined below,
    2. the frequency of the each of the characters in the list `chars` defined below,
    3. the average run length of a maximal sequence of capital letters,
    4. the longest run length of a maximal sequence of capital letters,
    5. the sum of the run lengths of the maximal sequences of capital letters.
    
    
**More precisely:**
    
- One can interpret *word* in a numbe of ways.
If we interpret *word* it as a string of consecutive letters, the sentence,
"The reporter filed a report about reportables." contains three occurences of the word *report*.
We could also interpred *word* as a string of consecutive letters, neither preceded nor followed by a letter,
then that sentence would contain only one occurence of the work *report*.
Choose whichever one of these interpretations your prefer. If you think of a better interpretation, feel free to use that.
    
- The simplest interpretation of maximal sequence of capital letters is a sequence of capital letters
neither preceded nor followed by a capital letter.
For example, in the string "abcDEFGhiJKL345 MNO PQR", "DEFG", "JKL", "MNO" and "PQR"
are maximal sequences of capital letters while "DEF" "KL" and "MNO PQR" aren't.
(The first two aren't maximal and the third doesn't consist entirely of capital letters.)
This isn't the only interpretation. In the string "He shouted, 'STOP SHOUTING AT ME!'", one could argue that "STOP SHOUTING AT ME" should count as a single (maximal) sequence of capital letters. If you prefer such an interpretation, feel free to use it.
    
- These details aren't really the point. We're just trying to extract features for analysis. Any reasonable approach is fine.
    
3.  Arrange the results in a dataframe. The frequencies computed in 1. and 2. should go in columns named
    `freq_<word>` and `freq_<char>`, respectively. The run lengths in 3., 4., and 5. should go in columns named
    `capital_run_length_average`, `capital_run_length_longest`, and `capital_run_length_total`, respectively.
    The last column in your dataframe should be the target, 1 if the target is spam and 0 if it isn't.
    Save your dataset as a `.csv` file called `sms_spam_features.csv`.


4.  Based on these 57 features, use a random forest tree to train a spam-detecting classifier:

5.  Repeat 2. through 4. for the datasets `spambase.csv` and `spam_or_not_spam.csv`.

6.  Comment on similarities/differences you notice between datasets (include our analysis of `spambase.csv`), classification methods, etc.

7.  **Optional:** Construct spam detectors using other classifiers you've learned about, and compare their performances.

In [1]:
words = ['make', 'address', 'all', '3d', 'our', 'over', 'remove', 'internet', 'order', 'mail', 'receive', 'will', 'people',
         'report', 'addresses', 'free', 'business', 'email', 'you', 'credit', 'your', 'font', '000', 'money', 'hp', 'hpl', 
         'george', '650', 'lab', 'labs', 'telnet', '857', 'data', '415', '85', 'technology', '1999', 'parts', 'pm', 'direct',
         'cs', 'meeting', 'original', 'project', 're', 'edu', 'table', 'conference']

In [2]:
chars = [';', '(', '[', '!', '$', '#']

In [3]:
import pandas as pd
import re
import numpy as np

## 1)

### a)

In [4]:
def charac_freq (charac, string):
    regex = re.compile (charac)
    matches = regex.findall (string)
    return len (matches) / len (string)   
    #   return the ratio of the count of the desired character in the in the string and the total number of 
    #   characters in the string.

In [5]:
#   Test
a = "This will work!"

print (charac_freq ('i', a))
print (charac_freq ('!', a))

0.13333333333333333
0.06666666666666667


### b)

In [6]:
def word_freq (word, string):
    #   Because we examination in case insensitive the word and the string can be made all lower case before
    #   applying the functions.
    regex = re.compile (word.lower ()) 
    matches = regex.findall (string.lower ())
    return len (matches) / len (string.split ())
    #   return the ratio of the coun of the desired word in the string and the total number of words in the 
    #   string.

In [7]:
#   Test
b = 'There is a lot of data to examine in Data Science.'

print (word_freq ('Science', b))
print (word_freq ('data', b))

0.09090909090909091
0.18181818181818182


### c)

In [8]:
def max_caps (string):
    regex = re.compile (r'[A-Z]+')   #   This will grab all strings with one or more capital letters.
    return regex.findall (string)

In [9]:
#   Test
c = 'THIS IS BUSIest time of the school YEar.'

print (max_caps (c))

['THIS', 'IS', 'BUSI', 'YE']


## 2)

### a)

In [10]:
messages = pd.read_csv ('~/Desktop/DS607/Assignment/A3/sms_spam_cleaned.csv')
messages.head (3)

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [11]:
freq_words = []   #   will contain lists of counts for each word in words

for ind in range (len (messages)):
    freq_i = []   #   count of each word in words per row
    for word in words:
        f = word_freq (word, messages.v2[ind])
        freq_i.append (f)
    freq_words.append (freq_i)

In [12]:
np.array (freq_words).shape 

(5564, 48)

### b)

In [13]:
freq_characs = []

for ind in range (len (messages)):
    freq_c = []
    for charac in chars:
        charac = '\%s' % charac
        #   characters such as '(' and '[' have specific uses in regular expressions, so they must be converted
        #   to '\(' and '\[' before being entered into the function charac_freq.
        f = charac_freq (charac, messages.v2[ind])
        freq_c.append (f)
    freq_characs.append (freq_c)

In [14]:
np.array (freq_characs).shape

(5564, 6)

### c)

In [15]:
avg_length = []

vectoring = np.vectorize (len, otypes = [float])
#   Applying this function will convert the strings in the array caps_message to floats equal to the length
#   of the strings.

for ind in range (len (messages)):
    caps_message = max_caps (messages.v2[ind])
    if len (caps_message) != 0:
    #   If there are no capital letters, caps_message will be an empty list with length 0. The function
    #   'vectoring' will return an empty array. Taking the mean of an empty array will return an error, so
    #   an if statement is used to seperate out the empty lists and add zero to avg_length to indicate
    #   there are zero capital letters.
    #   The mean of an empty array will divide a sum of 0 by 0, giving an error.
        mean_caps_message = vectoring (np.array (caps_message)).mean ()
        avg_length.append (mean_caps_message)
    else:
        avg_length.append (0)

In [16]:
len (avg_length)

5564

### d)

In [17]:
longest_string = []

for ind in range (len (messages)):
    caps_message = max_caps (messages.v2[ind])
    if len (caps_message) != 0:
    #   The function max () does not accept empty lists, so they are seperated out. The else adds zero to
    #   'longest_string' to indicate the message does not contain any capital letters.
        longest = len (max (caps_message, key = len))
        longest_string.append (longest)
    else:
        longest_string.append (0)   
        #   if a string does not contain capital letters 'There are no caps' will be displayed.

In [18]:
len (longest_string)

5564

### e)

In [19]:
total_length = []

vectoring = np.vectorize (len, otypes = [float])

for ind in range (len (messages)):
    caps_message = max_caps (messages.v2[ind])
    sum_caps_message = vectoring (np.array (caps_message)).sum ()
    total_length.append (sum_caps_message)

#   The sum of an empty list is 0. Therefore, .sum () works where .mean () did not work.

In [20]:
len (total_length)

5564

## 3)

In [21]:
#   For the words in the list words

words_f = []

#   Convert the words in words to 'freq_<word>' and create a new list.
for word in words:
    word = 'freq_' + word
    words_f.append (word)

#   For the characters in the list chars
    
chars_f = []

#   Convert the characters in chars to 'freq_<character>' and create a new list.
for charac in chars:
    charac = 'freq_' + charac
    chars_f.append (charac)

In [22]:
#   Convert freq_word to a dataframe
param1 = pd.DataFrame (np.array (freq_words), columns = words_f)


#   Convert freq_characs to a dataframe
param2 = pd.DataFrame (np.array (freq_characs), columns = chars_f)


#   Convert the lists avg_length, longest_string, and total_length to a dictionary and then to a dataframe.
d = {'capital_run_length_average' : avg_length, 'capital_run_length_longest' : longest_string, 'capital_run_length_total' : total_length}

param3 = pd.DataFrame (d)

In [23]:
#  Full dataframe of all the attributes.
parameters = pd.concat ([param1, param2, param3], axis = 1)

parameters.shape

(5564, 57)

In [24]:
parameters.head (3)

Unnamed: 0,freq_make,freq_address,freq_all,freq_3d,freq_our,freq_over,freq_remove,freq_internet,freq_order,freq_mail,...,freq_conference,freq_;,freq_(,freq_[,freq_!,freq_$,freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,3.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,2.0
2,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,...,0.0,0.0,0.006452,0.0,0.0,0.0,0.0,1.25,2,10.0


In [25]:
targets = []

#   Creates a list according to the label in messages.v1. 1 is applied for 'spam' and 0 for 'ham'
for ind in range (len (messages)):
    if messages.v1[ind] == 'spam':
        targets.append (1)
    else:
        targets.append (0)
        

#   Adds the list to the dataframe as a new column
parameters['target'] = targets

parameters.shape

(5564, 58)

In [26]:
#   Export dataframe as 'sms_spam_features.csv'
parameters.to_csv ('~/Desktop/DS607/Assignment/A3/sms_spam_features.csv', header = True)

## 4)

In [27]:
X = parameters.iloc [:,0:57]
y = parameters.iloc [:,57]

print (X.shape, y.shape)

(5564, 57) (5564,)


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [29]:
X_tr, X_te, y_tr, y_te = train_test_split (X, y, test_size = 0.2)

In [30]:
M = RandomForestClassifier ()
M.fit (X_tr, y_tr)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [31]:
y_pred = M.predict (X_te)

accuracy_score (y_te, y_pred)

0.9685534591194969

## 5)

#### Parts 2-4 using 'spambase.csv'

In [32]:
spambase = pd.read_csv ('~/Desktop/DS607/Assignment/A3/spambase.csv')

spambase.shape

(4601, 58)

In [33]:
spambase.head (3)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,target
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1


In [72]:
#   Export dataframe as 'spambase_features.csv'
spambase.to_csv ('~/Desktop/DS607/Assignment/A3/spambase_features.csv', header = True)

In [34]:
X2 = spambase.iloc [:, :57]
y2 = spambase.iloc [:, 57]

print (X2.shape, y2.shape)

(4601, 57) (4601,)


In [35]:
X2_tr, X2_te, y2_tr, y2_te = train_test_split (X2, y2, test_size = 0.2)

In [36]:
M2 = RandomForestClassifier ()
M2.fit (X2_tr, y2_tr)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [37]:
y2_pred = M2.predict (X2_te)

accuracy_score (y2_te, y2_pred)

0.9587404994571118

#### Parts 2-4 using 'spam_or_not_spam.csv'

In [4]:
spam_or_not = pd.read_csv ('~/Desktop/DS607/Assignment/A3/spam_or_not_spam.csv')

spam_or_not.shape

(3000, 2)

In [7]:
spam_or_not.head (4)

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0


In [8]:
spam_or_not.tail (4)

Unnamed: 0,email,label
2996,hyperlink hyperlink hyperlink let mortgage le...,1
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1
2999,hello this is chinese traditional 子 件 NUMBER世...,1


##### 2a)

In [40]:
for ind in range (len (spam_or_not)):
    test_message = []
    for word in words:
        try:
            f = word_freq (word, spam_or_not.email[ind])
        except:
            print (ind)
            break

2806
2828
2966


In [41]:
print (spam_or_not.loc[2806])
print (spam_or_not.loc[2828])
print (spam_or_not.loc[2966])

email     
label    1
Name: 2806, dtype: object
email     
label    1
Name: 2828, dtype: object
email    NaN
label      1
Name: 2966, dtype: object


In [42]:
freq_words2 = []

for ind in range (len (spam_or_not)):
    freq_i = []
    for word in words:
        try:
        #   There is an NaN at the 2967th row (row 2966) and causes word_freq to return an error. So, a try 
        #   statement is used to bypass the issue and add an NaN to freq_words2.
            f = word_freq (word, spam_or_not.email[ind])
            freq_i.append (f)
        except:
            freq_i.append (np.nan)
    freq_words2.append (freq_i)

In [43]:
np.array (freq_words2).shape

(3000, 48)

##### 2b)

In [44]:
freq_characs2 = []

for ind in range (len (spam_or_not)):
    freq_c = []
    for charac in chars:
        try:
            charac = '\%s' % charac
            f = charac_freq (charac, spam_or_not.email[ind])
            freq_c.append (f)
        except:
            freq_c.append (np.nan)
    freq_characs2.append (freq_c)

In [45]:
np.array (freq_characs2).shape

(3000, 6)

##### 2c)

In [46]:
avg_length2 = []

vectoring = np.vectorize (len, otypes = [float])

for ind in range (len (spam_or_not)):
    try:
        caps_message = max_caps (spam_or_not.email[ind])
        if len (caps_message) != 0:
            mean_caps_message = vectoring (np.array (caps_message)).mean ()
            avg_length2.append (mean_caps_message)
        else:
            avg_length2.append (0)
    except:
        avg_length2.append (np.nan)

In [47]:
len (avg_length2)

3000

##### 2d)

In [48]:
longest_string2 = []

for ind in range (len (spam_or_not)):
    try:
        caps_message = max_caps (spam_or_not.email[ind])
        if len (caps_message) != 0:
            longest = len (max (caps_message, key = len))
            longest_string2.append (longest)
        else:
            longest_string2.append (0)
    except:
        longest_string2.append (np.nan)

In [49]:
len (longest_string2)

3000

##### 2e)

In [50]:
total_length2 = []

vectoring = np.vectorize (len, otypes = [float])

for ind in range (len (spam_or_not)):
    try:
        caps_message = max_caps (spam_or_not.email[ind])
        sum_caps_message = vectoring (np.array (caps_message)).sum ()
        total_length2.append (sum_caps_message)
    except:
        total_length2.append (np.nan)

In [51]:
len (total_length2)

3000

##### 3)

In [52]:
param1_2 = pd.DataFrame (np.array (freq_words2), columns = words_f)

param2_2 = pd.DataFrame (np.array (freq_characs2), columns = chars_f)

d2 = {'capital_run_length_average' : avg_length2, 'capital_run_length_longest' : longest_string2, 'capital_run_length_total' : total_length2}
param3_2 = pd.DataFrame (d2)

parameters2 = pd.concat ([param1_2, param2_2, param3_2], axis = 1)

parameters2.shape

(3000, 57)

In [53]:
parameters2['target'] = spam_or_not['label']

parameters2.shape

(3000, 58)

In [71]:
#   Export dataframe as 'spam_or_not_spam_features.csv'
parameters2.to_csv ('~/Desktop/DS607/Assignment/A3/spam_or_not_spam_features.csv', header = True)

##### 4)

In [54]:
X3 = parameters2.iloc[:,0:57]
y3 = parameters2.iloc[:,57]

#   RandomTreeClassifier does not like NaN values in the data, so, all the NaN values are converted to 0.
X3_e = np.nan_to_num (X3)
y3_e = np.nan_to_num (y3)

X3_tr, X3_te, y3_tr, y3_te = train_test_split (X3_e, y3_e, test_size = 0.2)

In [55]:
M3 = RandomForestClassifier ()
M3.fit (X3_tr, y3_tr)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [56]:
y3_pred = M3.predict (X3_te)
accuracy_score (y3_te, y3_pred)

0.9483333333333334

## 6)

Sms_span.csv and spam_or_not_spam.csv contained the messages (or emails), so pre-processing for word and character frequencies, and longest length, average length and sum of lengths of capital letters needed to be done prior to training a model. For sms_span.csv, spam and not spam was labeled using the terms 'spam' and 'ham', respectively, so they needed to be converted to 1 and 0. Spam_or_not_spam.csv were already labeled as 1 and 0.
Spambase.csv already had all the pre-processing completed, so nothing needed to be done. 
Spam_or_not_spam.csv contained non-string entries, which caused errors with the regular expressions. So, exceptions needed to be made during pre-processing.

The some of the emails in spam_or_not_spam contained none roman characters, for example the last message contained traditional chinese characters. This does not appear to have affected the analysis.

After pre-processing, all three datasets had the same number of attributes (columns), spambase and spam_or_not_spam contain fewer entries (rows) than sms_span, and spam_or_not_spam contains fewer entries than the other two.
* sms_spam: 5564 entries
* spambase: 4601 entries
* spam_or_not_spam: 3000 entries

There was not a difference between training the models between each dataset.

The accuracy of the models are similar, all three being around 95 %.

## 7)

#### Try Decision Tree Classifier

In [57]:
from sklearn.tree import DecisionTreeClassifier

In [58]:
#   using sms_spam dataset

MDT_1 = DecisionTreeClassifier ()
MDT_1.fit (X_tr, y_tr)
yDT_1_pred = MDT_1.predict (X_te)
accuracy_score (y_te, yDT_1_pred)

0.9514824797843666

In [59]:
#   using spambase dataset

MDT_2 = DecisionTreeClassifier ()
MDT_2.fit (X2_tr, y2_tr)
yDT_2_pred = MDT_2.predict (X2_te)
accuracy_score (y2_te, yDT_2_pred)

0.9272529858849077

In [60]:
#   using spam_or_not_spam dataset

MDT_3 = DecisionTreeClassifier ()
MDT_3.fit (X3_tr, y3_tr)
yDT_3_pred = MDT_3.predict (X3_te)
accuracy_score (y3_te, yDT_3_pred)

0.9016666666666666

#### Try k-Nearest Neighbours

In [61]:
from sklearn.neighbors import KNeighborsClassifier

In [62]:
#   using sms_spam dataset

MKN_1 = KNeighborsClassifier ()
MKN_1.fit (X_tr, y_tr)
yKN_1_pred = MKN_1.predict (X_te)
accuracy_score (y_te, yKN_1_pred)

0.9317160826594789

In [63]:
#   using spambase dataset

MKN_2 = KNeighborsClassifier ()
MKN_2.fit (X2_tr, y2_tr)
yKN_2_pred = MKN_2.predict (X2_te)
accuracy_score (y2_te, yKN_2_pred)

0.7915309446254072

In [64]:
#   using spam_or_not_spam dataset

MKN_3 = KNeighborsClassifier ()
MKN_3.fit (X3_tr, y3_tr)
yKN_3_pred = MKN_3.predict (X3_te)
accuracy_score (y3_te, yKN_3_pred)

0.82

#### Try Logistic Regression

In [65]:
from sklearn.linear_model import LogisticRegression

In [66]:
#   using sms_spam dataset

MLR_1 = LogisticRegression ()
MLR_1.fit (X_tr, y_tr)
yLR_1_pred = MLR_1.predict (X_te)
accuracy_score (y_te, yLR_1_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.8984725965858041

In [67]:
#   using spambase dataset

MLR_2 = LogisticRegression ()
MLR_2.fit (X2_tr, y2_tr)
yLR_2_pred = MLR_2.predict (X2_te)
accuracy_score (y2_te, yLR_2_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9250814332247557

In [68]:
#   using spam_or_not_spam dataset

MLR_3 = LogisticRegression ()
MLR_3.fit (X3_tr, y3_tr)
yLR_3_pred = MLR_3.predict (X3_te)
accuracy_score (y3_te, yLR_3_pred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.7966666666666666

#### Summary

In [69]:
d3 = {'Dataset' : ['sms_spam.csv', 'spambase.csv', 'spam_or_not_spam.csv'], 
      'Random Forest' : [accuracy_score (y_te, y_pred), accuracy_score (y2_te, y2_pred), accuracy_score (y3_te, y3_pred)],
     'Decision Tree' : [accuracy_score (y_te, yDT_1_pred), accuracy_score (y2_te, yDT_2_pred), accuracy_score (y3_te, yDT_3_pred)],
     'k-Nearest Neighbours' : [accuracy_score (y_te, yKN_1_pred), accuracy_score (y2_te, yKN_2_pred), accuracy_score (y3_te, yKN_3_pred)],
     'Logistic Regression' : [accuracy_score (y_te, yLR_1_pred), accuracy_score (y2_te, yLR_2_pred), accuracy_score (y3_te, yLR_3_pred)]}

display (pd.DataFrame (d3))

Unnamed: 0,Dataset,Random Forest,Decision Tree,k-Nearest Neighbours,Logistic Regression
0,sms_spam.csv,0.968553,0.951482,0.931716,0.898473
1,spambase.csv,0.95874,0.927253,0.791531,0.925081
2,spam_or_not_spam.csv,0.948333,0.901667,0.82,0.796667


For all three datasets, the Random Forest Classifier models provided higher accuracy than the other classifiers seen in class. The k-Nearest Neighbours Classifier models provided the least accurate prediction for spambase, and Logistic Regression had the least accurate predictions for sms_spam and spam_or_not_spam.