# Intelligent Systems 2022: 10th  Practical Assignment
## Machine Learning Introduction

Your name: Sebastião Manuel Inácio Rosalino

Your VUnetID: sxx209

If you do not provide your name and VUnetID we will not accept your submission. 

### Preliminaries

At the end of this exercise you should be able to work with some basic Machine Learning concepts, and implement and evaluate simple classifiers for *spam classification* using the popular machine learning library scikit-learn(https://scikit-learn.org/stable/).
Scikit-learn offers a many helpful methods for creating simple machine learning models and to perform data science.

In this assignment you will:
1. Use pandas to read a dataset from a comma-separated-value (.csv) file.
2. You should be able to create tf-idf feature vectors with scikit-learn.
3. You should be able to create a simple classification and evaluate basic classification models.
4. You should have learned to improve classification models for textual data.




### Practicalities

Follow this Notebook step-by-step. For this course it is necessary that you manipulate the python programmes we provide. You can do the exercises in any Programming Editor of your liking. Still, please fill in the questions in this notebook as usual. 

Please use your studentID+Assignment10.ipynb as the name of the Notebook, and fill in the missing cells.   

Note: unlike the courses dedicated to programming we will not evaluate the style of the programs. But we will, however, test your programs on other data that we provide, and your program should give the correct output to the test-data as well.

As was mentioned, the assignment is graded as pass/fail. To pass you need to have either a full working code or an explanation of what you tried and what didn't work for the tasks that you were unable to complete (you can use multi-line comments or a text cell).


### Install some packages

First we need to install some additional packages that we will use throughout this assignment.
This might take a while.


In [66]:
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install scikit-learn



## Training classification models with Sci-Kit Learn.

With this notebbook, you have downloaded a small .csv file containing a public spam/ham SMS dataset that is often used for text classification purposes.
We will load this dataset with the pandas library (https://pandas.pydata.org/), which is often used for data analysis.

#### Note that you might re-run the Notebook multiple times, because the *df* variable is overwritten multiple times.

In [67]:
#load data
import pandas as pd
df = pd.read_csv ('spam.csv', encoding="latin-1")
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As you can see, the resulting pandas dataframe contains an index column, a label, and the message.
Let's first have a look at the class distribution.

## Task 1

For this first task, we ask you to do a basic data science task. Try to get an idea about the dataset by checking how balanced/unbalanced the dataset is. To do this, you need to compute the proportion of the *ham* and the *spam* class.

Find a Pandas function to compute the frequency of the labels to get an idea of the label distribution. 
Then write a short description of your results.
What percentage of the messages are labelled as spam?

*Hint: Have a look at the Pandas documentation (https://pandas.pydata.org/docs/). There a many ways to get your answer!*

In [68]:
#Write your Code for task 1 here.

(df['label'].value_counts() / len(df)) * 100

ham     86.593683
spam    13.406317
Name: label, dtype: float64

In [69]:
MyReport1 = """

To calculate the proportion of the label classes (ham and spam) on the dataset, I used the pandas function "value_counts" to count how many times in
total each label occured, divided by the total number of instances in the dataset, and that result multiplied by 100, so that I could get the relative
frequency of each label in percentage format.
When it comes to the results, approximately 86.6% were labelled as ham and approximately 13.4% were labelled as spam.

"""

The following code snipped will create textual features, as discussed in last weeks lecture. We will create tf-idf vectors and will append them to our pandas dataframe.
Then we will perform a simple train/test split of our dataset, using the scikit-learn splitting functions.

Have a look at the different parts that we created. What do the dataframes X_train, y_train, X_test, y_test contain?
Try to understand what is happening here by also having a look at the scikit-learn documentation (https://scikit-learn.org/stable/).

In [70]:
#imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#compute the tf-idf vectors for the messages and create a new dataframe for them
v = TfidfVectorizer()
tf_idf = v.fit_transform(df['message'])
df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=v.get_feature_names())

#combine the original dataframe with the dataframe for the tf-idf vectors
dataframes = [df, df_tfidf]
df_new = pd.concat(dataframes, axis=1)

#split the dataset into training and test set
train, test = train_test_split(df_new, test_size=0.9)

#separate feature matrices X from label vector y
X_train = train.iloc[:, 3:]
X_test = test.iloc[:, 3:]
y_train = train['label']
y_test = test['label']




In [71]:
#let's have a look at the different dataframes here
X_train

Unnamed: 0,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,...,ó_,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
723,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4175,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3954,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5558,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1820,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3199,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Naive Bayes Classification


In the lecture, we have introduced the naive Bayes classification algorithm and have already computed various examples by hand. Here, we will use scikit-learn to train your first own classification model for spam classification.
However, all examples from the lecture were using categorical features, while our tf-idf vectors here are real-valued features. 
Thus, the model used here will be slightly different than what we have seen in the lecture.



### Task 2

Use the training and test set created in the previous cell and train a Naive Bayes classifier using sci-kit learn.
Please have a look at the documentation on how to use classification model using X_train and y_train as an input.
Afterwards compute the accuracy of your classfier.


In [72]:
# My classification code

from sklearn.naive_bayes import MultinomialNB

y_test_prediction_naive_bayes = MultinomialNB().fit(X_train, y_train).predict(X_test)

# Computing the accuracy of my Naive Bayes classifier

from sklearn import metrics

print("The accuracy of my Naive Bayes classifier was:", metrics.accuracy_score(y_test, y_test_prediction_naive_bayes))

The accuracy of my Naive Bayes classifier was: 0.8685942173479562


As you might have seen, the accuracy of your Naive Bayes classifier should be over 95%.
This seems to be a very good score, for a very simple classification model and simple tf-idf features.

### Task 3

Have a look at different evaluation metrics for your classifier and discuss the suitability of accuracy for the spam classification task.
Have a look at the definition of accuracy and come up with another metric, which is better suited for our problem

*Hint: Have a look at this documentation and try out different evaluation metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics*

In [73]:
from sklearn.metrics import balanced_accuracy_score

# Computing the balanced accuracy of my Naive Bayes classifier

print("The balanced accuracy of my Naive Bayes classifier was:", metrics.balanced_accuracy_score(y_test, y_test_prediction_naive_bayes))

The balanced accuracy of my Naive Bayes classifier was: 0.5118518518518519


In [74]:
MyReport2 = """

The accuracy evaluation metric has showed manifestly high results. It is important to remember that accuracy is calculated as the percentage of correct 
predictions of the model in relation to the target class effectively assumed in each instance of the test set. However, this evaluation metric will not 
be the most appropriate for the dataset under study, because, as described in the documentation, this metric was designed to act on multiclass 
classification problems.

In order to obtain a more adequate and realistic perspective of the performance that our model presented with respect to the given dataset, I found the 
balanced accuracy evaluation metric. This evaluation metric best suits the characteristics of the dataset because, as described in the documentation, 
this metric was designed to deal not only with multiclass classification problems, but also with binary classification problems, as it is our goal to 
predict whether a message it's spam or ham. In addition, this metric appears to be more weighted and balanced in relation to the target class of the 
dataset, since calculating accuracy in the same way as the accuracy metric, it now takes into account the frequency and proportion that each label is 
represented.

As is clear from the output obtained, the evaluation of the model's predictive capacity upon the test set according to this new metric was much worse.

"""

### Task 4

Come up with any improvements for the classification model here.
You can come up with a new method and/or different features to improve the classification.
Can you beat the baseline Naive Bayes model?

If you try out a different classification model, the training of the model might take a couple of seconds.

Write at least 10 sentences describing your improvements and why these improvements are helping to improve the model?

In [75]:
# My first experiment will be to perform a classifier model based on a Classification Decision Tree

from sklearn import tree

y_test_prediction_decision_tree = tree.DecisionTreeClassifier().fit(X_train, y_train).predict(X_test)

# Computing the balanced accuracy of my Classification Decision Tree model

print("The balanced accuracy of my Decision Tree classifier was:", metrics.balanced_accuracy_score(y_test, y_test_prediction_decision_tree))

The balanced accuracy of my Decision Tree classifier was: 0.862652329749104


In [76]:
# My second experiment will be to perform a classifier model based KNN (K-Nearest Neighbour)

from sklearn.neighbors import KNeighborsClassifier

my_knn_classifier = KNeighborsClassifier(n_neighbors=3)

my_knn_classifier.fit(X_train, y_train)

y_test_prediction_knn = my_knn_classifier.predict(X_test)

# Computing the balanced accuracy of my KNN Classification model

print("The balanced accuracy of my KNN classifier was:", metrics.balanced_accuracy_score(y_test, y_test_prediction_knn))

The balanced accuracy of my KNN classifier was: 0.852231609489674


In [24]:
MyReport3 = """

In order to improve the model's ability to classify SMS in spam/ham, I carried out two experiments.

The first one consisted of using a different classification model, the Classification Decision Tree. After fitting this model on the training set and 
predicting the labels on the test set, this model presented a much better result, according to the balanced accuracy metric, than the Naive Bayes 
classifier used in task 3. Therefore, beating the baseline Naive Bayes model.

Moving on to the second experiment, this consisted of using another different model, the KNN (K-Nearest Neighbour). After fitting this model on the 
training set and predicting the labels on the test set, this classifier presented a much better result, according to the balanced accuracy metric, than 
the Naive Bayes classifier used in task 3. Thus, beating once again the baseline Naive Bayes model. Noteworthy to mention that this model was executed by 
setting K=3, which means that this model performed its spam/ham classifications in each message based on the 3 messages that were most "close" and 
"similar" in terms of their features of the message in question. Finally, this model presented results, according to the balanced accuracy metric, very 
similar to those obtained in the model that used the Decision Tree classifier.

To sum up, these two approaches, based on different models, are greatly improving the predictive capacity, due to the fact that the Naive 
Bayes baseline is not a robust enough model, and therefore, too simplistic, to deal with the characteristics of the dataset in order to present a good 
prediction upon the labels on the test set, whereas both the Decision Tree classifier and the KNN classifier constitute more "educated" and complex
models due to the procedures adopted in their classifying process of messages in spam/ham.

"""

## Final Task: Collect all the results

Uncomment and run this cell (and all the cells above) to generate the text file that you have to hand in together with the notebook on canvas!

### Please hand in only the text file which is generated by this method!

In [None]:
from utils import *
exportToText("assignment10.txt", MyReport1, MyReport2, MyReport3)