# Intelligent Systems 2023: Practical Assignment 10

## Machine Learning Introduction

Your name: Chantal Ariu

Your VUnetID: car103

If you do not provide your name and VUnetID we will not accept your submission. 

### Preliminaries

At the end of this exercise you should be able to work with some basic Machine Learning concepts, and implement and evaluate simple classifiers for *spam classification* using the popular machine learning library scikit-learn(https://scikit-learn.org/stable/).
Scikit-learn offers a many helpful methods for creating simple machine learning models and to perform data science.

In this assignment you will:
1. Use pandas to read a dataset from a comma-separated-value (.csv) file.
2. You should be able to create tf-idf feature vectors with scikit-learn.
3. You should be able to create a simple classification and evaluate basic classification models.
4. You should have learned to improve classification models for textual data.




### Practicalities

Follow this Notebook step-by-step. For this course it is necessary that you manipulate the python programmes we provide. You can do the exercises in any Programming Editor of your liking. Still, please fill in the questions in this notebook as usual. 

Please use your studentID+Assignment10.ipynb as the name of the Notebook, and fill in the missing cells.   

Note: unlike the courses dedicated to programming we will not evaluate the style of the programs. But we will, however, test your programs on other data that we provide, and your program should give the correct output to the test-data as well.

As was mentioned, the assignment is graded as pass/fail. To pass you need to have either a full working code or an explanation of what you tried and what didn't work for the tasks that you were unable to complete (you can use multi-line comments or a text cell).


### Install some packages

First we need to install some additional packages that we will use throughout this assignment.
This might take a while.


In [1]:
!python3 -m pip install pandas
!python3 -m pip install scikit-learn

Collecting pandas
  Downloading pandas-2.1.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting numpy<2,>=1.26.0 (from pandas)
  Downloading numpy-1.26.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dateutil>=2.8.2 (from pandas)
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.7/247.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Downloading pandas-2.1.4-cp312-cp312-macosx_11_0_arm64.whl (10.6 MB)
[

## Training classification models with Sci-Kit Learn.

With this notebbook, you have downloaded a small .csv file containing a public spam/ham SMS dataset that is often used for text classification purposes.
We will load this dataset with the pandas library (https://pandas.pydata.org/), which is often used for data analysis.


In [20]:
#load data
import pandas as pd
df = pd.read_csv ('spam.csv', encoding = "ISO-8859-1")
df.dropna(how="any", inplace=True, axis=1)
df.columns = ['label', 'message']
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As you can see, the resulting pandas dataframe contains an index column, a label, and the message.
Let's first have a look at the class distribution.

## Task 1

For this first task, we ask you to do a basic data science task. Try to get an idea about the dataset by checking how balanced/unbalanced the dataset is. To do this, you need to compute the proportion of the *ham* and the *spam* class.

Find a Pandas function to compute the frequency of the labels to get an idea of the label distribution. 
Then write a short description of your results.
What percentage of the messages are labelled as spam?

*Hint: Have a look at the Pandas documentation (https://pandas.pydata.org/docs/). There a many ways to get your answer!*

In [23]:
class_proportions = df['label'].value_counts(normalize=True)
print(class_proportions)

label
ham     0.865937
spam    0.134063
Name: proportion, dtype: float64


In [26]:
MyReport1 = """
The output states that ham has a percentage of 86.59 percent whereas spam has a percentage of 13.41 percent.
This means that in our dataset there is quite an unbalance going on between ham and spam class.
"""

The following code snipped will create textual features, as discussed in last weeks lecture. We will create tf-idf vectors and will append them to our pandas dataframe.
Then we will perform a simple train/test split of our dataset, using the scikit-learn splitting functions.

Have a look at the different parts that we created. What do the dataframes X_train, y_train, X_test, y_test contain?
Try to understand what is happening here by also having a look at the scikit-learn documentation (https://scikit-learn.org/stable/).

In [24]:
#imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#compute the tf-idf vectors for the messages and create a new dataframe for them
v = TfidfVectorizer()
tf_idf = v.fit_transform(df['message'])
df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=v.get_feature_names_out())

#combine the original dataframe with the dataframe for the tf-idf vectors
dataframes = [df, df_tfidf]
df_new = pd.concat(dataframes, axis=1)

#split the dataset into training and test set
train, test = train_test_split(df_new, test_size=0.9)

#separate feature matrices X from label vector y
X_train = train.iloc[:, 3:]
X_test = test.iloc[:, 3:]
y_train = train['label']
y_test = test['label']


In [25]:
#let's have a look at the different dataframes here
X_train

Unnamed: 0,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,...,ó_,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
5101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4650,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
832,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Naive Bayes Classification


In the lecture, we have introduced the naive Bayes classification algorithm and have already computed various examples by hand. Here, we will use scikit-learn to train your first own classification model for spam classification.
However, all examples from the lecture were using categorical features, while our tf-idf vectors here are real-valued features. 
Thus, the model used here will be slightly different than what we have seen in the lecture.



### Task 2

Use the training and test set created in the previous cell and train a Naive Bayes classifier using sci-kit learn.
Please have a look at the documentation on how to use classification model using X_train and y_train as an input.
Afterwards compute the accuracy of your classfier.


In [29]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8693918245264207


As you might have seen, the accuracy of your Naive Bayes classifier should be over 85%.
This seems to be a very good score, for a very simple classification model and simple tf-idf features.

### Task 3

Have a look at different evaluation metrics for your classifier and discuss the suitability of accuracy for the spam classification task.
Have a look at the definition of accuracy and come up with another metric, which is better suited for our problem

*Hint: Have a look at this documentation and try out different evaluation metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics*

In [49]:
from sklearn.metrics import precision_score, recall_score, f1_score, jaccard_score, classification_report

precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
f1 = f1_score(y_test, y_pred, pos_label='spam')
jaccard = jaccard_score(y_test, y_pred, pos_label='spam')
accuracy = accuracy_score(y_test, y_pred)

print("Naive Bayes Accuracy:", accuracy)
print("Naive Bayes Jaccard Score:",jaccard)
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred))


Naive Bayes Accuracy: 0.8693918245264207
Naive Bayes Jaccard Score: 0.02092675635276532
Naive Bayes Classification Report:
               precision    recall  f1-score   support

         ham       0.87      1.00      0.93      4346
        spam       1.00      0.02      0.04       669

    accuracy                           0.87      5015
   macro avg       0.93      0.51      0.49      5015
weighted avg       0.89      0.87      0.81      5015



In [52]:
MyReport2 = """
As stated in the documentation, I first used the more common evaluation metrics: precision, recall and f1. 
- Precision output: 1.0 (measure of accuracy of positive predictions) --> 100%
- Recall output: 0.21 (measure of proportion of actual positives that were identified correctly) --> ~21%
- F1 output: 0.04 (measures harmonic mean between precision and recall) --> ~4%
- Jaccard output: 0.02 (defined as size of intersection divided by the size of the union of two label sets) --> ~2%
"""

### Task 4

Come up with any improvements for the classification model here.
You can come up with a new method and/or different features to improve the classification.
Can you beat the baseline Naive Bayes model?

If you try out a different classification model, the training of the model might take a couple of seconds.

Write at least 10 sentences describing your improvements and why these improvements are helping to improve the model?

In [48]:
from sklearn.svm import SVC

# Here, we will initialize the SVM classifier model and pass the standard parameters that we will need
svm_classifier = SVC(kernel='sigmoid', gamma=1.0, class_weight='balanced')

# Here, we use the function .fit() (form sklearn documentation) to train the model
svm_classifier.fit(X_train, y_train)

y_pred_svm = svm_classifier.predict(X_test)

# Now we compute the accuracy and print it out
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy:", accuracy_svm)

precision_svm = precision_score(y_test, y_pred_svm, pos_label='spam')
recall_svm = recall_score(y_test, y_pred_svm, pos_label='spam')
f1_svm = f1_score(y_test, y_pred_svm, pos_label='spam')
jaccard_svm = jaccard_score(y_test, y_pred_svm, pos_label='spam')

# print("SVM Precision:", precision_svm)
# print("SVM Recall:", recall_svm)
# print("SVM F1 Score:", f1_svm)
print("SVM Jaccard Score:", jaccard_svm)

print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.9647058823529412
SVM Jaccard Score: 0.7416058394160584
SVM Classification Report:
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98      4346
        spam       0.97      0.76      0.85       669

    accuracy                           0.96      5015
   macro avg       0.97      0.88      0.92      5015
weighted avg       0.96      0.96      0.96      5015



In [50]:
MyReport3 = """
During our intro to ai course, I was doing research on the SVM model (Support Vector Machine Model) and therefore wanted
to look into it a little bit more to test whether it would perform better than the Naive Bayes model. I then went on
to research how to use an SVM model with sklearn and ended up on this website initially https://scikit-learn.org/stable/modules/svm.html. 
After reading about SVM on that website, I knew that theoretically, the SVM model would perform better than Naive Bayes
model, especially since the SVM model can take the parameter 'class_weight' and if set to 'balanced' (which is what I did),
it would be very beneficial since the dataset we are using is very unbalanced.

After doing the initial research and gathering the theoretical knowledge, I followed the guides on sklearn's documentation
as well as the information from the following website: https://www.milindsoorya.com/blog/build-a-spam-classifier-in-python
to gather more insight on how to train and implement an SVM model. Once I understood how to do so, I initialized the 
SVM classifier model, trained it, and used the same metrics I used for analyzing the Naive Bayes model, for my SVM model.

Now we can directly compare the performances using the output from the classification_report() function as well as the
Jaccard Scores and the accuracy scores, we get the following results:

Naive Bayes Accuracy: 0.8693918245264207
Naive Bayes Jaccard Score: 0.02092675635276532
Naive Bayes Classification Report:
               precision    recall  f1-score   support

         ham       0.87      1.00      0.93      4346
        spam       1.00      0.02      0.04       669

    accuracy                           0.87      5015
   macro avg       0.93      0.51      0.49      5015
weighted avg       0.89      0.87      0.81      5015


SVM Accuracy: 0.9647058823529412
SVM Jaccard Score: 0.7416058394160584
SVM Classification Report:
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98      4346
        spam       0.97      0.76      0.85       669

    accuracy                           0.96      5015
   macro avg       0.97      0.88      0.92      5015
weighted avg       0.96      0.96      0.96      5015

As we can see, my SVM model improved the performance by a significant amount. In particular, if we start the analysis
at the accuracy score, we can see that Naive Bayes Accuracy score is at 86.94% approximately, while the SVM Accuracy
score is at 96.47% approximately. This is an improvement of more than 5%. Moving on to the Jaccard score (using spam as
the label), we had a score of approximately 2% using the Naive Bayes model, which is very low compared to the Jaccard
score of my SVM model, which is at approximately 74%, resulting in a huge difference.
As for the other metrics, we can see that almost every metric is improved upon very significantly.
"""

## Final Task: Collect all the results

Uncomment and run this cell (and all the cells above) to generate the text file that you have to hand in together with the notebook on canvas!

### Please hand in only the text file which is generated by this method!

In [53]:
def exportToText(*args):
    with open(args[0], "w") as f:
        for argument in args:
            f.write("{}\n".format(argument))

exportToText("assignment10.txt", MyReport1, MyReport2, MyReport3)