**Name:** \_\_\_\_\_

**EID:** \_\_\_\_\_

# CS5489 - Assignment 1 - SMS classification

## Goal
In this assignment, the task is predict whether an SMS message is a real message, a spam message, or a phishing message (called smishing). Here are some examples:

  - **Normal**: "For real tho this sucks. I can't even cook my whole electricity is out. And I'm hungry."
  - **Spam**: "Had your mobile 10 mths? Update to latest Orange camera/video phones for FREE. Save £s with Free texts/weekend calls. Text YES for a callback orno to opt out"
  - **Smishing**: "Todays Vodafone numbers ending 5347 are selected to receive a Rs.2,00,000 award. If you have a match please call 6299257179 quoting claim code 2041 standard rates apply"


Your goal is to train a classifier to predict the class from the SMS text.


## Methodology
You need to train classifiers using the training data, and then predict on the test data. You are free to choose the feature extraction method and classifier algorithm.  You are free to use methods that were not introduced in class.  You should probably do cross-validation to select a good parameters.


## Evaluation

You need to report your test predictions. The csv file has determined the split of validation and test data, where the validation data will be used to determine the timestep of checkpointing. The model that achieves the best performance on validation set should be used to evaluate the test data. The test performance will be used to calculate your final ranking.

The evaluation metric is **balanced accuracy score**. This is because the dataset has some class imbalance as there are more normal samples than spam/smishing samples. See details for `sklearn.metrics.balanced_accuracy_score` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html).

## What to hand in
You need to turn in the following things:

1. This ipynb file `Assignment1-Doc.ipynb` with your source code and documentation. _**You should write about all the various attempts that you make to find a good solution.**_ You may also submit python scripts as source code, but you then must document all analysis and results (figures, outputs, etc.) in the ipynb file.
2. The PDF file exported from your `Assignment1-Doc.ipynb` file.
3. Your final CSV submission file on test data.
4. The ipynb file `Assignment1-Final.ipynb`, which contains the code that generates the final CSV submission file.  **This code will be used to verify that your submission is reproducible.**

**Please compress all four files into a single zipfile and upload it to Assignment 1 on Canvas.**

## Basic Requirements of Documentation:

For your documentation, you need to at least explain the following things:

- **Data Preprocessing**: This section should detail your exploratory data analysis and the rationale behind all preprocessing steps.
  - Describe the initial characteristics of your dataset.
  - Explain all techniques you have applied on the data
  - Clearly state which subset of the data was used to determine any inherent hyperparameters within your preprocessing techniques, ensuring no information from the test set was used (if any)
- **Methodology**: you will justify your modeling choices.
  - Chosen Models: describe the core principle or model architecture (for deep learning)
  - Chosen Optimizers (if any)
  - Chosen Loss Function (if any)
- **Hyperparameters**: This section should demonstrate a rigorous approach to hyper-parameter selection
  - List the key hyperparameters used
  - Document your hyperparameter search process. Compare model performance on a dedicated validation set and select the best-performing configuration based solely on validation metrics. As a core principle of this course, using the test set for hyperparameter selection is strictly prohibited and constitutes academic dishonesty.
- **Results and Visualization**: This section should provide clear evidence of your model's performance and a qualitative analysis of its behavior.
  - Learning Curves (only for deep learning methods): the loss (error) and/or accuracy on training and validation set must be provided.
  - Show examples of correctly classified and misclassified test samples. For misclassified samples, hypothesize why the model failed.

## Grading
The marks of the assignment are distributed as follows:
- 45% - Results using various classifiers and feature representations.
- 30% - Trying out feature representations (e.g. adding additional features) or classifiers not used in the tutorials/lectures.
- 20% - Quality of the written report.  More points for insightful observations and analysis.
- 5% - Final performance on the test data. If a submission cannot be reproduced by the submitted code, it will not receive marks for ranking.
- **Late Penalty:** 25 marks will be subtracted for each day late.

**NOTE:** This is an _individual_ assignment.

**NOTE:** you should start early! Some classifiers may take a while to train.

<hr>

# Load the Data

The training data is in the text file `smishing_train.txt`.  This CSV file contains the SMS text and the class label. The class labels are: `0`, `1`, `2`, which are `normal`, `spam`, `smishing`.

The validation/testing data is in the text file `smishing_test.txt`, and only contains the SMS text.

The label of validation/testing data is in the csv file `smishing_val_test.csv`, and only contains the SMS text.

You need to generate a csv file, with the following format:

<pre>
Id,Prediction
1,0
2,1
3,0
4,2
...
</pre>

Here are two helpful functions for reading the text data and writing the csv file.

In [None]:
%matplotlib inline
import matplotlib_inline   # setup output image format
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import matplotlib.pyplot as plt
import matplotlib
from numpy import *
from sklearn import *
from scipy import stats
import csv
import pandas as pd
random.seed(100)

In [None]:
def read_text_data(fname):
    txtdata = []
    classes = []
    with open(fname, 'r', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        for row in reader:
            # get the text
            txtdata.append(row[0])
            # get the class (convert to integer)
            if len(row)>1:
                classes.append(int(row[1]))

    return (txtdata, classes)

def write_csv(fname, Y):
    # fname = file name
    # Y is a list/array with class entries

    # header
    tmp = [['Id', 'Prediction']]

    # add ID numbers for each Y
    for (i,y) in enumerate(Y):
        tmp2 = [(i+1), y]
        tmp.append(tmp2)

    # write CSV file
    with open(fname, 'w') as f:
        writer = csv.writer(f)
        writer.writerows(tmp)

The below code will load the training and test sets.

In [None]:
# load the data
(traintxt, trainY) = read_text_data("smishing_train.txt")
(testtxt, _)       = read_text_data("smishing_val_test.txt")
testY              = pd.read_csv("smishing_val_test.csv")
print(len(traintxt))
print(len(testtxt))
testY.head()

2985
2986


Unnamed: 0,Id,Prediction,Usage
0,1,0,val
1,2,0,test
2,3,0,val
3,4,0,test
4,5,0,val


In [None]:
# show the classnames
classnames = unique(trainY)
print(classnames)

[0 1 2]


In [None]:
classlabels = ['normal', 'spam', 'smishing']

Here is an example to write a csv file with predictions on the test set.  These are random predictions.

In [None]:
# write your predictions on the test set
i = random.randint(len(classnames), size=len(testtxt))
predY = classnames[i]
write_csv("my_submission.csv", predY)

Look at the data:

In [None]:
for c in classnames:
    tmp = where(trainY==c)
    for a in tmp[0][0:5]:
        print('[{}]: {}'.format(classlabels[trainY[a]], traintxt[a]))

[normal]: Dunno da next show aft 6 is 850. Toa payoh got 650.
[normal]: I.ll hand her my phone to chat wit u
[normal]: I dont have i shall buy one dear
[normal]: Nite...
[normal]: Ok�congrats�
[spam]: I'd like to tell you my deepest darkest fantasies. Call me 09094646631 just 60p/min. To stop texts call 08712460324 (nat rate)
[spam]: Santa Calling! Would your little ones like a call from Santa Xmas eve? Call 09058094583 to book your time.
[spam]: Meet Top 35 US universities in Delhi at India Habitat Centre Lodhi Road on Nov 8th, 2 to 6 pm for student admission.Entry Free,  details contact 9911489000
[spam]: SMS AUCTION You have won a Nokia 7250i. This is what you get when you win our FREE auction. To take part send Nokia to 86021 now. HG/Suite342/2Lands Row/W1JHL 16+
[spam]: Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 86.
[smishing]: WIN URGENT! Your mobile number has been awarded with a £2000 prize GUARANTEED call 09061790121 from lan

# YOUR CODE and DOCUMENTATION HERE

In [None]:
# INSERT YOUR CODE HERE