# Assignment 3

## Guidelines

> Remember that this is a code notebook - add an explanation of what you do using text boxes and markdown, and comment your code. Answers without explanations may get less points.
>
> If you re-use a substantial portion of code you find online, e.g on Stackoverflow, you need to add a link to it and make the borrowing explicit. The same applies of you take it and modify it, even substantially. There is nothing bad in doing that, providing you are acknowledging it and make it clear you know what you're doing.
>
> The **Generative AI policy** from the syllabus for the programming assignments applies. Generative AI can be used as a source of information in these assignments if properly referenced. You can use generative AI assistance for writing code, but you must reference the chat used as a source, just as if you would take from StackOverflow. In ChatGPT, you can make an URL to the information you obtained by clicking the "Share link to Chat" button and then "Copy Link". This allows you to cite the source of the information you use in your answer or code solution. Of course, as you know, GenAI tools are not always a reliable source and its answers are intransparantly drawn from other sources - it is recommended to cross-check its output with other sources or your own understanding of the topic.
> 
> For the explanations of what you do that you provide with each question, as well as for (sub)questions that ask about things like motivation of choices or your opinion, the answer to this must be conceptualized and written by yourself and not copied from a generative AI source.
>
> Make sure your notebooks have been run when you submit, as I won't run them myself. Submit both the `.ipynb` file along with an `.html` export of the same. Submit all necessary auxilliary files as well. Please compress your submission into a `.zip` archive. Only `.zip` files can be submitted.
> If you are using Google Colab, here is a tutorial for obtaining an HTML export: https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab .
>
> With Jupyter, you can simply export it as HTML through the File menu.

## Grading policy
> As follows:
>
> * 80 points for correctly completing the assignment.
>
> * 20 points for appropriately writing and organizing your code in terms of structure, readibility (also by humans), comments and minimal documentation. It is important to be concise but also to explain what you did and why, when not obvious. Feel free to re-use functions and variables from previous questions if that helps for structure and readability - you do not need to repeat previous steps for each question.
> 
> Note that there are no extras for this assignment, as all 100 points are accrued via questions and question 6 has 10 'advanced' points to get.

**The AUC code of conduct applies to this assignment: please only submit your own work and follow the instructions on referencing external sources above.**

---

# Introduction

In this assignment, you will build and compare classifiers for measuring the **sentiment of tweets related to COVID-19** from the early days of the first outbreak.

The dataset you will work with is [publicly available in Kaggle](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification) (and attached to the assignment for your convenience). Make sure to check its minimal Kaggle documentation before starting.

This is a real dataset, and therefore messy. It is possible that you won't achieve great results on the classification task with your classifier. That is normal, don't worry about it! You also may find text encoding issues with this dataset. Try to find a simple solution to this problem, I don't think there is an easy way to fix it completely for these files.

*Please note: this dataset should not but might contain content which could be considered as offensive.*

---

# Skeleton pipeline (20 points)

## Question 1 (8 points)

Your dataset contains tweets, including handlers, hashtags, URLs, etc. Set-up a **minimal pre-processing pipeline** for them (focus on the `OriginalTweet` column), possibly including:

* Tokenization
* Filtering
* Lemmatization/Stemming

Please note that what to include is up to you, motivate your choices and remember that more is not necessarily better: if you are not sure why you are doing something, it might be better not to. Feel free to use NLTK, spaCy or anything else you like here.

*Note: we only really use the `OriginalTweet` and `Sentiment` columns for this assignment.*

In [1]:
# your code here

import numpy as np
import pandas as pd
from pathlib import Path

DATA_PATH = Path.cwd() / "data/Corona_NLP_train.csv"

# the file is not UTF-8 encoded, so change encoding
df_data = pd.read_csv(DATA_PATH, encoding="latin1")
df_data = df_data[["OriginalTweet", "Sentiment"]]
print(df_data.head())
print(df_data.info())

                                       OriginalTweet           Sentiment
0  @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...             Neutral
1  advice Talk to your neighbours family to excha...            Positive
2  Coronavirus Australia: Woolworths to give elde...            Positive
3  My food stock is not the only one which is emp...            Positive
4  Me, ready to go at supermarket during the #COV...  Extremely Negative
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   OriginalTweet  41157 non-null  object
 1   Sentiment      41157 non-null  object
dtypes: object(2)
memory usage: 643.2+ KB
None


In [2]:
import spacy

nlp = spacy.load("en_core_web_sm",
                exclude=["tok2vec", "parser", "ner"])

texts = df_data["OriginalTweet"]

def doc_to_string(doc) -> str:
    """
    Reformats strings returned by the SpaCy pipeline
    """
    
    s = ""
    for token in doc:
        if len(str(token)) > 2:
            s += ''.join([
                char for char in str(token).lower()
                if char.isalpha()
            ]) + ' '

    return s.rstrip()

docs = [doc_to_string(doc) for doc in nlp.pipe(texts)]

df_data["OriginalTweet"] = docs
print(df_data.head())

                                       OriginalTweet           Sentiment
0  menyrbie philgahan chrisitv httpstcoifzfan and...             Neutral
1  advice talk your neighbours family exchange ph...            Positive
2  coronavirus australia woolworths give elderly ...            Positive
3  food stock not the only one which empty   plea...            Positive
4  ready supermarket during the covid outbreak  n...  Extremely Negative


---

## Question 2 (4 points)

**Split your data into a train and a validation set**. You can use 85% for training and 15% for validation, or similar proportions. Remember to shuffle your data before splitting, specifying a seed to be able to replicate your results.

In [3]:
# your code here

from sklearn.model_selection import train_test_split

(X_test, X_nottest,
 y_test, y_nottest) = train_test_split(
     df_data["OriginalTweet"], df_data["Sentiment"], test_size=0.9, random_state = 888
 )

(X_val, X_train,
 y_val, y_train) = train_test_split(
     X_nottest, y_nottest, test_size=0.85, random_state = 777
 )

print(X_train.shape, X_val.shape, X_test.shape)

(31486,) (5556,) (4115,)


---

## Question 3 (8 points)

Write a function which, given as input a set of predictions and a set of ground truth labels and the name of the method, prints out a **classification report** including:
* Name of the method
* Accuracy
* Precision, recall and F1 measure
* An example of a correctly classified datapoint (e.g. a tweet)
* An example of a wrongly classified datapoint

*Note: You can do this question at the same time as question 4 so that you have something to report (the result of the baseline)*

In [4]:
# your code here

from sklearn.metrics import confusion_matrix

def classification_report(pred, truth, model: str):
    """
    Prints a classification report for a given method
    """
    # LABELS = np.unique(truth)
    labels = ["Extremely Negative", "Negative", "Neutral", "Positive", "Extremely Positive"]

    print(f"Results for {model}")

    total_correct = np.sum(pred == truth)
    accuracy = total_correct / len(truth)
    print(f"Accuracy:   {round(accuracy*100, 1)}%")
    
    cm = confusion_matrix(truth, pred, labels=labels)
    # print(labels, cm)

    # rows = underlying class
    # cols = labelled class
    row_sums = np.sum(cm, axis = 1)
    col_sums = np.sum(cm, axis = 0)
    
    for i, label in enumerate(labels):
        print(f"\nScores for class \"{label}\"")
        
        t_pos = cm[i, i]
        f_neg = row_sums[i] - t_pos
        f_pos = col_sums[i] - t_pos
        t_neg = total_correct - t_pos
        
        precision = t_pos / (t_pos + f_pos)
        recall = t_pos / (t_pos + f_neg)
        f_measure = 2 * ((precision * recall) / (precision + recall))
        print(f"Precision:  {round(precision * 100, 1)}%")
        print(f"Recall:     {round(recall * 100, 1)}%")
        print(f"F1 Measure: {round(f_measure * 100, 1)}%")

    # more to do but i'm tired

---

# Classifying (45 points)

As you will be performing classification on real data, processes may take a while to run. This is normal, but it should not take hours. Here's some advice if you find that some of your code takes a long time to run:
- If you are doing a hyperparameter search, try to make it quite small. Every hyperparameter combination that you try means training a new model, and runtimes can explode. You do not need to do a huge search for this assignment, it is enough if I can see that you are able to do it with a small example.
- If you are doing a grid search, try to know how many combinations of hyperparameters your code will check and try to have print statements to know where you are at. Computation time grows exponentially for each additional hyperparameter option so this can get out of hand quickly. Also, if training a single model as a step of your grid search takes longer than just training the model separately, there might be an issue with your grid search code.
- In a real project, you would want to make your code such that you can pause and resume training or optimization without having to re-do everything, e.g. by writing the results to a file. But for the purpose of this assignment it is not necessary to make it that complicated.
- Use separate code blocks especially for the part of code that trains a model. That way, you only need to run the training step once, while you can mess around with the output/evaluation etc. without having to wait for a new model to be trained each time.

## Question 4 (10 points)

An important first step when dealing with a real-world task is establishing a **solid baseline**. The baseline allows to a) develop the first full pipeline for your task, and b) to have something to compare against when you develop more advanced models.

Pick a method to use as a baseline. *A good option might be a TF-IDF Logistic Regression*. Feel free to use scikit-learn or another library of choice. See [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for more options.

Use your classification report function and the validation set to report on the performance of your baseline. *Pay attention: the validation data only needs to be transformed, and must not be used to fit any transformation. For example, if you have used a TF-IDF vectorizer by fitting it to your train data and then transformed it, use the same fitted vectorizer to transform your validation data.*

In [5]:
# your code here

from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
from sklearn.linear_model import LogisticRegression

tfidf = TFIDF(
    max_df=0.5,
    min_df=2
)
X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.transform(X_val) # do not fit!

tfidf_logreg = LogisticRegression(solver="sag")
tfidf_logreg.fit(X_train_tfidf, y_train)
# no transformation needed for logistic regression
y_pred_tfidf_logreg = tfidf_logreg.predict(X_val_tfidf)

classification_report(y_pred_tfidf_logreg, y_val, "TF-IDF Logistic Regression")

Results for TF-IDF Logistic Regression
Accuracy:   56.3%

Scores for class "Extremely Negative"
Precision:  64.3%
Recall:     49.4%
F1 Measure: 55.8%

Scores for class "Negative"
Precision:  49.2%
Recall:     51.5%
F1 Measure: 50.3%

Scores for class "Neutral"
Precision:  64.0%
Recall:     64.8%
F1 Measure: 64.4%

Scores for class "Positive"
Precision:  50.9%
Recall:     60.2%
F1 Measure: 55.1%

Scores for class "Extremely Positive"
Precision:  67.0%
Recall:     52.5%
F1 Measure: 58.8%


<h3> EXPLANATION </h3>

why are we using logistic regression when there are more than two classes?!?! am i supposed to do something else...

use sag because uhhh lbfgs reaches iteration limit

---

## Question 5 (20 points)

Try now to **beat your baseline**. Feel free to use scikit-learn or another library of choice. See [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for more options.

How to beat the baseline? There are many ways:
1. You could have a better text representation (e.g., using PPMI instead of TF-IDF, note that this is challenging because there is no ready-made scikit-learn vectorizer for this).
2. You can pick a more powerful model (e.g., random forests or SVMs).
3. You have to find good hyperparameters for your model, and not just use the default ones.

Regarding point 3 above, make sure to perform some hyperparameter searching using [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) or [randomized search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).

Use your classification report function and the validation set to report on the performance of your baseline. *Pay attention: the validation data only needs to be transformed, and must not be used to fit any transformation. For example, if you have used a TF-IDF vectorizer by fitting it to your train data and then transformed it, use the same fitted vectorizer to transform your validation data.*

In [6]:
# your code here

---

## Question 6 (15 points)

Design, develop and train a **neural network-based classifier** for this task, using scikit-learn, PyTorch or the Transformers library. The scikit-learn approach is demonstrated in Notebook 7_1, the Pytorch approach is demonstrated in Notebook 7_2. The Transformers approach is the most state-of-the-art approach, which involves taking a pre-trained LLM and tuning a sequence classification head for your text classification task. You can find a basic example in the Huggingface documentation: https://huggingface.co/docs/transformers/en/tasks/sequence_classification

The scikit-learn option is probably simpler than you think. Pytorch and Transformers classifiers are more advanced and challenging, but due to the current popularity of Transformer models it is relatively easy to find solutions to your problems with Transformer models. If you are up for a challenge, choose Pytorch if you are more interested in foundations of neural networks and machine learning more broadly, or choose Transformers if you are interested in LLMs and textual data.

The classifier can have the structure that you prefer and use an embedding model of your choice, just make sure to motivate your choices.

*Note: an NN-based classifier with scikit-learn yields 5 points max; one with PyTorch or a pre-tuned Transformers-based model yields 10 points max; one with PyTorch and pre-trained embeddings or a Transformers-based model tuned by yourself yields 15 points max. If you try PyTorch or Transformers but get stuck, you can still get partial points if you have a good explanation of what you tried.*

Use your classification report function and the validation set to report on the performance of your baseline. *Pay attention: the validation data only needs to be transformed, and must not be used to fit any transformation. For example, if you have used a TF-IDF vectorizer by fitting it to your train data and then transformed it, use the same fitted vectorizer to transform your validation data.*

In [7]:
# your code here

---

# Evaluating your classifiers (15 points)

## Question 7 (8 points)

Evaluate the performance of your models on the **test set**. Make sure to transform your test data as you did for your train data, and as needed for each classifier. *Pay attention: the test data only needs to be transformed, and must not be used to fit any transformation. For example, if you have used a TF-IDF vectorizer by fitting it to your train data and then transformed your train and validation with it, use the same fitted vectorizer to transform your test data.*

* Report the accuracy of each classifier, as well as its precision, recall and F1 score. 
* Plot a [confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) for your best classifier.
* Briefly discuss your results.

In [8]:
df_test = pd.read_csv("data/Corona_NLP_test.csv")

In [9]:
# your code here

---

## Question 8 (7 points)

When you perform a classification or labeling task, you may want to perform an error analysis to look for avenues for improvement. You can do this both quantitatively and qualitatively.

For your best classifier:
* Collect misclassified samples, e.g. by modifying your evaluation code from Question 7.

Perform a brief quantitative error analysis of your best classifier:
* Choose some properties that you think are relevant to classification quality, such as the length of the tweet or use of emoji. Come up with three interesting properties.
* Compute and compare these three properties for the misclassified samples to the average distribution over all samples.
* Describe your conclusions.

Perform a brief qualitative error analysis of your best classifier:
* Look at the misclassified samples, and make observations about their properties. Identify some properties that you think are relevant to classification quality but that you can't easily quantify, such as usage of sarcasm or irony, negation issues (not bad != bad), spelling or grammar issues, interpretation of emojis, context dependence of the tweet, or other observations.
* Describe your conclusions.


In [10]:
# your code here

---