In [None]:
### Exercise 2
# Getting more hands-on experience with supervised machine learning

## Question 1. 
In exercise 1, you set up a classifier using the hatespeech dataset (retrieved from: https://www.dropbox.com/sh/4mapojr85a6sc76/AABYMkjLVG-HhueAgd0qM9kwa?dl=0![image-2.png](attachment:image-2.png))
The classifier was based on a count vectorizer using Naïve Bayes. For this exercise, you created the following code:

```python
import csv
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix


file = "hatespeech_text_label_vote_RESTRICTED_100K.csv"
tweets = []
labels = []

with open(file) as fi:
    data = csv.reader(fi, delimiter='\t')
    for row in data:
        tweets.append(row[0])
        labels.append(row[1])

print(len(tweets) == len(labels)) # there should be just as many tweets as there are labels

Counter(labels)
plt.bar(Counter(labels).keys(), Counter(labels).values())


# splitting up the dataset
from sklearn.model_selection import train_test_split

tweets_train, tweets_test, y_train, y_test = train_test_split(tweets, labels, test_size=0.2, random_state=42)


# Classifier with a count vectorizer and Naïve Bayes
from sklearn.feature_extraction.text import (CountVectorizer)

countvectorizer = CountVectorizer(stop_words="english")
X_train = countvectorizer.fit_transform(tweets_train)
X_test = countvectorizer.transform(tweets_test)

from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)


# Print a classification report for the classifier
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))


```

Can you use the examples from the slides used in the lecture and train another classifier based on Logistic Regression and a tf-idf vectorizer?

## Question 2.

As discussed earlier, we can try different combinations of these models (Naïve Bayes and Logistic Regression) and vectorizers (count and tf-idf). If you want to use Naïve Bayes and Logistic Regression as the models for a classifier, and a count vectorizer and a tf-idf vectorizer, how many classifiers could you then train? 

## Question 3.

We could simply copy-paste the code used in the previous questions and adjust it for each of the classifiers. However, a cleaner approach is to write a function in which we define the specifics of each classifier. The code below does that.

In this code, we create a loop that trains each classifier by calling the function that is build in the first part of the code. 

Run the code below and compare it to the code that you wrote to train one classifier: Do you understand what is happening there?

In [9]:
from sklearn.feature_extraction.text import (CountVectorizer)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import (TfidfVectorizer)
from sklearn.linear_model import (LogisticRegression)
from sklearn.metrics import classification_report

configs = [
  ("NB-count",CountVectorizer(min_df=5,max_df=.5),
   MultinomialNB()),
  ("NB-TfIdf",TfidfVectorizer(min_df=5,max_df=.5),
   MultinomialNB()),
  ("LR-Count",CountVectorizer(min_df=5,max_df=.5),
   LogisticRegression(solver="liblinear")),
  ("LR-TfIdf",TfidfVectorizer(min_df=5,max_df=.5),
   LogisticRegression(solver="liblinear"))]


for name, vectorizer, classifier in configs:
    print(name)
    X_train = vectorizer.fit_transform(tweets_train)
    X_test = vectorizer.transform(tweets_test)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(classification_report(y_test, y_pred))
    print("\n") 

NB-count
              precision    recall  f1-score   support

     abusive       0.80      0.88      0.84      5369
     hateful       0.41      0.28      0.33       966
      normal       0.85      0.79      0.82     10848
        spam       0.53      0.63      0.57      2817

    accuracy                           0.77     20000
   macro avg       0.65      0.64      0.64     20000
weighted avg       0.77      0.77      0.77     20000



NB-TfIdf
              precision    recall  f1-score   support

     abusive       0.81      0.81      0.81      5369
     hateful       0.87      0.05      0.09       966
      normal       0.76      0.92      0.83     10848
        spam       0.65      0.32      0.43      2817

    accuracy                           0.77     20000
   macro avg       0.77      0.53      0.54     20000
weighted avg       0.76      0.77      0.73     20000



LR-Count
              precision    recall  f1-score   support

     abusive       0.87      0.91      0.89 

## Question 4.

Check out the documentation of scikit learn (https://scikit-learn.org/stable/supervised_learning.html). Can you try to use other models and train a classifier with them? Can you merge this code into the code used in the previous question?

## Question 5.

Based on the output that the classifier prints, what classifier performs the best? In your answer, consider:
* What information you need to identify the best classifier
* What metric you base your conclusion (i.e., precision, recall, accurcay, or F1-score) on and why

## Question 6.

Let's say that you base your evaluation on the F1-score of the classifier. You can choose between the macro average and the weighted average of the F1-score. Check out the scikit learn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support). What F1-value (macro average or weighted average) would you select?

## Question 7.

When looking at the classification report, you will see another column indicating values for something labelled 'support'.
Can you do some searching online and find out what 'support' is?

## Question 8.

Two researchers want to use the classifier to distinguish between tweets that are spam or hateful and tweets that are not (either because they are normal or abusive). 
They are, however, not happy with the performance of the classifier when looking at the accuracy, precision, and recall for the spam category or the hateful category.
One of the researchers suggests to first recode the labels, so that all tweets that were annotated as spam receive a label 'spam' or 'hateful' are grouped together and all other tweets are grouped together as well.

What would the consequences be of doing so?
What can you do to check your own answer? Try to recode the labels and see what happens!


## Question 9.
For now, let's say that the classifier based on a count vectorizer and Logistic Regression is the one we prefer. We now want to use this model to predict the label for new data that we have not annotated (remember, this was the whole goal of SML)!

To do this, let’s save our classifier and our vectorizer to a file. If we don’t do this, we would need to re-train our model every time we want to use it. This is not so convenient, for example, we would always need to have our training data at hand. The code below shows you how to make a vectorizer and train a classifier (a repetition of what we did before to show you the whole process) and store them into a file.

In the code, you will see that both the classifier and the vectorizer are stored into a file. Why do you need to store both (why not just store the classifier only)?

In [21]:
import pickle
import joblib

# Make a vectorizer and train a classifier
vectorizer=CountVectorizer(min_df=5, max_df=.5)
classifier=LogisticRegression(solver="liblinear")
X_train=vectorizer.fit_transform(tweets_train)
classifier.fit(X_train, y_train)

# Save them to disk
with open("myvectorizer.pkl",mode="wb") as f:
    pickle.dump(vectorizer, f)
with open("myclassifier.pkl",mode="wb") as f:
    joblib.dump(classifier, f)

# Later on, re-load this classifier and apply:
new_tweets = ["This Tweet is very shitty nasty mean and hateful", 
            "This is a very normal normal tweet.", 
            "2%^&GHJ &(&hrqjf3 click this link"]

with open("myvectorizer.pkl",mode="rb") as f:
    myvectorizer = pickle.load(f)
with open("myclassifier.pkl",mode="rb") as f:
    myclassifier = joblib.load(f)
    
new_features = myvectorizer.transform(new_tweets)
pred = myclassifier.predict(new_features)

for tweet, label in zip(new_tweets, pred):
    print(f"'{tweet}' is probably '{label}'.")

'This Tweet is very shitty nasty mean and hateful' is probably 'abusive'.
'This is a very normal normal tweet.' is probably 'normal'.
'2%^&GHJ &(&hrqjf3 click this link' is probably 'spam'.


### About this exercise:

This exercise is based on the materials developed and the texts written by Wouter van Atteveldt, Damian Trilling and Carlos Arcila Calderon as reported in their book 'Computational Analysis of Communication' (2022, Wiley-Blackwell).