# Final Project

**Group HOMEWORK**. This final project can be collaborative. The maximum members of a group is 2. You can also work by yourself. Please respect the academic integrity. **Remember: if you get caught on cheating, you get F.**

## A Introduction to the competition

<img src="news-sexisme-EN.jpg" alt="drawing" width="380"/>

Sexism is a growing problem online. It can inflict harm on women who are targeted, make online spaces inaccessible and unwelcoming, and perpetuate social asymmetries and injustices. Automated tools are now widely deployed to find, and assess sexist content at scale but most only give classifications for generic, high-level categories, with no further explanation. Flagging what is sexist content and also explaining why it is sexist improves interpretability, trust and understanding of the decisions that automated tools use, empowering both users and moderators.

This project is based on SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS). [Here](https://codalab.lisn.upsaclay.fr/competitions/7124#learn_the_details-overview) you can find a detailed introduction to this task.

You only need to complete **TASK A - Binary Sexism Detection: a two-class (or binary) classification where systems have to predict whether a post is sexist or not sexist**. To cut down training time, we only use a subset of the original dataset (5k out of 20k). The dataset can be found in the same folder. 

Different from our previous homework, this competition gives you great flexibility (and very few hints), you can determine: 
-  how to preprocess the input text (e.g., remove emoji, remove stopwords, text lemmatization and stemming, etc.);
-  which method to use to encode text features (e.g., TF-IDF, N-grams, Word2vec, GloVe, Part-of-Speech (POS), etc.);
-  which model to use.

## Requirements
-  **Input**: the text for each instance.
-  **Output**: the binary label for each instance.
-  **Feature engineering**: use at least 2 different methods to extract features and encode text into numerical values.
-  **Model selection**: implement with at least 3 different models and compare their performance.
-  **Evaluation**: create a dataframe with rows indicating feature+model and columns indicating Precision, Accuracy and F1-score (using weighted average). Your results should have at least 6 rows (2 feature engineering methods x 3 models). Report best performance with (1) your feature engineering method, and (2) the model you choose. 
- **Format**: add explainations for each step (you can add markdown cells). At the end of the report, write a summary and answer the following questions: 
    - What preprocessing steps do you follow?
    - How do you select the features from the inputs? 
    - Which model you use and what is the structure of your model?
    - How do you train your model?
    - What is the performance of your best model?
    - What other models or feature engineering methods would you like to implement in the future?
- **Two Rules**, violations will result in 0 points in the grade: 
    - Not allowed to use test set in the training: You CANNOT use any of the instances from test set in the training process. 
    - Not allowed to use code from generative AI (e.g., ChatGPT). 

## Evaluation

The performance should be only evaluated on the test set (a total of 1086 instances). Please split original dataset into train set and test set. The test set should NEVER be used in the training process. The evaluation metric is a combination of precision, recall, and f1-score (use `classification_report` in sklearn). 

The total points are 10.0. Each team will compete with other teams in the class on their best performance. Points will be deducted if not following the requirements above.

If ALL the requirements are met:
- Top 25\% teams: 10.0 points.
- Top 25\% - 50\% teams: 8.5 points.
- Top 50\% - 75\% teams: 7.0 points.
- Top 75\% - 100\% teams: 6.0 points.

## Submission
Similar as homework, submit both a PDF and .ipynb version of the report. 

The report should include: (a)code, (b)outputs, (c)explainations for each step, and (d)summary (you can add markdown cells). 

The due date is **December 8, Friday by 11:59pm.

In [33]:
!pip install pandas
!ip install -U scikit-learn
!pip install -U scikit-learn
!pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0mObject "install" is unknown, try "ip help".


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

# Data Preprocessing 

Here we are taking out data out of the CSV and into data frames. We transform the data by encoding the label column and separating it into two training and test data sets. From there, we create four data frames total for X and Y of the training and test set. To process our data before we can run feature selection or apply it to a model, we have to clean it up and making easier to read in. I started by opting to remove stop words, these are common words that are insignificant. From there, emails, numbers, html tags, special characters, and punctuation are removed.

In [63]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
import nltk
import re

nltk.download('stopwords')  # Used once to download the stopwords

df = pd.read_csv("edos_labelled_data.csv")

le = LabelEncoder()
df['label'] = le.fit_transform(df['label']) # Encode label column, 1 is sexist, 0 is not sexist content

# Put data into train and test datasets
df_train, df_test = df[df["split"]=="train"], df[df["split"]=="test"]
df_train, df_test = df_train.drop(columns=["split"]), df_test.drop(columns=["split"])

# Put features and results in separate data frames
features = ['rewire_id', 'text']
X_train, Y_train = df_train[features], df_train['label']
X_test, Y_test = df_test[features], df_test['label']

# Here's where we start doing the data preprocessing
def preprocess_text(text):
    text = text.lower()  # Lowercase all the data

    # Get rid of words that aren't useful
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])

    # Remove emails, numbers, html tags, special characters, and punctuation
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove emails
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and punctuations

    return text  # Get back sanitized, processed text

# I got an error when running this so, I disabled the SettingWithCopyWarning for the specified lines
pd.options.mode.chained_assignment = None
X_train['text'] = X_train['text'].apply(preprocess_text)
X_test['text'] = X_test['text'].apply(preprocess_text)

pd.options.mode.chained_assignment = 'warn'  # Changed back to give the warning again

X_train.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,rewire_id,text
0,sexism2022_english-9609,nigeria rape woman men rape back nsfw in niger...
1,sexism2022_english-16993,then keeper
2,sexism2022_english-13149,like metallica video poor mutilated bastard sa...
3,sexism2022_english-13021,woman
4,sexism2022_english-966,bet wished gun


# Random Forest, Support Vector Machine, and Naive Bayes with TF-IDF

Here we apply a Term Frequency Inverse Document Frequency algorithim to transform our text into values that can be used for prediction. I chose the three models as they were relatively simple to create and I thought would give me a solid baseline to compare against both another feature selection method and more advanced models. We initalized the TF-IDF from Sci Kit along with pipelines that allow us to rapidly train our models. At the end you'll see the output of 3 classification reports with Random Forest having the highest macro F1 score.

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Intialize our vector from Sci Kit
tfidf_vectorizer = TfidfVectorizer()

# Make the pipelines for each of the models
rf_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', RandomForestClassifier())
])

svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', SVC())
])

nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Here's where we train them
rf_pipeline.fit(X_train['text'], Y_train)
svm_pipeline.fit(X_train['text'], Y_train)
nb_pipeline.fit(X_train['text'], Y_train)

# Evaluate each one on the test data
rf_predictions = rf_pipeline.predict(X_test['text'])
svm_predictions = svm_pipeline.predict(X_test['text'])
nb_predictions = nb_pipeline.predict(X_test['text'])

# Generate reports on how well each model performed on test data for identify if sexist or not
print("Random Forest Classification Report:")
print(classification_report(Y_test, rf_predictions, target_names=['not sexist', 'sexist']))

print("\nSVM Classification Report:")
print(classification_report(Y_test, svm_predictions, target_names=['not sexist', 'sexist']))

print("\nNaive Bayes Classification Report:")
print(classification_report(Y_test, nb_predictions, target_names=['not sexist', 'sexist']))


Random Forest Classification Report:
              precision    recall  f1-score   support

  not sexist       0.80      0.97      0.88       789
      sexist       0.83      0.37      0.51       297

    accuracy                           0.81      1086
   macro avg       0.82      0.67      0.70      1086
weighted avg       0.81      0.81      0.78      1086


SVM Classification Report:
              precision    recall  f1-score   support

  not sexist       0.78      0.98      0.87       789
      sexist       0.85      0.26      0.40       297

    accuracy                           0.79      1086
   macro avg       0.81      0.62      0.64      1086
weighted avg       0.80      0.79      0.74      1086


Naive Bayes Classification Report:
              precision    recall  f1-score   support

  not sexist       0.74      1.00      0.85       789
      sexist       0.93      0.05      0.09       297

    accuracy                           0.74      1086
   macro avg       0.83    

# Random Forest, Support Vector Machine, and Naive Bayes with Word2Vec

As an alternative feature selection methodology, I chose Word2Vec. It develops an understanding of word associations from large bodies of text and can understand more complex ideas like synonyms. To create it, I had to upload the multi gigabyte file of the pre-trained w2v model from Google. We go through each word in a given line of text and create an array of vectors. These are then used to train our classifiers and just as before, output three classifcation reports. Here we see a substancial improvement to our SVM, but it is still slightly behind our Random Forest with TF-IDF.

In [68]:
from gensim.models import KeyedVectors
from sklearn.naive_bayes import GaussianNB  # Changed because I had errors when using continuous features

# Here's where we get the massive file that I got from Google
word_vectors = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Function to vectorize a list of texts
def vectorize_texts(text_list):
    vectors = []
    for text in text_list:
        words = text.split()
        word_vectors_list = [word_vectors[word] for word in words if word in word_vectors]
        if len(word_vectors_list) > 0:
            vectors.append(np.mean(word_vectors_list, axis=0))
        else:
            vectors.append(np.zeros(300))  # 300 is the dimensionality of the Word2Vec vectors
    return np.array(vectors)

# Use that function on the training and testing data
X_train_vect = vectorize_texts(X_train['text'])
X_test_vect = vectorize_texts(X_test['text'])

# Apply our 3 models to the training data
rf_classifier = RandomForestClassifier().fit(X_train_vect, Y_train)
svm_classifier = SVC().fit(X_train_vect, Y_train)
nb_classifier = GaussianNB().fit(X_train_vect, Y_train)  # GaussianNB is used as it works with continuous features

# Run the predictions
rf_predictions = rf_classifier.predict(X_test_vect)
svm_predictions = svm_classifier.predict(X_test_vect)
nb_predictions = nb_classifier.predict(X_test_vect)

# Output performance
print("Random Forest Classification Report:")
print(classification_report(Y_test, rf_predictions, target_names=['not sexist', 'sexist']))

print("\nSVM Classification Report:")
print(classification_report(Y_test, svm_predictions, target_names=['not sexist', 'sexist']))

print("\nNaive Bayes Classification Report:")
print(classification_report(Y_test, nb_predictions, target_names=['not sexist', 'sexist']))

Random Forest Classification Report:
              precision    recall  f1-score   support

  not sexist       0.75      0.98      0.85       789
      sexist       0.71      0.14      0.24       297

    accuracy                           0.75      1086
   macro avg       0.73      0.56      0.54      1086
weighted avg       0.74      0.75      0.68      1086


SVM Classification Report:
              precision    recall  f1-score   support

  not sexist       0.80      0.95      0.87       789
      sexist       0.72      0.37      0.49       297

    accuracy                           0.79      1086
   macro avg       0.76      0.66      0.68      1086
weighted avg       0.78      0.79      0.76      1086


Naive Bayes Classification Report:
              precision    recall  f1-score   support

  not sexist       0.84      0.50      0.63       789
      sexist       0.36      0.75      0.49       297

    accuracy                           0.57      1086
   macro avg       0.60    

# BERT

Here we implement and fine tune a Bidirectional Encoder Representations from Transformers model. It's widely considered a baseline in natural language processing. We create tokenizer and model from the Hugging Face transformer library. After tokenizing, we use data loader to create iterable datasets with batch processing for more efficient training. Our optimizer is AdamW which will update model weights as we go through training. Of note here, are three parameters, max_length in the tokenizing process, batch_size in the data loader, and the number of epochs during training. I changed these based on [this](https://arxiv.org/pdf/2305.00076v1.pdf) report to see if my data preprocessing and fine tuning could match their results. It seemed the most optimized for data of this size.

In [69]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset
import torch
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# Get tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Tokinization of the texts
tokenized_train = tokenizer(X_train['text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt')
tokenized_test = tokenizer(X_test['text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt')

# Training + testing set up with TensorDataset objs and DataLoader utility
train_dataset = TensorDataset(tokenized_train['input_ids'], tokenized_train['attention_mask'], torch.tensor(Y_train.tolist()))
test_dataset = TensorDataset(tokenized_test['input_ids'], tokenized_test['attention_mask'], torch.tensor(Y_test.tolist()))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Get params
optimizer = AdamW(model.parameters(), lr=2e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # I had to use cloud compute with a dGPU for this
model.to(device)

# Here's where we do training
num_epochs = 20 # Number of times a dataset passes through an algorithm
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Eval
model.eval()

# Going to use the same loop for all the other models, just different training
all_predictions = []  # Goes over the test data in batches, get output, and put predictions here
all_labels = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print(classification_report(all_labels, all_predictions, target_names=le.classes_))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

  not sexist       0.87      0.85      0.86       789
      sexist       0.62      0.67      0.65       297

    accuracy                           0.80      1086
   macro avg       0.75      0.76      0.75      1086
weighted avg       0.81      0.80      0.80      1086



# XLM-RoBERTa

XLM-R is a pre-trained model that seemed to have massive performance on handling text processing and classification. Similar to before, we import a tokenizer and model, note that we specificy 2 labels here for binary classification. Like the other models, what we do here is tokenize the inputs, create TensorDataset and DataLoader for both the test train and test, and then train them with the model.

In [56]:
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset

# Get tokenizer and model
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base', num_labels=2)

# Tokenizization like before, same max_length params too
tokenized_train = tokenizer.batch_encode_plus(X_train['text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt')
tokenized_test = tokenizer.batch_encode_plus(X_test['text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt')

# Training + testing set up
train_dataset = TensorDataset(tokenized_train['input_ids'], tokenized_train['attention_mask'], torch.tensor(Y_train.tolist()))
test_dataset = TensorDataset(tokenized_test['input_ids'], tokenized_test['attention_mask'], torch.tensor(Y_test.tolist()))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)  # Same batch sizes as before
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Training setup
optimizer = AdamW(model.parameters(), lr=2e-5)  # Intialize the optimizer with a learning rate 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Same deal as before, better to run on GPU
model.to(device)

# Begin training
num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Eval
model.eval()

# Same loop and technique as with all the others
all_predictions = []  # Goes over the test data in batches, get output, and put predictions here
all_labels = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print(classification_report(all_labels, all_predictions, target_names=le.classes_))

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

  not sexist       0.89      0.85      0.87       789
      sexist       0.65      0.72      0.68       297

    accuracy                           0.81      1086
   macro avg       0.77      0.78      0.77      1086
weighted avg       0.82      0.81      0.82      1086



# HateBERT

A bit of an odd-ball choice, but I found it online and it seemed perfectly fit to both the text size and the goal. It is specifically designed to detect abusive language. The process for setting it up and running it is nearly identical to the other models, we just have to tokenize, prep Datasets and Loaders, train, and evaluate. HateBERT out performed general BERT on the datasets they tried it with, so I was curious if I would get the same or a better result.

In [57]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW

# Get HateBERT specific tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('GroNLP/hateBERT')
model = AutoModelForSequenceClassification.from_pretrained('GroNLP/hateBERT', num_labels=2)

# Tokenizization like before, same max_length params too
tokenized_train = tokenizer.batch_encode_plus(X_train['text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt')
tokenized_test = tokenizer.batch_encode_plus(X_test['text'].tolist(), padding=True, truncation=True, max_length=128, return_tensors='pt')

# Training + testing set up
train_dataset = TensorDataset(tokenized_train['input_ids'], tokenized_train['attention_mask'], torch.tensor(Y_train.tolist()))
test_dataset = TensorDataset(tokenized_test['input_ids'], tokenized_test['attention_mask'], torch.tensor(Y_test.tolist()))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Training setup (as before)
optimizer = AdamW(model.parameters(), lr=2e-5)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Very GPU intensive, takes a good bit of time to run
model.to(device)

# Train models again
num_epochs = 20  # Same number of epochs before
for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Eval
model.eval()

# Same loop and technique as with all the others
all_predictions = []
all_labels = []
with torch.no_grad():  # Goes over the test data in batches, get output, and put predictions here
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print(classification_report(all_labels, all_predictions, target_names=le.classes_))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at GroNLP/hateBERT and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


              precision    recall  f1-score   support

  not sexist       0.87      0.91      0.89       789
      sexist       0.73      0.62      0.67       297

    accuracy                           0.83      1086
   macro avg       0.80      0.77      0.78      1086
weighted avg       0.83      0.83      0.83      1086



In [74]:
results = {
    "Feature and Model": ["TF-IDF w/ Random Forest", "TF-IDF w/ SVM", "TF-IDF w/ Naive Bayes",
                      "Word2Vec w/ Random Forest", "Word2Vec w/ SVM", "Word2Vec w/ Naive Bayes",
                     "BERT", "XLM-R", "HateBERT"],
    "Precision": [0.81, 0.80, 0.79, 0.74, 0.78, 0.71, 0.81, 0.82, 0.83],  # Hypothetical precision scores
    "Accuracy": [0.81, 0.79, 0.74, 0.75, 0.79, 0.57, 0.80, 0.81, 0.83],   # Hypothetical accuracy scores
    "F1-Score": [0.78, 0.74, 0.64, 0.68, 0.76, 0.59, 0.80, 0.82, 0.83]    # Hypothetical F1 scores
}

results_df = pd.DataFrame(results)
print(results_df)

           Feature and Model  Precision  Accuracy  F1-Score
0    TF-IDF w/ Random Forest       0.81      0.81      0.78
1              TF-IDF w/ SVM       0.80      0.79      0.74
2      TF-IDF w/ Naive Bayes       0.79      0.74      0.64
3  Word2Vec w/ Random Forest       0.74      0.75      0.68
4            Word2Vec w/ SVM       0.78      0.79      0.76
5    Word2Vec w/ Naive Bayes       0.71      0.57      0.59
6                       BERT       0.81      0.80      0.80
7                      XLM-R       0.82      0.81      0.82
8                   HateBERT       0.83      0.83      0.83


## Summary

1. What preprocessing steps do you follow?
   
   Your answer: To begin with, I transformed the sexist column so that 1 would indicate if the content was sexist and 0 if not. From there, I split the data into training and test dataframes as well as into X and Y. I continued by opting to remove stop words, these are common words that are insignificant. I used the stop words list built into the Natural Language Toolkit. From there, I also removed emails, numbers, html tags, special characters, and punctuation
   
2. How do you select the features from the inputs?
   
   Your answer: I explored a number of feature selection methods. TF-IDF stood out to me for weighting the occurance of rarer words that, after a cursory glance at the CSV file, seems to correspond with counts of sexism. I also chose Word2Vec as I had heard it understood relationships between words and maintened context. We got some mixed results as TF-IDF seemed to make Random Forest perform very well and Word2Vec made our SVM the best model in the bunch for their respective runs.
   
3. Which model you use and what is the structure of your model?
   
   Your answer: I created 6 separate models, 3 to try 2 feature selection techniques, and another 3 to experiment with more advanced models. Random Forest, SVM, and Naive Bayes were ones we had all covered in class and explored before. The only significant change to them was the type of encoded data fed into them. The other 3 models are BERT, XLM-R, and HateBERT. I covered a bit of the reasons why I chose them above, but the main point was I wanted to see just how much more advanced some of the newer models are with the reference point of simpler classification methods. They are quite complex, but I tried to break down some of the higher level steps in my code.
   
4. How do you train your model?
   
   Your answer: We fed the data into models.  For the first 3 models, I used pipeline and the last 3, after tokenizing the inputs, I used batches with train_loader to simplify things. Those served as the basis of our model which we then used to evaluate on our test data.
   
5. What is the performance of your best model?
   
   Your answer: My best performing model is HateBERT with on both macro and weighted average F1 with scores of 0.78 and 0.83 respectively.
   
6. What other models or feature engineering methods would you like to implement in the future?
   
   Your answer: In the future, I'd like to devote more time to both data preprocessing (e.g. not filtering out some stop wards like 'she' as it may provide useful context) and feature selection. I only explored two and I'm sure there may be more useful ocnes.
   