### Student Information
Name: 蔡睿翊

Student ID: 112065802

GitHub ID: [vincenttsai2015](https://github.com/vincenttsai2015/)

Kaggle name: juiyitsai

Kaggle private scoreboard snapshot: 

[Snapshot](蔡睿翊_rank.png)

---

### Instructions

1. First: __This part is worth 40% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/t/6132d98db776a0745496d3ebbe011f3c) regarding Emotion Recognition on Twitter by this link https://www.kaggle.com/t/6132d98db776a0745496d3ebbe011f3c. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 30% of the 40% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 30 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 39.5% out of 40%)   
    Submit your last submission __BEFORE the deadline (Dec. 19th 11:59 pm, Tuesday)__. Make sure to take a screenshot of your position at the end of the competition.
    

2. Second: __This part is worth 40% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


3. Third: __This part is worth 20% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the assignment with the regulated format to the [folder](https://drive.google.com/drive/folders/1auARVdUHtww5U_T6MDeiZ8ApZ_ANjeYl).

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 20th 11:59 pm, Wednesday)__. 

### Import the necessary tools
First, we need to import the necessary tools including general ones and those from Pytorch and huggingface as follows:

In [1]:
### Begin Assignment Here
# import general tools
import numpy as np
import pandas as pd
import json
import re
import nltk
import string
from tqdm import tqdm

# import DL tools
import torch
from torch import nn
from torch.nn import Module, CrossEntropyLoss
from torch.optim import Adam
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel

### Data preprocessing
#### Raw data parsing
From the given dataset, we load the csv files of the ground truth and identification (identifying training/testing instances) by Pandas to create two dataframes ```df_groundtruth``` and ```df_identification```, respectively. And we load the IDs and texts raw data line by line into two respective lists, which are then used to build a dataframe ```df_raw```. Then we merge ```df_identification``` and ```df_raw``` on the column ```tweet_id``` to create a merged dataframe ```df_raw_merge```.

In [3]:
print('Raw data parsing...')
# groundtruth
df_groundtruth = pd.read_csv('emotion.csv')

# train-test-splitting
df_identification = pd.read_csv('data_identification.csv')

# raw data parsing
tweet_ids = []
textual_data = []
with open("tweets_DM.json","r") as jsfile:
    for line in jsfile.readlines():
        dic = json.loads(line)
        tweet_ids.append(dic['_source']['tweet']['tweet_id'])
        textual_data.append(dic['_source']['tweet']['text'])

# merging with identification for splitting train/test data
df_raw = pd.DataFrame({'tweet_id': tweet_ids, 'text': textual_data})
df_raw_merge = pd.merge(df_raw, df_identification, on='tweet_id', how='left')

#### Tidying up training/testing data
We split ```df_raw_merge``` into training and testing parts according to column ```identification```. For the testing part, we load the template "sampleSubmission.csv" to create dataframe ```df_submit_samples```, drop column ```emotion``` to create dataframe ```df_test_ID``` and rename column ```id``` by ```tweet_id``` to align with the testing part when merging ```df_test_ID``` and ```df_raw_merge_test``` (the testing parts of ```df_raw_merge```). 

Dropping columns ```tweet_id``` and ```identification```, we have clean training and testing dataframes ```df_train``` and ```df_test```.

To perform validation during the training process, we further divide ```df_train``` into the training part and validation part by sampling 20% data of ```df_train``` as validation data.

In [9]:
print('Tidying up training data...')
# preparing training data and labels
df_identification_train = df_identification[df_identification['identification']=='train']
df_raw_merge_train = df_raw_merge[df_raw_merge['identification']=='train']
df_raw_merge_train = pd.merge(df_raw_merge_train, df_groundtruth, on='tweet_id', how='left')

print('Tidying up testing data...')
# preparing testing data 
df_submit_samples = pd.read_csv("sampleSubmission.csv",encoding="utf-8")
df_submit_samples.rename(columns = {'id':'tweet_id'}, inplace = True)
df_test_ID = df_submit_samples.drop(columns=['emotion'])
df_identification_test = df_identification[df_identification['identification']=='test']
df_raw_merge_test = df_raw_merge[df_raw_merge['identification']=='test']
df_raw_merge_test = pd.merge(df_test_ID, df_raw_merge_test, on='tweet_id', how='left')

df_train = df_raw_merge_train.drop(columns=['tweet_id', 'identification'])
df_test = df_raw_merge_test.drop(columns=['tweet_id', 'identification'])

print('Sampling validation data from training data...')
# split the training data into training part and validation part
df_val = df_train.sample(frac=0.2, random_state=30)
df_train = df_train.drop(df_val.index)

Note that there are 8 classes in the dataset.

In [11]:
# available labels and tokenizer
available_labels = df_raw_merge_train['emotion'].unique().tolist()
all_labels = {element: count for count, element in enumerate(available_labels)}
# {'anticipation':0, 'sadness':1, 'fear':2, 'joy':3, 'anger':4, 'trust':5, 'disgust':6, 'surprise':7}

#### Dataset module
* To construct dataloaders for the learning process, we implement a dataset module ```KaggleSentimentDataset``` that inherits ```torch.utils.data.Dataset```. 
* In this module, we collect texts encoded and tokenized by ```tokenizer``` that calls ```BertTokenizer``` in module ```transformer``` and remove noises such as unnecessary stopwords and punctuations, etc., with modules such as nltk (Natural Language Toolkit) and re (regular expression).

In [7]:
print('Tokenizing...')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')  

# self-defined dataset module
class KaggleSentimentDataset(Dataset):
    def __init__(self, df, tokenizer):
        texts = df.text.values.tolist()
        texts = [self._preprocess(text) for text in texts]
        self.texts = [tokenizer(text, padding='max_length', max_length=128, truncation=True, return_tensors="pt") for text in texts]
        if 'emotion' in df:
            self.labels = [all_labels[label] for label in df['emotion']]
        else:
            self.labels = [-1] * len(df)
    
    def _preprocess(self, text):
        text = self._remove_amp(text)
        text = self._remove_links(text)
        text = self._remove_hashes(text)
        text = self._remove_retweets(text)
        text = self._remove_mentions(text)
        text = self._remove_multiple_spaces(text)
        text = self._remove_punctuation(text)

        text_tokens = self._tokenize(text)
        text_tokens = self._stopword_filtering(text_tokens)
        text = self._stitch_text_tokens_together(text_tokens)

        return text.strip()

    def _remove_amp(self, text):
        return text.replace("&amp;", " ")
    
    def _remove_links(self, text):
        return re.sub(r'https?:\/\/[^\s\n\r]+', ' ', text)

    def _remove_hashes(self, text):
        return re.sub(r'#', ' ', text)
    
    def _remove_retweets(self, text):
        return re.sub(r'^RT[\s]+', ' ', text)
    
    def _remove_mentions(self, text):
        return re.sub(r'(@.*?)[\s]', ' ', text)
    
    def _remove_multiple_spaces(self, text):
        return re.sub(r'\s+', ' ', text)

    def _remove_punctuation(self, text):
        return ''.join(character for character in text if character not in string.punctuation)

    def _tokenize(self, text):
        return nltk.word_tokenize(text, language="english")

    def _stopword_filtering(self, text_tokens):
        stop_words = nltk.corpus.stopwords.words('english')
        return [token for token in text_tokens if token not in stop_words]

    def _stitch_text_tokens_together(self, text_tokens):
        return " ".join(text_tokens)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = -1
        if hasattr(self, 'labels'):
            label = self.labels[idx]
        return text, label

#### Dataloader construction
With the implementation of module ```KaggleSentimentDataset```, we pack the training/validation/test dataframes into training/validation/testing datasets and load them into respective dataloaders with ```batch_size=16```.

In [12]:
# construct the datasets
print('Training dataset construction...')
training_dataset = KaggleSentimentDataset(df_train, tokenizer)
print('Validation dataset construction...')
validation_dataset = KaggleSentimentDataset(df_val, tokenizer)
print('Testing dataset construction...')
testing_dataset = KaggleSentimentDataset(df_test, tokenizer)

Training dataset construction...
Validation dataset construction...
Testing dataset construction...


In [13]:
# dataloaders
print('Building dataloaders...')
train_dataloader = DataLoader(training_dataset, batch_size=16, shuffle=True, num_workers=0)
val_dataloader = DataLoader(validation_dataset, batch_size=16, num_workers=0)
test_dataloader = DataLoader(testing_dataset, batch_size=16, shuffle=False, num_workers=0)

### Sentiment Classification Model
We build a sentiment classification model with the implementation of class ```BertSentimentClassifier``` as follows.

#### Base model
We adopt pretrained RoBERTa (a variation of BERT) as a base model by importing ```BertModel``` in ```transformers``` with specification of model name ```roberta-base```.

#### Fine-tune the base model
We fine-tune the base model by adding 2 fully connected layers ```fc1``` and ```fc2``` and a ReLU layer ```relu``` to derive the output.

In [14]:
# model
class BertSentimentClassifier(Module):
    def __init__(self, base_model, dropout=0.5):
        super(BertSentimentClassifier, self).__init__()
        self.bert = base_model
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(768, 32)
        self.fc2 = nn.Linear(32, 8) # len(all_labels)=8
        self.relu = nn.ReLU()

    def forward(self, input_id, mask):
        _, pooled_output = self.bert(input_ids=input_id, attention_mask=mask, return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output_1 = self.fc1(dropout_output)
        linear_output_2 = self.fc2(linear_output_1)
        final_layer = self.relu(linear_output_2)
        return final_layer

# model initialization
print('Initializing BERT model...')
base_model = BertModel.from_pretrained("roberta-base")
model = BertSentimentClassifier(base_model)

Initializing BERT model...


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

You are using a model of type roberta to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['encoder.layer.1.attention.output.dense.bias', 'encoder.layer.5.output.dense.weight', 'encoder.layer.8.attention.self.query.weight', 'encoder.layer.5.attention.self.query.bias', 'encoder.layer.5.intermediate.dense.bias', 'encoder.layer.3.intermediate.dense.weight', 'encoder.layer.6.attention.self.query.bias', 'encoder.layer.0.attention.self.key.bias', 'encoder.layer.0.output.dense.bias', 'encoder.layer.8.output.dense.bias', 'encoder.layer.4.intermediate.dense.weight', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.8.attention.output.LayerNorm.weight', 'encoder.layer.9.output.dense.bias', 'encoder.layer.2.output.LayerNorm.bias', 'encoder.layer.6.output.dense.bias', 'encoder.layer.7.output.dense.bias', 'encoder.layer.3.attention.self.value.bias', 'encoder.layer.2.attention.output.LayerNorm.weight', 'encoder.layer.9.attention.self.value.weight', 'encoder.layer.

### Training and Validation
After initializing the sentiment classification model, we define cross-entropy loss with ```CrossEntropyLoss()``` and load training and validation dataloaders, we start the training and validation processes as follows.

#### Training
* We train the fine-tuned model with training dataloader in 3 epochs with learning rate ```0.00002 (2e-5)``` and optimizer ```Adam```.
* The ```output``` is derived by inputting ```train_input['attention_mask']``` and ```train_input['input_ids']```.
* Use ```output``` and ```train_label``` to calculate the loss (averaged by the length of training dataloader) and accuracy (averaged by the size of training instances). 

#### Validation
* In each epoch, we validate the trained model with validation dataloader.
* The process is basically the same with the training process (except for model update).
* To avoid overfitting, we keep monitoring the best validation loss in each epoch and set the count of early stopping to 3.
* If the current validation loss is lower than ```best_val_loss```, we update the current validation loss to ```best_val_loss``` and save the trained model as ```best_model.pt``` for testing.

In [15]:
# training
def train(model, train_ldr, val_ldr, learning_rate, epochs):
    best_val_loss = float('inf')
    early_stopping_threshold_count = 0
    
    # GPU usage determination
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda:0" if use_cuda else "cpu")
    
    # Defining loss function and optimizer
    criterion = CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=learning_rate)
    
    # Send the model to GPU memory
    model = model.to(device)
    criterion = criterion.to(device)
    # training iteration
    for epoch_num in range(epochs):
        print(f'Epoch: {epoch_num}')
        # training accuracy and loss
        train_acc = []
        train_loss = 0
        
        model.train()
        # tqdm
        for train_input, train_label in tqdm(train_ldr):
            train_label = train_label.to(device)
            mask = train_input['attention_mask'].to(device)
            input_id = train_input['input_ids'].squeeze(1).to(device)
            
            optimizer.zero_grad()
            # prediction result
            output = model(input_id, mask)
            # loss
            batch_loss = criterion(output, train_label)
            train_loss += batch_loss.item()
            
            # model update
            model.zero_grad()
            batch_loss.backward()            
            optimizer.step()
            
            # accuracy
            output_index = output.argmax(axis=1)
            acc = (output_index == train_label)
            train_acc += acc

        train_accuracy = (sum(train_acc)/len(train_acc)).item()
        print(f'Train Loss: {train_loss / len(train_ldr): .3f} | Train Accuracy: {train_accuracy: 10.3%}')
        
        # Model validation
        model.eval()
        # validation accuracy and loss
        val_acc = []
        val_loss = 0        
        with torch.no_grad(): # no need to compute gradient
            # validate with trained model
            for val_input, val_label in tqdm(val_ldr):
                # same process with training
                val_label = val_label.to(device)
                mask = val_input['attention_mask'].to(device)
                input_id = val_input['input_ids'].squeeze(1).to(device)
                # prediction result
                output = model(input_id, mask)
                # loss
                batch_loss = criterion(output, val_label)
                val_loss += batch_loss.item()
                
                # accuracy
                output_index = output.argmax(axis=1)
                acc = (output_index == val_label)
                val_acc += acc
            val_accuracy = (sum(val_acc)/len(val_acc)).item()            
            print(f'Val Loss: {val_loss / len(val_ldr): .3f} | Val Accuracy: {val_accuracy: 10.3%}')
            
            if best_val_loss > val_loss:
                best_val_loss = val_loss
                torch.save(model, f"best_model.pt")
                print("Saved model")
                early_stopping_threshold_count = 0
            else:
                early_stopping_threshold_count += 1
            if early_stopping_threshold_count >= 3:
                print("Early stopping")
                break

In [None]:
print('Training...')
EPOCHS = 3
LR = 2e-5
train(model, train_dataloader, val_dataloader, LR, EPOCHS)

As shown in the screenshot of the training procedure, 
* With batch size = 16, it takes around 2 hours for training one epoch on an NVIDIA RTX 3090 GPU card.
* If the training epoch number is set to be large (e.g., 10), the training accuracy can reach 72.847% at epoch 9. 
* However, at epoch 3, the validation error starts to increase, which demonstrates that the model starts to overfit. 
* Thus the ideal number of training epochs is 3.

![training_process.png](training_process.png)

### Testing
* We use the tesing dataloader and the model saved in validation process to conduct testing. 
* The prediction results in ```pred_results``` consist of the IDs of the sentiment classes.
* To meet the format requirement for uploading the prediction result to Kaggle
    * update the column ```emotion``` of the dataframe ```sample_submission``` that loads the file "sampleSubmission.csv" by ```pred_results```.
    * convert the prediction result to the corresponding sentiment class name through mapping with the dictionary ```id2label```.    
* Save the updated dataframe as a csv file (e.g., "final_submission_112065802.csv").

In [None]:
def get_text_predictions(model, test_ldr):
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    model = model.to(device)    
    
    results_predictions = []
    with torch.no_grad():
        model.eval()
        for data_input, _ in tqdm(test_ldr):
            attention_mask = data_input['attention_mask'].to(device)
            input_ids = data_input['input_ids'].squeeze(1).to(device)
            output = model(input_ids, attention_mask)            
            output_index = output.argmax(axis=1)
            results_predictions.append(output_index)
    
    return torch.cat(results_predictions).cpu().detach().numpy()

print('Testing...')
pred_model = torch.load("best_model.pt")
sample_submission = pd.read_csv("sampleSubmission.csv")
pred_results = get_text_predictions(pred_model, test_dataloader)
id2label = {count: element for count, element in enumerate(available_labels)}
sample_submission["emotion"] = pred_results
sample_submission["emotion"] = sample_submission["emotion"].map(id2label)
sample_submission.to_csv("final_submission_112065802.csv", index=False)

### Postscript
* In the beginning, since I did not totally get what to do, I used TA's codes in [this link](https://github.com/KevinCodePlace/NTHU_Data_Mining_2022Fall/blob/main/DM2022-Lab2/DM2022-Lab2-Homework-111065542.ipynb) to go through the whole process of this homework assignment and figure out how to imitate the implementation. After going through the whole process with TA's codes, I got 55%, the highest at that time. 

* But after trying to implement the modules of dataset construction, model building and training procedures on my own (with references found on the Internet), the resulting performance failed to be higher than 50% (even in the last submission......) by the deadline. 

* Thus I chose the best performance derived by my own implementation (which is 46.757%) in Kaggle instead of the initial one (that achieves 55% by using TA's codes).