---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the [DM2021-Lab2-master Repo](https://github.com/fhcalderon87/DM2021-Lab2-master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/c/dm2021-lab2-hw2/) regarding Emotion Recognition on Twitter. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission __BEFORE the deadline (Dec. 24th 11:59 pm, Friday)__. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 29th 11:59 pm, Wednesday)__. 

In [3]:
### Begin Assignment Here

In [None]:
import json 
import csv
import torch
import numpy as np
from transformers import BertTokenizer,BertModel
from torch import nn
from torch.optim import Adam
from tqdm.notebook import tqdm

td_route = "./dm2021-lab2-hw2"

In [None]:
Tweet_list = []
with open(td_route + '/tweets_DM.json') as f:
    for jsonObj in tqdm(f):
        Tweet_list.append(json.loads(jsonObj))
#I use tqdm to check the progress of many operations in this project
#since the dataset is very large, and I want to see how long the task
#would take to complete so I don't have to sit and stare at nothing

In [None]:
ori_Tweet_list = Tweet_list

In [None]:
if(True):
    Tweet_list = ori_Tweet_list[len(ori_Tweet_list)//10*0:len(ori_Tweet_list)//10*1]
#The dataset is too large so I broke it into 10 parts and rotate them each epoch
#The rotation was done manually as each epoch took really long and automation doesn't feel necessary

In [None]:
identification = {'test':set(),'train':set()}
with open(td_route + '/data_identification.csv', newline='') as csvfile:
    rd = csv.reader(csvfile, delimiter=',', quotechar='|')
    for tid,ident in rd:
        if(ident in identification):identification[ident].add(tid)

In [None]:
emotions = {}
with open(td_route + '/emotion.csv', newline='') as csvfile:
    rd = csv.reader(csvfile, delimiter=',', quotechar='|')
    next(rd, None) #Skip header
    for tid,emt in rd:
        emotions[tid] = emt

In [None]:
Tweet_dict = {'train':{},'test':{}}
for tl in Tweet_list:
    tid = tl['_source']['tweet']['tweet_id']
    if(tid in identification['train']):
        Tweet_dict['train'][tid] = tl['_source']['tweet']['text']
    elif(tid in identification['test']):
        Tweet_dict['test'][tid] = tl['_source']['tweet']['text']

In [None]:
print(len(Tweet_dict['train']),len(Tweet_dict['test']))
print(len(emotions))
print(len(Tweet_list),len(ori_Tweet_list))
#Checks for dataset size

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
#Pretrained BertTokenizer were used

In [None]:
tokenz = tokenizer(Tweet_dict['train']['0x376b20'],padding='max_length', max_length = 280, truncation=True,return_tensors="pt")
tokenz
#tokenizer operation check, since The max length of a tweet is 280 characters, I set that as the length of vector

In [None]:
max_len = 0
for tl in ori_Tweet_list:
    if(len(tl['_source']['tweet']['text']) > max_len):
        max_len = len(tl['_source']['tweet']['text'])
print(max_len)
#Double check maximum length, which turns out is 252
#I still used 280 as max length in the end to make sure no additional data would break this code in the future.

In [None]:
labels = {'anger':0,'anticipation':1,'disgust':2,'fear':3,'sadness':4,'surprise':5,'trust':6,'joy':7}
#lable dict to convert text to class_id

In [None]:
class Dataset(torch.utils.data.Dataset):

    def __init__(self, texts):
        self.labels = [labels[emotions[key]] for key in tqdm(texts)]
        self.texts = [tokenizer(texts[key],padding='max_length', max_length = 280, truncation=True,return_tensors="pt") for key in tqdm(texts)]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        return self.labels[idx]

    def get_batch_texts(self, idx):
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

In [None]:
class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-cased')
        #Start with a pretrained model to speed up training
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 8)
        #Classification layer has 8 outputs since we have 8 classes of emotions
        self.relu = nn.ReLU()

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer

In [None]:
full_ds = Dataset(Tweet_dict['train'])

In [None]:
train_size = int(len(full_ds)*0.7)
test_size = int(len(full_ds)*0.3)
if(len(full_ds) % 10 != 0):test_size += 1
train_ds,test_ds = torch.utils.data.random_split(full_ds, [train_size, test_size], generator=torch.Generator().manual_seed(42069))
#Split the dataset into training and testing sets, with a 7:3 ratio

In [None]:
print(len(train_ds),len(test_ds),len(train_ds)/len(full_ds),len(test_ds)/len(full_ds))
#Dataset sizes validation
train_dataloader = torch.utils.data.DataLoader(train_ds, batch_size=2, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(test_ds, batch_size=2)

In [None]:
#Operation Checks
batch_iterator = iter(train_dataloader)
inputs, label = next(batch_iterator)
for inputs,label in tqdm(batch_iterator):
    continue
print(label)
del batch_iterator

In [None]:
def train(model, learning_rate, epochs):

    train,val = train_ds,test_ds

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)
    cnter = 12

    if use_cuda:
            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):

            total_acc_train = 0
            total_loss_train = 0

            for train_input, train_label in tqdm(train_dataloader):

                train_label = train_label.type(torch.LongTensor).to(device)
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)
                #print(output,train_label)
                batch_loss = criterion(output, train_label)
                total_loss_train += batch_loss.item()
                
                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()
            
            total_acc_val = 0
            total_loss_val = 0

            with torch.no_grad():

                for val_input, val_label in tqdm(val_dataloader):

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label)
                    total_loss_val += batch_loss.item()
                    
                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc
            
            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_ds): .3f} \
                | Train Accuracy: {total_acc_train / len(train_ds): .3f} \
                | Val Loss: {total_loss_val / len(test_ds): .3f} \
                | Val Accuracy: {total_acc_val / len(test_ds): .3f}')
            torch.save(model.state_dict(), 'best_checkpoint_1222_'+ str(cnter) +'.pth')
            cnter += 1

In [None]:
EPOCHS = 1
model = BertClassifier()
model.load_state_dict(torch.load('best_checkpoint_1222_11.pth'))
LR = 1e-6
train(model, LR, EPOCHS)
#The state file of the Bert model is too large to include, google drive link provided
#In the end I stop at each epoch since even with 1/10 of the data, it still takes 2 hour to train each epoch.

In [None]:
test_dict = {}
for tl in ori_Tweet_list:
    tid = tl['_source']['tweet']['tweet_id']
    if(tid in identification['test']):
        test_dict[tid] = tl['_source']['tweet']['text']
print(len(test_dict))

#Get validation tweets

In [None]:
rev_labels = {}
for key in labels:
    rev_labels[labels[key]] = key
    
#reverse label dict since the output is in class_id and not text like 'anticipation' that we want

In [None]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
outs = []
with torch.no_grad():
    for key in tqdm(test_dict):
        val_input = tokenizer(test_dict[key],padding='max_length', max_length = 280, truncation=True,return_tensors="pt")
        mask = val_input['attention_mask'].to(device)
        input_id = val_input['input_ids'].squeeze(1).to(device)
        output = model(input_id, mask).to('cpu')
        outs.append([key,rev_labels[int(torch.argmax(output))]])

#Generate validation labels, this took 3 hours each run :(

In [None]:
import csv
print(len(outs))
with open('val3.csv', 'w', newline='') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    spamwriter.writerow(["id"]+["emotion"])
    for row in outs:  
        spamwriter.writerow(row)
#Write output file with header

In [None]:
##These code was copied from another notebook file, the original notebook named HW2.ipynb is also included

In [None]:
##Report##

Since I heard from my final project teammates that Bert performs pretty well at this task, I did not do my own feature engineering and model development, so most of my time spent on this project was monitoring the progress of the training, which is really slow, and tweak the input size and learning rate if overfitting occurs. The only data preprocessing I did was to split the data into 10 parts since training the whole dataset at once max out my computer's memory. When the model stops improving I increased the dataset size and the batch size, which did imporve a bit more but progress was slow, and I can't increase the batch size too much as my GPU's memory was insufficient, in the end I stopped at 0.52 F-score. Later I learned that the training process I used is basically Early stopping, I start doing this when the model overfits after a ten hour training of 3 epochs, to stop wasting time I only train 1 epoch at a time after that, and check the progress at each step to determine if tweaks are needed.