### Student Information
Name: 莊昱陽

Student ID: 111061643

GitHub ID: yuyangdanny

Kaggle name: ABC

Kaggle private scoreboard snapshot: (6-th place on private)

[Snapshot](../pics/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the DM2023-Lab2-master. You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/t/09b1d0f3f8584d06848252277cb535f2) regarding Emotion Recognition on Twitter by this link https://www.kaggle.com/t/09b1d0f3f8584d06848252277cb535f2. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission __BEFORE the deadline (Dec. 27th 11:59 pm, Wednesday)_. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 31th 11:59 pm, Sunday)__. 

# Third part
See the framework architecture in report.pdf at "/Homework/report.pdf" for final result on private leader broad. <br>
And I also provide extra experiment result in report.pdf. <br>
The following code in this notebook is the function I tryed in all my experiments, and I'll also provide the example usage of it.

## 0. build data reader pipeline class 

In [26]:
import json
import pandas as pd
import tqdm as tqdm
from sklearn.model_selection import train_test_split

In [27]:
class DataReader:
    def __init__(self, sample_num=-1):
        # sample_num = -1 for use whole training dataset
        # sample_num > 0 for use sub sample dataset 
        self.sample_num = sample_num

    def read_json(self, path):
        with open(path, 'r') as file:
            lines = file.readlines()

            json_objects = []
            for line in lines:
                json_obj = json.loads(line)
                json_objects.append(json_obj)
        return pd.DataFrame(json_objects)

    def read_csv(self, path):
        return pd.read_csv(path)

    def process(self, tweets_df, id_df, emo_df):
        df = tweets_df
        flattened_df = pd.json_normalize(df['_source'])
        df = pd.concat([df, flattened_df], axis=1)
        df.drop('_source', axis=1, inplace=True)
        df.drop('_crawldate', axis=1, inplace=True)
        df.drop('_index', axis=1, inplace=True)
        df.drop('_type', axis=1, inplace=True)
        df['tweet.hashtags'] = df['tweet.hashtags'].apply(lambda x: ' '.join(x))
        df = df.rename(columns={'tweet.tweet_id': 'tweet_id', 'tweet.text': 'text', 'tweet.hashtags': 'hashtags'})
        merged_df = pd.merge(df, id_df, on='tweet_id', how='inner')
        train_df = merged_df[merged_df['identification'] == 'train']
        train_df = pd.merge(train_df, emo_df, on='tweet_id', how='inner')
        test_df = merged_df[merged_df['identification'] == 'test']
        train_df.drop('identification', axis=1, inplace=True)
        test_df.drop('identification', axis=1, inplace=True)
        test_df.rename(columns={'tweet_id': 'id'})
        
        possible_labels = train_df['emotion'].unique()
        label_dict = {}
        for index, possible_label in tqdm(enumerate(possible_labels)):
            label_dict[possible_label] = index
        train_df['label'] = train_df.emotion.replace(label_dict)

        # Use sample dataframe of not
        if self.sample_num == -1:
            return train_df, test_df, label_dict
        else:
            return train_df.head(self.sample_num), test_df, label_dict
    
    def train_val_split(self, df):
        X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15,
                                                  random_state=17, 
                                                  stratify=df.label.values)
        df['data_type'] = ['not_set'] * df.shape[0]
        df.loc[X_train, 'data_type'] = 'train'
        df.loc[X_val, 'data_type'] = 'val'
        
        return df


## 1. Build Preprocessing pipeline class (for feature engineering)

1. I tryed 2 different feature engineering method to deal with the 'text' and 'hashtag' column feature
    1. (MyPreProcessor): Rule from scratch
    2. (EkphrasisPreProcessor): Rule from ekphrasis (https://github.com/cbaziotis/ekphrasis)
    
So I design 2 different class to build the preprocessing pipeline call 'MyPreProcessor' and 'EkphrasisPreProcessor'
And for the final leaderbroad result is by using the rule of 'MyPreProcessor'

In [28]:
import string
import nltk
import re

from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import Tokenizer
from ekphrasis.dicts.emoticons import emoticons
from ekphrasis.dicts.noslang.slangdict import slangdict

class MyPreProcessor:
    '''
    Main process pipeline is in __call__ function, so you can jump into __call__ funciton to see the detail
    '''
    def __init__(self) -> None:
        self.lemmatizer = WordNetLemmatizer()

    def lemma_traincorpus(self, text):
        words = word_tokenize(text)
        lemmatized_words = [self.lemmatizer.lemmatize(word) for word in words]

        return ' '.join(lemmatized_words)

    def contractions(self, text):
        text = re.sub(r"he's", "he is", text)
        text = re.sub(r'\b(u)\b', 'you', text)
        text = re.sub(r"there's", "there is", text)
        text = re.sub(r"We're", "We are", text)
        text = re.sub(r"That's", "That is", text)
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"they're", "they are", text)
        text = re.sub(r"Can't", "Cannot", text)
        text = re.sub(r"wasn't", "was not", text)
        text = re.sub(r"don\x89Ûªt", "do not", text)
        text = re.sub(r"aren't", "are not", text)
        text = re.sub(r"isn't", "is not", text)
        text = re.sub(r"What's", "What is", text)
        text = re.sub(r"haven't", "have not", text)
        text = re.sub(r"hasn't", "has not", text)
        text = re.sub(r"There's", "There is", text)
        text = re.sub(r"He's", "He is", text)
        text = re.sub(r"It's", "It is", text)
        text = re.sub(r"You're", "You are", text)
        text = re.sub(r"I'M", "I am", text)
        text = re.sub(r"shouldn't", "should not", text)
        text = re.sub(r"wouldn't", "would not", text)
        text = re.sub(r"i'm", "I am", text)
        text = re.sub(r"I\x89Ûªm", "I am", text)
        text = re.sub(r"I'm", "I am", text)
        text = re.sub(r"Isn't", "is not", text)
        text = re.sub(r"Here's", "Here is", text)
        text = re.sub(r"you've", "you have", text)
        text = re.sub(r"you\x89Ûªve", "you have", text)
        text = re.sub(r"we're", "we are", text)
        text = re.sub(r"what's", "what is", text)
        text = re.sub(r"couldn't", "could not", text)
        text = re.sub(r"we've", "we have", text)
        text = re.sub(r"it\x89Ûªs", "it is", text)
        text = re.sub(r"doesn\x89Ûªt", "does not", text)
        text = re.sub(r"It\x89Ûªs", "It is", text)
        text = re.sub(r"Here\x89Ûªs", "Here is", text)
        text = re.sub(r"who's", "who is", text)
        text = re.sub(r"I\x89Ûªve", "I have", text)
        text = re.sub(r"y'all", "you all", text)
        text = re.sub(r"can\x89Ûªt", "cannot", text)
        text = re.sub(r"would've", "would have", text)
        text = re.sub(r"it'll", "it will", text)
        text = re.sub(r"we'll", "we will", text)
        text = re.sub(r"wouldn\x89Ûªt", "would not", text)
        text = re.sub(r"We've", "We have", text)
        text = re.sub(r"he'll", "he will", text)
        text = re.sub(r"Y'all", "You all", text)
        text = re.sub(r"Weren't", "Were not", text)
        text = re.sub(r"Didn't", "Did not", text)
        text = re.sub(r"they'll", "they will", text)
        text = re.sub(r"they'd", "they would", text)
        text = re.sub(r"DON'T", "DO NOT", text)
        text = re.sub(r"That\x89Ûªs", "That is", text)
        text = re.sub(r"they've", "they have", text)
        text = re.sub(r"i'd", "I would", text)
        text = re.sub(r"should've", "should have", text)
        text = re.sub(r"You\x89Ûªre", "You are", text)
        text = re.sub(r"where's", "where is", text)
        text = re.sub(r"Don\x89Ûªt", "Do not", text)
        text = re.sub(r"we'd", "we would", text)
        text = re.sub(r"i'll", "I will", text)
        text = re.sub(r"weren't", "were not", text)
        text = re.sub(r"They're", "They are", text)
        text = re.sub(r"Can\x89Ûªt", "Cannot", text)
        text = re.sub(r"you\x89Ûªll", "you will", text)
        text = re.sub(r"I\x89Ûªd", "I would", text)
        text = re.sub(r"let's", "let us", text)
        text = re.sub(r"it's", "it is", text)
        text = re.sub(r"can't", "cannot", text)
        text = re.sub(r"don't", "do not", text)
        text = re.sub(r"you're", "you are", text)
        text = re.sub(r"i've", "I have", text)
        text = re.sub(r"that's", "that is", text)
        text = re.sub(r"i'll", "I will", text)
        text = re.sub(r"doesn't", "does not", text)
        text = re.sub(r"i'd", "I would", text)
        text = re.sub(r"didn't", "did not", text)
        text = re.sub(r"ain't", "am not", text)
        text = re.sub(r"you'll", "you will", text)
        text = re.sub(r"I've", "I have", text)
        text = re.sub(r"Don't", "do not", text)
        text = re.sub(r"I'll", "I will", text)
        text = re.sub(r"I'd", "I would", text)
        text = re.sub(r"Let's", "Let us", text)
        text = re.sub(r"you'd", "You would", text)
        text = re.sub(r"It's", "It is", text)
        text = re.sub(r"Ain't", "am not", text)
        text = re.sub(r"Haven't", "Have not", text)
        text = re.sub(r"Could've", "Could have", text)
        text = re.sub(r"youve", "you have", text)  
        text = re.sub(r"donå«t", "do not", text) 

        abbreviations = {
            "aren't" : "are not",
            "can't" : "cannot",
            "couldn't" : "could not",
            "didn't" : "did not",
            "doesn't" : "does not",
            "don't" : "do not",
            "hadn't" : "had not",
            "hasn't" : "has not",
            "haven't" : "have not",
            "he'd" : "he had",
            "he'll" : "he will",
            "he's" : "he is",
            "I'd" : "I had",
            "I'll" : "I will",
            "I'm": "I am",
            "I've" : "I have",
            "isn't" : "is not",
            "let's" : "let us",
            "mightn't" : "might not",
            "mustn't" : "must not",
            "shan't" : "shall not",
            "she'd" : "she had",
            "she'll" : "she will",
            "she's" : "she is",
            "shouldn't" : "should not",
            "that's" : "that is",
            "there's" : "there is",
            "they'd" : "they had",
            "they'll" : "they will",
            "they're" : "they are",
            "they've" : "they have",
            "we'd" : "we had",
            "we're" : "we are",
            "we've" : "we have",
            "weren't" : "were not",
            "what'll" : "what will",
            "what're" : "what are",
            "what's" : "what is",
            "what've" : "what have",
            "where's" : "where is",
            "who's" : "who had",
            "who'll" : "who will",
            "who're" : "who are",
            "who's" : "who is",
            "who've" : "who have",
            "won't" : "will not",
            "wouldn't" : "would not",
            "wouldnt" : "would not",
            "you'd" : "you had",
            "you'll" : "you will",
            "you're" : "you are",
            "you've" : "you have",
            "arent" : "are not",
            "cant" : "cannot",
            "couldnt" : "could not",
            "didnt" : "did not",
            "doesnt" : "does not",
            "dont" : "do not",
            "hadnt" : "had not",
            "hasnt" : "has not",
            "havent" : "have not",
            "Id" : "I had",
            "Ill" : "I will",
            "Im": "I am",
            "Ive" : "I have",
            "isnt" : "is not",
            "lets" : "let us",
            "mightnt" : "might not",
            "mustnt" : "must not",
            "shouldnt" : "should not",
            "werent" : "were not",
            "gonna" : "going to",
            "imma" : "i am going to"
        }

        for key, value in abbreviations.items():
            text = re.sub(re.escape(key), value, text)
        return text
    
    def rm_space(self, text):
        text = text.strip()
        text = text.split()
        return ' '.join(text)

    def clean(self, text):
        text = re.sub(r"&gt;", ">", text)
        text = re.sub(r"&lt;", "<", text)
        text = re.sub(r"&amp;", "&", text)
        text = re.sub(r'&[^ ]*', '', text)
        
        return text

    def __call__(self, text):
        # rm html
        html = re.compile(r'<.*?>')
        text = html.sub(r'',text)

        # rm URL
        url = re.compile(r'https?://\S+|www\.\S+')
        text = url.sub(r'',text)

        # rm informal space
        text = self.rm_space(text)

        # clean
        text = self.clean(text)

        # rm metion
        text = re.sub(r'@\S+', '', text)

        # Split connective
        text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)

        # lower case transformation
        text = text.lower()

        # contractions
        text = self.contractions(text)

        return text


class EkphrasisPreProcessor:
    """
    This class does some cleaning and normalization prior to BPE tokenization
    """

    def __init__(self):

        self.text_processor = TextPreProcessor(
            # terms that will be normalized
            normalize=[
                "url",
                "email",
                "phone",
                "user",
                "time",
                "date",
                'percent',
                'money'
            ],
            # terms that will be annotated
            # annotate={"repeated", "elongated"},
            annotate={"hashtag", "allcaps", "elongated", "repeated",
                    'emphasis'},
            # corpus from which the word statistics are going to be used
            # for word segmentation
            segmenter="twitter",
            # corpus from which the word statistics are going to be used
            # for spell correction
            spell_correction=True,
            corrector="twitter",
            unpack_hashtags=False,  # perform word segmentation on hashtags
            unpack_contractions=True,  # Unpack contractions (can't -> can not)
            spell_correct_elong=True,  # spell correction for elongated words
            fix_bad_unicode=True,
            tokenizer=Tokenizer(lowercase=True).tokenize,
            # list of dictionaries, for replacing tokens extracted from the text,
            # with other expressions. You can pass more than one dictionaries.
            dicts=[emoticons, slangdict],
        )

    def preprocess_tweet(self, tweet):
        return " ".join(self.text_processor.pre_process_doc(tweet))
    
    # this will return the tokenized text     
    def __call__(self, tweet):
        return self.text_processor.pre_process_doc(tweet)
    

''' Here are some example for using the preprocessor 
1. Use MyPreProcessor:
    preprocessor = MyPreProcessor()
    tqdm.pandas()
    train_df['text'] = train_df['text'].progress_apply(preprocessor)

2. Use EkphrasisPreProcessor
    preprocessor = EkphrasisPreProcessor()
    tqdm.pandas()
    train_df['text'] = train_df['text'].progress_apply(preprocessor.preprocess_tweet)
'''

[nltk_data] Downloading package wordnet to /home/Yuyang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/Yuyang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/Yuyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


" Here are some example for using the preprocessor \n1. Use MyPreProcessor:\n    preprocessor = MyPreProcessor()\n    tqdm.pandas()\n    train_df['text'] = train_df['text'].progress_apply(preprocessor)\n\n2. Use EkphrasisPreProcessor\n    preprocessor = EkphrasisPreProcessor()\n    tqdm.pandas()\n    train_df['text'] = train_df['text'].progress_apply(preprocessor.preprocess_tweet)\n"

## 2. Build model training pipeline class

In [29]:
import torch
import random
import numpy as np
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
from transformers import RobertaTokenizer
from transformers import RobertaForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm
from tqdm._tqdm_notebook import tqdm_notebook

In [30]:

class Trainer:
    def __init__(self):
        pass
    
    def prepare_train_val_dataloader(self, df, tokenizer, batch_size=64):
        encoded_data_train = tokenizer.batch_encode_plus(
            tqdm(df[df.data_type=='train'].text.values), 
            add_special_tokens=True, 
            return_attention_mask=True, 
            pad_to_max_length=True, 
            max_length=256, 
            return_tensors='pt'
        )

        encoded_data_val = tokenizer.batch_encode_plus(
            tqdm(df[df.data_type=='val'].text.values), 
            add_special_tokens=True, 
            return_attention_mask=True, 
            pad_to_max_length=True, 
            max_length=256, 
            return_tensors='pt'
        )

        input_ids_train = encoded_data_train['input_ids']
        attention_masks_train = encoded_data_train['attention_mask']
        labels_train = torch.tensor(df[df.data_type=='train'].label.values)
        input_ids_val = encoded_data_val['input_ids']
        attention_masks_val = encoded_data_val['attention_mask']
        labels_val = torch.tensor(df[df.data_type=='val'].label.values)

        dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
        dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

        dataloader_train = DataLoader(dataset_train, 
                        sampler=RandomSampler(dataset_train), 
                        batch_size=batch_size)
        dataloader_validation = DataLoader(dataset_val, 
                        sampler=SequentialSampler(dataset_val), 
                        batch_size=batch_size)

        return dataloader_train, dataloader_validation

    def build_model(self, model_name, label_dict):
        if model_name == 'bert-base-uncased' or 'prajjwal1/bert-tiny':
            model = BertForSequenceClassification.from_pretrained(model_name,
                                        num_labels=len(label_dict),
                                        output_attentions=False,
                                        output_hidden_states=False)
        elif model_name == 'roberta-base' or 'distilroberta-base':
            model = RobertaForSequenceClassification.from_pretrained(model_name,
                                    num_labels=len(label_dict),
                                    output_attentions=False,
                                    output_hidden_states=False)

        return model
    
    def _setSeed(self, seed_id=42):
        random.seed(seed_id)
        np.random.seed(seed_id)
        torch.manual_seed(seed_id)
        torch.cuda.manual_seed_all(seed_id)

    def train(self, model, model_name, dataloader_train, dataloader_validation, epochs=2):
        self._setSeed()

        optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
        scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloader_train)*epochs)
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        model.to(device)

        for epoch in tqdm(range(1, epochs+1)):
    
            model.train()
            loss_train_total = 0
            progress_bar = tqdm(dataloader_train, desc= f'Model: {model_name}' + 'Epoch {:1d}/{:1d}'.format(epoch, epochs), leave=False, disable=False)
            for batch in progress_bar:

                model.zero_grad()
                batch = tuple(b.to(device) for b in batch)
                inputs = {'input_ids':      batch[0],
                        'attention_mask': batch[1],
                        'labels':         batch[2],
                        }       

                outputs = model(**inputs)
                
                loss = outputs[0]
                loss_train_total += loss.item()
                loss.backward()

                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                scheduler.step()
                progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
                
                
            torch.save(model.state_dict(), f'finetuned_epoch_{epoch}.model')
            tqdm.write(f'\nEpoch {epoch}')
            loss_train_avg = loss_train_total/len(dataloader_train)             
            tqdm.write(f'Training loss: {loss_train_avg}')
            val_loss, predictions, true_vals = self.evaluate(model, dataloader_validation, device)
            val_f1 = self.f1_score_func(predictions, true_vals)
            tqdm.write(f'Validation loss: {val_loss}')
            tqdm.write(f'F1 Score (Weighted): {val_f1}')

        torch.save(model, f'whole_finetuned_model.pth')

        return model

    def evaluate(self, model, dataloader_val, device):

        model.eval()
        
        loss_val_total = 0
        predictions, true_vals = [], []
        
        for batch in dataloader_val:
            batch = tuple(b.to(device) for b in batch)
            inputs = {'input_ids':      batch[0],
                    'attention_mask': batch[1],
                    'labels':         batch[2],
                    }

            with torch.no_grad():        
                outputs = model(**inputs)
                
            loss = outputs[0]
            logits = outputs[1]
            loss_val_total += loss.item()

            logits = logits.detach().cpu().numpy()
            label_ids = inputs['labels'].cpu().numpy()
            predictions.append(logits)
            true_vals.append(label_ids)
        
        loss_val_avg = loss_val_total/len(dataloader_val) 
        predictions = np.concatenate(predictions, axis=0)
        true_vals = np.concatenate(true_vals, axis=0)
                
        return loss_val_avg, predictions, true_vals

    def f1_score_func(self, preds, labels):
        preds_flat = np.argmax(preds, axis=1).flatten()
        labels_flat = labels.flatten()
        return f1_score(labels_flat, preds_flat, average='weighted')


## 3. Generate submition data function

In [39]:

def gen_submit_csv(test_df, preprocessor, tokenizer, model, model_name, device):

    tqdm_notebook.pandas()
    test_df['text'] = test_df['text'].progress_apply(preprocessor)
    test_df = test_df.drop('_score', axis=1).drop('hashtags', axis=1)
    test = test_df.set_index('tweet_id').T.to_dict('list')
    label = []

    print("Start to parse to leader broad ! ")
    for id in tqdm(test):
        sentence = test[id]

        inputs = tokenizer(sentence, padding='max_length', truncation=True, max_length=256, return_tensors="pt")

        # to gpu
        ids = inputs["input_ids"].to(device)
        mask = inputs["attention_mask"].to(device)

        # to model
        outputs = model(ids, mask)
        logits = outputs[0]

        active_logits = logits.view(-1, model.num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1)

        tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
        ids_to_labels = {'0':'anticipation', '1':'sadness', '2':'fear', '3':'joy', '4':'anger', '5':'trust', '6':'disgust', '7':'surprise'}
        token_predictions = ids_to_labels[str(flattened_predictions.cpu().numpy()[0])]
        label.append(token_predictions)

    submission = test_df.drop('_score', axis=1).drop('hashtags', axis=1).drop('text', axis=1)
    submission = submission.rename({'tweet_id': 'id'})
    submission['emotion'] = 0
    submission = submission.assign(emotion = label)
    submission = submission.drop(['text'], axis=1)
    submission.to_csv('submission.csv', index=False)
    print(f'Save model at submission.csv from model: {model_name}')

## Whole pipeline execution example

### 1. Prepare dataset

In [32]:
# Set file path
json_file = "./kaggle/input/tweets_DM.json"
data_identification_file = "./kaggle/input/data_identification.csv"
emotion_file = "./kaggle/input/emotion.csv"

# Read 3 raw data into dataframe
datareader = DataReader()
id_df = datareader.read_csv(data_identification_file)
id_df_train = id_df[id_df['identification'] == 'train']
id_df_test = id_df[id_df['identification'] == 'test']
emo_df = datareader.read_csv(emotion_file)

# Prepare train val test dataset
train_df, test_df, label_dict = datareader.process(datareader.read_json(json_file), id_df, emo_df)
df = datareader.train_val_split(train_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
8it [00:00, 174762.67it/s]


### 2. Preprocess data (feature engineering)

In [None]:
tqdm_notebook.pandas()

# Choose Feature engineer method
FE_method = '1' # '0' for feature engineer from scratch '1' for ekphrasis

# Process feature engineer method
if FE_method == '0':
    preprocessor = MyPreProcessor()
elif FE_method == '1':
    preprocessor = EkphrasisPreProcessor()
    preprocessor = preprocessor.preprocess_tweet

train_df['text'] = train_df['text'].progress_apply(preprocessor)

### 3. Choose model and tokenizer
We can choose 4 different model types here, include:
1. bert-tiny
2. bert-base-uncased
3. roberta-base
4. distilroberta-base

In [35]:
model_name = "prajjwal1/bert-tiny" # "roberta-base", "distilroberta-base", "prajjwal1/bert-tiny", "bert-base-uncased"

if model_name == 'prajjwal1/bert-tiny':
    tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)
    epochs = 1
elif model_name == 'bert-base-uncased':
    tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)
    epochs = 3
elif model_name == 'roberta-base' or 'distilroberta-base':
    tokenizer = RobertaTokenizer.from_pretrained(model_name, do_lower_case=True)
    epochs = 2

### 4. Train model

In [36]:
trainer = Trainer()
dataloader_train, dataloader_validation = trainer.prepare_train_val_dataloader(df, tokenizer, batch_size=256)
model = trainer.build_model(model_name, label_dict)
model = trainer.train(model, model_name, dataloader_train, dataloader_validation, epochs)

  0%|          | 0/1237228 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 1237228/1237228 [14:08<00:00, 1457.96it/s]
100%|██████████| 218335/218335 [02:29<00:00, 1461.92it/s]
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight']
- This IS expected i


Epoch 1
Training loss: 1.6145336296699961


100%|██████████| 1/1 [08:04<00:00, 484.75s/it]

Validation loss: 1.5123645513026962
F1 Score (Weighted): 0.38490764461491617





### 5. Generate submission csv

In [41]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
gen_submit_csv(test_df, preprocessor, tokenizer, model, model_name, device)

## 6. Ensumble

In [None]:
import pandas as pd
from tqdm._tqdm_notebook import tqdm_notebook

# Write file path here
file1 = "/mnt/bel/Code/NLP/DM2023-Lab2/new_tryed___bert-base-uncased_submission.csv"
file2 = "/mnt/bel/Code/NLP/DM2023-Lab2/new_tryed___distilroberta-base_submission.csv"
file3 = "/mnt/bel/Code/NLP/DM2023-Lab2/new_tryed___roberta-base_submission.csv"
file4 = "/mnt/bel/Code/NLP/DM2023-Lab2/new_tryed_ekphrasis_distilroberta-base_submission.csv"
file5 = "/mnt/bel/Code/NLP/DM2023-Lab2/new_tryed_ekphrasis_roberta-base_submission.csv"

# Convert into dataframe and rename emotion colunm name
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
df4 = pd.read_csv(file4)
df5 = pd.read_csv(file5)
df1 = df1.rename(columns={'emotion': 'emotion_df1'})
df2 = df2.rename(columns={'emotion': 'emotion_df2'})
df3 = df3.rename(columns={'emotion': 'emotion_df3'})
df4 = df4.rename(columns={'emotion': 'emotion_df4'})
df5 = df5.rename(columns={'emotion': 'emotion_df5'})

# Merge all dataframe
merged_df = pd.merge(df1[['id', 'emotion_df1']], df2[['id', 'emotion_df2']], on='id', how='outer')
merged_df = pd.merge(merged_df, df3[['id', 'emotion_df3']], on='id', how='outer')
merged_df = pd.merge(merged_df, df4[['id', 'emotion_df4']], on='id', how='outer')
merged_df = pd.merge(merged_df, df5[['id', 'emotion_df5']], on='id', how='outer')


def majority_vote(row):
    values = row[['emotion_df1', 'emotion_df2', 'emotion_df3', 'emotion_df4', 'emotion_df5']]
    
    mode_values = values.mode()

    # use majority_vote result if mode less than amount of df
    if len(mode_values) < 5:
        mode_value = mode_values.iloc[0]
    # if we didn't get real majority here, we just take first df result
    else:
        mode_value = row['emotion_df1']
    
    return mode_value

# Utils majority vote method and output csv file
tqdm_notebook.pandas()
merged_df['emotion'] = merged_df.progress_apply(majority_vote, axis=1)
result_df = merged_df[['id', 'emotion']]
print("Size: ", result_df.shape)
result_df.to_csv("ensumble_submission.csv", index=False)
