# 1. Information about the submission

## 1.1 Name and number of the assignment 

Text categorization: argument mining 

Assignment 2

## 1.2 Student name

Waralak Pariwatphan

## 1.3 Codalab user ID

wkpn

## 1.4 Additional comments

I apologize for any inconvenience caused, but I require **1-day extension** to 
submit this homework. I appreciate your understanding and support in this matter. 

<br> 

**Submision date:** April 24, 2023







# 2. Technical Report

*Use Section 2 to describe results of your experiments as you would do writing a paper about your results. DO NOT insert code in this part. Only insert plots and tables summarizing results as needed. Use formulas if needed do described your methodology. The code is provided in Section 3.*

## 2.1 Methodology 

**Data pre-processing**

Data was downloaded from [Github](https://github.com/dialogue-evaluation/RuArg/tree/main/data). The data consists of 3 files: `train.tsv`, `val_empty.tsv`, and `test-no_labels.tsv` and these files are stored as `train_df`, `val_df`, and `test_df`. I also observe the amount of each label, as seen in the table below.

<br>

Columns | Label -1 | Label 1 | Label 2 | Label 0
--- | --- | --- | --- | ---
masks_stance         | 0.53 | 0.27 | 0.10 | 0.09
masks_argument       | 0.53 | 0.37 | 0.05 | 0.05
quarantine_stance    | 0.69 | 0.20 | 0.09 | 0.02
quarantine_argument  | 0.69 | 0.26 | 0.03 | 0.02
vaccines_stance      | 0.75 | 0.13 | 0.06 | 0.06
vaccines_argument    | 0.75 | 0.19 | 0.02 | 0.04

<br>

I eliminate certain unnecessary words and signs, such as `[USER]`, `-`, and `space`. Then I lemmatize all the words in each sentence, remove stop words and punctuation, and rejoin them to form a sentence. 

<br>

**Method**

This method is inspired by a Kaggle which applied `BERT` and combine with neural network for text claasification task. Due to the runtime limit, I come up with the idea to apply BERT with basic machine learning.

The first technique is divided into two parts: the data embedding and machine learning models. To prepare for the next stage, I apply the tokenizer and `BERT` ([from huggingface model repository](https://huggingface.co/ai-forever/sbert_large_nlu_ru)) approach to sentences and normalize them. 

The embedded tokens are then fed into machine learning models, including: `LogisticRegression`, `RandomForestClassifier`, and `KNeighborsClassifier`. 
This stage utilizes `GridSearchCV` to identify the best parameters and since there is the class imbalance problem (as indicated in the table above), `RandomOverSampler` ([link to Medium](https://medium.com/geekculture/how-to-deal-with-class-imbalances-in-python-960908fe0425)) is also applied to this model. After that, I train and predict the data and calculate the score by submitting the prediction into CodaLab.


## 2.2 Discussion of results

After applying `GridSearchCV`, here are the best parameters for each ML:

    * LogisticRegression(penalty = 'l2', C = 10, multi_class = 'auto', random_state=0)
    * RandomForestClassifier(n_estimators = 200, max_depth = 10, random_state=0)
    * KNeighborsClassifier(n_neighbors = 5, weights = 'distance', leaf_size = 10)


<br>

The scores for each model result are from post-evaluation in CodaLab and are shown in the following table.


Method | Validation : F1 Stance Detection |	Validation : F1 Premise Classification 
--- | --- | ---
**BERT + LogisticRegression**     | **0.4768** | **0.4662**
BERT + RandomForestClassifier | 0.4466 | 0.4564
BERT + KNeighborsClassifier   | 0.3562 | 0.3865

<br>

So, BERT + LogisticRegression gives the best score for this method and also perform the score higher than the baseline (0.4180 and 0.4355 in post-evaluation). 


# 3. Code

*Enter here all code used to produce your results submitted to Codalab. Add some comments and subsections to navigate though your solution.*

*In this part you are expected to develop yourself a solution of the task and provide a reproducible code:*
- *Using Python 3;*
- *Contains code for installation of all dependencies;*
- *Contains code for downloading of all the datasets used*;
- *Contains the code for reproducing your results (in other words, if a tester downloads your notebook she should be able to run cell-by-cell the code and obtain your experimental results as described in the methodology section)*.


*As a result, you code will be graded according to these criteria:*
- ***Readability**: your code should be well-structured preferably with indicated parts of your approach (Preprocessing, Model training, Evaluation, etc.).*
- ***Reproducibility**: your code should be reproduced without any mistakes with “Run all” mode (obtaining experimental part).*


## 3.1 Requirements

In [1]:
!pip install transformers
!pip install pymystem3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.0-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.2/224.2 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.0 tokenizers-0.13.3 transformers-4.28.1
Looking in indexes: https://pypi.org/simple, https://us

In [2]:
import numpy as np 
import pandas as pd

from tqdm import tqdm
tqdm.pandas()

import warnings
warnings.filterwarnings("ignore")

from pymystem3 import Mystem
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from string import punctuation

from transformers import AutoTokenizer, AutoModel
import torch

from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 3.2 Download the data

In [3]:
!wget -O train.tsv https://raw.githubusercontent.com/dialogue-evaluation/RuArg/main/data/train.tsv
!wget -O val_empty.tsv https://raw.githubusercontent.com/dialogue-evaluation/RuArg/main/data/val_empty.tsv
!wget -O test-no_labels.tsv https://raw.githubusercontent.com/dialogue-evaluation/RuArg/main/data/test-no_labels.tsv

--2023-04-24 17:54:56--  https://raw.githubusercontent.com/dialogue-evaluation/RuArg/main/data/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1539551 (1.5M) [text/plain]
Saving to: ‘train.tsv’


2023-04-24 17:54:56 (25.5 MB/s) - ‘train.tsv’ saved [1539551/1539551]

--2023-04-24 17:54:56--  https://raw.githubusercontent.com/dialogue-evaluation/RuArg/main/data/val_empty.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 308303 (301K) [text/plain]
Saving to: ‘val_empty.tsv’


2023-04-24 17:54:56 (8.36 MB/s) - ‘val_empty.t

In [4]:
train_df = pd.read_csv("train.tsv", sep="\t")
val_df = pd.read_csv("val_empty.tsv", sep="\t")         # no labels
test_df = pd.read_csv("test-no_labels.tsv", sep="\t")   # no labels

print(train_df.shape, val_df.shape, test_df.shape)

train_df.head(3)

(6717, 8) (1431, 2) (1402, 8)


Unnamed: 0,text_id,text,masks_stance,masks_argument,quarantine_stance,quarantine_argument,vaccines_stance,vaccines_argument
0,17024,"[USER], согласно предписаниям Роспотребнадзора...",-1,-1,1,1,-1,-1
1,17025,О несоблюдении карантинных мер контактными лиц...,-1,-1,1,1,-1,-1
2,17027,"[USER], читайте больше книжек на карантине, мо...",-1,-1,1,1,-1,-1


In [5]:
# ---------- add columns for val_df ----------

for prefix in ['masks', 'quarantine', 'vaccines']:
    for subfix in ['stance', 'argument']:  
        val_df[f'{prefix}_{subfix}'] = np.nan

In [6]:
# ---------- obsearve the labels in train_df ----------

cols = np.array(train_df.columns[2:])
for col in cols:
    idx = train_df[col].value_counts().index.values
    val = train_df[col].value_counts().values
    print(f'\n{col} \nclass: {idx} \nproportion: {val/sum(val)}')


masks_stance 
class: [-1  1  2  0] 
proportion: [0.53401816 0.27274081 0.10480869 0.08843234]

masks_argument 
class: [-1  1  0  2] 
proportion: [0.53401816 0.36489504 0.05061784 0.05046896]

quarantine_stance 
class: [-1  1  2  0] 
proportion: [0.68736043 0.1996427  0.0873902  0.02560667]

quarantine_argument 
class: [-1  1  2  0] 
proportion: [0.68736043 0.26142623 0.03230609 0.01890725]

vaccines_stance 
class: [-1  1  0  2] 
proportion: [0.75316361 0.1289266  0.06223016 0.05567962]

vaccines_argument 
class: [-1  1  0  2] 
proportion: [0.75316361 0.18430847 0.04034539 0.02218252]


## 3.3 Preprocessing 

In [7]:
m = Mystem() 
russian_stopwords = stopwords.words("russian")

def preprocessing_text(text):
    text = text.replace('[USER],', '').replace('[USER]', '').replace(' -', ' - ').replace('  ', ' ')
    tokens = m.lemmatize(text)
    tokens = [token for token in tokens if token not in russian_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    text = " ".join(tokens)
    
    return text

Installing mystem to /root/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz


In [8]:
train_df['text_prep'] = train_df['text'].progress_apply(lambda row: preprocessing_text(row))
val_df['text_prep'] = val_df['text'].progress_apply(lambda row: preprocessing_text(row))
test_df['text_prep'] = test_df['text'].progress_apply(lambda row: preprocessing_text(row))

100%|██████████| 6717/6717 [00:15<00:00, 446.23it/s]
100%|██████████| 1431/1431 [00:02<00:00, 562.88it/s]
100%|██████████| 1402/1402 [00:03<00:00, 421.68it/s]


In [9]:
train_df['text_prep'][:5]

0    согласно предписание роспотребнадзор весь тран...
1    несоблюдение карантинный мера контактный лицо ...
2       читать книжка карантин мочь мозг реанимировать
3    идти почитай инст наш городской паблик каждый ...
4    весь контактный лицо который обозначать отправ...
Name: text_prep, dtype: object

In [10]:
train_df.head(3)

Unnamed: 0,text_id,text,masks_stance,masks_argument,quarantine_stance,quarantine_argument,vaccines_stance,vaccines_argument,text_prep
0,17024,"[USER], согласно предписаниям Роспотребнадзора...",-1,-1,1,1,-1,-1,согласно предписание роспотребнадзор весь тран...
1,17025,О несоблюдении карантинных мер контактными лиц...,-1,-1,1,1,-1,-1,несоблюдение карантинный мера контактный лицо ...
2,17027,"[USER], читайте больше книжек на карантине, мо...",-1,-1,1,1,-1,-1,читать книжка карантин мочь мозг реанимировать


In [11]:
val_df.head(3)

Unnamed: 0,text_id,text,masks_stance,masks_argument,quarantine_stance,quarantine_argument,vaccines_stance,vaccines_argument,text_prep
0,17041,> 26 марта его поместили на принудительный кар...,,,,,,,26 март оно помещать принудительный карантин п...
1,17057,И шевкунов вещает из телевизора про необходимо...,,,,,,,шевкунов вещать телевизор необходимость самоиз...
2,17058,Это результат его же лобировал до последнего ...,,,,,,,это результат лобировать последний отказ каран...


In [12]:
test_df.head(3)

Unnamed: 0,text_id,text,masks_stance,masks_argument,quarantine_stance,quarantine_argument,vaccines_stance,vaccines_argument,text_prep
0,17059,Каникулы только дадут почву для распостранения...,,,,,,,каникулы давать почва распостранение вирус нуж...
1,17072,"Думал спокойно посидим в небольшой компании, п...",,,,,,,думать спокойно посидеть небольшой компания по...
2,17077,"[USER], в Китае болезнь гуляет с декабря, прос...",,,,,,,китай болезнь гулять декабрь просто последний ...


## 3.4 My method of text processing

### BERT 

In [13]:
# Load AutoModel from huggingface model repository : https://huggingface.co/ai-forever/sbert_large_nlu_ru

tokenizer_bert = AutoTokenizer.from_pretrained("sberbank-ai/sbert_large_nlu_ru")
model_bert = AutoModel.from_pretrained("sberbank-ai/sbert_large_nlu_ru")

model_bert.cuda(device=0) 

Downloading (…)okenizer_config.json:   0%|          | 0.00/323 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(120138, 1024, padding_idx=0)
    (position_embeddings): Embedding(512, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-23): 24 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inp

In [14]:
def sentence_embeddings(sentences, tokenizer, model):
    encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=200, return_tensors='pt')

    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in encoded_input.items()})

    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    
    return embeddings[0].cpu().numpy()      # First element of model_output contains all token embeddings

In [15]:
# ---------- apply BERT ----------

text_dict = {'train': train_df['text_prep'], 
             'val': val_df['text_prep'],
             'test': test_df['text_prep']}

dict_text_emb = dict()

for data_cat, text_prep in text_dict.items():
    result_emb = []

    for text in tqdm(text_prep):
        text_emb = sentence_embeddings(text, tokenizer_bert, model_bert)
        result_emb.append(text_emb)
       
    dict_text_emb[data_cat] = np.array(result_emb)

100%|██████████| 6717/6717 [03:25<00:00, 32.71it/s]
100%|██████████| 1431/1431 [00:49<00:00, 29.09it/s]
100%|██████████| 1402/1402 [00:34<00:00, 40.28it/s]


In [16]:
dict_text_emb

{'train': array([[ 0.02084843, -0.01510109, -0.01588843, ...,  0.01088143,
         -0.00916416,  0.0512589 ],
        [ 0.02601876, -0.00097957,  0.01140994, ...,  0.00829574,
          0.00525257,  0.02397489],
        [ 0.02266327, -0.0101819 ,  0.02739637, ..., -0.01387585,
          0.00345494,  0.05961689],
        ...,
        [ 0.00327311, -0.01681431,  0.00527889, ..., -0.01718167,
         -0.01344006,  0.06987476],
        [ 0.01806905,  0.00453402,  0.00501075, ...,  0.01610673,
          0.03497645,  0.03502917],
        [-0.02683961, -0.0177097 ,  0.03930188, ..., -0.00222838,
         -0.00117566,  0.06533927]], dtype=float32),
 'val': array([[ 0.01507563, -0.01729855,  0.02702075, ...,  0.01835766,
         -0.01561389,  0.03586648],
        [ 0.03517829,  0.00751793,  0.00121354, ...,  0.00619511,
         -0.01025546,  0.03417748],
        [ 0.04251985, -0.01783796, -0.01960066, ..., -0.00442569,
          0.01404218,  0.05558132],
        ...,
        [ 0.02453897, -

### Finding best params

In [None]:
models = {
    'lr' : LogisticRegression(random_state=0),
    'rf' : RandomForestClassifier(random_state=0),
    'knn' : KNeighborsClassifier()
}

In [None]:
params = {
    'lr':{
        'penalty' : ['l1', 'l2'],
        'C' : [0.1, 0.5, 1, 5, 10],
        'multi_class' : ['auto', 'ovr', 'multinomial'],
    },

    'rf':{
        'n_estimators' : [50, 100, 200],
        'max_depth' : [1, 5, 10],     
    },   

    'knn':{
        'n_neighbors' : [5, 10, 15, 20, 50],
        'weights' : ['uniform', 'distance'],
        'leaf_size' : [10, 20, 30, 50],
    }
}

In [21]:
y_columns = ['masks_stance', 'masks_argument', 
             'quarantine_stance', 'quarantine_argument',
             'vaccines_stance', 'vaccines_argument']

y_data_all = {}
y_data_all['train'] = train_df[y_columns]
y_data_all['val'] = val_df[y_columns]
y_data_all['test'] = test_df[y_columns]

In [None]:
def grid_search_params(x_data_all, models_all, params_all):
    ros = RandomOverSampler(random_state=0)
    dict_best_params = dict() 
    
    for model_name in models_all:
        print(f'\n---------- Model name : {model_name} ----------')

        dict_best_params_pre = dict()

        for prefix in tqdm(['masks', 'quarantine', 'vaccines']):
            dict_best_params_sub = dict()
            for subfix in ['stance', 'argument']:
                x_ros, y_ros = ros.fit_resample(x_data_all['train'], y_data_all['train'][f'{prefix}_{subfix}'])
                model = models_all[model_name]
                params_grid = params_all[model_name]
                
                gs_model = GridSearchCV(model, param_grid=params_grid, scoring='f1_weighted')
                gs_model.fit(x_ros, y_ros)

                print(f'----- {prefix}_{subfix} -----')
                print(gs_model.best_params_)

                dict_best_params_sub[subfix] = gs_model.best_params_

            dict_best_params_pre[prefix] = dict_best_params_sub

        dict_best_params[model_name] = dict_best_params_pre
               
    return dict_best_params

In [None]:
dict_params = []
gs_res = grid_search_params(dict_text_emb, models, params)
dict_params.append(gs_res)


---------- Model name : lr ----------


  0%|          | 0/3 [00:00<?, ?it/s]

----- masks_stance -----
{'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}


 33%|███▎      | 1/3 [18:38<37:16, 1118.46s/it]

----- masks_argument -----
{'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}
----- quarantine_stance -----
{'C': 10, 'multi_class': 'ovr', 'penalty': 'l2'}


 67%|██████▋   | 2/3 [47:17<24:31, 1471.48s/it]

----- quarantine_argument -----
{'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}
----- vaccines_stance -----
{'C': 10, 'multi_class': 'ovr', 'penalty': 'l2'}


100%|██████████| 3/3 [1:22:34<00:00, 1651.41s/it]


----- vaccines_argument -----
{'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}

---------- Model name : rf ----------


  0%|          | 0/3 [00:00<?, ?it/s]

----- masks_stance -----
{'max_depth': 10, 'n_estimators': 200}


 33%|███▎      | 1/3 [29:06<58:13, 1746.56s/it]

----- masks_argument -----
{'max_depth': 10, 'n_estimators': 200}
----- quarantine_stance -----
{'max_depth': 10, 'n_estimators': 200}


 67%|██████▋   | 2/3 [55:01<27:13, 1633.69s/it]

----- quarantine_argument -----
{'max_depth': 10, 'n_estimators': 200}
----- vaccines_stance -----
{'max_depth': 10, 'n_estimators': 200}


100%|██████████| 3/3 [1:21:41<00:00, 1633.77s/it]


----- vaccines_argument -----
{'max_depth': 10, 'n_estimators': 200}

---------- Model name : knn ----------


  0%|          | 0/3 [00:00<?, ?it/s]

----- masks_stance -----
{'leaf_size': 10, 'n_neighbors': 50, 'weights': 'distance'}


 33%|███▎      | 1/3 [16:46<33:33, 1006.55s/it]

----- masks_argument -----
{'leaf_size': 10, 'n_neighbors': 5, 'weights': 'distance'}
----- quarantine_stance -----
{'leaf_size': 10, 'n_neighbors': 5, 'weights': 'distance'}


 67%|██████▋   | 2/3 [44:20<23:07, 1387.51s/it]

----- quarantine_argument -----
{'leaf_size': 10, 'n_neighbors': 5, 'weights': 'distance'}
----- vaccines_stance -----
{'leaf_size': 10, 'n_neighbors': 5, 'weights': 'distance'}


100%|██████████| 3/3 [1:16:51<00:00, 1537.26s/it]

----- vaccines_argument -----
{'leaf_size': 10, 'n_neighbors': 5, 'weights': 'distance'}





In [None]:
dict_params

[{'lr': {'masks': {'stance': {'C': 10, 'multi_class': 'auto', 'penalty': 'l2'},
    'argument': {'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}},
   'quarantine': {'stance': {'C': 10, 'multi_class': 'ovr', 'penalty': 'l2'},
    'argument': {'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}},
   'vaccines': {'stance': {'C': 10, 'multi_class': 'ovr', 'penalty': 'l2'},
    'argument': {'C': 10, 'multi_class': 'auto', 'penalty': 'l2'}}},
  'rf': {'masks': {'stance': {'max_depth': 10, 'n_estimators': 200},
    'argument': {'max_depth': 10, 'n_estimators': 200}},
   'quarantine': {'stance': {'max_depth': 10, 'n_estimators': 200},
    'argument': {'max_depth': 10, 'n_estimators': 200}},
   'vaccines': {'stance': {'max_depth': 10, 'n_estimators': 200},
    'argument': {'max_depth': 10, 'n_estimators': 200}}},
  'knn': {'masks': {'stance': {'leaf_size': 10,
     'n_neighbors': 50,
     'weights': 'distance'},
    'argument': {'leaf_size': 10, 'n_neighbors': 5, 'weights': 'distance'}},
   'quar

### Train and predict the data 

In [17]:
# ---------- select the best params ----------

models_best_params = {
    'lr' : LogisticRegression(
                            penalty = 'l2',
                            C = 10,
                            multi_class = 'auto',
                            random_state=0
                            ),
                      
    'rf' : RandomForestClassifier(
                            n_estimators = 200,
                            max_depth = 10,
                            random_state=0
                            ),
                      
    'knn' : KNeighborsClassifier(
                            n_neighbors = 5, 
                            weights = 'distance',
                            leaf_size = 10
                            )
}

In [24]:
ros = RandomOverSampler(random_state=0)
pred_dict = dict()
 
for model_name, model in tqdm(models_best_params.items()):
    
    y_val = y_data_all['val'].copy()
    y_test = y_data_all['test'].copy()

    for prefix in ['masks', 'quarantine', 'vaccines']:
        for subfix in ['stance', 'argument']:
            x_ros, y_ros = ros.fit_resample(dict_text_emb['train'], train_df[[f'{prefix}_{subfix}']])
            best_model = model
            best_model.fit(x_ros, y_ros)
            y_val[f'{prefix}_{subfix}'] = best_model.predict(dict_text_emb['val'])
            y_test[f'{prefix}_{subfix}'] = best_model.predict(dict_text_emb['test'])

    pred_dict[model_name] = {'val' : y_val, 'test' : y_test}

100%|██████████| 3/3 [07:06<00:00, 142.05s/it]


In [27]:
pd.DataFrame(pred_dict['lr']['test'])

Unnamed: 0,masks_stance,masks_argument,quarantine_stance,quarantine_argument,vaccines_stance,vaccines_argument
0,-1,-1,2,2,-1,-1
1,-1,-1,1,2,-1,-1
2,-1,-1,2,1,-1,-1
3,-1,-1,2,2,-1,-1
4,-1,-1,1,1,-1,-1
...,...,...,...,...,...,...
1397,-1,-1,-1,-1,0,2
1398,-1,-1,-1,-1,2,1
1399,2,0,-1,-1,2,1
1400,-1,-1,-1,-1,2,1


In [28]:
pd.DataFrame(pred_dict['rf']['test'])

Unnamed: 0,masks_stance,masks_argument,quarantine_stance,quarantine_argument,vaccines_stance,vaccines_argument
0,-1,-1,1,1,-1,-1
1,-1,-1,1,1,-1,-1
2,-1,-1,1,1,-1,-1
3,-1,-1,1,2,-1,-1
4,-1,-1,1,1,-1,-1
...,...,...,...,...,...,...
1397,-1,-1,-1,-1,2,1
1398,-1,-1,-1,-1,2,1
1399,-1,-1,-1,-1,2,-1
1400,-1,-1,-1,-1,1,1


In [29]:
test_res_lr = pd.DataFrame(pred_dict['lr']['test'])
test_res_lr.to_csv(f'test_res_bert_lr.tsv', index = False, sep='\t')

test_res_rf = pd.DataFrame(pred_dict['rf']['test'])
test_res_rf.to_csv(f'test_res_bert_rf.tsv', index = False, sep='\t')

test_res_knn = pd.DataFrame(pred_dict['knn']['test'])
test_res_knn.to_csv(f'test_res_bert_knn.tsv', index = False, sep='\t')

**Evaluation**

Since the validation set has no labels, so I use CodaLab to get the score of only test set and all scores are indicated in the part 2.2 Discussion of results.



---

