# prerequisite

In [1]:
!git clone https://github.com/sw-membership/datasets

Cloning into 'datasets'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 33 (delta 6), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (33/33), done.


In [2]:
!pip install transformers
!pip install datasets
!pip install sadice

!pip install git+https://github.com/ufoym/imbalanced-dataset-sampler

Collecting transformers
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 4.1 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.7 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 55.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 67.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 71.9 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
  

In [3]:
import torch
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, f1_score
from transformers import ElectraForSequenceClassification, ElectraTokenizerFast
from datasets import load_dataset
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import DataLoader
from transformers import TrainingArguments

# model & tokenizer

In [4]:
model_config = {
    "num_labels": 3,
    "id2label": {0: 0, 1: 1, 2: 2},
    "label2id": {0: 0, 1: 1, 2: 2}
}

In [5]:
discriminator = ElectraForSequenceClassification.from_pretrained("google/electra-small-discriminator", **model_config)
tokenizer = ElectraTokenizerFast.from_pretrained("google/electra-small-discriminator")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

# split train, val, test dataset

In [6]:
csv = pd.read_csv('datasets/data/SAFFN.csv', header=None, usecols=[1,2], names=['labels', 'content'])
csv.head()

Unnamed: 0,labels,content
0,1,"According to Gran , the company has no plans t..."
1,1,Technopolis plans to develop in stages an area...
2,0,The international electronic industry company ...
3,2,With the new production plant the company woul...
4,2,According to the company 's updated strategy f...


In [7]:
split = StratifiedShuffleSplit(n_splits=1, test_size=500, random_state=42)

train_val_idx, test_idx = next(split.split(csv, csv["labels"]))
df_train_val = csv.loc[train_val_idx, :].reset_index()

train_idx, val_idx = next(split.split(df_train_val, df_train_val["labels"]))

train = csv.loc[train_idx, :]
val = csv.loc[val_idx, :]
test = csv.loc[test_idx, :]

train.shape, val.shape, test.shape

((3846, 2), (500, 2), (500, 2))

In [8]:
train["labels"].value_counts()

1    2436
2    1167
0     243
Name: labels, dtype: int64

In [9]:
val["labels"].value_counts()

1    335
2    143
0     22
Name: labels, dtype: int64

In [10]:
test["labels"].value_counts()

1    297
2    141
0     62
Name: labels, dtype: int64

In [11]:
train.to_csv("train.csv", index=False)
val.to_csv("val.csv", index=False)
test.to_csv("test.csv", index=False)

# datasets & dataloader

In [12]:
def tokenize(data, tokenizer=tokenizer):
    return tokenizer(data["content"], padding="max_length", truncation=True)

In [13]:
dataset = load_dataset('csv', data_files={'train': 'train.csv', 'val': 'val.csv', 'test': 'test.csv'})
dataset = dataset.map(tokenize)
dataset.set_format(
    type="torch",
    columns=["input_ids", "token_type_ids", "attention_mask", "labels"],
    device='cpu',
)

Using custom data configuration default-3d26c5bbfb3805e1


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-3d26c5bbfb3805e1/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3d26c5bbfb3805e1/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3846 [00:00<?, ?ex/s]

  0%|          | 0/500 [00:00<?, ?ex/s]

  0%|          | 0/500 [00:00<?, ?ex/s]

In [14]:
train = dataset["train"].remove_columns("content")
val = dataset["val"].remove_columns("content")
test = dataset["test"].remove_columns("content")

train, val, test

(Dataset({
     features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 3846
 }), Dataset({
     features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 500
 }), Dataset({
     features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 500
 }))

In [15]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred, average="macro")

    return {"accuracy": accuracy, "f1": f1}

# ImbalancedSamplerTrainer

In [16]:
import datasets
import torch
from sadice import SelfAdjDiceLoss
from torch.utils.data import DataLoader
from transformers import Trainer

from torchsampler import ImbalancedDatasetSampler


class ImbalancedSamplerTrainer(Trainer):
    def get_train_dataloader(self) -> DataLoader:
        train_dataset = self.train_dataset

        def get_label(dataset):
            return dataset["labels"]

        train_sampler = ImbalancedDatasetSampler(
            train_dataset, callback_get_label=get_label
        )

        return DataLoader(
            train_dataset,
            batch_size=self.args.train_batch_size,
            sampler=train_sampler,
            collate_fn=self.data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
        )


class TrainerWithDiceLoss(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        criterion = SelfAdjDiceLoss()
        loss = criterion(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

# train

In [23]:
args = TrainingArguments(
    output_dir='./result/',
    seed=42,
    num_train_epochs=200,
    learning_rate=1e-4,
    weight_decay=0.0,
    gradient_accumulation_steps=1,
    adam_epsilon=1e-8,
    max_grad_norm=1.0,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,

    # checkpoint
    overwrite_output_dir=True,
    save_strategy='steps',
    save_steps=1000,

    # evaluation
    evaluation_strategy='steps',
    eval_steps=1000,
    metric_for_best_model="f1",

    # early stopping
    load_best_model_at_end=True
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [24]:
trainer = ImbalancedSamplerTrainer(
    model=discriminator,
    args=args,
    train_dataset=train,
    eval_dataset=val,
    compute_metrics=compute_metrics
)

In [25]:
trainer.train()

***** Running training *****
  Num examples = 3846
  Num Epochs = 200
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 24200


Step,Training Loss,Validation Loss,Accuracy,F1
1000,0.1186,0.810502,0.834,0.740562
2000,0.0398,1.212663,0.832,0.746579
3000,0.0252,1.268967,0.842,0.749846
4000,0.0188,1.285786,0.838,0.736228
5000,0.0206,1.139631,0.846,0.768818
6000,0.0163,1.25115,0.84,0.74958
7000,0.0169,1.315602,0.84,0.728601
8000,0.0083,1.329102,0.854,0.763891
9000,0.0089,1.508816,0.846,0.760171
10000,0.0088,1.411353,0.844,0.738483


***** Running Evaluation *****
  Num examples = 500
  Batch size = 32
Saving model checkpoint to ./result/checkpoint-1000
Configuration saved in ./result/checkpoint-1000/config.json
Model weights saved in ./result/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 500
  Batch size = 32
Saving model checkpoint to ./result/checkpoint-2000
Configuration saved in ./result/checkpoint-2000/config.json
Model weights saved in ./result/checkpoint-2000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 500
  Batch size = 32
Saving model checkpoint to ./result/checkpoint-3000
Configuration saved in ./result/checkpoint-3000/config.json
Model weights saved in ./result/checkpoint-3000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 500
  Batch size = 32
Saving model checkpoint to ./result/checkpoint-4000
Configuration saved in ./result/checkpoint-4000/config.json
Model weights saved in ./result/checkpoint-4000/pytorch_model.bin
***** Ru

TrainOutput(global_step=24200, training_loss=0.01761591829636619, metrics={'train_runtime': 9237.9815, 'train_samples_per_second': 83.265, 'train_steps_per_second': 2.62, 'total_flos': 2.26301950144512e+16, 'train_loss': 0.01761591829636619, 'epoch': 200.0})

# test

In [35]:
!pip install torchmetrics

Collecting torchmetrics
  Downloading torchmetrics-0.5.1-py3-none-any.whl (282 kB)
[?25l[K     |█▏                              | 10 kB 14.3 MB/s eta 0:00:01[K     |██▎                             | 20 kB 7.0 MB/s eta 0:00:01[K     |███▌                            | 30 kB 5.2 MB/s eta 0:00:01[K     |████▋                           | 40 kB 5.0 MB/s eta 0:00:01[K     |█████▉                          | 51 kB 2.5 MB/s eta 0:00:01[K     |███████                         | 61 kB 2.8 MB/s eta 0:00:01[K     |████████                        | 71 kB 2.8 MB/s eta 0:00:01[K     |█████████▎                      | 81 kB 3.2 MB/s eta 0:00:01[K     |██████████▍                     | 92 kB 3.4 MB/s eta 0:00:01[K     |███████████▋                    | 102 kB 2.7 MB/s eta 0:00:01[K     |████████████▊                   | 112 kB 2.7 MB/s eta 0:00:01[K     |██████████████                  | 122 kB 2.7 MB/s eta 0:00:01[K     |███████████████                 | 133 kB 2.7 MB/s eta 0:0

In [52]:
from torch.utils.data import DataLoader
from torchmetrics import F1
from tqdm.auto import tqdm

test_loader = DataLoader(dataset["test"], batch_size=32)

f1_score = F1(num_classes=3, average="macro").cuda()

In [69]:
test_data = load_dataset('csv', data_files={'test': 'test.csv'})
test_data = test_data.map(tokenize)
test_data.set_format(
    type="torch",
    columns=["input_ids", "token_type_ids", "attention_mask", "labels"],
    device='cuda',
)
test_loader = DataLoader(test_data["test"], batch_size=32)

Using custom data configuration default-cf8908480cfdf88b
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-cf8908480cfdf88b/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-cf8908480cfdf88b/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff/cache-84ed058f062fb088.arrow


In [70]:
test_loader.dataset

Dataset({
    features: ['labels', 'content', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 500
})

In [66]:
discriminator = None
discriminator = ElectraForSequenceClassification.from_pretrained("result/checkpoint-16000")
discriminator.cuda()
discriminator.eval()

with torch.no_grad():
    for batch in tqdm(test_loader):
        labels = batch.pop("labels")
        output = discriminator(**batch)
        logits = output.logits
        pred = torch.argmax(logits, dim=1)
        f1_score(pred, labels)

score = f1_score.compute()
print(score.item())

loading configuration file result/checkpoint-16000/config.json
Model config ElectraConfig {
  "_name_or_path": "google/electra-small-discriminator",
  "architectures": [
    "ElectraForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": 0,
    "1": 1,
    "2": 2
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "0": 0,
    "1": 1,
    "2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "torch_dtype": "float32",
  "transformers_version": "4.10.3",
  "type_vo

  0%|          | 0/16 [00:00<?, ?it/s]

0.9558663964271545


In [67]:
discriminator = None
discriminator = ElectraForSequenceClassification.from_pretrained("result/checkpoint-24000")
discriminator.cuda()
discriminator.eval()

with torch.no_grad():
    for batch in tqdm(test_loader):
        labels = batch.pop("labels")
        output = discriminator(**batch)
        logits = output.logits
        pred = torch.argmax(logits, dim=1)
        f1_score(pred, labels)

score = f1_score.compute()
print(score.item())

loading configuration file result/checkpoint-24000/config.json
Model config ElectraConfig {
  "_name_or_path": "google/electra-small-discriminator",
  "architectures": [
    "ElectraForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": 0,
    "1": 1,
    "2": 2
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "0": 0,
    "1": 1,
    "2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "torch_dtype": "float32",
  "transformers_version": "4.10.3",
  "type_vo

  0%|          | 0/16 [00:00<?, ?it/s]

0.9589840173721313


In [77]:
discriminator = None
discriminator = ElectraForSequenceClassification.from_pretrained("result/checkpoint-24000")

loading configuration file result/checkpoint-24000/config.json
Model config ElectraConfig {
  "_name_or_path": "google/electra-small-discriminator",
  "architectures": [
    "ElectraForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": 0,
    "1": 1,
    "2": 2
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "0": 0,
    "1": 1,
    "2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads": 4,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "summary_activation": "gelu",
  "summary_last_dropout": 0.1,
  "summary_type": "first",
  "summary_use_proj": true,
  "torch_dtype": "float32",
  "transformers_version": "4.10.3",
  "type_vo

In [74]:
def predict(discriminator, text, tokenizer=tokenizer):
    with torch.no_grad():
        tokens = tokenizer(text, padding="max_length", truncation=True, return_tensors="pt")
        output = discriminator(**tokens)
        logits = output.logits
        pred = logits.argmax().item()
        score = logits.softmax(1).max().item()
        return {"label": pred, "score": score}

In [80]:
text_list = [
    "It's a historical low point.",
    "If you look at the corporate performance that was frustrating to see the company's investment in Korea from 2011 to 2015, the KOSPI achieved 124 trillion 124 trillion won in 2011 based on annual operating profit, and if you look at it until 2015, it continued to go back and forth from 120 trillion won.",
    "In the process, the market was so happy from 2011 to 1975, and from 2016 to 2017, semiconductors turned around from 2016, gave 194 in 2017, and then 198 in 2018, and the KOSPI hit a full high point in 290 and 138 this year.",
    "That's why it's hard for me to predict the KOSPI at the moment, but if I look at the flow of corporate profits, it seems that the KOSPI between 2011 and 2015 will come down in a new way.",
    "And the stock market recovered a lot on the assumption that stocks will recover from the pandemic from the third quarter, and then corporate earnings will falter every time there are small issues, and as I said, corporate earnings will be high next week. Recently, companies that sell a lot of strong soup ramen with MS ramen have a clear personality. Recently, Ottogi was released today, not a Ottogi employee, but it's okay. Evan is okay.",
    "I haven't tried it yet, but I'm saying the sales volume isn't it?"
]

for text in text_list:
    result = predict(discriminator, text)
    print(result, '\n', text, '\n')

{'label': 0, 'score': 0.9999994039535522} 
 It's a historical low point. 

{'label': 0, 'score': 0.9914835095405579} 
 If you look at the corporate performance that was frustrating to see the company's investment in Korea from 2011 to 2015, the KOSPI achieved 124 trillion 124 trillion won in 2011 based on annual operating profit, and if you look at it until 2015, it continued to go back and forth from 120 trillion won. 

{'label': 2, 'score': 0.9999985694885254} 
 In the process, the market was so happy from 2011 to 1975, and from 2016 to 2017, semiconductors turned around from 2016, gave 194 in 2017, and then 198 in 2018, and the KOSPI hit a full high point in 290 and 138 this year. 

{'label': 1, 'score': 0.9999539852142334} 
 That's why it's hard for me to predict the KOSPI at the moment, but if I look at the flow of corporate profits, it seems that the KOSPI between 2011 and 2015 will come down in a new way. 

{'label': 2, 'score': 0.5307590365409851} 
 And the stock market recover