На основе туториала `transformers`


- из данных убраны дубли

In [1]:
TRAIN_CSV = f"./datasets/train_clean.csv"
SMALL_CSV = f"./cache/train.csv"
SCORING_CSV = f"./datasets/test.csv"

USE_SMALL = False

In [2]:
import torch
from tqdm import tqdm

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Датасеты

- пакет huggingface datasets

In [3]:
# !pip install datasets
import datasets

In [4]:
arxiv_dataset = datasets.Dataset.from_csv(SMALL_CSV if USE_SMALL else TRAIN_CSV)

Using custom data configuration default-044aa5b2537bb910
Reusing dataset csv (/home/user1/.cache/huggingface/datasets/csv/default-044aa5b2537bb910/0.0.0)


In [5]:
test_size = 0.2 if USE_SMALL else 0.02
arxiv_dataset = arxiv_dataset.train_test_split(test_size=test_size)
pass

In [6]:
len(arxiv_dataset["train"]), len(arxiv_dataset["test"]), arxiv_dataset["train"][0].keys()

(103490, 2113, dict_keys(['abstract', 'title']))

In [7]:
scoring_dataset = datasets.Dataset.from_csv(SCORING_CSV)
len(scoring_dataset), scoring_dataset[0].keys()

Using custom data configuration default-2e0a9ad90b647d2d
Reusing dataset csv (/home/user1/.cache/huggingface/datasets/csv/default-2e0a9ad90b647d2d/0.0.0)


(1000, dict_keys(['abstract']))

## Токенайзер

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")   # 1.3 MB

In [9]:
prefix = "summarize: "

def preprocess_function(examples):
    """Почти все последовательности без обрезки: max = 1096 / 103
        - max_length=1024
        - max_length=128
    """

    srcs = [prefix + doc for doc in examples["abstract"]]
    model_inputs = tokenizer(srcs, max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        trgs = tokenizer(examples["title"], max_length=128, truncation=True)

    model_inputs["labels"] = trgs["input_ids"]
    return model_inputs

In [10]:
tokenized_arxiv = arxiv_dataset.map(preprocess_function, batched=True)

100%|██████████| 104/104 [00:34<00:00,  2.99ba/s]
100%|██████████| 3/3 [00:00<00:00,  4.28ba/s]


In [11]:
tokenized_arxiv["train"][0].keys(), tokenized_arxiv["train"][0]["abstract"], tokenized_arxiv["train"][0]["title"]

(dict_keys(['abstract', 'title', 'input_ids', 'attention_mask', 'labels']),
 'a version of the virial theorem, which takes into account the effects of the non-compact extra-dimensions, is derived in the framework of the brane world models. in the braneworld scenario, the four dimensional effective einstein equation has some extra terms, called dark radiation and dark pressure, respectively, which arise from the embedding of the 3-brane in the bulk. to derive the generalized virial theorem we use a method based on the collisionless boltzmann equation. the dark radiation term generates an equivalent mass term (the dark mass), which gives an effective contribution to the gravitational energy. this term may account for the well-known virial theorem mass discrepancy in actual clusters of galaxies. an approximate solution of the vacuum field equations on the brane, corresponding to weak gravitational fields, is also obtained, and the expressions for the dark radiation and dark mass are deriv

# Модель

In [12]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-base")

In [13]:
# model.to(device)    # if no train else trainer sends

In [14]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Обучение

In [15]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./BART-base-results",
    optim="adamw_torch",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    weight_decay=0.01,
    # logging_steps=1000,
    save_steps=1000,
    save_total_limit=3,
    num_train_epochs=3*8,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_arxiv["train"],
    eval_dataset=tokenized_arxiv["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

Using amp half precision backend


In [16]:
tqdm._instances.clear()

trainer.train(resume_from_checkpoint=True)

Loading model from ./BART-base-results/checkpoint-74000).
The following columns in the training set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: title, abstract. If title, abstract are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 103490
  Num Epochs = 24
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 155232
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 11
  Continuing training from global step 74000
  Will skip the first 11 epochs then the first 22816 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.
Skipping the f

{'loss': 1.7665, 'learning_rate': 1.0405457637600496e-05, 'epoch': 11.52}



[A                                                      Saving model checkpoint to ./BART-base-results/checkpoint-75000
Configuration saved in ./BART-base-results/checkpoint-75000/config.json


{'loss': 1.7499, 'learning_rate': 1.0341037930323644e-05, 'epoch': 11.6}


Model weights saved in ./BART-base-results/checkpoint-75000/pytorch_model.bin
tokenizer config file saved in ./BART-base-results/checkpoint-75000/tokenizer_config.json
Special tokens file saved in ./BART-base-results/checkpoint-75000/special_tokens_map.json
Deleting older checkpoint [BART-base-results/checkpoint-72000] due to args.save_total_limit

[A                                                      

{'loss': 1.7542, 'learning_rate': 1.0276618223046794e-05, 'epoch': 11.67}



[A                                                      Saving model checkpoint to ./BART-base-results/checkpoint-76000
Configuration saved in ./BART-base-results/checkpoint-76000/config.json


{'loss': 1.7637, 'learning_rate': 1.0212198515769944e-05, 'epoch': 11.75}


Model weights saved in ./BART-base-results/checkpoint-76000/pytorch_model.bin
tokenizer config file saved in ./BART-base-results/checkpoint-76000/tokenizer_config.json
Special tokens file saved in ./BART-base-results/checkpoint-76000/special_tokens_map.json
Deleting older checkpoint [BART-base-results/checkpoint-73000] due to args.save_total_limit

[A                                                      

{'loss': 1.7183, 'learning_rate': 1.0147907647907648e-05, 'epoch': 11.83}



[A                                                      Saving model checkpoint to ./BART-base-results/checkpoint-77000
Configuration saved in ./BART-base-results/checkpoint-77000/config.json


{'loss': 1.7281, 'learning_rate': 1.0083487940630798e-05, 'epoch': 11.9}


Model weights saved in ./BART-base-results/checkpoint-77000/pytorch_model.bin
tokenizer config file saved in ./BART-base-results/checkpoint-77000/tokenizer_config.json
Special tokens file saved in ./BART-base-results/checkpoint-77000/special_tokens_map.json
Deleting older checkpoint [BART-base-results/checkpoint-74000] due to args.save_total_limit


KeyboardInterrupt: 

# Генерация

In [17]:
def generate(example):
    input_ids = tokenizer(prefix + example["abstract"], 
                        max_length=1024, 
                        truncation=True, 
                        return_tensors="pt").input_ids  # Batch size 1
    outputs = model.generate(input_ids.to(device))
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [18]:
n = 10
arxiv_dataset["test"][n]["abstract"], arxiv_dataset["test"][n]["title"], generate(arxiv_dataset["test"][n])

('we provide an approach for learning deep neural net representations of models described via conditional moment restrictions. conditional moment restrictions are widely used, as they are the language by which social scientists describe the assumptions they make to enable causal inference. we formulate the problem of estimating the underling model as a zero-sum game between a modeler and an adversary and apply adversarial training. our approach is similar in nature to generative adversarial networks (gan), though here the modeler is learning a representation of a function that satisfies a continuum of moment conditions and the adversary is identifying violating moments. we outline ways of constructing effective adversaries in practice, including kernels centered by k-means clustering, and random forests. we examine the practical performance of our approach in the setting of non-parametric instrumental variable regression.',
 'adversarial generalized method of moments',
 'learning condi

In [19]:
n = 42
arxiv_dataset["test"][n]["abstract"], arxiv_dataset["test"][n]["title"], generate(arxiv_dataset["test"][n])

('an increasing number of studies use the spectrum of cardiac signals for analyzing the spatiotemporal dynamics of complex cardiac arrhythmias. however, the relationship between the spectrum of cardiac signals and the spatiotemporal dynamics of the underlying cardiac sources remains to date unclear. in this paper, we derive a mathematical expression relating the spectrum of cardiac signals to the spatiotemporal dynamics of cardiac sources and the measurement characteristics of the lead systems. then, by using analytical methods and computer simulations we analyze the spectrum of cardiac signals measured by idealized lead systems during correlated and uncorrelated spatiotemporal dynamics. our results show that lead systems can have distorting effects on the spectral envelope of cardiac signals, which depend on the spatial resolution of the lead systems and on the degree of spatiotemporal correlation of the underlying cardiac sources. in addition to this, our results indicate that the sp

# BLEU-score

Самоделки:
- 0.02457 (словарь 6152, по 5 эпох по 5r-4, 1e-3, min.val.loss = 3.875) 
- **0.19204** (словарь 60 тыс. ~15 эпох с шагом 5e-4 -> 5e-5, min.val.loss = 2.289)
- 0.12601 (словарь 84 тыс. много разных эпох, сходится плохо, min.val.loss = 3.305)
- 0.10644 (BPE, словарь 16 тыс., много разных эпох, сходится плохо, min.val.loss = 3.8)

T5-small
- BLEU-score: **0.044...** 1% тюнинг
- BLEU-score: **0.16563** (3 эпохи - 2,5 часа RTX2060 6Gb)

T5-base
- BLEU-score: **0.07422** (без обучения)
- обучение не тянет...

BART-base
- BLEU-score: **0.17743** 1% тюнинг
- BLEU-score: **0.17984** (1.43 эпохи - 2,5 часа RTX2060 6Gb)
- BLEU-score: **0.19266** (2 эпохи)


In [20]:
from torchtext.data.metrics import bleu_score

tqdm._instances.clear()

candidates = []
references = []
for example in tqdm(tokenized_arxiv["test"]):
    candidates.append(generate(example).split())
    references.append([example["title"].split()])

score = bleu_score(candidates, references, max_n=3, weights=[1/3]*3)

print('BLEU-score: {0:.5f}'.format(score))

100%|██████████| 2113/2113 [08:57<00:00,  3.93it/s]


BLEU-score: 0.19266


### Stepik score

In [21]:
SUBMISSION_NAME = "BART-base" if USE_SMALL else "BART-base-tune"

Генерация заголовков для тестовых данных

In [22]:
tqdm._instances.clear()

abstracts = []
titles = []

for example in tqdm(scoring_dataset):
    abstracts.append(example["abstract"])
    titles.append(generate(example))

100%|██████████| 1000/1000 [04:08<00:00,  4.03it/s]


Получилось, например

In [23]:
abstracts[1], titles[1]

('The doc2vec approach was introduced as an extension to word2vec (Le and Mikolov, 2014), to generate embeddings at the level of entire documents, with interesting results, followed by mixed success at reproducing results from the initial paper. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and 2 advanced embedding-generating methodologies for documents. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.',
 'evaluation of the doc2vec embedding-generating methodologies')

Записываем полученные заголовки в файл формата `<abstract>,<title>`:

In [24]:
import pandas as pd

submission_df = pd.DataFrame({'abstract': abstracts, 'title': titles})
submission_df.to_csv(f"./submission/predicted_titles_{SUBMISSION_NAME}.csv", index=False)

In [25]:
submission_df["title"].apply(lambda x: len(str(x).split())).describe()[["mean","std", "max"]]

mean     7.460000
std      2.337919
max     14.000000
Name: title, dtype: float64

С помощью скрипта `generate_csv` приводим файл `submission_prediction.csv` в формат, необходимый для отправки:

In [26]:
from helpers.create_submission import generate_csv

generate_csv(input_file=f"./submission/predicted_titles_{SUBMISSION_NAME}.csv", 
             output_file=f'./submission/submission_{SUBMISSION_NAME}.csv', 
             voc_file=f'./datasets/vocs.pkl')

# С учетом

In [27]:
import pandas as pd
import numpy as np

train_df = pd.read_csv("./datasets/train.csv")
submission_df = pd.read_csv(f"./submission/predicted_titles_{SUBMISSION_NAME}.csv")

intersect_idx = np.intersect1d(submission_df["abstract"].str.lower(), train_df["abstract"].str.lower(), return_indices=True)

submission_df.loc[intersect_idx[1], 'title'] = train_df.loc[intersect_idx[2], 'title'].values

In [28]:
from helpers.create_submission import generate_csv

submission_df.to_csv(f"./submission/predicted_titles_{SUBMISSION_NAME}_fake.csv", index=False)

generate_csv(input_file=f"./submission/predicted_titles_{SUBMISSION_NAME}_fake.csv", 
             output_file=f'./submission/submission_{SUBMISSION_NAME}_fake.csv', 
             voc_file=f'./datasets/vocs.pkl')

In [29]:
f'./submission/submission_{SUBMISSION_NAME}_fake.csv'

'./submission/submission_BART-base-tune_fake.csv'

T5-small:
- **Score: 0.26174** 1% tuning
- **Score: 0.34497** tuning 3 эпохи
- **Score: 0.51810** + добавление правильных меток из трейна

T5-base:
- **Score: 0.20510** w/o tuning
- для обучения с имеющейся длиной последовательности не хватает памяти GPU

BART-base
- **Score: 0.33851** 1% tuning
- **Score: 0.39536** tuning 1,5 эпохи
- **Score: 0.54804** + добавление правильных меток из трейна
- **Score: 0.56782** + 2 эпохи с накоплением градиента (вот и в топ-10)
- ... дальше не интересно