По мотивам топового решения на Каггле, [туториала](https://towardsdatascience.com/conditional-text-generation-by-fine-tuning-gpt-2-11c1a9fc639d)

- из данных убраны дубли

In [1]:
TRAIN_CSV = f"./datasets/train_clean.csv"
SMALL_CSV = f"./cache/train.csv"
SCORING_CSV = f"./datasets/test.csv"

USE_SMALL = False

In [2]:
import torch
from tqdm import tqdm

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
max_title = 34          # max 103, "типа 3 сигмы"
max_abstract = 457      # max 1096, "типа 3 сигмы"

In [4]:
# gpt2 124M parameters - 523 MB
# distilgpt2 82M parameters - 336 MB
MODEL = "gpt2"  

## Датасеты

- пакет huggingface datasets

In [5]:
# !pip install datasets
import datasets

In [6]:
arxiv_dataset = datasets.Dataset.from_csv(SMALL_CSV if USE_SMALL else TRAIN_CSV)

Using custom data configuration default-044aa5b2537bb910
Reusing dataset csv (/home/user1/.cache/huggingface/datasets/csv/default-044aa5b2537bb910/0.0.0)


In [7]:
test_size = 0.2 if USE_SMALL else 0.02
arxiv_dataset = arxiv_dataset.train_test_split(test_size=test_size)
pass

In [8]:
len(arxiv_dataset["train"]), len(arxiv_dataset["test"]), arxiv_dataset["train"][0].keys()

(103490, 2113, dict_keys(['abstract', 'title']))

In [9]:
scoring_dataset = datasets.Dataset.from_csv(SCORING_CSV)
len(scoring_dataset), scoring_dataset[0].keys()

Using custom data configuration default-2e0a9ad90b647d2d
Reusing dataset csv (/home/user1/.cache/huggingface/datasets/csv/default-2e0a9ad90b647d2d/0.0.0)


(1000, dict_keys(['abstract']))

## Токенайзер

In [10]:
from transformers import AutoTokenizer

SPECIAL_TOKENS  = { "bos_token": "[BOS]",
                    "eos_token": "[EOS]",
                    "unk_token": "[UNK]",                    
                    "pad_token": "[PAD]",
                    "sep_token": "[SEP]"}

tokenizer = AutoTokenizer.from_pretrained(MODEL)   # 1.3 MB
tokenizer.add_special_tokens(SPECIAL_TOKENS)

5

GPT-2 построена на декодерах. Работает в режиме языковой модели - т.е. на вход последовательность, на выход прогноз ее продолжения. Поэтому для обучения и генерации данные и таргет ей подаются одной последовательностью, разделенной специальными токенами, которые она и запоминает.
- в пакете `transformers` тут также надо подавать `label` == `input_ids` для единоообразия конструкций для трейнера

In [11]:
def preprocess_function_gpt2(examples):
    """Без оптимизаций, сразу паддинг по максимальному"""

    inputs = [tokenizer.bos_token + abstract + \
             tokenizer.sep_token + title + \
             tokenizer.eos_token for (abstract, title) in zip(examples["abstract"], examples["title"])]

    model_inputs = tokenizer(inputs, max_length=max_abstract + max_title + 3, truncation=True, padding="max_length")
    model_inputs["label"] = model_inputs["input_ids"]   # вот такое API
    return model_inputs

In [12]:
tokenized_arxiv = arxiv_dataset.map(preprocess_function_gpt2, batched=True)

100%|██████████| 104/104 [00:55<00:00,  1.88ba/s]
100%|██████████| 3/3 [00:01<00:00,  2.82ba/s]


In [13]:
tokenized_arxiv["train"][0].keys(), tokenized_arxiv["train"][0]["abstract"], tokenized_arxiv["train"][0]["title"]

(dict_keys(['abstract', 'title', 'input_ids', 'attention_mask', 'label']),
 'as in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. in spite of efforts to create probabilistic annotations, especially in the gene ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the fisher exact test. we developed an open-source r package to deal with probabilistic categorical data analysis, probcd, that does not require a static contingency table. the contingency table for the enrichment problem is built using the expectation of a bernoulli scheme stochastic process given the categorization probabilities. an on-line interface was created to allow 

Специальные токены работают

In [14]:
tokenizer.decode(tokenized_arxiv["train"][0]["input_ids"], skip_special_tokens=False)

'[BOS]as in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. in spite of efforts to create probabilistic annotations, especially in the gene ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the fisher exact test. we developed an open-source r package to deal with probabilistic categorical data analysis, probcd, that does not require a static contingency table. the contingency table for the enrichment problem is built using the expectation of a bernoulli scheme stochastic process given the categorization probabilities. an on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiolo

# Модель

In [15]:
from transformers import GPT2LMHeadModel, TrainingArguments, Trainer

model = GPT2LMHeadModel.from_pretrained(MODEL,
                                        bos_token_id=tokenizer.bos_token_id,
                                        eos_token_id=tokenizer.eos_token_id,
                                        sep_token_id=tokenizer.sep_token_id,
                                        pad_token_id=tokenizer.pad_token_id, 
                                        output_hidden_states=False)

# как вариант - западить все [EOS] и в конструктор pad_token_id = tokenizer.eos_token_id

model.resize_token_embeddings(len(tokenizer))

Embedding(50262, 768)

Заморозка слоев (чем дальше слой от входа, тем более общие зависомсти в исходных последовательностях он акуумулирует (токен -> предложение -> текст -> дискурс)).

По-идее, призвано ускорить сходимость при сохранениии генерализации.

In [16]:
for parameter in model.parameters():
    parameter.requires_grad = False                         # везде отключено построение дерева autograd

for i, m in enumerate(model.transformer.h):        
    #Only un-freeze the last n transformer blocks
    if i >= 6:
        for parameter in m.parameters():
            parameter.requires_grad = True 

for parameter in model.transformer.ln_f.parameters():        
    parameter.requires_grad = True

for parameter in model.lm_head.parameters():        
    parameter.requires_grad = True

In [17]:
# model.to(device)    # if no train else trainer sends

In [18]:
# from transformers import DataCollatorForSeq2Seq

# не работает с GPT-2 (отпаддили сразу все заранее, это вроде как некрасиво, но можно, и по памяти не отвалится неожиданно)
# data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Обучение

In [19]:
training_args = TrainingArguments(
    output_dir="./GPT2-results",
    optim="adamw_torch",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,
    gradient_accumulation_steps=12,         # eff.batch = grd_acc * batch = 36
    weight_decay=0.01,
    save_steps=100, 
    save_total_limit=3,
    num_train_epochs=0.05*12,               # eff.epochs = epoch / grd_acc = 0.05
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_arxiv["train"],
    eval_dataset=tokenized_arxiv["test"],
    tokenizer=tokenizer,
)

Using amp half precision backend


In [20]:
tqdm._instances.clear()

trainer.train()#(resume_from_checkpoint=True)

The following columns in the training set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: title, abstract. If title, abstract are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 103490
  Num Epochs = 1
  Instantaneous batch size per device = 3
  Total train batch size (w. parallel, distributed & accumulation) = 36
  Gradient Accumulation steps = 12
  Total optimization steps = 1725
 29%|██▉       | 500/1725 [27:45<1:07:59,  3.33s/it]

{'loss': 2.8089, 'learning_rate': 1.4249275362318842e-05, 'epoch': 0.17}


 58%|█████▊    | 1000/1725 [55:29<40:04,  3.32s/it] Saving model checkpoint to ./GPT2-results/checkpoint-1000
Configuration saved in ./GPT2-results/checkpoint-1000/config.json


{'loss': 1.5474, 'learning_rate': 8.452173913043478e-06, 'epoch': 0.35}


Model weights saved in ./GPT2-results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./GPT2-results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./GPT2-results/checkpoint-1000/special_tokens_map.json
 87%|████████▋ | 1500/1725 [1:23:27<12:19,  3.28s/it]

{'loss': 1.5306, 'learning_rate': 2.6550724637681165e-06, 'epoch': 0.52}


100%|██████████| 1725/1725 [1:35:43<00:00,  3.30s/it]The following columns in the evaluation set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: title, abstract. If title, abstract are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2113
  Batch size = 3
                                                     
100%|██████████| 1725/1725 [1:36:50<00:00,  3.30s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 1725/1725 [1:36:50<00:00,  3.37s/it]

{'eval_loss': 1.433777928352356, 'eval_runtime': 66.8537, 'eval_samples_per_second': 31.606, 'eval_steps_per_second': 10.545, 'epoch': 0.6}
{'train_runtime': 5810.734, 'train_samples_per_second': 10.686, 'train_steps_per_second': 0.297, 'train_loss': 1.9055420275701993, 'epoch': 0.6}





TrainOutput(global_step=1725, training_loss=1.9055420275701993, metrics={'train_runtime': 5810.734, 'train_samples_per_second': 10.686, 'train_steps_per_second': 0.297, 'train_loss': 1.9055420275701993, 'epoch': 0.6})

In [21]:
trainer.save_model() 

Saving model checkpoint to ./GPT2-results
Configuration saved in ./GPT2-results/config.json
Model weights saved in ./GPT2-results/pytorch_model.bin
tokenizer config file saved in ./GPT2-results/tokenizer_config.json
Special tokens file saved in ./GPT2-results/special_tokens_map.json


# Генерация

Модель помнит, что после текста, обрамленного `[BOS]` `[SEP]`, идет другой текст, заканчивающийся `[EOS]`, и пытается его максимально правдоподобно воспроизвести.

Готового seq2seq метода генерации в пакете нету, поэтому:
- считаем ответом все что модель нагенерила после `[SEP]` и по длине не более `max_title` (ибо недоученная она генерит до упора) 

In [22]:
model.config.pad_token_id = model.config.eos_token_id       # ?? не надо, когда все спец.токены в модели
model.config.max_length = max_abstract + max_title + 3

def generate(example, num_beams=3, num_return_sequences=1):
    """Метод генерации - только лучевой поиск"""

    text = tokenizer.bos_token + example["abstract"] + tokenizer.sep_token
    input_ids = tokenizer(text, 
                        max_length=max_abstract + 2, 
                        truncation=True, 
                        return_tensors="pt").input_ids  # Batch size 1
    outputs = model.generate(input_ids.to(device), do_sample=True, 
                                                   num_beams=num_beams, 
                                                   max_length = max_abstract + max_title + 3,
                                                   repetition_penalty=5.0, 
                                                   no_repeat_ngram_size  = 2, 
                                                   early_stopping=True, 
                                                   num_return_sequences=num_return_sequences)
    
    res = tokenizer.decode(outputs[0], skip_special_tokens=False)
    res = res[res.find(tokenizer.sep_token)+len(tokenizer.sep_token):res.find(tokenizer.eos_token)]
    
    return " ".join(res.split()[:max_title])

In [23]:
n = 10
arxiv_dataset["test"][n]["abstract"], arxiv_dataset["test"][n]["title"], generate(arxiv_dataset["test"][n])

('we provide a jacobian criterion that applies to arbitrary chemical reaction networks taken with mass-action kinetics to preclude the existence of multiple positive steady states within any stoichiometric class for any choice of rate constants. we are concerned with the characterization of injective networks, that is, networks for which the species formation rate function is injective in the interior of the positive orthant within each stoichiometric class. we show that a network is injective if and only if the determinant of the jacobian of a certain function does not vanish. the function consists of components of the species formation rate function and a maximal set of independent conservation laws. the determinant of the function is a polynomial in the species concentrations and the rate constants (linear in the latter) and its coefficients are fully determined. the criterion also precludes the existence of degenerate steady states. further, we relate injectivity of a chemical reac

In [24]:
n = 42
arxiv_dataset["test"][n]["abstract"], arxiv_dataset["test"][n]["title"], generate(arxiv_dataset["test"][n])

('compact pseudo-riemannian manifolds that have parallel weyl tensor without being conformally flat or locally symmetric are known to exist in infinitely many dimensions greater than 4. we prove some general topological properties of such manifolds, namely, vanishing of the euler characteristic and real pontryagin classes, and infiniteness of the fundamental group. we also show that, in the lorentzian case, each of them is at least 5-dimensional and admits a two-fold cover which is a bundle over the circle.',
 'on compact manifolds admitting indefinite metrics with parallel weyl   tensor',
 'compacts with hyperbolic polynomials')

# BLEU-score

Самоделки:
- 0.02457 (словарь 6152, по 5 эпох по 5r-4, 1e-3, min.val.loss = 3.875) 
- **0.19204** (словарь 60 тыс. ~15 эпох с шагом 5e-4 -> 5e-5, min.val.loss = 2.289)
- 0.12601 (словарь 84 тыс. много разных эпох, сходится плохо, min.val.loss = 3.305)
- 0.10644 (BPE, словарь 16 тыс., много разных эпох, сходится плохо, min.val.loss = 3.8)

T5-small
- BLEU-score: **0.044...** 1% тюнинг
- BLEU-score: **0.16563** (3 эпохи - 2,5 часа RTX2060 6Gb)

T5-base
- BLEU-score: **0.07422** (без обучения)
- обучение не тянет...

BART-base
- BLEU-score: **0.17743** 1% тюнинг
- BLEU-score: **0.17984** (1.43 эпохи - 2,5 часа RTX2060 6Gb)
- BLEU-score: **0.19266** (2 эпохи)

GPT-2
- BLEU-score: **0.00318** 5% тюниг

In [25]:
from torchtext.data.metrics import bleu_score

tqdm._instances.clear()

candidates = []
references = []
for example in tqdm(tokenized_arxiv["test"]):
    candidates.append(generate(example).split())
    references.append([example["title"].split()])

score = bleu_score(candidates, references, max_n=3, weights=[1/3]*3)

print('BLEU-score: {0:.5f}'.format(score))

100%|██████████| 2113/2113 [32:27<00:00,  1.09it/s] 


BLEU-score: 0.00318


### Stepik score

In [26]:
SUBMISSION_NAME = "GPT-base" if USE_SMALL else "GPT-base-tune"

Генерация заголовков для тестовых данных

In [27]:
tqdm._instances.clear()

abstracts = []
titles = []

for example in tqdm(scoring_dataset):
    abstracts.append(example["abstract"])
    titles.append(generate(example))

100%|██████████| 1000/1000 [15:11<00:00,  1.10it/s]


Получилось, например

In [28]:
abstracts[1], titles[1]

('The doc2vec approach was introduced as an extension to word2vec (Le and Mikolov, 2014), to generate embeddings at the level of entire documents, with interesting results, followed by mixed success at reproducing results from the initial paper. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and 2 advanced embedding-generating methodologies for documents. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.',
 'doc2vars: training methods in embedded documents')

Записываем полученные заголовки в файл формата `<abstract>,<title>`:

In [29]:
import pandas as pd

submission_df = pd.DataFrame({'abstract': abstracts, 'title': titles})
submission_df.to_csv(f"./submission/predicted_titles_{SUBMISSION_NAME}.csv", index=False)

In [30]:
submission_df["title"].apply(lambda x: len(str(x).split())).describe()[["mean","std", "max"]]

mean    12.066000
std     10.477663
max     34.000000
Name: title, dtype: float64

С помощью скрипта `generate_csv` приводим файл `submission_prediction.csv` в формат, необходимый для отправки:

In [31]:
from helpers.create_submission import generate_csv

generate_csv(input_file=f"./submission/predicted_titles_{SUBMISSION_NAME}.csv", 
             output_file=f'./submission/submission_{SUBMISSION_NAME}.csv', 
             voc_file=f'./datasets/vocs.pkl')

# С учетом

In [32]:
import pandas as pd
import numpy as np

train_df = pd.read_csv("./datasets/train.csv")
submission_df = pd.read_csv(f"./submission/predicted_titles_{SUBMISSION_NAME}.csv")

intersect_idx = np.intersect1d(submission_df["abstract"].str.lower(), train_df["abstract"].str.lower(), return_indices=True)

submission_df.loc[intersect_idx[1], 'title'] = train_df.loc[intersect_idx[2], 'title'].values

In [33]:
from helpers.create_submission import generate_csv

submission_df.to_csv(f"./submission/predicted_titles_{SUBMISSION_NAME}_fake.csv", index=False)

generate_csv(input_file=f"./submission/predicted_titles_{SUBMISSION_NAME}_fake.csv", 
             output_file=f'./submission/submission_{SUBMISSION_NAME}_fake.csv', 
             voc_file=f'./datasets/vocs.pkl')

In [34]:
f'./submission/submission_{SUBMISSION_NAME}_fake.csv'

'./submission/submission_GPT-base-tune_fake.csv'

T5-small:
- **Score: 0.26174** 1% tuning
- **Score: 0.34497** tuning 3 эпохи
- **Score: 0.51810** + добавление правильных меток из трейна

T5-base:
- **Score: 0.20510** w/o tuning
- для обучения с имеющейся длиной последовательности не хватает памяти GPU

BART-base
- **Score: 0.33851** 1% tuning
- **Score: 0.39536** tuning 1,5 эпохи
- **Score: 0.54804** + добавление правильных меток из трейна
- **Score: 0.56782** + 2 эпохи с накопление градиента (вот и в топ-10)
- ... дальше не интересно

GPT-2
- **Score:**