## Библиотеки

* HuggingFace Transformers (for the CodeT5 model)
* HuggingFace Datasets (for loading the dataset + preprocessing it)
* PyTorch Lightning (for training)
* Weights and Biases (for logging training metrics).

In [1]:
!pip install -q transformers datasets pytorch-lightning wandb

## Подготовим данные

Загрузим данные - code(Ruby)-to-text.

In [2]:
from datasets import load_dataset

dataset = load_dataset("code_x_glue_ct_code_to_text", "ruby")
print(dataset)

Downloading readme:   0%|          | 0.00/26.7k [00:00<?, ?B/s]

Downloading data: 100%|██████████████████████████████████████████████████████████| 19.8M/19.8M [00:07<00:00, 2.68MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 1.06M/1.06M [00:01<00:00, 938kB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 1.03M/1.03M [00:01<00:00, 889kB/s]


Generating train split:   0%|          | 0/24927 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1400 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1261 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 24927
    })
    validation: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 1400
    })
    test: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
        num_rows: 1261
    })
})


In [3]:
example = dataset['train'][0]

print("Code:", example["code"])
print("Docstring:", example["docstring"])

Code: def handle_parsed_websocket_message(json_data)
      data =  json_data.is_a?(Hash) ? json_data.stringify_keys : {}
      if CelluloidPubsub::Reactor::AVAILABLE_ACTIONS.include?(data['client_action'].to_s)
        log_debug "#{self.class} finds actions for  #{json_data}"
        delegate_action(data) if data['client_action'].present?
      else
        handle_unknown_action(data['channel'], json_data)
      end
    end
Docstring: method that checks if the data is a Hash

 if the data is a hash then will stringify the keys and will call the method {#delegate_action}
 that will handle the message, otherwise will call the method {#handle_unknown_action}

 @see #delegate_action
 @see #handle_unknown_action

 @param [Hash] json_data

 @return [void]

 @api public


На вход
* input: код, преобразоваанный в `input_ids` + `attention_mask`
* output: docstrings, преобразованные в  `labels` (`input_ids` docstrings).


In [4]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-small")

prefix = "Summarize Ruby: "
max_input_length = 256
max_target_length = 128

def preprocess_examples(examples):
  # encode the code-docstring pairs
  codes = examples['code']
  docstrings = examples['docstring']
  
  inputs = [prefix + code for code in codes]
  model_inputs = tokenizer(inputs, max_length=max_input_length, padding="max_length", truncation=True)

  # encode the summaries
  labels = tokenizer(docstrings, max_length=max_target_length, padding="max_length", truncation=True).input_ids

  # important: we need to replace the index of the padding tokens by -100
  # such that they are not taken into account by the CrossEntropyLoss
  labels_with_ignore_index = []
  for labels_example in labels:
    labels_example = [label if label != 0 else -100 for label in labels_example]
    labels_with_ignore_index.append(labels_example)
  
  model_inputs["labels"] = labels_with_ignore_index

  return model_inputs

tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/703k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/294k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

In [5]:
# Now that we have defined the function, let's call `.map()` on the HuggingFace Dataset object, which allows us to apply this function in batches (by default a batch size of 1,000 is used!) - hence super fast.
dataset = dataset.map(preprocess_examples, batched=True)

Map:   0%|          | 0/24927 [00:00<?, ? examples/s]

Map:   0%|          | 0/1400 [00:00<?, ? examples/s]

Map:   0%|          | 0/1261 [00:00<?, ? examples/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 24927
    })
    validation: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1400
    })
    test: Dataset({
        features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1261
    })
})

In [7]:
from torch.utils.data import DataLoader

dataset.set_format(type="torch", columns=['input_ids', 'attention_mask', 'labels'])
train_dataloader = DataLoader(dataset['train'], shuffle=True, batch_size=48)
valid_dataloader = DataLoader(dataset['validation'], batch_size=16)
test_dataloader = DataLoader(dataset['test'], batch_size=16)

In [8]:
batch = next(iter(train_dataloader))
print(batch.keys())

dict_keys(['input_ids', 'attention_mask', 'labels'])


In [9]:
tokenizer.decode(batch['input_ids'][0])

'<s>Summarize Ruby: def [](connection_name)\n      class_name = connection_name.to_s.split(\'_\').map { |s| s.capitalize }.join\n\n      begin\n        Coolio::Connections.const_get class_name\n      rescue NameError\n        raise NameError, "No connection type registered for #{connection_name.inspect}"\n      end\n    end</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><

In [10]:
labels = batch['labels'][0]
tokenizer.decode([label for label in labels if label != -100])

'<s>Look up a connection class by its name</s>'

## Тюним с помощью PyTorch Lightning


In [11]:
from transformers import T5ForConditionalGeneration, AdamW, get_linear_schedule_with_warmup
import pytorch_lightning as pl

class CodeT5(pl.LightningModule):
    def __init__(self, lr=5e-5, num_train_epochs=15, warmup_steps=1000):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-small")
        self.save_hyperparameters()

    def forward(self, input_ids, attention_mask, labels=None):     
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        return outputs
    
    def common_step(self, batch, batch_idx):
        outputs = self(**batch)
        loss = outputs.loss

        return loss
      
    def training_step(self, batch, batch_idx):
        loss = self.common_step(batch, batch_idx)     
        # logs metrics for each training_step,
        # and the average across the epoch
        self.log("training_loss", loss)

        return loss

    def validation_step(self, batch, batch_idx):
        loss = self.common_step(batch, batch_idx)     
        self.log("validation_loss", loss, on_epoch=True)

        return loss

    def test_step(self, batch, batch_idx):
        loss = self.common_step(batch, batch_idx)     

        return loss

    def configure_optimizers(self):
        # create optimizer
        optimizer = AdamW(self.parameters(), lr=self.hparams.lr)
        # create learning rate scheduler
        num_train_optimization_steps = self.hparams.num_train_epochs * len(train_dataloader)
        lr_scheduler = {'scheduler': get_linear_schedule_with_warmup(optimizer,
                                                    num_warmup_steps=self.hparams.warmup_steps,
                                                    num_training_steps=num_train_optimization_steps),
                        'name': 'learning_rate',
                        'interval':'step',
                        'frequency': 1}
        
        return {"optimizer": optimizer, "lr_scheduler": lr_scheduler}

    def train_dataloader(self):
        return train_dataloader

    def val_dataloader(self):
        return valid_dataloader

    def test_dataloader(self):
        return test_dataloader

Инициализируем w&b

In [12]:
import wandb

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33moleg_oleg_96[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Инициализируем модель

In [13]:
model = CodeT5()

config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [14]:
!pwd

/home/somov/SBT-Deep-Learning-Seminars/Transformers


In [16]:
from pytorch_lightning import Trainer
import os
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor, ModelCheckpoint

wandb_logger = WandbLogger(name='codet5-finetune-code-summarization-ruby-shuffle', project='CodeT5')
# for early stopping, see https://pytorch-lightning.readthedocs.io/en/1.0.0/early_stopping.html?highlight=early%20stopping
early_stop_callback = EarlyStopping(
    monitor='validation_loss',
    patience=3,
    strict=False,
    verbose=False,
    mode='min'
)

data_path = '/home/somov/SBT-Deep-Learning-Seminars/Transformers'
lr_monitor = LearningRateMonitor(logging_interval='step')
checkpoint_callback = ModelCheckpoint(dirpath=os.path.join(data_path, 'finetuned_t5_model'),
                                      filename='t5-{epoch}-{val_loss:.2f}')

trainer = Trainer(accelerator="gpu", devices=1,
                  default_root_dir=os.path.join(data_path, 'finetuned_t5_model'), 
                  logger=wandb_logger, 
                  callbacks=[early_stop_callback, lr_monitor, checkpoint_callback])
trainer.fit(model)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/somov/.conda/envs/dl_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py:72: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M
-----------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
241.969   Total estimated model params size (MB)


Sanity Checking: |                                                                              | 0/? [00:00<?…

/home/somov/.conda/envs/dl_env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
/home/somov/.conda/envs/dl_env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Training: |                                                                                     | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

Validation: |                                                                                   | 0/? [00:00<?…

### Save PL model and load as HF model

In [17]:
model.model.save_pretrained(os.path.join(data_path, 'finetuned_t5_model'))

In [18]:
from transformers import T5ForConditionalGeneration

hf_model = T5ForConditionalGeneration.from_pretrained(os.path.join(data_path, 'finetuned_t5_model'))

## Inference

Протестируем модель

In [19]:
from datasets import load_dataset

dataset = load_dataset("code_x_glue_ct_code_to_text", "ruby")
print(dataset['test'])

Dataset({
    features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
    num_rows: 1261
})


In [25]:
test_example = dataset['test'][5]
print("Code:", test_example['code'])

Code: def glob(pattern)
      pattern = Pathname.new(pattern).cleanpath.to_s
      Dir.glob(pattern, File::FNM_DOTMATCH).sort.reject do |file|
        basename = File.basename(file)
        IGNORED_FILES.include?(basename)
      end
    end



Подготовим input в модель с помощью `RobertaTokenizer` и генерируем с помощь `.generate()` метода. Сущеестует множество методов генерации -[here](https://huggingface.co/blog/how-to-generate). Здесь воспользуемся greedy decoding. 


In [26]:
# prepare for the model
input_ids = tokenizer(test_example['code'], return_tensors='pt').input_ids
# generate
outputs = hf_model.generate(input_ids)
print("Generated docstring:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Generated docstring: Returns a list of files matching the given pattern.

 @param [String] pattern


Сравним с истинным предсказанием:

In [27]:
print("Ground truth:", test_example['docstring'])

Ground truth: Glob across the given pattern, accounting for dotfiles, removing Ruby's
 dumb idea to include +'.'+ and +'..'+ as entries.

 @param [String] pattern
   the path or glob pattern to get all files from

 @return [Array<String>]
   the list of all files
