<a href="https://colab.research.google.com/github/saurabh-singh-rajput/commit-message-generator/blob/main/commit_message_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Automated commit message generation

In [38]:
!git clone https://github.com/saurabh-singh-rajput/commit-message-generator.git

Cloning into 'commit-message-generator'...
remote: Enumerating objects: 400, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 400 (delta 9), reused 0 (delta 0), pack-reused 377[K
Receiving objects: 100% (400/400), 61.30 MiB | 11.43 MiB/s, done.
Resolving deltas: 100% (124/124), done.


In [39]:
%cd commit-message-generator

/content/commit-message-generator/commit-message-generator


In [40]:
!pwd

/content/commit-message-generator/commit-message-generator


In [41]:
# !pip install -r requirements.txt

### Read filtered data

This **filtered_data.csv** is the result of passing around ~63000 data points through the filter proposed in this [paper](https://arxiv.org/pdf/2202.02974.pdf). The filter retains the messages that contain the aspect of both *why* and *what* in the message.

In [42]:
import pandas as pd
data = pd.read_csv("filtered_data.csv")

### Preprocess data

We read the diff and messages tables seperately and then combine them into one dataframe : "data"

In [43]:
data.head()

Unnamed: 0.1,Unnamed: 0,diff,message
0,0,mmm a / README . md <nl> ppp b / README . md <...,updated read me more documentation to come !
1,1,Binary files a / bigbluebutton - client / bran...,Updated fit - to - screen icon
2,2,mmm a / util - taglib / src / com / liferay / ...,LPS - 64187 update package info
3,3,mmm a / bindings / cpp / configure . ac <nl> p...,Remove unneeded check for cppunit
4,4,mmm a / owncloud - android - library <nl> ppp ...,Updated library to fix bug in SAML authenticat...


In [44]:
# Dropping the unnecessary index column: "Unnamed: 0"

data.drop(data.columns[0], inplace=True, axis=1)

In [45]:
data.head()

Unnamed: 0,diff,message
0,mmm a / README . md <nl> ppp b / README . md <...,updated read me more documentation to come !
1,Binary files a / bigbluebutton - client / bran...,Updated fit - to - screen icon
2,mmm a / util - taglib / src / com / liferay / ...,LPS - 64187 update package info
3,mmm a / bindings / cpp / configure . ac <nl> p...,Remove unneeded check for cppunit
4,mmm a / owncloud - android - library <nl> ppp ...,Updated library to fix bug in SAML authenticat...


In [46]:
# Renaming the column

data.rename(columns = {'diff':'commits'}, inplace=True)

In [47]:
# Checking for any missing values

data.isna().any()

commits    False
message    False
dtype: bool

### Train and test splits

85% of the data is for training

15% of the data is for validation

We have a seperate set for testing the data

In [48]:
val_data = data.iloc[:int(len(data)*0.085),:]
train_data = data.iloc[int(len(data)*0.085):,:]

Resetting indexes after the split

In [49]:
val_data.reset_index(drop=True, inplace=True)
train_data.reset_index(drop=True, inplace=True)

In [50]:
print("Train and validation data length",len(train_data)," ",len(val_data))

Train and validation data length 27707   2573


Now, we have train and test dataset: each dataset consists of code-diff data and commit messages.

### Model

Encoder-Decoder Model from Hugginface. The inspiration for this comes from this [paper](https://arxiv.org/abs/1907.12461)

We will be using CodebERT as an Encoder and GPT-2 as Decoder. We then fine-this on our dataset of commits and corresponding messages.

Importing libraries : AutoTokenizer for tokenizing the input according to codebert input format and EncoderDecoder model to construct the entire model

In [51]:
# !pip install transformers[onnx]

In [52]:
from transformers import AutoTokenizer
from transformers import EncoderDecoderModel

### Initialize the model with encoder and decoder

In [53]:
model = EncoderDecoderModel.from_encoder_decoder_pretrained("microsoft/graphcodebert-base", "gpt2")

Some weights of RobertaModel were not initialized from the model checkpoint at microsoft/graphcodebert-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.6.crossattention.c_attn.weight', 'h.3.crossattention.c_proj.bias', 'h.10.crossattention.c_attn.weight', 'h.1.ln_cross_attn.bias', 'h.2.crossattention.q_attn.bias', 'h.11.crossattention.c_proj.weight', 'h.5.crossattention.c_proj.weight', 'h.5.crossattention.c_attn.bias', 'h.6.crossattention.c_proj.weight', 'h.6.crossattention.q_attn.bias', 'h.4.ln_cross_attn.bias', 'h.3.crossattention.q_attn.weight', 'h.1.crossattention.q_attn.bias', 'h.2.crossattention.c_proj.weight', 'h.4.crossattention.c_attn.weight', 'h.8.crossattention.q_attn.weight', 'h.0.crossattention.q_attn.bia

### Set the model in training mode

In [54]:
# We will do .train() to set the model in training model. If the model is not in training mode, the weights will not be updates. So, its not learning anything

model.train()

# Please note that model.eval() will be used while inference and evaluation stage

EncoderDecoderModel(
  (encoder): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm):

In [55]:
# Total parameters for the model

model.num_parameters()

277452288

### Load codebert's tokenizer

In [56]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

1. Setting cls_token : To indicate start of sentence
2. Setting sep_token: To indicate end of sentence

#### The imports are for the following reasons:

1. The datasets is used to transform pandas dataframe to pyarrow (convinient to use with transformers)
2. The rouge_score and evalute libraries are for evalutaion metrics

In [57]:
# Importing the libraries

import datasets
import evaluate

In [58]:
# pip install rouge_score

In [59]:
# Loading the metric from huggingface datasets library to evaluate the model performance

rouge = datasets.load_metric("rouge")
bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


We define a function to evaluate rouge score and bleu-variants. This is optinal but gives us a better evaluation criteria for the model

In [60]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
    bleu1 = bleu.compute(predictions=pred_str, references=label_str,max_order=1)
    bleu2 = bleu.compute(predictions=pred_str, references=label_str, max_order=2)
    bleu3 = bleu.compute(predictions=pred_str, references=label_str, max_order=3)
    bleu4 = bleu.compute(predictions=pred_str, references=label_str, max_order=4)


    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
        "bleu1_score":round(bleu1["bleu"],4),
        "bleu2_score":round(bleu2["bleu"],4),
        "bleu3_score":round(bleu3["bleu"],4),
        "bleu4_score":round(bleu4["bleu"],4)
        }

We define a function to process the input data : code-diff files.

The following are the steps involved:

1. It takes an input batch (16 (correlate with RAM)) from dataset.
2. The code-diff files and messages are tokenized using the codebert tokenizer.
3. Labels is just a copy of commit messages tokenized id's.
4. pad_token ids are set to -100 so that they are not considered during training and evaluation (high-level intuition).

Things to note : A maximum size of 123 is considered for code-diff files and a maximum of 30 for commit messages (The values are based on the maximum tokens present in the set)

**Hyper-parameters :**

1. batch_size
2. encoder_max_length
3. decoder_max_length

In [61]:
batch_size=16
encoder_max_length=123
decoder_max_length=50

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["commits"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["message"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask

  #This was required in the earlier version, but internal implementation captures this so we don't have to provide
  # batch["decoder_input_ids"] = outputs.input_ids
  # batch["decoder_attention_mask"] = outputs.attention_mask

  batch["labels"] = outputs.input_ids.copy()

  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`.
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

### Set the model configuration

This entirely depends on the type of encoder and decoder

In [62]:
# set special tokens
# We are setting the entire models special tokens
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

# sensible parameters for beam search
model.config.vocab_size = model.config.decoder.vocab_size # Setting the models' vocabulary to GPT-2's vocabulary size (~50K)
model.config.max_length = 64 # Maximum length of message generated
model.config.min_length = 5 # Minimum length of message generated
model.config.no_repeat_ngram_size = 3 # A word will not be repeated more than three times while geenrating a new commit message
model.config.early_stopping = True
model.config.length_penalty = 2.0
model.config.num_beams = 4 # Top-4 words for beam search rather than considering all words for greedy search

### Dataset restructuring

We are doing this to convert out pandas dataframe to pyarrow dataset. Pyarrow has good functionality and alignment with hugginface datasets library and we could use map function to manipulate the whole dataset according a fucntion provided.

Overall, we could just write a manual function for preprocessing and use map function to extract all the input_ids and attention_masks.

In [63]:
import pyarrow as pa
from datasets import Dataset

In [64]:
train_set = Dataset(pa.Table.from_pandas(train_data))

In [65]:
val_set = Dataset(pa.Table.from_pandas(val_data))

### Data mapping and batch conversions according to the function
This section take the data and maps the data according to the funtion (1st argument) behaviour. We also define the batch size which is a hyperparameter

In [66]:
train_dataset = train_set.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["commits", "message"]
)


train_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)

Map:   0%|          | 0/27707 [00:00<?, ? examples/s]

In [67]:
val_dataset = val_set.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["commits", "message"]
)


val_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "labels"],
)

Map:   0%|          | 0/2573 [00:00<?, ? examples/s]

Now, we have all the input_ids, attention_masks and labels for train and validation set

### We are using a trainer from Hugginface
We are using **Seq2SeqTrainingArguments** and **Seq2SeqTrainer** from Hugginface.

**trainin_args** variable holds the hyperparameters. Feel free to explore all the hyperparameters on huggingface website

trainer variable is to initialize the trianer and pass model, datasets and metric funtion that we defined.

In [68]:
# !pip install transformers[torch]

In [69]:
# !pip install accelerate -U

In [70]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    overwrite_output_dir=True, # Overwrites previously saved model
    learning_rate=5e-5,
    evaluation_strategy="steps",
    logging_steps=1_000,
    # Batch size for train and validaiton.
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    output_dir = "./", # Saves the model to current directory
    save_steps=100, # Save the model to output directory after every 500 steps
    eval_steps=300,
    warmup_steps=200,
    save_total_limit=1, # Saves one file.
    num_train_epochs=0.2, # 0.2 epochs. Increase
)


trainer = Seq2SeqTrainer(
    model=model, # Modelname
    tokenizer=tokenizer, # Tokenzier
    args=training_args, # Hyperparameter arguments
    compute_metrics=compute_metrics, # Not mandatory (Manually defined rouge metric function)
    # Datasets
    train_dataset=train_data,
    eval_dataset=val_data
    )

In [71]:
# call .train on trainer to start training.

trainer.train()

KeyError: ignored

## Testing the trained model

It follows similar steps as previous except that we are not training the model, but just using the trained model to produce results and compare

### Preprocessing

In [None]:
diff = pd.read_csv("data/test.3000.diff", sep="/n")

In [None]:
mess = pd.read_csv("data/test.3000.msg", sep="/n")

In [None]:
diff["commits"] = diff[diff.columns[0]]
mess["message"] = mess[mess.columns[0]]

In [None]:
diff.drop(diff.columns[0], inplace=True, axis=1)
mess.drop(mess.columns[0], inplace=True, axis=1)

In [None]:
diff["commits"][6]

### converting to pyarrow dataset

In [None]:
val = Dataset(pa.Table.from_pandas(diff))

In [None]:
import datasets
from transformers import AutoTokenizer, EncoderDecoderModel

In [None]:
tokenizer = AutoTokenizer.from_pretrained("checkpoint-300") # Please find the checkpoint save in the current directotry
model = EncoderDecoderModel.from_pretrained("checkpoint-300")

# Loading the model to gpu
model.to()

In [None]:
def generate_summary(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    inputs = tokenizer(batch["commits"], padding="max_length", truncation=True, max_length=123, return_tensors="pt")
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    outputs = model.generate(input_ids, attention_mask=attention_mask)

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch

Increase the batch size if you want faster computation of ids and attetnion_mask.

In [None]:
results = val.map(generate_summary, batched=True, batch_size=16)

In [None]:
pred_str = results["pred"]
label_str = mess["message"]

rouge_output = rouge.compute(predictions=pred_str, references=label_str)
bleu1 = bleu.compute(predictions=pred_str, references=label_str,max_order=1)
bleu2 = bleu.compute(predictions=pred_str, references=label_str, max_order=2)
bleu3 = bleu.compute(predictions=pred_str, references=label_str, max_order=3)
bleu4 = bleu.compute(predictions=pred_str, references=label_str, max_order=4)
meteor = meteor.compute(predictions=pred_str, references=label_str)

print("Rouge: ",rouge_output)
print("Bleu-1: ", bleu1["bleu"])
print("Bleu-2: ",bleu2["bleu"])
print("Bleu-3: ",bleu3["bleu"])
print("Bleu-4: ", bleu4["bleu"])
print("Meteor: ",meteor["meteor"])
#print(bert_score)


In [None]:
rouge_output["rougeL"].mid