# Assignment 2

# Introduction

- In The Assignment, I have used the Indic2Transformer model and applied evaluation metrics BLEU and rogue-score on the inference from the model.

- For this task, I used the test dataset IN22-GEN and choosing source language Assamese(asm_Beng) to target language Bengali(ben_BENG).

# Dataset

- The Dataset used is the test dataset __IN22-GEN__ and the files __test.asm_Beng__ and __test.ben_Beng__.
- This dataset consists of 1024 sentences in all the 22 languages, for whom batch size was taken to be 4 and 16 during 2 such instances.


# Evaluation Metrices

## __BLEU metric__
It is a evaluation metric used for tasks related to translations done by machine, when translating 1 language to other.

__Workings__: The score find similarity between predicted and target text using n-grams, which are continuous sequences of n words. The precision is then modified by a brevity penalty to account for translations that are shorter than the targeted translations.

__formula__:

$$ BLEU = Brevity \;\; Penalty * e^{∑ pn} $$

Where:
- Brevity Penalty adjusts the score for translations which are shorter than the reference translations.

- The range of BLEU score is from 0 to 1, with higher values indicating better translation.

## __ROUGE scores__
It is a set of metrices used for text summarization tasks, where we want to generate a summary of a longer text. It was designed to evaluate the quality of predicted summaries by comparing them to target summaries.

__Working__

It measures the similarity between the predicted summary and target summaries using _overlapping n-grams, in predicted summary and the target summaries. Most common n-grams used are unigrams, bigrams, and trigrams. It calculates the recall of n-grams in the predicted summary by comparing them to the target summaries.

__Formula__:

$$ ROUGE = \Sigma{ (Recall \; of \; n\;grams) } $$

Where:

- Recall of n-grams: no. of n-grams which appear in both the predicted summary and the target summary divided by the total number of n-grams in the target summary.

ROUGE score ranges from 0 to 1, with larger values showing better summary quality.

# Approach

I performed the generation and evaluation using the following steps:
1. Following all the important steps for using the IndicTrans2 model used in the previous lab session, I used the pre-built method with the change in batch size from 4 to 16.
2. We select T4 GPU as the run-time environment.
3. During this procedure, I have used assamese and bengali whose test files, I imported in google Collab.
4. Next, We obtain the tokenizer, model and find the predicted translation by the model.
5. At last, I have used _BLEU_ and _rouge-scorer_ metrices for evaluation.

# Code

## Necessary Step

Please run the cells below to install the necessary dependencies.

In [None]:
# Clone the required Git repository for IndicTrans2
%%capture
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [None]:
# Clone the Hugging face interface from github
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [None]:
# Install other essential dependecies for working of the transformer
%%capture
!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece
!python3 -m pip install torchmetrics datasets

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!python3 -m pip install --editable ./
%cd ..

Restart your run-time first and then run the cells below.

## Working for Transformer

1. Importing the important modules:
  * transformer
  * torch
  * AutoModelForSeq2SeqLM from transformer
  * BitsAndBytesConfig from transformer
  * IndicProcessor from from IndicTransTokenizer
  * IndicTransTokenizer from IndicTransTokenizer

2. Set the Batch size equal to 8. Next, create a variable DEVICE and set it to "cuda" if torch.cuda.is_available() or else set it as "cpu". Finally set Quantization as "None"

3. Two functions are there.
    * First function to intialise the model and the tokenizer and returns both
    * Another function which helps in the translation of a whole batch.


In [None]:
import torch
from transformers import BitsAndBytesConfig, AutoModelForSeq2SeqLM
from IndicTransTokenizer import IndicProcessor, IndicTransTokenizer
import transformers as trf

In [None]:
# Set the variables
BATCH_SIZE= 8;
DEVICE= "cuda" if torch.cuda.is_available() else "cpu"
Quantization= None


### Model initializer and tokenizer function.


We used the function initialize_model_and_tokenizer which takes in 3 arguments: ckpt_dir, direction, quantization.

Create a variable tokenizer.

Next, We create a model variable set to AutoModelForSeq2SeqLM where we have to load the pretrained model from checkpoint directory

In [None]:
# Create a function initialize_model_and_tokenizer which takes in 4 arguments: ckpt_dir, direction, quantization.
def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
    if (quantization  == '4-bit'):
      qconfig= BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_use_double_quant= True,
          bnb_4bit_compute_dtype= torch.bfloat16
    )
    elif (quantization  == '8-bit'):
      qconfig= BitsAndBytesConfig(
          load_in_8bit=True,
          bnb_8bit_use_double_quant= True,
          bnb_8bit_compute_dtype= torch.bfloat32
    )
    else:
      qconfig=None

    # Create a variable tokenizer and set it as IndicTransTokenizer with direction set as direction.
    tokenizer= IndicTransTokenizer( direction=direction )

    # Create a model variable set to AutoModelForSeq2SeqLM, Keep trust_remote_code=True, low_cpu_mem_usage=True and quantization_config=qconfig.
    model= AutoModelForSeq2SeqLM.from_pretrained(ckpt_dir, trust_remote_code=True, low_cpu_mem_usage=True, quantization_config=qconfig)


    # if qconfig is none, save the model in device.
    if qconfig == None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    model.eval();
    # return both tokenizer and model
    return tokenizer, model


## Helper Function to get translation

In [None]:
def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            src=True,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text
        generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations

## Languages and their codes

Now we have to Finally join all the functions and datasets together to create our own predictions.

Here is the list of languages supported by the IndicTrans2 models:

| Language                       | Code      |
|--------------------------------|-----------|
| Assamese                       | asm_Beng  |
| Bengali                        | ben_Beng  |

## Importing the Dataset

In [None]:
with open('/content/test.asm_Beng', 'r') as f:
    input_assamese = f.readlines()
    input_assamese= [line.strip() for line in input_assamese]

with open('/content/test.ben_Beng', 'r') as f:
    target_bengali = f.readlines()
    target_bengali= [line.strip() for line in target_bengali]

input_assamese[:5]

In [None]:
target_bengali[:5]

## Assamese to Bengali conversion

In [None]:
AI4_BHARAT= "ai4bharat/indictrans2-indic-indic-1B"; my_direction="/content/IndicTrans2/huggingface_interface/IndicTransTokenizer/IndicTransTokenizer/indic-indic/";

# getting tokenizer and model by passing arg. to initialize_model_and_tokenizer function
Tokenizer, model= initialize_model_and_tokenizer( ckpt_dir=AI4_BHARAT, direction= my_direction, quantization="4-bit"  )

indic= IndicProcessor(inference=True)

# Choose the source langauge as English and target language as Hindi.
lan_src= "asm_Beng"; lan_tar="ben_Beng"

# Find target translation using the batch_translate function with arguments: input_sentences, src_lang, tgt_lang, model, tokenizer, ip
pred_bengali= batch_translate( input_assamese , lan_src, lan_tar, model, Tokenizer, indic  )

del Tokenizer, model

In [None]:
pred_bengali[:5]

## Evaluation Metrics

In [None]:
from datasets import load_metric

pred_new= [ sen.split() for sen in pred_bengali ]
tar_new= [[sen.split(' ')] for sen in target_bengali ]

metric= load_metric("bleu")
metric.add_batch(predictions= pred_new, references=tar_new)
metric.compute()

In [None]:
from datasets import load_metric

metric= load_metric("rouge")
metric.add_batch(predictions= pred_bengali, references= target_bengali)
metric.compute()

# Result

These are some of the observations and inferences which I gained while working on the problem.
  - There is a slight difference in the structure of the predicted translations and the target translations.
  - The Evaluation metrics used showed very poor performance of the model when used for evaluating translation from Assamese to Bengali.
  - The above result can happen due to different reasons such as the differnce in structuring of the words.
  - A Better approach could use some different kind of preprocesssing which can improve the performance of the model.
  - The accuracy can see a increase after fine-tuning the model.