<a href="https://colab.research.google.com/github/shahidul034/Data-Structures-and-Algorithm-Tutorial/blob/main/contest_bn_to_rm_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Python Libraries

In [None]:
!pip install datasets evaluate transformers[sentencepiece] -q
!pip install accelerate -q
!apt install git-lfs -q
!pip install sacrebleu -q

## Managing Warnings and Disabling WandB Integration

In [None]:
import warnings
import os
warnings.filterwarnings("ignore")
os.environ["WANDB_DISABLED"] = "true"

**Understand the Dataset Structure**:  
   Begin by inspecting the dataset to understand its structure and contents. This includes checking:
   - Column names (`features`)
   - Data types
   - Number of rows (`num_rows`)

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset("SKNahin/bengali-transliteration-data")
raw_datasets

**Preprocessing the Data**:  
   Write code to clean and prepare the dataset for modeling:
   - Handle missing values (e.g., fill, drop, or impute missing values).
   - Normalize or scale features if required.
   - Encode categorical data if applicable.
   - Check for and handle duplicates or outliers.
   - Perform feature selection if necessary.
   - stop words removing
   - stemminmg or lemmatization

In [None]:
raw_datasets= ...

**Splitting the Dataset**:  
   - Use the `.train_test_split()` function (or similar) to divide the dataset into training and validation sets.
   - Specify the `train_size` and use a consistent `seed` for reproducibility.
   - Rename the test split to "validation" for clarity if needed.

In [None]:
split_datasets = raw_datasets["train"].train_test_split(train_size='', seed='')
split_datasets

### Selecting a Model and Tokenizer

1. **Understand the Task**:  
   Before selecting a model, identify the task you are solving (e.g., translation, summarization, text classification). In this example, the task is **language translation** from Bengali (`bn`) to English (`en`).

2. **Select a Model Checkpoint**:  
   - Choose a pre-trained model checkpoint suitable for your task from the Hugging Face Model Hub.  
   - Look for a model trained on your desired source and target languages or specific domains.  

3. **Initialize the Tokenizer**:  
   - Load the tokenizer corresponding to your selected model.  

4. **Load the Model**:  
   - Verify the model compatibility with your task (e.g., sequence-to-sequence models for translation).


---

Explore the [Hugging Face Model Hub](https://huggingface.co/models) to find the most suitable model for their specific use case.

In [None]:
from transformers import AutoTokenizer,AutoModelForSeq2SeqLM
model_checkpoint = ""
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)


1. **Understand the Preprocessing Requirements**:  
   - Seq2Seq models, such as translation models, require input data (source text) and target data (target text).  
   - Inputs and targets must be tokenized and truncated to a fixed `max_length` to fit the model's requirements.

2. **Define the Maximum Sequence Length**:  
   - Set a `max_length` parameter (e.g., 64) to limit the tokenized sequences for both inputs and targets.  
   - This ensures that overly long sequences do not cause memory issues during training.

3. **Create a Preprocessing Function**:  
   - Write a function that:
     - Reads the source and target columns from the dataset.  
     - Tokenizes the source and target texts.  
     - Truncates sequences that exceed the maximum length.  

In [None]:
max_length = ""
def preprocess_function(examples):
    pass

## Tokenizing and Preparing the Dataset

In [None]:
tokenized_datasets = split_datasets.map(
    ## write your code
)

## Setting Up a Data Collator for dynamically pads input sequences in a batch to the same length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(...)

## Loading and Using Evaluation Metrics

In [None]:
import evaluate

metric = evaluate.load('')

## Select the evaluation metrics for your task
Here are some common evaluation metrics used in Natural Language Processing (NLP):

1. **Accuracy**: Measures the overall correctness of predictions.
2. **Precision**: The proportion of correct positive predictions out of all positive predictions made.
3. **Recall**: The proportion of true positive instances identified out of all actual positives.
4. **F1 Score**: The harmonic mean of precision and recall, providing a balanced evaluation metric.
5. **BLEU (Bilingual Evaluation Understudy)**: Used for evaluating the quality of machine-translated text against one or more reference translations.
6. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Measures the overlap of n-grams between the system and reference summaries, commonly used for summarization tasks.
7. **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**: Considers synonyms, stemming, and word order for evaluating machine translation.
8. **chrF++**: A character n-gram F-score metric used for evaluating translation quality.
9. **Perplexity**: Measures how well a probability model predicts a sample, often used in language modeling.
10. **BERTScore**: Uses BERT embeddings to evaluate the similarity between predicted and reference texts.

These metrics help assess the performance and effectiveness of NLP models across various tasks.

https://huggingface.co/evaluate-metric

https://huggingface.co/docs/evaluate/en/choosing_a_metric


In [None]:
import numpy as np


def compute_metrics(eval_preds):
    pass
    ## write your code

## Setting Up Seq2Seq Training Arguments
1. **Purpose of `Seq2SeqTrainingArguments`**:  
   - `Seq2SeqTrainingArguments` is a specialized configuration class for sequence-to-sequence (Seq2Seq) tasks using the Hugging Face `Trainer`.
   - It configures important training parameters like batch size, learning rate, number of epochs, etc., to control the behavior of the training loop.

2. **Key Parameters in the Training Arguments**:

   - **`output_dir`**:  
     - The directory where the model checkpoints will be saved. In this case, it will be saved as `"finetuned-bn-to-rm"`.
   
   - **`evaluation_strategy`**:  
     - Defines when to run evaluations. Set to `"no"` to disable evaluation during training.
   
   - **`save_strategy`**:  
     - Specifies when to save model checkpoints. `"epoch"` saves checkpoints at the end of each epoch.
   
   - **`learning_rate`**:  
     - The learning rate used by the optimizer. In this case, it's set to `2e-5`, which is commonly used for fine-tuning.
   
   - **`per_device_train_batch_size` and `per_device_eval_batch_size`**:  
     - Batch sizes for training and evaluation. Training uses a batch size of 32, while evaluation uses a larger batch size of 64.
   
   - **`weight_decay`**:  
     - Regularization to prevent overfitting, typically set to a small value like `0.01`.
   
   - **`save_total_limit`**:  
     - Limits the total number of checkpoints to keep. In this case, only the last 3 checkpoints will be retained.
   
   - **`num_train_epochs`**:  
     - Number of training epochs, set to 3 in this example.
   
   - **`predict_with_generate`**:  
     - Ensures the model generates predictions during evaluation, rather than relying on classification logits.
   
   - **`fp16`**:  
     - Enables mixed precision training for faster training with lower memory usage, set to `True` to use 16-bit floating point precision.

   - **`push_to_hub`**:  
     - Controls whether or not to upload the model to the Hugging Face Model Hub. Set to `False` to disable this behavior.

https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments


In [None]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"demo_name",
    evaluation_strategy='',
    save_strategy='',
    learning_rate='',
    per_device_train_batch_size='',
    per_device_eval_batch_size='',
    weight_decay='',
    save_total_limit='',
    num_train_epochs='',
    predict_with_generate='',
    fp16='',
    push_to_hub='',
    ## Add more parameters if needed
    )

### Setting Up and Using `Seq2SeqTrainer`

1. **Purpose of `Seq2SeqTrainer`**:  
   - The `Seq2SeqTrainer` class is a specialized trainer for sequence-to-sequence tasks (such as machine translation or summarization).  
   - It integrates the model, datasets, training arguments, and metrics, providing a simple interface for training and evaluation.

2. **Key Parameters in the `Seq2SeqTrainer`**:

   - **`model`**:  
     - The pre-trained or fine-tuned model you are training. This is passed as an argument to the trainer.
   
   - **`args`**:  
     - The training configuration, including batch sizes, learning rate, and number of epochs, which were set up in the `Seq2SeqTrainingArguments`.

   - **`train_dataset`**:  
     - The training dataset that has been preprocessed and tokenized. In this case, it's `tokenized_datasets["train"]`.

   - **`eval_dataset`**:  
     - The evaluation dataset used for validation during training. Here, it's `tokenized_datasets["validation"]`.

   - **`data_collator`**:  
     - The data collator, such as `DataCollatorForSeq2Seq`, used to pad and format the batches for training. It ensures the inputs and targets are padded correctly during batch creation.

   - **`tokenizer`**:  
     - The tokenizer used for tokenizing the text. It's necessary to decode predictions during evaluation and ensure correct formatting of inputs.

   - **`compute_metrics`**:  
     - A function that computes evaluation metrics, such as BLEU score. In this case, the `compute_metrics` function is used to compute BLEU during evaluation.


https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.Seq2SeqTrainer


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset='',
    eval_dataset='',
    data_collator='',
    tokenizer='',
    compute_metrics='',
    ## Add more parameters if needed
)

In [None]:
trainer.train()

## Evaluating your Model and write a evaluation function
The `compute_metrics` function that you defined earlier is called to compute metrics based on the model's predictions and the true labels from the evaluation dataset. This function is responsible for calculating evaluation metrics

In [None]:
trainer.evaluate(
    ## add parameters if needed.
)