# ***EN-FR Translation***

The objective of this project is to develop a robust machine learning model capable of translating English sentences or words into their French equivalents. This is a fundamental task in natural language processing (NLP) with significant applications in multilingual communication, education, and machine-assisted translation tools.

To achieve this, we will leverage a high-quality dataset sourced from Kaggle:
[Language Translation: English-French Dataset](https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench)


This dataset consists of two columns:


*   English: Contains English words or sentences.
*   French: Contains the corresponding translations in French.

For the model architecture, we will utilize Transformers, a state-of-the-art framework in NLP. Specifically, we will employ MarianMT, a pre-trained translation model designed for efficient and accurate machine translation tasks.

This notebook will guide you through the following steps:


1.   Data preparation and preprocessing.
2.   Model fine-tuning using the selected dataset
3.   Evaluation of the model’s performance.
4.   Real-time inference to translate English sentences into French.

Don't waste no time, let’s get started ! 🚀

In [None]:
# make sure we have all the necessary libraries installed:
!pip install transformers datasets torch scikit-learn pandas



In [None]:
import torch

# Attempt GPU; if not, stay on CPU
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


# Loading and Preprocessing

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import MarianMTModel, MarianTokenizer


data = pd.read_csv('eng_-french.csv')
print(data.head(10))  # Print the first 10 lines

  English words/sentences French words/sentences
0                     Hi.                 Salut!
1                    Run!                Cours !
2                    Run!               Courez !
3                    Who?                  Qui ?
4                    Wow!             Ça alors !
5                   Fire!               Au feu !
6                   Help!             À l'aide !
7                   Jump.                 Saute.
8                   Stop!            Ça suffit !
9                   Stop!                 Stop !


In [None]:
#some fast visualization
data.columns = ["eng","fre"]
data.head()

Unnamed: 0,eng,fre
0,Hi.,Salut!
1,Run!,Cours !
2,Run!,Courez !
3,Who?,Qui ?
4,Wow!,Ça alors !


In [None]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Convert pandas DataFrame to Hugging Face Dataset format and split it
train_dataset = Dataset.from_pandas(train_data[["eng", "fre"]])
test_dataset = Dataset.from_pandas(test_data[["eng", "fre"]])

# Inspect the first example
print(train_dataset[0])

{'eng': 'Do you know how my friends describe me?', 'fre': 'Sais-tu comment mes amis me décrivent\u202f?', '__index_level_0__': 139981}


## Pre-process datasets: Tokenize the Data

In [None]:
# Load MarianMT tokenizer and model for translation
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Tokenization function for input-output pairs (English -> French)
def tokenize_function(examples):
    inputs = tokenizer(examples['eng'], padding="max_length", truncation=True, max_length=64)
    targets = tokenizer(examples['fre'], padding="max_length", truncation=True, max_length=64)
    inputs['labels'] = targets['input_ids']
    return inputs

# Apply tokenization to the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

#  We choose smaller subset of dataset to reduce the time it takes to train
train_dataset = train_dataset.select(range(500))
test_dataset = test_dataset.select(range(100))

# Inspect the tokenized dataset structure
print(train_dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/140496 [00:00<?, ? examples/s]

Map:   0%|          | 0/35125 [00:00<?, ? examples/s]

{'eng': 'Do you know how my friends describe me?', 'fre': 'Sais-tu comment mes amis me décrivent\u202f?', '__index_level_0__': 139981, 'input_ids': [1123, 55, 340, 541, 240, 3698, 9689, 143, 54, 0, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [30842, 9, 21, 1148, 1027, 143, 9, 15, 7830, 143, 1205, 9249, 4169, 99, 0, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 

## Set up the model for training

In [None]:
!pip install sacrebleu # Install the necessary library for BLEU score



In [None]:
from transformers import Trainer, TrainingArguments
import numpy as np
import os
os.environ["WANDB_MODE"] = "disabled"

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Decode predictions and labels to text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Calculate BLEU score using sacrebleu
    bleu = sacrebleu.corpus_bleu(decoded_preds, [decoded_labels])

    return {"bleu": bleu.score}


# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',  # save the model checkpoints
    logging_steps=100,  # Log less frequently
    run_name="translation_experiment",
    evaluation_strategy="epoch",  # Evaluate once per epoch
    learning_rate=2e-5,  # Learning rate
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,  # Batch size for evaluation
    num_train_epochs=5,  # Number of training epochs
    weight_decay=0.01,  # Weight decay for regularization
    logging_dir='./logs',  #save logs
    report_to="none",
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    #compute_metrics=compute_metrics
)




## Let's train it!

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.0645,0.633157
2,0.5584,0.533348
3,0.4287,0.492458
4,0.3175,0.476037
5,0.2915,0.470724




TrainOutput(global_step=625, training_loss=0.498389656829834, metrics={'train_runtime': 85.3316, 'train_samples_per_second': 29.297, 'train_steps_per_second': 7.324, 'total_flos': 42372956160000.0, 'train_loss': 0.498389656829834, 'epoch': 5.0})

## Evaluate the model

In [None]:
eval = trainer.evaluate()
print(f"Evaluation results of the model: {eval}")

Evaluation results of the model: {'eval_loss': 0.4707241952419281, 'eval_runtime': 0.52, 'eval_samples_per_second': 192.297, 'eval_steps_per_second': 48.074, 'epoch': 5.0}


## Testing: Translate a new sentence

In [None]:
sentence = "This is a good project."
inputs = tokenizer(sentence, return_tensors="pt", padding=True).to(device)
translated = model.generate(inputs["input_ids"], max_length=128)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print(f"Original: {sentence}")
print(f"Translated: {translated_text}")

Original: This is a good project.
Translated: C'est un bon projet.


Thank you ! 😊
