# Text-to-Text Translation Preprocessing

This notebook aims to preprocess the Parallel Global Voices English-French dataset for text-to-text translation tasks.

## Import Libraries

Let's start by importing the necessary libraries.

In [1]:
from datasets import load_dataset

import pandas as pd
import re

import torch
from transformers import MarianTokenizer
from sklearn.model_selection import train_test_split
import 

  from .autonotebook import tqdm as notebook_tqdm


## Load Dataset

Load the Parallel Global Voices English-French dataset using the `load_dataset` function.


In [2]:
dataset = load_dataset("Nicolas-BZRD/Parallel_Global_Voices_English_French", split='train').to_pandas()

## Data Preprocessing Steps

### Text Cleaning

The text cleaning steps are as follows:
- Lowercase the text
- Remove special characters (we will keep the punctuation marks as they are important for translation tasks)
- handle URLs/HTML tags, etc.
- Remove extra whitespaces
- Expand contractions (e.g. don't -> do not)

In [3]:
def text_cleaning(text):
    text = text.lower()                 # lowercase
    text = re.sub('<[^>]+>', '', text)  # remove HTML tags
    text = re.sub(r'http\S+', '', text) # remove URLs
    text = re.sub(' +', ' ', text)      # remove extra spaces
    
    # TODO : remove special characters (check the exploratory notebook)
    # TODO : expand contractions (here or inside a data augmentation function "import contractions")
    
    return text

In [4]:
dataset['en_cleaned'] = dataset['en'].apply(lambda x: text_cleaning(x))
dataset['fr_cleaned'] = dataset['fr'].apply(lambda x: text_cleaning(x))

### Tokenization and Numerical Encoding
In this step, we perform tokenization and numerical encoding using an appropriate tokenizer. We use the MarianTokenizer from the transformers library, specifically the 'Helsinki-NLP/opus-mt-en-fr' model, which is trained for English to French translation.

The ``dataset['en_cleaned']`` column contains the preprocessed English text, and the ``dataset['fr_cleaned']`` column contains the preprocessed French text. We apply the tokenizer.encode() method to each text, which tokenizes the text and returns the encoded representation as tensors.

The encoded representations are stored in the ``dataset['en_encoded']`` and ``dataset['fr_encoded']`` columns, respectively. These encoded representations can be used as input to train translation models or perform other text-to-text translation tasks.


In [5]:
# Tokenization and numerical encoding using appropriate tokenizer
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
dataset['en_encoded'] = dataset['en_cleaned'].apply(lambda x: tokenizer.encode(x, return_tensors='pt'))
dataset['fr_encoded'] = dataset['fr_cleaned'].apply(lambda x: tokenizer.encode(x, return_tensors='pt'))

Token indices sequence length is longer than the specified maximum sequence length for this model (733 > 512). Running this sequence through the model will result in indexing errors


### Padding and Truncation

This section performs padding and truncation on the encoded sequences to ensure they have a consistent length.
The maximum sequence length is set to 512.
Sequences shorter than the maximum length are padded with zeros, while longer sequences are truncated.
The padded and truncated sequences are stored in the 'en_padded' and 'fr_padded' columns of the dataset, respectively.


In [6]:
max_seq_length = 512  # Maximum sequence length supported by the model

# Tokenization and numerical encoding with truncation and padding
dataset['en_encoded'] = dataset['en_cleaned'].apply(lambda x: tokenizer.encode(x, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors='pt'))
dataset['fr_encoded'] = dataset['fr_cleaned'].apply(lambda x: tokenizer.encode(x, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors='pt'))

### Train/Validation/Test Split


In [7]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

In [10]:
from transformers import MarianMTModel, MarianTokenizer, Seq2SeqTrainer, Seq2SeqTrainingArguments
import torch

# Initialize the Marian tokenizer and model
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    output_dir='./model_output',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy='steps',
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    warmup_steps=2000,
    num_train_epochs=3,
    logging_dir='./logs',
)

# Define a training dataset
# Ensure that the 'en_encoded' and 'fr_encoded' columns contain tokenized sequences
train_dataset = [(src_seq, tgt_seq) for src_seq, tgt_seq in zip(train_data['en_encoded'], train_data['fr_encoded'])]

# Define a validation dataset
# Ensure that the 'en_encoded' and 'fr_encoded' columns contain tokenized sequences
eval_dataset = [(src_seq, tgt_seq) for src_seq, tgt_seq in zip(test_data['en_encoded'], test_data['fr_encoded'])]

# Create a Seq2SeqTrainer instance and train the model
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Start training
trainer.train()

  0%|          | 0/205236 [08:05<?, ?it/s]
  0%|          | 0/205236 [00:00<?, ?it/s]

TypeError: vars() argument must have __dict__ attribute

In [None]:
# Example sentence to translate
sentence_to_translate = "Hello, how are you?"

# Tokenize the input sentence
inputs = tokenizer.encode(sentence_to_translate, return_tensors='pt')

# Generate translation
translated = model.generate(inputs, max_length=512, num_beams=4, early_stopping=True)

# Decode the translated output tokens back to text
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)
print("Translated:", translated_text)


## Save Preprocessed Data

In [None]:
# # Determine the number of chunks to split the data
# num_chunks = 10  # You can adjust this number based on your dataset size

# # Calculate the chunk size
# chunk_size = len(train_data) // num_chunks

# # Save the data in chunks
# for i in range(num_chunks):
#     start_idx = i * chunk_size
#     end_idx = start_idx + chunk_size if i < num_chunks - 1 else len(train_data)
#     chunk_data = train_data.iloc[start_idx:end_idx]
#     chunk_data.to_csv(f'../data/preprocessed/train/preprocessed_train_chunk_{i + 1}.csv.gz', index=False, compression='gzip')


In [None]:
# train_data.to_csv('../data/preprocessed/train/preprocessed_train.csv.gz', index=False, compression='gzip')
# test_data.to_csv('../data/preprocessed/test/preprocessed_test.csv.gz', index=False, compression='gzip')