# Text-to-Text Translation Preprocessing

This notebook aims to preprocess the Parallel Global Voices English-French dataset for text-to-text translation tasks.

## Import Libraries

Let's start by importing the necessary libraries.

In [24]:
from datasets import load_dataset

import pandas as pd
import re

import torch
from transformers import MarianTokenizer
from sklearn.model_selection import train_test_split

## Load Dataset

Load the Parallel Global Voices English-French dataset using the `load_dataset` function.


In [16]:
dataset = load_dataset("Nicolas-BZRD/Parallel_Global_Voices_English_French", split='train').to_pandas()

## Data Preprocessing Steps

### Text Cleaning

The text cleaning steps are as follows:
- Lowercase the text
- Remove special characters (we will keep the punctuation marks as they are important for translation tasks)
- handle URLs/HTML tags, etc.
- Remove extra whitespaces
- Expand contractions (e.g. don't -> do not)

In [17]:
def text_cleaning(text):
    text = text.lower()                 # lowercase
    text = re.sub('<[^>]+>', '', text)  # remove HTML tags
    text = re.sub(r'http\S+', '', text) # remove URLs
    text = re.sub(' +', ' ', text)      # remove extra spaces
    
    # TODO : remove special characters (check the exploratory notebook)
    # TODO : expand contractions (here or inside a data augmentation function "import contractions")
    
    return text

In [18]:
dataset['en_cleaned'] = dataset['en'].apply(lambda x: text_cleaning(x))
dataset['fr_cleaned'] = dataset['fr'].apply(lambda x: text_cleaning(x))

### Tokenization and Numerical Encoding
In this step, we perform tokenization and numerical encoding using an appropriate tokenizer. We use the MarianTokenizer from the transformers library, specifically the 'Helsinki-NLP/opus-mt-en-fr' model, which is trained for English to French translation.

The ``dataset['en_cleaned']`` column contains the preprocessed English text, and the ``dataset['fr_cleaned']`` column contains the preprocessed French text. We apply the tokenizer.encode() method to each text, which tokenizes the text and returns the encoded representation as tensors.

The encoded representations are stored in the ``dataset['en_encoded']`` and ``dataset['fr_encoded']`` columns, respectively. These encoded representations can be used as input to train translation models or perform other text-to-text translation tasks.


In [19]:
# Tokenization and numerical encoding using appropriate tokenizer
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-fr')
dataset['en_encoded'] = dataset['en_cleaned'].apply(lambda x: tokenizer.encode(x, return_tensors='pt'))
dataset['fr_encoded'] = dataset['fr_cleaned'].apply(lambda x: tokenizer.encode(x, return_tensors='pt'))

Token indices sequence length is longer than the specified maximum sequence length for this model (733 > 512). Running this sequence through the model will result in indexing errors


### Padding and Truncation

This section performs padding and truncation on the encoded sequences to ensure they have a consistent length.
The maximum sequence length is set to 512.
Sequences shorter than the maximum length are padded with zeros, while longer sequences are truncated.
The padded and truncated sequences are stored in the 'en_padded' and 'fr_padded' columns of the dataset, respectively.


In [22]:
max_seq_length = 512  # Maximum sequence length supported by the model

# Tokenization and numerical encoding with truncation and padding
dataset['en_encoded'] = dataset['en_cleaned'].apply(lambda x: tokenizer.encode(x, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors='pt'))
dataset['fr_encoded'] = dataset['fr_cleaned'].apply(lambda x: tokenizer.encode(x, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors='pt'))

### Train/Validation/Test Split


In [26]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

## Save Preprocessed Data

In [None]:
# train_data.to_csv('../data/preprocessed/preprocessed_train.csv', index=False)
# test_data.to_csv('../data/preprocessed/preprocessed_test.csv', index=False)