# Downloading the OSCAR Dataset

In this notebook, we download the OSCAR dataset to perform the pretraining of the **CamemBERT** language model.

## Prerequisites
1. Ensure you are logged into **Hugging Face**:
   - Run the following command in your terminal to authenticate:
     ```bash
     huggingface-cli login
     ```
   - This will prompt you to enter your Hugging Face token. You can generate a token from [Hugging Face Settings](https://huggingface.co/settings/tokens).

2. Accept the dataset usage conditions on Hugging Face:
   - Visit the [OSCAR dataset page](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) and click on **"Request Access"** if needed.

## Steps in this Notebook
- Download the OSCAR dataset in the desired language.
- Preprocess the dataset for training.


> Once you have completed the prerequisites, you can run the code below to proceed.


In [42]:
from huggingface_hub import whoami

# Vérifiez votre identité sur Hugging Face
user_info = whoami()
print(f"Vous êtes connecté en tant que : {user_info['name']}")

Vous êtes connecté en tant que : Noureddine-khaous


## Download the oscar dataset :

In [41]:
from datasets import load_dataset

dataset = load_dataset("oscar-corpus/OSCAR-2201",
                        #use_auth_token=True, # the method doesn't accept the param
                        language="fr", 
                        streaming=False, # download locally
                        split="train",
                        cache_dir='CamemBERT/data',
                        trust_remote_code=True) # optional

Loading dataset shards:   0%|          | 0/658 [00:00<?, ?it/s]

In [24]:
print(f'La taille du dataset est : {len(dataset)}')
print(f'Le type de dataset est : {type(dataset)}')

La taille du dataset est : 52037098
Le type de dataset est : <class 'datasets.arrow_dataset.Dataset'>


In [39]:
import os

# Obtenir le chemin des fichiers cache
cache_files = dataset.cache_files

# Calculer la taille totale en bytes
total_size_bytes = sum(os.path.getsize(file["filename"]) for file in cache_files)

# Convertir en gigaoctets
total_size_gb = total_size_bytes / (1024 ** 3)
print(f"Taille totale du dataset en cache : {total_size_gb:.2f} GB")


Taille totale du dataset en cache : 308.25 GB


In [21]:
print(f"Nombre total d'exemples : {len(dataset)}")
print("Colonnes :", dataset.column_names)
print()
#print("Premier exemple :", dataset[0])
print(dataset.features)      # Types des colonnes

Nombre total d'exemples : 52037098
Colonnes : ['id', 'text', 'meta']

{'id': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'meta': {'warc_headers': {'warc-record-id': Value(dtype='string', id=None), 'warc-date': Value(dtype='string', id=None), 'content-type': Value(dtype='string', id=None), 'content-length': Value(dtype='int32', id=None), 'warc-type': Value(dtype='string', id=None), 'warc-identified-content-language': Value(dtype='string', id=None), 'warc-refers-to': Value(dtype='string', id=None), 'warc-target-uri': Value(dtype='string', id=None), 'warc-block-digest': Value(dtype='string', id=None)}, 'identification': {'label': Value(dtype='string', id=None), 'prob': Value(dtype='float32', id=None)}, 'annotations': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'line_identifications': [{'label': Value(dtype='string', id=None), 'prob': Value(dtype='float32', id=None)}]}}


In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

example = {"text": "Bonjour, comment allez-vous ? ."}

tokenized_example = tokenizer(
    example["text"],
    padding="max_length",
    truncation=True,
    max_length=20,  # Limite arbitraire pour l'exemple
    return_tensors="pt"  # Retourne les tenseurs PyTorch
)

print("Tokens:", tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"][0]))
print("Input IDs:", tokenized_example["input_ids"])
print("Attention Mask:", tokenized_example["attention_mask"])


Tokens: ['[CLS]', 'bon', '##jou', '##r', ',', 'comment', 'all', '##ez', '-', 'vo', '##us', '?', 'au', '##jou', '##rd', "'", 'hui', 'il', 'fai', '[SEP]']
Input IDs: tensor([[  101, 14753, 23099,  2099,  1010,  7615,  2035,  9351,  1011, 29536,
          2271,  1029,  8740, 23099,  4103,  1005, 17504,  6335, 26208,   102]])
Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


## Preparing Data for Training CamemBERT

### Context
We aim to train CamemBERT using the French OSCAR dataset. As mentioned in the original CamemBERT paper, reducing the training dataset size (e.g., from 138GB to 4GB) still provides meaningful results, although with a slight performance drop.

In this project, we downloaded **308GB** of French text from OSCAR. To make the training manageable and comparable, we limit the dataset size to **4GB**.

### Objective
Create a random subset of **4GB** from the **308GB** dataset while minimizing the risk of overfitting. To avoid the model learning biases from specific topics, we shuffle the data and select a representative sample.

### Methodology

1. **Dataset Download**: We used Hugging Face's `datasets` library to download the French OSCAR dataset.
   - Total number of examples: **52,037,098**.
   - Total size: **308GB**.

2. **Calculating the Ratio**:
   To reduce the dataset to **4GB**, we compute the ratio:
   $$
   \text{ratio} = \frac{4}{308} \approx 0.013
   $$
   This corresponds to approximately **1.3%** of the dataset, or **676,482 examples** out of the **52 million**.

3. **Selection Process**:
   - Shuffle the dataset to ensure randomness.
   - Select **676,482 examples** to create a dataset of approximately **4GB**.

4. **Saving the Subset**:
   - The selected data is saved in an efficient format:
     - **Arrow**: Optimized for direct use with Hugging Face.
     - **Parquet**: Suitable for general-purpose processing.

### Pipeline Steps
- Shuffle the full dataset.
- Randomly select **676,482 examples** corresponding to **4GB**.
- Save the subset in a compact format.

### Verification
After saving, we verify the final dataset size in **GB** to ensure it aligns with the target size of **4GB**.

---

This reduction ensures faster training while preserving the diversity and quality needed to avoid biases and overfitting. The resulting mini-dataset will be used for pretraining or fine-tuning CamemBERT.
___

In [None]:
from huggingface_hub import whoami
user_info = whoami()
print(f"Vous êtes connecté en tant que : {user_info['name']}")

In [None]:
from datasets import load_dataset

dataset = load_dataset("oscar-corpus/OSCAR-2201",
                        #use_auth_token=True, # the method doesn't accept the param
                        language="fr", 
                        streaming=False, # download locally
                        split="train",
                        cache_dir='CamemBERT/data',
                        trust_remote_code=True) # optional

## 2. Shuffle the data :

In [None]:
import random

shuffled_dataset = dataset.shuffle(seed=42)
num_examples_needed = 676482  # Nombre d'exemples pour environ 4GB

mini_dataset = shuffled_dataset.select(range(num_examples_needed))
print(f"Nombre d'exemples dans le mini-dataset : {len(mini_dataset)}")

Save the mini oscar dataset of 4gb

In [None]:
# Sauvegarder en Arrow (recommandé pour Hugging Face)
mini_dataset.save_to_disk("CamemBERT/data/mini_oscar/mini_dataset.arrow")
print("Mini-dataset sauvegardé au format Arrow")