## Preparing Data for Training CamemBERT

### Context
We aim to train CamemBERT using the French OSCAR dataset. As mentioned in the original CamemBERT paper, reducing the training dataset size (e.g., from 138GB to 4GB) still provides meaningful results, although with a slight performance drop.

In this project, we downloaded **308GB** of French text from OSCAR. To make the training manageable and comparable, we limit the dataset size to **4GB**.

### Objective
Create a random subset of **4GB** from the **308GB** dataset while minimizing the risk of overfitting. To avoid the model learning biases from specific topics, we shuffle the data and select a representative sample.

### Methodology

1. **Dataset Download**: We used Hugging Face's `datasets` library to download the French OSCAR dataset.
   - Total number of examples: **52,037,098**.
   - Total size: **308GB**.

2. **Calculating the Ratio**:
   To reduce the dataset to **4GB**, we compute the ratio:
   $$
   \text{ratio} = \frac{4}{308} \approx 0.013
   $$
   This corresponds to approximately **1.3%** of the dataset, or **676,482 examples** out of the **52 million**.

3. **Selection Process**:
   - Shuffle the dataset to ensure randomness.
   - Select **676,482 examples** to create a dataset of approximately **4GB**.

4. **Saving the Subset**:
   - The selected data is saved in an efficient format:
     - **Arrow**: Optimized for direct use with Hugging Face.
     - **Parquet**: Suitable for general-purpose processing.

### Pipeline Steps
- Shuffle the full dataset.
- Randomly select **676,482 examples** corresponding to **4GB**.
- Save the subset in a compact format.

### Verification
After saving, we verify the final dataset size in **GB** to ensure it aligns with the target size of **4GB**.

---

This reduction ensures faster training while preserving the diversity and quality needed to avoid biases and overfitting. The resulting mini-dataset will be used for pretraining or fine-tuning CamemBERT.
___


## 1. Load the full dataset :

In [None]:
from huggingface_hub import whoami
user_info = whoami()
print(f"Vous êtes connecté en tant que : {user_info['name']}")

In [11]:
from datasets import load_dataset

dataset = load_dataset("oscar-corpus/OSCAR-2201",
                        #use_auth_token=True, # the method doesn't accept the param
                        language="fr", 
                        streaming=False, # download locally
                        split="train",
                        cache_dir='CamemBERT/data',
                        trust_remote_code=True) # optional

Loading dataset shards:   0%|          | 0/658 [00:00<?, ?it/s]

## 2. Shuffle :

In [3]:
import random

shuffled_dataset = dataset.shuffle(seed=42)
num_examples_needed = 676482  # Nombre d'exemples pour environ 4GB

mini_dataset = shuffled_dataset.select(range(num_examples_needed))
print(f"Nombre d'exemples dans le mini-dataset : {len(mini_dataset)}")

Nombre d'exemples dans le mini-dataset : 676482


In [5]:
# Sauvegarder en Arrow (recommandé pour Hugging Face)
mini_dataset.save_to_disk("CamemBERT/data/mini_oscar/mini_dataset.arrow")
print("Mini-dataset sauvegardé au format Arrow")

Saving the dataset (0/9 shards):   0%|          | 0/676482 [00:00<?, ? examples/s]

Mini-dataset sauvegardé au format Arrow


**Load dataset**

In [15]:
from datasets import load_from_disk

# Load the Dataset :
dataset_path = "CamemBERT/data/mini_oscar/mini_dataset.arrow"  # Remplacez par le chemin réel
dataset = load_from_disk(dataset_path)

# Show dataset info :
print(f"Nombre d'exemples : {len(dataset)}")


Nombre d'exemples : 676482


**Check Size**

In [16]:
import os
cache_files = dataset.cache_files
# Get size in Bytes
total_size_bytes = sum(os.path.getsize(file["filename"]) for file in cache_files)
# Convert to GB
total_size_gb = total_size_bytes / (1024 ** 3)
print(f"Taille totale du dataset en cache : {total_size_gb:.2f} GB")

Taille totale du dataset en cache : 4.00 GB


In [21]:
mini_dataset = dataset.select(range(10))
mini_dataset

Dataset({
    features: ['id', 'text', 'meta'],
    num_rows: 10
})

In [18]:
pandas_df = dataset.to_pandas()

In [20]:
pandas_df.head()

Unnamed: 0,id,text,meta
0,50771362,Gardez l’œil sur toutes les images publiées su...,{'warc_headers': {'warc-record-id': '<urn:uuid...
1,35841498,"Feeder métal avec levier de serrage manuel, ve...",{'warc_headers': {'warc-record-id': '<urn:uuid...
2,44962099,"Audio Lingua - mp3 en anglais, allemand, arabe...",{'warc_headers': {'warc-record-id': '<urn:uuid...
3,31392137,La troisième prestation est une revue ponctuel...,{'warc_headers': {'warc-record-id': '<urn:uuid...
4,26786817,Les machines à café Dolce Gusto sont les modèl...,{'warc_headers': {'warc-record-id': '<urn:uuid...
