# Downloading the OSCAR Dataset for CamemBERT Pretraining

In this notebook, we download the OSCAR dataset to perform the pretraining of the **CamemBERT** language model.

## Prerequisites
1. Ensure you are logged into **Hugging Face**:
   - Run the following command in your terminal to authenticate:
     ```bash
     huggingface-cli login
     ```
   - This will prompt you to enter your Hugging Face token. You can generate a token from [Hugging Face Settings](https://huggingface.co/settings/tokens).

2. Accept the dataset usage conditions on Hugging Face:
   - Visit the [OSCAR dataset page](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) and click on **"Request Access"** if needed.

## Steps in this Notebook
- Download the OSCAR dataset in the desired language.
- Preprocess the dataset for training.
- Begin pretraining the CamemBERT model.

> Once you have completed the prerequisites, you can run the code below to proceed.


In [None]:
from huggingface_hub import whoami

# Vérifiez votre identité sur Hugging Face
user_info = whoami()
print(f"Vous êtes connecté en tant que : {user_info['name']}")

In [None]:
from datasets import load_dataset

dataset = load_dataset("oscar-corpus/OSCAR-2201",
                        #use_auth_token=True, # required
                        language="fr", 
                        streaming=False, # download locally
                        split="train",
                        cache_dir='CamemBERT/data',
                        trust_remote_code=True) # optional


In [None]:
# Explorer quelques exemples du dataset
print("Quelques exemples de données dans le dataset OSCAR :\n")

# Parcourir les premières lignes
for i, example in enumerate(dataset):
    print(f"Exemple {i + 1}: {example}\n")
    if i >= 1:  # Afficher seulement les 5 premiers exemples
        break