# XNLI Dataset Exploration

This notebook demonstrates how to download and explore the XNLI dataset using the `datasets` library from Hugging Face. We will also decode the `input_ids` to see the corresponding tokens.

## Install Required Libraries

First, we need to install the `datasets` and `transformers` libraries if they are not already installed.

In [None]:
!pip install datasets transformers

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer

language = 'fr'
cache_directory = "../../data/xnli"
# Download the XNLI dataset for the specified language
dataset = load_dataset("facebook/xnli", name=language, cache_dir=cache_directory)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 392702
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 5010
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 2490
    })
})


## Dataset Exploration

Let's look at some examples from the dataset to understand its structure.

In [2]:
# Display some examples from the dataset
print(dataset['train'][0])
print(dataset['validation'][0])
print(dataset['test'][0])

{'premise': "L' écrémage conceptuel de la crème a deux dimensions fondamentales : le produit et la géographie .", 'hypothesis': 'Le produit et la géographie sont ce qui fait travailler la crème de la crème .', 'label': 1}
{'premise': 'Et il a dit, maman, je suis à la maison.', 'hypothesis': "Il a appelé sa mère dès que le bus scolaire l'a déposé.", 'label': 1}
{'premise': "Eh bien, je ne pensais même pas à cela, mais j'étais si frustré, et j'ai fini par lui reparler.", 'hypothesis': 'Je ne lui ai pas parlé de nouveau', 'label': 2}


## Decoding `input_ids`

Now we will decode the `input_ids` to see the corresponding tokens.

In [14]:
from transformers import CamembertTokenizer

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")

# Example of input_ids
inputs = tokenizer("This is an example. sentence. \t hello sir", return_tensors="pt")
input_ids = inputs["input_ids"][0]

# Decode the input_ids to see the tokens
decoded_sentece = tokenizer.decode(input_ids, skip_special_tokens=True)
tokens = tokenizer.convert_ids_to_tokens(input_ids)

print("Input IDs:", input_ids)
print("Tokens:", tokens)
print("Decoded sentence:", decoded_sentece)

Input IDs: tensor([    5, 17526,  2856,   674,  1017, 21598,     9, 22625,     9,   616,
         6974,    86,    81,     6])
Tokens: ['<s>', '▁This', '▁is', '▁an', '▁ex', 'ample', '.', '▁sentence', '.', '▁h', 'ello', '▁si', 'r', '</s>']
Decoded sentence: This is an example. sentence. hello sir


## Conclusion

We have downloaded and explored the XNLI dataset, and we have also decoded the `input_ids` to see the corresponding tokens. You can now use this dataset for your natural language processing tasks.

In [4]:
len(dataset['train'])

392702

In [44]:
from transformers import CamembertTokenizer

# Charger le tokenizer CamemBERT
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")

# Exemple : préparer une prémisse et une hypothèse
premise = "Le ciel est bleu"
hypothesis = "Le ciel est de couleur bleue"

# Concaténer et tokeniser
inputs = tokenizer(
    premise, hypothesis,
    max_length=128,        # Limite de la séquence (peut être ajustée)
    truncation=True,       # Tronquer si la séquence est trop longue
    padding="max_length",  # Compléter à la longueur max (pour les batchs)
    return_tensors="pt"    # Retourner des tenseurs PyTorch
)

# Résultat
print(inputs)

{'input_ids': tensor([[   5,   54, 1918,   30, 1549,    6,    6,   54, 1918,   30,    8,  648,
         5251,    6,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0,

In [45]:
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tokens

['<s>',
 '▁Le',
 '▁ciel',
 '▁est',
 '▁bleu',
 '</s>',
 '</s>',
 '▁Le',
 '▁ciel',
 '▁est',
 '▁de',
 '▁couleur',
 '▁bleue',
 '</s>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',

In [10]:
print(tokenizer.sep_token, tokenizer.sep_token_id, tokenizer.cls_token, tokenizer.mask_token, tokenizer.sep_token, tokenizer.pad_token)

</s> 6 <s> <mask> </s> <pad>


In [15]:
# import torch 
# import torch.nn as nn
# from torch.utils.data import DataLoader, Dataset
# from datasets import load_dataset
# from transformers import CamembertTokenizer

# class XNLIDataset(Dataset):
#     def __init__(self, split="train"):
#         super(XNLIDataset, self).__init__()

#         self.split = split
#         self.language = "fr"
#         self.cache_directory = "../data/xnli"
#         self.data = load_dataset(
#             "facebook/xnli",
#             name=self.language,
#             cache_dir=self.cache_directory
#         )
#         self.tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
#         self.max_length = 128

#     def __len__(self):
#         return len(self.data[self.split])

#     def __getitem__(self, idx) -> dict :
#         if self.split == "train":

#             inputs = self.tokenizer(
#                 self.data(self.split)[idx]["premise"],
#                 self.data(self.split)[idx]["hypothesis"],
#                 max_length=self.max_length,
#                 truncation=True,
#                 padding="max_length",
#                 return_tensors="pt"
#             )
#             return inputs
            
        
#         elif self.split == "test":
#             inputs = self.tokenizer(
#                 self.data(self.split)[idx]["premise"],
#                 self.data(self.split)[idx]["hypothesis"],
#                 max_length=self.max_length,
#                 truncation=True,
#                 padding="max_length",
#                 return_tensors="pt"
#             )
#             return inputs
#         elif self.split == "validation":
#             inputs = self.tokenizer(
#                 self.data(self.split)[idx]["premise"],
#                 self.data(self.split)[idx]["hypothesis"],
#                 max_length=self.max_length,
#                 truncation=True,
#                 padding="max_length",
#                 return_tensors="pt"
#             )
#             return inputs

In [17]:
import torch
from torch.utils.data import Dataset
from datasets import load_dataset
from transformers import CamembertTokenizer


class XNLIDataset(Dataset):
    def __init__(self, split="train", language="fr", tokenizer=tokenizer, cache_directory="../data/xnli", max_length=128):
        """
        Dataset PyTorch pour le dataset XNLI.

        Args:
            split (str): Partition des données ("train", "test", "validation").
            language (str): Langue cible.
            cache_directory (str): Répertoire pour stocker le dataset téléchargé.
            max_length (int): Longueur maximale pour le padding/truncation.
        """
        super(XNLIDataset, self).__init__()
        self.split = split
        self.language = language
        self.cache_directory = cache_directory
        self.max_length = max_length

        # Charger les données et le tokenizer
        self.data = load_dataset(
            "facebook/xnli",
            name=self.language,
            cache_dir=self.cache_directory
        )[self.split]  # Charger uniquement la partition demandée

        self.tokenizer = tokenizer #CamembertTokenizer.from_pretrained("camembert-base")

    def __len__(self):
        """Retourne la taille du dataset."""
        return len(self.data)

    def __getitem__(self, idx):
        """
        Récupère un échantillon spécifique.

        Args:
            idx (int): Index de l'échantillon.

        Returns:
            dict: Contient les `input_ids`, `attention_mask` et `label`.
        """
        example = self.data[idx]
        inputs = self.tokenizer(
            example["premise"],
            example["hypothesis"],
            max_length=self.max_length,
            truncation=True,
            padding="max_length",
            return_tensors="pt"
        )

        # Ajouter les labels
        inputs = {key: val.squeeze(0) for key, val in inputs.items()}  # Enlever la dimension batch
        inputs["label"] = torch.tensor(example["label"], dtype=torch.long)

        return inputs

In [None]:
xnli = XNLIDataset(split="train")

data_loader = DataLoader(xnli, batch_size=8, shuffle=True)
batch = next(iter(data_loader))

In [19]:
for key, val in batch.items():
    print(f"{key}: {val.size()}")

input_ids: torch.Size([8, 128])
attention_mask: torch.Size([8, 128])
label: torch.Size([8])


In [None]:
import os
import re
import torch
from torch import nn
from d2l import torch as d2l

#@save
d2l.DATA_HUB['SNLI'] = (
    'https://nlp.stanford.edu/projects/snli/snli_1.0.zip',
    '9fcde07509c7e87ec61c640c1b2753d9041758e4')

data_dir = d2l.download_extract('SNLI')