## NLI dataset examples

This notebook helps explore some examples of NLI datasets for a better understanding of how they are constructed.

## Notebook setup

In [1]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import random
from datasets import load_dataset, Dataset
import pandas as pd
import textwrap

### Exploring NLI dataset example

In [2]:
mnli_dataset = load_dataset("nyu-mll/multi_nli")
mnli_dataset_train_df = mnli_dataset["train"].to_pandas()
display(mnli_dataset_train_df.head(3))
print('...')
display(mnli_dataset_train_df.tail(3))

Unnamed: 0,promptID,pairID,premise,premise_binary_parse,premise_parse,hypothesis,hypothesis_binary_parse,hypothesis_parse,genre,label
0,31193,31193n,Conceptually cream skimming has two basic dime...,( ( Conceptually ( cream skimming ) ) ( ( has ...,(ROOT (S (NP (JJ Conceptually) (NN cream) (NN ...,Product and geography are what make cream skim...,( ( ( Product and ) geography ) ( ( are ( what...,(ROOT (S (NP (NN Product) (CC and) (NN geograp...,government,1
1,101457,101457e,you know during the season and i guess at at y...,( you ( ( know ( during ( ( ( the season ) and...,(ROOT (S (NP (PRP you)) (VP (VBP know) (PP (IN...,You lose the things to the following level if ...,( You ( ( ( ( lose ( the things ) ) ( to ( the...,(ROOT (S (NP (PRP You)) (VP (VBP lose) (NP (DT...,telephone,0
2,134793,134793e,One of our number will carry out your instruct...,( ( One ( of ( our number ) ) ) ( ( will ( ( (...,(ROOT (S (NP (NP (CD One)) (PP (IN of) (NP (PR...,A member of my team will execute your orders w...,( ( ( A member ) ( of ( my team ) ) ) ( ( will...,(ROOT (S (NP (NP (DT A) (NN member)) (PP (IN o...,fiction,0


...


Unnamed: 0,promptID,pairID,premise,premise_binary_parse,premise_parse,hypothesis,hypothesis_binary_parse,hypothesis_parse,genre,label
392699,13960,13960e,Houseboats are a beautifully preserved traditi...,( Houseboats ( ( are ( ( a ( ( beautifully pre...,(ROOT (S (NP (NNS Houseboats)) (VP (VBP are) (...,The tradition of houseboats originated while t...,( ( ( The tradition ) ( of houseboats ) ) ( ( ...,(ROOT (S (NP (NP (DT The) (NN tradition)) (PP ...,travel,0
392700,114061,114061n,Obituaries fondly recalled his on-air debates ...,( Obituaries ( fondly ( ( ( ( recalled ( his (...,(ROOT (S (NP (NNS Obituaries)) (ADVP (RB fondl...,The obituaries were beautiful and written in k...,( ( The obituaries ) ( ( were ( ( beautiful an...,(ROOT (S (NP (DT The) (NNS obituaries)) (VP (V...,slate,1
392701,2065,2065n,in that other you know uh that i should do it ...,( ( ( ( in ( that other ) ) ( you ( know ( uh ...,(ROOT (SBAR (SBAR (WHPP (IN in) (WHNP (WDT tha...,My husband has been so overworked lately that ...,( ( My husband ) ( ( has ( ( been ( so overwor...,(ROOT (S (NP (PRP$ My) (NN husband)) (VP (VBZ ...,telephone,1


The cell below can be run multiple times to show random examples from the dataset:

In [5]:
example_index = random.randint(0, len(mnli_dataset_train_df))
label_mapping = {0: 'entailment', 1: 'neutral', 2: 'contradiction'}
label = mnli_dataset_train_df['label'][example_index]
print(f"{label} ({label_mapping[label]})")
print(textwrap.fill(mnli_dataset_train_df['premise'][example_index], width=120))
print(textwrap.fill(mnli_dataset_train_df['hypothesis'][example_index], width=120))
print(f"Genre: {mnli_dataset_train_df['genre'][example_index]}")

1 (neutral)
He didn't think Vrenna was a spy for the north but very few knew how to block someone like Susan.
Vrenna was secretive but was probably not a spy.
Genre: fiction


In [6]:
zs_classifier = pipeline("zero-shot-classification", model='facebook/bart-large-mnli', device=0)

nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli', clean_up_tokenization_spaces=True)
nli_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli', clean_up_tokenization_spaces=True)
print(nli_model.classification_head.out_proj)
print(nli_model.config.id2label)






Linear(in_features=1024, out_features=3, bias=True)
{0: 'contradiction', 1: 'neutral', 2: 'entailment'}


In [7]:
dataset = load_dataset("fancyzhx/ag_news", split="test")
id2labels = ["World", "Sports", "Business", "Sci/Tech"]
dataset = dataset.map(lambda x: {"class": id2labels[x["label"]]}, remove_columns=["label"])

In [8]:
print(dataset[0])

{'text': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", 'class': 'Business'}


In [9]:
dataset.to_pandas().head(3)

Unnamed: 0,text,class
0,Fears for T N pension after talks Unions repre...,Business
1,The Race is On: Second Private Team Sets Launc...,Sci/Tech
2,Ky. Company Wins Grant to Study Peptides (AP) ...,Sci/Tech


The train_dataset has double the records compared to the original dataset because of the way the create_input_sequence function is designed. Specifically, the function duplicates the input text and creates two sequences for each sample: one with the original label and one with a contradiction label.

Here's a step-by-step explanation:

Text Duplication: The text is duplicated by text*2.
Two Templates: Two sequences are created using the template, one with the original label and one with a contradiction label.
Encoding: The tokenizer encodes these two sequences, effectively doubling the number of records.
This results in each original sample generating two new samples in the train_dataset.

In [10]:
# Check the structure of the dataset
print(dataset[0])  # Print the first example in the dataset to see its structure

# Check the type of the 'text' field
print(type(dataset[0]["text"]))  # Verify if it is a string or something else

{'text': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", 'class': 'Business'}
<class 'str'>


### NLI label generation examples

 See nli-finetuning-ag-news-example.ipynb for full PoC

In [11]:
# Load the dataset
dataset = load_dataset("fancyzhx/ag_news", split="test")
id2labels = ["World", "Sports", "Business", "Sci/Tech"]
dataset = dataset.map(lambda x: {"class": id2labels[x["label"]]}, remove_columns=["label"])

# Select a random index and print the original content
random_index = random.randint(0, len(dataset) - 1)
print(dataset[random_index])

# Convert the dataset to a Pandas DataFrame
df = dataset.to_pandas()

# Add a new column for the entailment and contradiction examples
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli', clean_up_tokenization_spaces=True)
template = "This example is {}."

entailment_input_ids = []
contradiction_input_ids = []
attention_masks = []
labels_list = []
input_sentences = []

num_contradictions = 3

for index, row in df.iterrows():
    text = row["text"]
    label = row["class"]
    
    # Encode the entailment example
    encoded_text = tokenizer.encode(f"<s>{text}</s>", add_special_tokens=False)
    entailment_ids = encoded_text + tokenizer.encode(f" {template.format(label)}", add_special_tokens=False)
    
    # Add entailment example
    entailment_input_ids.append(entailment_ids)
    attention_masks.append([1] * len(entailment_ids))
    labels_list.append(2)  # Entailment label
    input_sentences.append(f"<s>{text}</s> {template.format(label)}")
    
    # Create contradiction examples
    possible_contradictions = [x for x in id2labels if x != label]
    selected_contradictions = random.sample(possible_contradictions, num_contradictions)
    
    for contradiction_label in selected_contradictions:
        contradiction_ids = encoded_text + tokenizer.encode(f" {template.format(contradiction_label)}", add_special_tokens=False)
        contradiction_input_ids.append(contradiction_ids)
        attention_masks.append([1] * len(contradiction_ids))
        labels_list.append(0)  # Contradiction label
        input_sentences.append(f"<s>{text}</s> {template.format(contradiction_label)}")

# Create a new DataFrame with the transformed data
transformed_df = pd.DataFrame({
    "input_ids": entailment_input_ids + contradiction_input_ids,
    "attention_mask": attention_masks,
    "labels": labels_list,
    "input_sentence": input_sentences
})

# Convert the transformed DataFrame back to a Dataset
transformed_dataset = Dataset.from_pandas(transformed_df)

# Print outputs for the selected random index
print('Entailment item:')
for key, value in transformed_dataset[random_index * (num_contradictions + 1)].items():
    print(f"{key}: {value}")
print('Contradiction item(s):')
for i in range(1, num_contradictions + 1):
    for key, value in transformed_dataset[random_index * (num_contradictions + 1) + i].items():
        print(f"{key}: {value}")

{'text': 'Eagles lead Cowboys 7-0 after first quarter Terrell Owens turned the first pass thrown to him into a 59-yard touchdown and gave the Philadelphia Eagles a 7-0 lead over the Dallas Cowboys after the first quarter Monday night.', 'class': 'Sports'}
Entailment item:
input_ids: [0, 37590, 2784, 16415, 7842, 13, 1005, 849, 3416, 131, 29, 244, 3345, 849, 3416, 131, 29, 382, 12, 6996, 884, 34, 156, 41, 4023, 22947, 196, 6221, 13, 796, 3949, 6408, 30, 5, 997, 7, 10854, 1459, 7, 244, 21020, 8, 9648, 39, 247, 4, 2, 152, 1246, 16, 2090, 4]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels: 2
input_sentence: <s>Eagles lead Cowboys 7-0 after first quarter Terrell Owens turned the first pass thrown to him into a 59-yard touchdown and gave the Philadelphia Eagles a 7-0 lead over the Dallas Cowboys after the first quarter Monday night.</s> This example is Spo

In [12]:
# Load the dataset
dataset = load_dataset("fancyzhx/ag_news", split="test")
id2labels = ["World", "Sports", "Business", "Sci/Tech"]
dataset = dataset.map(lambda x: {"class": id2labels[x["label"]]}, remove_columns=["label"])

# Convert the dataset to a Pandas DataFrame
df = dataset.to_pandas()

# Add a new column for the entailment and contradiction examples
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli', clean_up_tokenization_spaces=True)
template = "This example is {}."

entailment_input_ids = []
contradiction_input_ids = []
attention_masks = []
labels_list = []
input_sentences = []

for index, row in df.iterrows():
    text = row["text"]
    label = row["class"]
    
    # Encode the entailment example
    encoded_text = tokenizer.encode(f"<s>{text}</s>", add_special_tokens=False)
    entailment_ids = encoded_text + tokenizer.encode(f" {template.format(label)}", add_special_tokens=False)
    
    # Add entailment example
    entailment_input_ids.append(entailment_ids)
    attention_masks.append([1] * len(entailment_ids))
    labels_list.append(2)  # Entailment label
    input_sentences.append(f"<s>{text}</s> {template.format(label)}")
    
    # Create contradiction examples
    possible_contradictions = [x for x in id2labels if x != label]
    selected_contradictions = random.sample(possible_contradictions, 1)
    
    for contradiction_label in selected_contradictions:
        contradiction_ids = encoded_text + tokenizer.encode(f" {template.format(contradiction_label)}", add_special_tokens=False)
        contradiction_input_ids.append(contradiction_ids)
        attention_masks.append([1] * len(contradiction_ids))
        labels_list.append(0)  # Contradiction label
        input_sentences.append(f"<s>{text}</s> {template.format(contradiction_label)}")

# Create a new DataFrame with the transformed data
transformed_df = pd.DataFrame({
    "input_ids": entailment_input_ids + contradiction_input_ids,
    "attention_mask": attention_masks,
    "labels": labels_list,
    "input_sentence": input_sentences
})

# Convert the transformed DataFrame back to a Dataset
transformed_dataset = Dataset.from_pandas(transformed_df)

# Print outputs
random_index = random.randint(0, len(transformed_dataset) // 2)
print('Entailment item:')
for key, value in transformed_dataset[random_index * 2].items():
    print(f"{key}: {value}")
print('Contradiction item:')
for key, value in transformed_dataset[random_index * 2 + 1].items():
    print(f"{key}: {value}")

Entailment item:
input_ids: [0, 10350, 8435, 12200, 4991, 7724, 23, 1144, 9, 11371, 849, 3416, 131, 29, 3345, 708, 255, 32205, 11371, 21, 2114, 10, 92, 3345, 1486, 94, 363, 71, 8560, 1283, 4373, 31, 624, 39, 308, 168, 14, 37, 21, 2449, 5, 247, 74, 28, 12662, 88, 7724, 71, 5, 1136, 9, 29315, 20442, 4, 2, 152, 1246, 16, 2090, 4]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels: 2
input_sentence: <s>Blame player, not game It was like nothing youd ever exercised your thumbs to before. You could do whatever you wanted, whenever you wanted. The game seemed endless.</s> This example is Sci/Tech.
Contradiction item:
input_ids: [0, 250, 1792, 8629, 2585, 24038, 4665, 6101, 7, 43038, 11099, 2873, 360, 71, 4370, 12110, 27828, 5, 194, 6, 11, 10, 177, 14, 818, 222, 45, 185, 317, 6, 501, 212, 12, 8970, 9720, 8822, 378, 13, 5386, 158, 12, 466, 1124, 81, 440, 4, 2, 152, 1246, 16, 2090, 4]
attenti

In [14]:
print(dataset)
print(dataset[0])
print(dataset[1])
print(dataset[2])

Dataset({
    features: ['text', 'class'],
    num_rows: 7600
})
{'text': "Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.", 'class': 'Business'}
{'text': 'The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.', 'class': 'Sci/Tech'}
{'text': 'Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.', 'class': 'Sci/Tech'}


BACKUP CODE

In [15]:
random_index = random.randint(0, len(dataset))
# random_index = 0
print(dataset[random_index])
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli', clean_up_tokenization_spaces=True)
template = "This example is {}."

def create_input_sequence(sample):
  text = sample["text"]
  label = sample["class"][0]
  contradiction_label = random.choice([x for x in id2labels if x!=label])

  encoded_sequence = tokenizer(text*2, [template.format(label), template.format(contradiction_label)])
  encoded_sequence["labels"] = [2,0]
  encoded_sequence["input_sentence"] = tokenizer.batch_decode(encoded_sequence.input_ids)

  return encoded_sequence

train_dataset = dataset.map(create_input_sequence, batched=True, batch_size=1, remove_columns=["class", "text"])
print('Entailment item:') 
for key, value in train_dataset[random_index*2].items():
    print(f"{key}: {value}")
print('Contradiction item:')    
for key, value in train_dataset[random_index*2+1].items():
    print(f"{key}: {value}")

{'text': 'Congressman Spratt wants Fed to US Representative John Spratt of South Carolina said the Federal Reserve should go lightly #39; #39; on raising the benchmark interest rate because of the economy.', 'class': 'Business'}
Entailment item:
input_ids: [0, 25997, 397, 14933, 2611, 1072, 2337, 7, 382, 10308, 610, 14933, 2611, 9, 391, 1961, 26, 5, 1853, 3965, 197, 213, 14998, 849, 3416, 131, 849, 3416, 131, 15, 3282, 5, 5437, 773, 731, 142, 9, 5, 866, 4, 2, 2, 713, 1246, 16, 2090, 4, 2]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels: 2
input_sentence: <s>Congressman Spratt wants Fed to US Representative John Spratt of South Carolina said the Federal Reserve should go lightly #39; #39; on raising the benchmark interest rate because of the economy.</s></s>This example is Business.</s>
Contradiction item:
input_ids: [0, 25997, 397, 14933, 2611, 1072, 2337, 7, 382, 103