# Bootstrap few-shot CoT demonstations for IndicXLNI.

IndicXNLI, is an NLI dataset for 11 Indian languages. It has been created by high-quality machine translation of the original English XNLI dataset.

This notebook starts with a very simple Chain-of-Thought-based module for IndicXNLI.

We found that bootstrapping demonstrations with DSPy improved performance by 15.9%. This is a single compilation step using dspy.BootstrapFewShotWithRandomSearch.

## Set-up

In [1]:
import os
import openai

import glob
import os
import pandas as pd
import random

import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

In [2]:
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join('.', 'cache')

In [3]:
# We'll rely on turbo for everything:

turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')

dspy.settings.configure(lm=turbo)

In [17]:
# Toggling this to true will redo the bootstrapping process. When
# it is set to False, the existing demonstrations will be used but
# turbo will still be used to evaluate the zero-shot and full programs.
RUN_FROM_SCRATCH = True

## IndicXLNI

In [5]:
from datasets import load_dataset
dataset = load_dataset('Divyanshu/indicxnli', 'hi')

## Data loader

In [6]:
def load_indicxlni(dataset, split="validation"):
    
    data_df = pd.DataFrame(dataset[split])
    label_map = {0: "Yes", 1: "Neutral", 2: "No"}
    def as_example(row): 
        return dspy.Example({
            "premise": row['premise'],
            "hypothesis": row['hypothesis'],
            "answer": label_map[row['label']]
        }).with_inputs("premise", "hypothesis")

    return list(data_df.apply(as_example, axis=1).values)

## Train and dev samples

In [7]:
all_train = load_indicxlni(dataset, "train")
all_dev = load_indicxlni(dataset, "validation")

random.seed(1)
random.shuffle(all_train)
random.shuffle(all_dev)

# 200 random train, 50 random dev:
train, dev = all_train[: 200], all_dev[200: 250]

len(train), len(dev)

(200, 50)

## Test

In [8]:
random.seed(1)

test = load_indicxlni(dataset, "test")

# 100 random test:
test = test[: 100]
len(test)

100

## Evaluation tools

In [9]:
indicxlni_accuracy = dspy.evaluate.metrics.answer_exact_match

In [10]:
evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)

## Zero-shot CoT

In [11]:
class IndicXLNISignature(dspy.Signature):
    ("""You are given a premise and a hypothesis. """
    """You must indicate with Yes/No/Neutral answer whether we can logically """
    """conclude the hypothesis from the premise.""")

    premise = dspy.InputField()
    hypothesis = dspy.InputField()
    answer = dspy.OutputField(desc="Yes or No or Neutral")

In [12]:
class IndicXLNICoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(IndicXLNISignature)

    def forward(self, premise, hypothesis):
        return self.generate_answer(premise=premise, hypothesis=hypothesis)

In [13]:
cot_zeroshot = IndicXLNICoT()

In [14]:
evaluator(cot_zeroshot, metric=indicxlni_accuracy)

Average Metric: 44 / 100  (44.0): 100%|███████| 100/100 [02:00<00:00,  1.20s/it]

Average Metric: 44 / 100  (44.0%)





44.0

## Optimized few-shot with bootstrapped demonstrations

In [15]:
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    num_candidate_programs=10,
    num_threads=8,
    metric=indicxlni_accuracy)

Going to sample between 1 and 8 traces per predictor.
Will attempt to train 10 candidate sets.


In [18]:
if RUN_FROM_SCRATCH:
    cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)
else:
    cot_fewshot = IndicXLNICoT()
    cot_fewshot.load("indicxlni-cot_fewshot-turbo-gpt3.5-demos.json")

Average Metric: 16 / 50  (32.0): 100%|██████████| 50/50 [00:29<00:00,  1.72it/s]


Average Metric: 16 / 50  (32.0%)
Score: 32.0 for set: [0]
New best score: 32.0 for seed -3
Scores so far: [32.0]
Best score: 32.0


Average Metric: 18 / 50  (36.0): 100%|██████████| 50/50 [00:28<00:00,  1.73it/s]


Average Metric: 18 / 50  (36.0%)
Score: 36.0 for set: [8]
New best score: 36.0 for seed -2
Scores so far: [32.0, 36.0]
Best score: 36.0


  6%|██▌                                       | 12/200 [00:35<09:21,  2.99s/it]


Bootstrapped 8 full traces after 13 examples in round 0.


Average Metric: 17 / 50  (34.0): 100%|██████████| 50/50 [00:28<00:00,  1.73it/s]


Average Metric: 17 / 50  (34.0%)
Score: 34.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0]
Best score: 36.0
Average of max per entry across top 1 scores: 0.36
Average of max per entry across top 2 scores: 0.46
Average of max per entry across top 3 scores: 0.52
Average of max per entry across top 5 scores: 0.52
Average of max per entry across top 8 scores: 0.52
Average of max per entry across top 9999 scores: 0.52


  6%|██▌                                       | 12/200 [00:37<09:51,  3.15s/it]


Bootstrapped 7 full traces after 13 examples in round 0.


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:29<00:00,  1.71it/s]


Average Metric: 19 / 50  (38.0%)
Score: 38.0 for set: [8]
New best score: 38.0 for seed 0
Scores so far: [32.0, 36.0, 34.0, 38.0]
Best score: 38.0
Average of max per entry across top 1 scores: 0.38
Average of max per entry across top 2 scores: 0.52
Average of max per entry across top 3 scores: 0.58
Average of max per entry across top 5 scores: 0.62
Average of max per entry across top 8 scores: 0.62
Average of max per entry across top 9999 scores: 0.62


  2%|▊                                          | 4/200 [00:10<08:28,  2.59s/it]


Bootstrapped 3 full traces after 5 examples in round 0.


Average Metric: 21 / 50  (42.0): 100%|██████████| 50/50 [00:28<00:00,  1.73it/s]


Average Metric: 21 / 50  (42.0%)
Score: 42.0 for set: [8]
New best score: 42.0 for seed 1
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0]
Best score: 42.0
Average of max per entry across top 1 scores: 0.42
Average of max per entry across top 2 scores: 0.52
Average of max per entry across top 3 scores: 0.58
Average of max per entry across top 5 scores: 0.66
Average of max per entry across top 8 scores: 0.66
Average of max per entry across top 9999 scores: 0.66


  1%|▍                                          | 2/200 [00:08<14:09,  4.29s/it]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 23 / 50  (46.0): 100%|██████████| 50/50 [00:28<00:00,  1.73it/s]


Average Metric: 23 / 50  (46.0%)
Score: 46.0 for set: [8]
New best score: 46.0 for seed 2
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.7
Average of max per entry across top 8 scores: 0.72
Average of max per entry across top 9999 scores: 0.72


  3%|█▎                                         | 6/200 [00:17<09:33,  2.96s/it]


Bootstrapped 4 full traces after 7 examples in round 0.


Average Metric: 20 / 50  (40.0): 100%|██████████| 50/50 [00:29<00:00,  1.71it/s]


Average Metric: 20 / 50  (40.0%)
Score: 40.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.66
Average of max per entry across top 5 scores: 0.74
Average of max per entry across top 8 scores: 0.78
Average of max per entry across top 9999 scores: 0.78


  4%|█▉                                         | 9/200 [00:24<08:30,  2.67s/it]


Bootstrapped 4 full traces after 10 examples in round 0.


Average Metric: 16 / 50  (32.0): 100%|██████████| 50/50 [00:29<00:00,  1.68it/s]


Average Metric: 16 / 50  (32.0%)
Score: 32.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0, 32.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.66
Average of max per entry across top 5 scores: 0.74
Average of max per entry across top 8 scores: 0.8
Average of max per entry across top 9999 scores: 0.8


  8%|███▏                                      | 15/200 [00:35<07:15,  2.35s/it]


Bootstrapped 5 full traces after 16 examples in round 0.


Average Metric: 21 / 50  (42.0): 100%|██████████| 50/50 [00:29<00:00,  1.68it/s]


Average Metric: 21 / 50  (42.0%)
Score: 42.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0, 32.0, 42.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.62
Average of max per entry across top 5 scores: 0.72
Average of max per entry across top 8 scores: 0.8
Average of max per entry across top 9999 scores: 0.82


  1%|▍                                          | 2/200 [00:04<07:07,  2.16s/it]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 8 / 28  (28.6):  54%|█████▉     | 27/50 [00:16<00:13,  1.73it/s]

Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x7fb29b690dc0> with kwargs {}


Average Metric: 21 / 50  (42.0): 100%|██████████| 50/50 [00:29<00:00,  1.67it/s]


Average Metric: 21 / 50  (42.0%)
Score: 42.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0, 32.0, 42.0, 42.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.62
Average of max per entry across top 5 scores: 0.74
Average of max per entry across top 8 scores: 0.82
Average of max per entry across top 9999 scores: 0.86


  8%|███▎                                      | 16/200 [00:42<08:07,  2.65s/it]


Bootstrapped 6 full traces after 17 examples in round 0.


Average Metric: 16 / 50  (32.0): 100%|██████████| 50/50 [00:29<00:00,  1.68it/s]


Average Metric: 16 / 50  (32.0%)
Score: 32.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0, 32.0, 42.0, 42.0, 32.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.62
Average of max per entry across top 5 scores: 0.74
Average of max per entry across top 8 scores: 0.82
Average of max per entry across top 9999 scores: 0.86


  9%|███▊                                      | 18/200 [00:45<07:40,  2.53s/it]


Bootstrapped 4 full traces after 19 examples in round 0.


Average Metric: 16 / 50  (32.0): 100%|██████████| 50/50 [00:29<00:00,  1.69it/s]


Average Metric: 16 / 50  (32.0%)
Score: 32.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0, 32.0, 42.0, 42.0, 32.0, 32.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.62
Average of max per entry across top 5 scores: 0.74
Average of max per entry across top 8 scores: 0.82
Average of max per entry across top 9999 scores: 0.86


  6%|██▎                                       | 11/200 [00:28<08:10,  2.59s/it]


Bootstrapped 8 full traces after 12 examples in round 0.


Average Metric: 20 / 50  (40.0): 100%|██████████| 50/50 [00:28<00:00,  1.74it/s]

Average Metric: 20 / 50  (40.0%)
Score: 40.0 for set: [8]
Scores so far: [32.0, 36.0, 34.0, 38.0, 42.0, 46.0, 40.0, 32.0, 42.0, 42.0, 32.0, 32.0, 40.0]
Best score: 46.0
Average of max per entry across top 1 scores: 0.46
Average of max per entry across top 2 scores: 0.56
Average of max per entry across top 3 scores: 0.62
Average of max per entry across top 5 scores: 0.74
Average of max per entry across top 8 scores: 0.86
Average of max per entry across top 9999 scores: 0.9
13 candidate programs found.





In [19]:
evaluator(cot_fewshot, metric=indicxlni_accuracy)

Average Metric: 51 / 100  (51.0): 100%|███████| 100/100 [04:13<00:00,  2.54s/it]

Average Metric: 51 / 100  (51.0%)





51.0

In [20]:
cot_fewshot.save("indicxlni-cot_fewshot-turbo-gpt3.5-demos.json")

## Example prompt with prediction

In [21]:
turbo.inspect_history(n=1)





You are given a premise and a hypothesis. You must indicate with Yes/No/Neutral answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Premise: ${premise}

Hypothesis: ${hypothesis}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Yes or No or Neutral

---

Premise: ब्रिटेन द्वारा गुआडलूप पर कब्जा करने के बाद, विक्टर ह्यूजेस, कन्वेंशन के कमिश्नर ने द्वीप को वापस ले लिया, गुलामी को समाप्त करने की घोषणा की और पुराने गार्ड कोलोन को गिलॉटिंग करने के बारे में सेट किया।

Hypothesis: गुडेलोप ने ब्रिटेन पर कब्जा कर लिया।

Reasoning: Let's think step by step in order to produce the answer. We know that after Britain took control of Guadeloupe, Victor Hugues, the commissioner of the Convention, took the island back, declared an end to slavery, and set up the old guard colony for guillotining.

Answer: No

---

Premise: कर अपने युद्धों के लिए भुगतान करने के लिए बढ़ गया, और अधिक से अधिक किसानों को अपन