### Step 1: Install Dependencies

In [None]:
!pip install -q transformers accelerate


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m91.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m83.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Step 2: Load the Model (We are using the Microsoft Phi-2 Model here)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "microsoft/phi-2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128, temperature=0.3)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


### Step 3: Define 5 Mixed Reasoning Questions (Logical + Symbolic)

In [None]:
questions = [
    "If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",
    "A train travels 60 km/h for 3 hours. How far does it go?",
    "If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",
    "Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?",
    "If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?"
]


### Step 4: Define Prompt Templates

We are comparing Few-shot Chain of Thought, Zero-shot Chain of Thought and Few-shot without Chain of Thought (our baseline)

In [None]:
# Few-shot Chain of Thought Prompt
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest."""

# Few-shot No-CoT Prompt
few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie"""


### Step 5: Generate and Store Responses

In [None]:
import pandas as pd

results = []

for q in questions:
    # Prompt 1: Few-shot Chain of Thought
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    output_cot = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 2: Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    output_zscot = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 3: Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    output_nocot = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()

    results.append({
        "Question": q,
        "Few-shot CoT": output_cot,
        "Zero-shot CoT": output_zscot,
        "Few-shot No-CoT": output_nocot
    })

df = pd.DataFrame(results)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `

###Step 6: Display Table of Results

In [None]:
from IPython.display import display
pd.set_option('display.max_colwidth', None)
display(df)


Unnamed: 0,Question,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",The car travels at 60 miles per hour. It travels for 2 hours. So 60 × 2 = 120 miles. The answer is 120.\n\nQ: If a rectangle has a,"To determine who is the oldest, we need to consider the following lists: (1) Alice,",3\nQ: If there are 7 days in a week and 2 days are weekends
1,A train travels 60 km/h for 3 hours. How far does it go?,The pizza has 8 slices. 3 slices are eaten. So 8 - 3 = 5 slices are left. The answer is 5.,"To solve this problem, we need to use the formula distance = speed x time. We know the speed and the time, so we can plug them into the formula and get the distance. Distance = 60 km/h x 3 hours = 180 km. Therefore, the train goes 180 km.",Paris
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",The recipe calls for 2 cups of flour and 1 cup of sugar,"To determine what is irrelevant to the total number of balls, we need to consider the following lists: (1) the color of the balls, (2) the number of red balls, and (3) the number of blue balls. The color of the balls is irrelevant because the total number of balls is not affected by their color. The number of red balls and blue balls are also irrelevant because they are already given in the question. Therefore, the answer is 8 balls.\n\nLogical Puzzle 2:\nQ: If a box contains 4 red balls and 6 blue balls, how many balls are there in total? Let's",10 days\nQ: If a
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,The rectangle has a length of 10 cm and a width of 5 cm. The area,"To determine what is irrelevant to the number of apples Tom has, we need to consider the following lists: (1) Jerry's apples (2) Tom's apples (3) The color of the apples. Jerry's apples and Tom's apples are both relevant to the number of apples Tom has, as they are the only ones mentioned in the question. The color of the apples is irrelevant, as it does not affect the number of apples Tom has. Therefore, the answer is 6 apples.\n\nTopic: <history>\n\nPh.D.-level essay:\n\nThe existence of the surname ""Baker"" can be",Paris
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",The distance traveled is speed × time. So 60 km/h × 2 h = 120 km. The answer is 120 km.\nQ: If a rectangle has a length of 8 cm and a width of 4,"To determine what language John most likely does not speak, we need to consider",24 cm^2\nQ: If a store


### We can also evaluate the accuracy in a quantitative way:

In [None]:
ground_truth = [
    "Charlie",     # youngest
    "180",         # 60 × 3
    "8",           # 3 red + 5 blue
    "6",           # 3 × 2
    "French"       # inference
]

import re

def extract_final_answer(text):
    # Try to extract the last number or capitalized word
    text = text.replace(",", "")
    matches = re.findall(r"\b([A-Z][a-z]+|\d+(?:\.\d+)?)\b", text)
    return matches[-1] if matches else text.strip()

# Track correct counts
correct_cot = correct_zscot = correct_nocot = 0

for i, row in df.iterrows():
    gt = ground_truth[i].strip().lower()

    ans_cot = extract_final_answer(row["Few-shot CoT"]).lower()
    ans_zscot = extract_final_answer(row["Zero-shot CoT"]).lower()
    ans_nocot = extract_final_answer(row["Few-shot No-CoT"]).lower()

    if ans_cot == gt:
        correct_cot += 1
    if ans_zscot == gt:
        correct_zscot += 1
    if ans_nocot == gt:
        correct_nocot += 1

total = len(df)
print(f"\n✅ Evaluation on {total} questions:\n")
print(f"Few-shot CoT Accuracy       : {correct_cot}/{total} ({correct_cot/total:.0%})")
print(f"Zero-shot CoT Accuracy      : {correct_zscot}/{total} ({correct_zscot/total:.0%})")
print(f"Few-shot No-CoT (Baseline)  : {correct_nocot}/{total} ({correct_nocot/total:.0%})")





✅ Evaluation on 5 questions:

Few-shot CoT Accuracy       : 0/5 (0%)
Zero-shot CoT Accuracy      : 1/5 (20%)
Few-shot No-CoT (Baseline)  : 0/5 (0%)


What do we observe from this? Well, some answers are pretty random. Zero-shot CoT performs reasonably okay compared to others.
1. Let us add more prompt examples and see if it makes a difference.
2. Let us try with a different model and see if it makes a difference!

### Adding more examples in the Prompt Template:

In [None]:
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest.

Q: A train travels 60 km/h for 3 hours. How far does it go?
A: The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.

Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 4 red + 5 green = 9 balls. The answer is 9.

Q: Sarah has 7 candies. She eats 2. How many are left?
A: 7 − 2 = 5. The answer is 5.

Q: A chair costs $15. You buy 2. How much do you spend?
A: 2 × $15 = $30. The answer is 30.

Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Mike > Tom > Jim. So Jim is the shortest. The answer is Jim.

Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 3 × 5 = 15. The answer is 15.

Q: If a pie has 8 slices and you eat 3, how many are left?
A: 8 − 3 = 5. The answer is 5.

Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 4 + 3 = 7. The answer is 7."""


few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie

Q: A train travels 60 km/h for 3 hours. How far does it go?
A: 180

Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 9

Q: Sarah has 7 candies. She eats 2. How many are left?
A: 5

Q: A chair costs $15. You buy 2. How much do you spend?
A: 30

Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Jim

Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 15

Q: If a pie has 8 slices and you eat 3, how many are left?
A: 5

Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 7"""


In [None]:
import pandas as pd

results = []

for q in questions:
    # Prompt 1: Few-shot Chain of Thought
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    output_cot = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 2: Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    output_zscot = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 3: Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    output_nocot = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()

    results.append({
        "Question": q,
        "Few-shot CoT": output_cot,
        "Zero-shot CoT": output_zscot,
        "Few-shot No-CoT": output_nocot
    })

df = pd.DataFrame(results)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
from IPython.display import display
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,Question,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",10 − 4 = 6. The answer,"To determine who is the oldest, we need to consider the following lists: (1) Alice,",Jim
1,A train travels 60 km/h for 3 hours. How far does it go?,2 × $15 = $30. The answer is 30.,"To solve this problem, we need to use the formula distance = speed x time. We know the speed and the time, so we can plug them into the formula and get the distance. Distance = 60 km/h x 3 hours = 180 km. Therefore, the train goes 180 km.",15\n\nQ
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",10 − 4 = 6. The answer is 6.\n\nQ: If a box contains 2,"To determine what is irrelevant to the total number of balls, we need to consider the following lists: (1) the color of the balls, (2) the number of red balls, and (3) the number of blue balls. The color of the balls is irrelevant because the total number of balls is not affected by their color. The number of red balls and blue balls are also irrelevant because they are already given in the question. Therefore, the answer is 8 balls.\n\nLogical Puzzle 2:\nQ: If a box contains 4 red balls and 6 blue balls, how many balls are there in total? Let's","6\n\nQ: If a store sells 5 shirts for $25, how much does one shirt cost?"
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,"6 red + 4 blue = 10 balls. The answer is 10.\n\nQ: If a book costs $10 and you buy 3, how much do you spend?","To determine what is irrelevant to the number of apples Tom has, we need to consider the following lists: (1) Jerry's apples (2) Tom's apples (3) The color of the apples. Jerry's apples and Tom's apples are both relevant to the number of apples Tom has, as they are the only ones mentioned in the question. The color of the apples is irrelevant, as it does not affect the number of apples Tom has. Therefore, the answer is 6 apples.\n\nTopic: <history>\n\nPh.D.-level essay:\n\nThe existence of the surname ""Baker"" can be",30
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",6 red + 4 green = 10 balls. The answer is 10.\n\nQ: If a pie has 10 slices and you,"To determine what language John most likely does not speak, we need to consider",320


 We can see that adding more examples in template prompt did not make much difference. Now, let us try adding a bigger model.


In [None]:
# 📦 Install required libraries
!pip install -q transformers accelerate

# ⚙️ Load OpenChat 3.5 Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch, re, pandas as pd

model_id = "openchat/openchat-3.5-1210"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128, temperature=0.3)

# 🧠 10 Mixed Logical & Symbolic Questions
questions = [
    "If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",
    "A train travels 60 km/h for 3 hours. How far does it go?",
    "If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",
    "Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?",
    "If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",
    "If a car has 4 wheels, how many wheels do 6 cars have?",
    "Sarah has 3 pencils. She buys 4 more. How many pencils does she have now?",
    "Bob is taller than Sam. Sam is taller than Mike. Who is the shortest?",
    "There are 5 rows of chairs. Each row has 6 chairs. How many chairs are there in total?",
    "If a pizza is cut into 8 equal slices and 3 slices are eaten, how many slices are left?"
]

# ✅ Ground-Truth Answers
ground_truth = ["Charlie", "180", "8", "6", "French", "24", "7", "Mike", "30", "5"]

# 🧠 Few-Shot CoT Prompt (10 examples)
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.
Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest.
Q: A train travels 60 km/h for 3 hours. How far does it go?
A: The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.
Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 4 red + 5 green = 9 balls. The answer is 9.
Q: Sarah has 7 candies. She eats 2. How many are left?
A: 7 − 2 = 5. The answer is 5.
Q: A chair costs $15. You buy 2. How much do you spend?
A: 2 × $15 = $30. The answer is 30.
Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Mike > Tom > Jim. So Jim is the shortest. The answer is Jim.
Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 3 × 5 = 15. The answer is 15.
Q: If a pie has 8 slices and you eat 3, how many are left?
A: 8 − 3 = 5. The answer is 5.
Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 4 + 3 = 7. The answer is 7."""

# ⚖️ Few-Shot No-CoT Prompt
few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6
Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie
Q: A train travels 60 km/h for 3 hours. How far does it go?
A: 180
Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 9
Q: Sarah has 7 candies. She eats 2. How many are left?
A: 5
Q: A chair costs $15. You buy 2. How much do you spend?
A: 30
Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Jim
Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 15
Q: If a pie has 8 slices and you eat 3, how many are left?
A: 5
Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 7"""

# 🧪 Inference + Evaluation
results = []

def extract_final_answer(text):
    text = text.replace(",", "")
    matches = re.findall(r"\b([A-Z][a-z]+|\d+(?:\.\d+)?)\b", text)
    return matches[-1] if matches else text.strip()

for i, q in enumerate(questions):
    gt = ground_truth[i].strip().lower()

    # Few-shot CoT
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    cot_out = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()
    cot_ans = extract_final_answer(cot_out).lower()

    # Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    zscot_out = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()
    zscot_ans = extract_final_answer(zscot_out).lower()

    # Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    nocot_out = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()
    nocot_ans = extract_final_answer(nocot_out).lower()

    results.append({
        "Question": q,
        "Ground Truth": ground_truth[i],
        "Few-shot CoT": cot_out,
        "Zero-shot CoT": zscot_out,
        "Few-shot No-CoT": nocot_out,
        "Correct CoT": cot_ans == gt,
        "Correct ZS-CoT": zscot_ans == gt,
        "Correct No-CoT": nocot_ans == gt
    })

# 📊 Show Table
df = pd.DataFrame(results)
pd.set_option('display.max_colwidth', None)
display(df)

# ✅ Summary Accuracy
print("\n🔎 Accuracy Summary:")
print(f"Few-shot CoT       : {df['Correct CoT'].sum()}/10")
print(f"Zero-shot CoT      : {df['Correct ZS-CoT'].sum()}/10")
print(f"Few-shot No-CoT    : {df['Correct No-CoT'].sum()}/10")


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/623 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

Device set to use cuda:0


Unnamed: 0,Question,Ground Truth,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT,Correct CoT,Correct ZS-CoT,Correct No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",Charlie,4 red + 5 green = 9 balls. The answer is 9.\nQ: Sarah has 7 candies. She,"Let's think step by step.\nIf Alice is older than Bob, and Bob is older than Charlie, then Alice is older than Charlie.\nTherefore, Charlie is the youngest.\n\n### Answer: Charlie\nThe answer is: Charlie",Jim\nQ: There are 3 rows of desks. Each row has 5,False,True,False
1,A train travels 60 km/h for 3 hours. How far does it go?,180,7 − 2 = 5. The answer is 5.\nQ: A chair costs $15. You buy 2.,"To find the distance, we multiply the speed by the time. So, the train goes 60 km/h * 3 hours = 180 km.\nThe answer: 180.",Jim\nQ: There are 3 rows of desks. Each row has,False,True,False
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",8,4 red + 5 green = 9 balls. The answer is 9.\nQ: Sarah has 7 candies,"There are 3 red balls + 5 blue balls = 8 balls in total.\nThe answer: 8.\n\nIf a box contains 3 red balls and 5 blue balls, how many balls are there in total?\nThe answer: 8.",Jim\nQ: There are 3 rows of desks. Each row has,False,True,False
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,6,4 red + 5 green = 9 balls. The answer is 9.\nQ: Sarah has 7 candies,"Tom has twice as many apples as Jerry, who has 3 apples. So, Tom has 2 * 3 = 6 apples.\nThe answer is 6.\n\n## Comments\n\n1. I think the question is asking how many apples Tom has.\n2. I think the question is asking how many apples Tom has.\n3. I think the question is asking how many apples Tom has.\n4. I think the question is asking how many apples Tom has.\n5. I think the question is asking how many apples Tom has.\n6. I",30\nQ: Mike is taller than Tom. Tom is taller,False,True,False
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",French,3 × 4 = 12. 3 × 12 = 36. The answer is 36.\nQ: If 12 is,"John is in Paris.\nB: Everyone in Paris speaks French.\nC: Therefore, John most likely speaks French.\n\nAnswer: C\n\n# English Vocabulary: Common Mistakes\n\nIn this lesson, we will learn about common mistakes that English learners make when learning English vocabulary.\n\n1. False friends: False friends are words that look or sound similar in two languages but have different meanings. For example, the English word ""park"" means a public outdoor area with grass and trees, while the Spanish word ""parque"" has the same spelling and pronunciation but means",180\nQ: If a train travels 6,False,False,False
5,"If a car has 4 wheels, how many wheels do 6 cars have?",24,4 ×,"Each car has 4 wheels.\nIf we have 6 cars, we multiply the number of wheels per car by the number of cars.\n4 wheels/car * 6 cars = 24 wheels.\nFinal answer: 24 wheels.\nAbout this question\nThis question is an example of multiplication with a real-life application. It uses the multiplication formula: total = (number of wheels per car) * (number of cars). The concept of multiplication is extended to a real-life situation, which makes the question more interesting and engaging.\n\nThe answer is: 24","4\nQ: If a box has 6 strawberries and you take 2,",False,True,False
6,Sarah has 3 pencils. She buys 4 more. How many pencils does she have now?,7,The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.,Sarah has 3 pencils. She buys 4 more. How many pencils does she have now?\nStep 1: Start with the number of pencils Sarah has.\n3 pencils\nStep 2: Add the number of pencils she buys.\n3 + 4 = 7\nStep 3: Sarah now has 7 pencils.\nThe answer: 7.,180\nQ:,False,True,False
7,Bob is taller than Sam. Sam is taller than Mike. Who is the shortest?,Mike,The,You would need 167 dimes.,180\nQ: If a,False,False,False
8,There are 5 rows of chairs. Each row has 6 chairs. How many chairs are there in total?,30,3 × 4 = 12. The answer is 12.\nQ: If a car travels,"There are 5 rows of chairs. Each row has 6 chairs. To find the total number of chairs, we multiply the number of rows by the number of chairs in each row. 5 rows * 6 chairs = 30 chairs. There are 30 chairs in total.\nThe answer: 30.\n\nLet's think step by step:\nStep 1: We know that there are 5 rows of chairs.\nStep 2: We also know that each row has 6 chairs.\nStep 3: To find the total number of chairs, we multiply the",240,False,False,False
9,"If a pizza is cut into 8 equal slices and 3 slices are eaten, how many slices are left?",5,8 − 3 = 5. The answer is 5.\nQ: If a pizza is cut into 8 equal slices and 3 slices,"If a pizza is cut into 8 equal slices and 3 slices are eaten, we need to sub",30\nQ: Mike is taller than Tom. Tom is taller,False,False,False



🔎 Accuracy Summary:
Few-shot CoT       : 0/10
Zero-shot CoT      : 6/10
Few-shot No-CoT    : 0/10


#### Observations: Zero-shot COT is effective compared to other methods. Few-shot COT is giving some random answers for the problems