# LLM Reasoning

This notebook compares how LLMs (mainly Llama 3 and 3.1, but other LLMs can be added easily) from different Generative AI providers perform on three recent examples that show issues with LLM reasoning:

* [The Reversal Curse](https://github.com/lukasberglund/reversal_curse) shows that LLMs trained on "A is B" fail to learn "B is A".
* [How many r's in the word strawberry?](https://x.com/karpathy/status/1816637781659254908) shows "the weirdness of LLM Tokenization".  
* [Which number is bigger, 9.11 or 9.9?](https://x.com/DrJimFan/status/1816521330298356181) shows that "LLMs are alien beasts."


In [1]:
!cat ../.env.sample

ANTHROPIC_API_KEY=""
FIREWORKS_API_KEY=""
GROQ_API_KEY=""
MISTRAL_API_KEY=""
OPENAI_API_KEY=""
OLLAMA_API_URL="http://localhost:11434"
REPLICATE_API_KEY=""
TOGETHER_API_KEY=""
OCTO_API_KEY=""
AWS_ACCESS_KEY_ID=""
AWS_SECRET_ACCESS_KEY=""

Make sure your ~/.env file (copied from the .env.sample file above) has the API keys of the LLM providers to compare set before running the cell below:

In [2]:
import sys
sys.path.append('../../aimodels')

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [None]:
#!pip install boto3
#!pip install fireworks-ai

## Specify LLMs to Compare

In [3]:
import aimodels as ai

client = ai.Client()

In [4]:
import time

llms = ["aws:meta.llama3-8b-instruct-v1:0",
        "groq:llama3-8b-8192",
        "fireworks:accounts/fireworks/models/llama-v3-8b-instruct",
        "octo:meta-llama-3-8b-instruct",
        "together:meta-llama/Llama-3-8b-chat-hf",
        "openai:gpt-3.5-turbo",
        "replicate:meta/meta-llama-3-8b-instruct",

        "aws:meta.llama3-1-8b-instruct-v1:0",
        "groq:llama-3.1-8b-instant",
        "fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct",
        "together:meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
        "octo:meta-llama-3.1-8b-instruct",
       ]

def compare_llm(messages):
    execution_times = []
    responses = []
    for llm in llms:
        start_time = time.time()
        response = client.chat.completions.create(model=llm, messages=messages)
        end_time = time.time()
        execution_time = end_time - start_time
        responses.append(response.choices[0].message.content.strip())
        execution_times.append(execution_time)
        print(f"{llm} - {execution_time:.2f} seconds: {response.choices[0].message.content.strip()}")
    return responses, execution_times

## The Reversal Curse

In [5]:
messages = [
    {"role": "user", "content": "Who is Tom Cruise's mother?"},
]

responses, execution_times = compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - 2.38 seconds: Tom Cruise's mother is Mary Lee South (née Pfeiffer).
groq:llama3-8b-8192 - 2.24 seconds: Tom Cruise's mother is Mary Lee South (née Pfeiffer).
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - 0.92 seconds: Tom Cruise's mother is Mary Lee South (née Pfeiffer). She was a special education teacher and a social worker.
octo:meta-llama-3-8b-instruct - 1.82 seconds: Tom Cruise's mother is Mary Lee South (née Pfeiffer). She was a special education teacher and a homemaker.
together:meta-llama/Llama-3-8b-chat-hf - 0.61 seconds: Tom Cruise's mother is Mary Lee South (née Pfeiffer).
openai:gpt-3.5-turbo - 1.00 seconds: Tom Cruise's mother is Mary Lee Pfeiffer.
replicate:meta/meta-llama-3-8b-instruct - 1.36 seconds: Tom Cruise's mother is Mary Lee South (née Pfeiffer).
aws:meta.llama3-1-8b-instruct-v1:0 - 0.45 seconds: Tom Cruise's mother is Mary Lee Pfeiffer.
groq:llama-3.1-8b-instant - 0.84 seconds: Tom Cruise's mother is Mary Lee Pfeif

In [6]:
import pandas as pd

def display(llms, execution_times, responses):
    data = {
        'Provider:Model Name': llms,
        'Execution Time': execution_times,
        'Model Response ': responses
    }
    
    df = pd.DataFrame(data)
    df.index = df.index + 1
    styled_df = df.style.set_table_styles(
        [{'selector': 'th', 'props': [('text-align', 'center')]}, 
         {'selector': 'td', 'props': [('text-align', 'center')]}]
    ).set_properties(**{'text-align': 'center'})
    
    return styled_df 

In [7]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,aws:meta.llama3-8b-instruct-v1:0,2.383425,Tom Cruise's mother is Mary Lee South (née Pfeiffer).
2,groq:llama3-8b-8192,2.241169,Tom Cruise's mother is Mary Lee South (née Pfeiffer).
3,fireworks:accounts/fireworks/models/llama-v3-8b-instruct,0.916995,Tom Cruise's mother is Mary Lee South (née Pfeiffer). She was a special education teacher and a social worker.
4,octo:meta-llama-3-8b-instruct,1.82236,Tom Cruise's mother is Mary Lee South (née Pfeiffer). She was a special education teacher and a homemaker.
5,together:meta-llama/Llama-3-8b-chat-hf,0.607085,Tom Cruise's mother is Mary Lee South (née Pfeiffer).
6,openai:gpt-3.5-turbo,1.002106,Tom Cruise's mother is Mary Lee Pfeiffer.
7,replicate:meta/meta-llama-3-8b-instruct,1.362718,Tom Cruise's mother is Mary Lee South (née Pfeiffer).
8,aws:meta.llama3-1-8b-instruct-v1:0,0.454378,Tom Cruise's mother is Mary Lee Pfeiffer.
9,groq:llama-3.1-8b-instant,0.835516,Tom Cruise's mother is Mary Lee Pfeiffer.
10,fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct,0.371963,Tom Cruise's mother is Mary Lee Pfeiffer.


In [8]:
messages = [
    {"role": "user", "content": "Who is Mary Lee Pfeiffer's son?"},
]

responses, execution_times = compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - 1.23 seconds: I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private individual and not a public figure, or that the name is not well-known. Can you provide more context or details about who Mary Lee Pfeiffer is or why you are looking for information about her son?
groq:llama3-8b-8192 - 0.39 seconds: I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private or personal matter, or that the person is not a public figure. Can you provide more context or clarify who Mary Lee Pfeiffer is?
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - 0.44 seconds: According to my knowledge, Mary Lee Pfeiffer's son is John Pfeiffer.
octo:meta-llama-3-8b-instruct - 1.25 seconds: I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private

In [9]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,aws:meta.llama3-8b-instruct-v1:0,1.225959,"I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private individual and not a public figure, or that the name is not well-known. Can you provide more context or details about who Mary Lee Pfeiffer is or why you are looking for information about her son?"
2,groq:llama3-8b-8192,0.3918,"I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private or personal matter, or that the person is not a public figure. Can you provide more context or clarify who Mary Lee Pfeiffer is?"
3,fireworks:accounts/fireworks/models/llama-v3-8b-instruct,0.438607,"According to my knowledge, Mary Lee Pfeiffer's son is John Pfeiffer."
4,octo:meta-llama-3-8b-instruct,1.250298,"I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private or personal matter, or that the person is not a public figure. Can you provide more context or clarify who Mary Lee Pfeiffer is?"
5,together:meta-llama/Llama-3-8b-chat-hf,0.924522,"I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private or personal matter, or that the person is not a public figure. Can you provide more context or clarify who Mary Lee Pfeiffer is?"
6,openai:gpt-3.5-turbo,0.637278,Mary Lee Pfeiffer's son is actor and filmmaker Joaquin Phoenix.
7,replicate:meta/meta-llama-3-8b-instruct,1.37563,"According to my knowledge, Mary Lee Pfeiffer's son is John Pfeiffer."
8,aws:meta.llama3-1-8b-instruct-v1:0,0.639018,I don't have information on Mary Lee Pfeiffer's son. Is there something else I can help you with?
9,groq:llama-3.1-8b-instant,1.059837,I don't have information on Mary Lee Pfeiffer's son. Is there something else I can help you with?
10,fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct,0.387835,I don't have information on Mary Lee Pfeiffer's son. Is there something else I can help you with?


## How many r's in the word strawberry?

In [10]:
messages = [
    {"role": "user", "content": "How many r's in the word strawberry?"},
]

responses, execution_times = compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - 0.48 seconds: There are 2 R's in the word "strawberry".
groq:llama3-8b-8192 - 0.16 seconds: There are 2 R's in the word "strawberry".
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - 0.25 seconds: There are 2 R's in the word "strawberry".
octo:meta-llama-3-8b-instruct - 0.31 seconds: There are 2 R's in the word "strawberry".
together:meta-llama/Llama-3-8b-chat-hf - 0.25 seconds: There are 2 R's in the word "strawberry".
openai:gpt-3.5-turbo - 0.90 seconds: There are three r's in the word "strawberry."
replicate:meta/meta-llama-3-8b-instruct - 1.33 seconds: Let me count them for you!

There are 2 R's in the word "strawberry".
aws:meta.llama3-1-8b-instruct-v1:0 - 0.49 seconds: There are 3 r's in the word "strawberry".
groq:llama-3.1-8b-instant - 2.36 seconds: There are 3 r's in the word "strawberry".
fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct - 0.43 seconds: There are 3 r's in the word "strawberry".
together:meta-llama/Meta-Lla

In [11]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,aws:meta.llama3-8b-instruct-v1:0,0.480391,"There are 2 R's in the word ""strawberry""."
2,groq:llama3-8b-8192,0.159436,"There are 2 R's in the word ""strawberry""."
3,fireworks:accounts/fireworks/models/llama-v3-8b-instruct,0.254061,"There are 2 R's in the word ""strawberry""."
4,octo:meta-llama-3-8b-instruct,0.314966,"There are 2 R's in the word ""strawberry""."
5,together:meta-llama/Llama-3-8b-chat-hf,0.248981,"There are 2 R's in the word ""strawberry""."
6,openai:gpt-3.5-turbo,0.899374,"There are three r's in the word ""strawberry."""
7,replicate:meta/meta-llama-3-8b-instruct,1.328329,"Let me count them for you! There are 2 R's in the word ""strawberry""."
8,aws:meta.llama3-1-8b-instruct-v1:0,0.494379,"There are 3 r's in the word ""strawberry""."
9,groq:llama-3.1-8b-instant,2.36402,"There are 3 r's in the word ""strawberry""."
10,fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct,0.434086,"There are 3 r's in the word ""strawberry""."


## Which number is bigger?

In [12]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9?"},
]

responses, execution_times = compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - 0.49 seconds: 9.9 is bigger than 9.11.
groq:llama3-8b-8192 - 0.20 seconds: 9.11 is bigger than 9.9.
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - 0.27 seconds: 9.9 is bigger than 9.11.
octo:meta-llama-3-8b-instruct - 0.29 seconds: 9.11 is bigger than 9.9.
together:meta-llama/Llama-3-8b-chat-hf - 0.70 seconds: 9.11 is bigger than 9.9.
openai:gpt-3.5-turbo - 1.05 seconds: 9.9
replicate:meta/meta-llama-3-8b-instruct - 1.58 seconds: Let me help you with that!

9.11 is bigger than 9.9.
aws:meta.llama3-1-8b-instruct-v1:0 - 0.83 seconds: To compare these two numbers, we need to look at the decimal part. Since 9.11 has a larger decimal part (0.11) than 9.9 (0.9), 9.11 is bigger.
groq:llama-3.1-8b-instant - 0.23 seconds: 9.9 is bigger than 9.11.
fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct - 0.19 seconds: 9.9 is bigger than 9.11.
together:meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo - 0.36 seconds: 9.9 is bigger than 9.11.
octo:meta-l

In [13]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,aws:meta.llama3-8b-instruct-v1:0,0.489279,9.9 is bigger than 9.11.
2,groq:llama3-8b-8192,0.200864,9.11 is bigger than 9.9.
3,fireworks:accounts/fireworks/models/llama-v3-8b-instruct,0.271625,9.9 is bigger than 9.11.
4,octo:meta-llama-3-8b-instruct,0.294958,9.11 is bigger than 9.9.
5,together:meta-llama/Llama-3-8b-chat-hf,0.695657,9.11 is bigger than 9.9.
6,openai:gpt-3.5-turbo,1.051595,9.9
7,replicate:meta/meta-llama-3-8b-instruct,1.580146,Let me help you with that! 9.11 is bigger than 9.9.
8,aws:meta.llama3-1-8b-instruct-v1:0,0.828657,"To compare these two numbers, we need to look at the decimal part. Since 9.11 has a larger decimal part (0.11) than 9.9 (0.9), 9.11 is bigger."
9,groq:llama-3.1-8b-instant,0.232353,9.9 is bigger than 9.11.
10,fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct,0.192978,9.9 is bigger than 9.11.


In [14]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9? Think step by step."},
]

responses, execution_times = compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - 1.83 seconds: Let's break it down step by step:

1. Both numbers have the same first digit, which is 9.
2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9.
3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11.
4. Therefore, 9.9 is greater than 9.11.

So, the answer is: 9.9 is bigger than 9.11.
groq:llama3-8b-8192 - 0.31 seconds: Let's break it down step by step:

1. Both numbers have the same first digit, which is 9.
2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9.
3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11.
4. Therefore, 9.9 is greater than 9.11.

So, the answer is: 9.9 is bigger than 9.11.
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - 0.72 seconds: Let's break it down step by step!

1. Both numbers have the same first digit, which is 9.
2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9.
3. Since

In [15]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,aws:meta.llama3-8b-instruct-v1:0,1.830247,"Let's break it down step by step: 1. Both numbers have the same first digit, which is 9. 2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9. 3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11. 4. Therefore, 9.9 is greater than 9.11. So, the answer is: 9.9 is bigger than 9.11."
2,groq:llama3-8b-8192,0.307869,"Let's break it down step by step: 1. Both numbers have the same first digit, which is 9. 2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9. 3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11. 4. Therefore, 9.9 is greater than 9.11. So, the answer is: 9.9 is bigger than 9.11."
3,fireworks:accounts/fireworks/models/llama-v3-8b-instruct,0.716259,"Let's break it down step by step! 1. Both numbers have the same first digit, which is 9. 2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9. 3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11. 4. Therefore, 9.9 is greater than 9.11. So, the correct answer is: 9.9 is bigger than 9.11!"
4,octo:meta-llama-3-8b-instruct,1.397936,"Let's break it down step by step: 1. Both numbers have the same first digit, which is 9. 2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9. 3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11. 4. Therefore, 9.9 is greater than 9.11. So, the answer is: 9.9 is bigger than 9.11."
5,together:meta-llama/Llama-3-8b-chat-hf,0.764065,"Let's break it down step by step: 1. Both numbers have the same first digit, which is 9. 2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9. 3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11. 4. Therefore, 9.9 is greater than 9.11. So, the answer is: 9.9 is bigger than 9.11."
6,openai:gpt-3.5-turbo,1.262903,"To determine which number is bigger, we can compare the whole numbers first. In this case, both numbers have a whole number of 9. Since they are the same, we need to look at the decimal portion. 9.11 is bigger than 9.9 because 0.11 is greater than 0.9."
7,replicate:meta/meta-llama-3-8b-instruct,2.307123,"Let's break it down step by step! 1. Both numbers start with the same digits: 9.1 2. The next digit is also the same: 1 3. The last digit is different: 9.11 has a 1, while 9.9 has a 9 Since 9 is greater than 1, the correct answer is: 9.9 is bigger than 9.11"
8,aws:meta.llama3-1-8b-instruct-v1:0,1.374292,"Let's compare the two numbers step by step. 1. Both numbers start with 9, so we can ignore the 9 for now. 2. We're left with 0.11 and 0.9. Which one is bigger? 3. Since 0.9 is bigger than 0.11, the whole number 9.9 is bigger than 9.11. So, the answer is: 9.9 is bigger than 9.11."
9,groq:llama-3.1-8b-instant,0.463357,"Let's compare the two numbers step by step: 1. Both numbers start with 9, so we can ignore the 9 for now. 2. We're left with 0.11 and 0.9. 3. 0.9 is bigger than 0.11. 4. Since 9 is the same in both numbers, the overall number with 0.9 is bigger than the number with 0.11. Therefore, 9.9 is bigger than 9.11."
10,fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct,0.584036,"Let's compare the two numbers step by step: 1. Both numbers start with 9, so we can ignore the 9 for now. 2. We're left with 0.11 and 0.9. 3. 0.9 is bigger than 0.11. 4. Since 9 is the same in both numbers, the overall number with 0.9 is bigger than the number with 0.11. Therefore, 9.9 is bigger than 9.11."


## Takeaways
1. Not all LLMs are created equal - not even all Llama 3 (or 3.1) are created equal (by different providers).
2. Ask LLM to think step by step may help improve its reasoning.
3. The way tokenization works in LLM could lead to a lot of weirdness in LLM (see AK's awesome [video](https://www.youtube.com/watch?v=zduSFxRajkE) for a deep dive).
4. A more comprehensive benchmark would be desired, but a quick LLM comparison like shown here can be the first step.