# LLM Reasoning

This notebook compares how LLMs (mainly Llama 3 and 3.1, but other LLMs can be added easily) from different Generative AI providers perform on three recent examples that show issues with LLM reasoning:

* [The Reversal Curse](https://github.com/lukasberglund/reversal_curse) shows that LLMs trained on "A is B" fail to learn "B is A".
* [How many r's in the word strawberry?](https://x.com/karpathy/status/1816637781659254908) shows "the weirdness of LLM Tokenization".  
* [Which number is bigger, 9.11 or 9.9?](https://x.com/DrJimFan/status/1816521330298356181) shows that "LLMs are alien beasts."


In [1]:
!cat ../.env.sample

ANTHROPIC_API_KEY=""
FIREWORKS_API_KEY=""
GROQ_API_KEY=""
MISTRAL_API_KEY=""
OPENAI_API_KEY=""
OLLAMA_API_URL="http://localhost:11434"
REPLICATE_API_KEY=""
TOGETHER_API_KEY=""
OCTO_API_KEY=""
AWS_ACCESS_KEY_ID=""
AWS_SECRET_ACCESS_KEY=""

Make sure your ~/.env file (copied from the .env.sample file above) has the API keys of the LLM providers to compare set before running the cell below:

In [2]:
import sys
sys.path.append('../../aimodels')

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [None]:
#!pip install boto3
#!pip install fireworks-ai

## Specify LLMs to Compare

In [3]:
import aimodels as ai

client = ai.Client()

In [4]:
llms = ["aws:meta.llama3-8b-instruct-v1:0",
        "groq:llama3-8b-8192",
        "fireworks:accounts/fireworks/models/llama-v3-8b-instruct",
        "octo:meta-llama-3-8b-instruct",
        "together:meta-llama/Llama-3-8b-chat-hf",
        "openai:gpt-3.5-turbo",
        "replicate:meta/meta-llama-3-8b-instruct",

        "aws:meta.llama3-1-8b-instruct-v1:0",
        "groq:llama-3.1-8b-instant",
        "fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct",
        "together:meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
        "octo:meta-llama-3.1-8b-instruct",
        
       ]

def compare_llm(messages):
    for llm in llms:
        response = client.chat.completions.create(model=llm, messages=messages)
        print(f"{llm} - {response.choices[0].message.content.strip()}\n==========")

## The Reversal Curse

In [5]:
messages = [
    {"role": "user", "content": "Who is Tom Cruise's mother?"},
]

In [6]:
compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - Tom Cruise's mother is Mary Lee South (née Pfeiffer).
groq:llama3-8b-8192 - Tom Cruise's mother is Mary Lee South (née Pfeiffer). She was a special education teacher and a homemaker.
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - Tom Cruise's mother is Mary Lee South (née Pfeiffer).
octo:meta-llama-3-8b-instruct - Tom Cruise's mother is Mary Lee South (née Pfeiffer).
together:meta-llama/Llama-3-8b-chat-hf - Tom Cruise's mother is Mary Lee South (née Pfeiffer).
openai:gpt-3.5-turbo - Tom Cruise's mother is Mary Lee Pfeiffer.
replicate:meta/meta-llama-3-8b-instruct - Tom Cruise's mother is Mary Lee South (née Pfeiffer).
aws:meta.llama3-1-8b-instruct-v1:0 - Tom Cruise's mother is Mary Lee Pfeiffer.
groq:llama-3.1-8b-instant - Tom Cruise's mother is Mary Lee Pfeiffer.
fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct - Tom Cruise's mother is Mary Lee Pfeiffer.
together:meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo - Tom Cruise's mother 

In [7]:
messages = [
    {"role": "user", "content": "Who is Mary Lee Pfeiffer's son?"},
]
compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private individual and not a public figure, or that the name is not well-known. Can you provide more context or details about who Mary Lee Pfeiffer is or why you are looking for information about her son?
groq:llama3-8b-8192 - I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a private or personal matter, or that the person is not a public figure. Can you provide more context or clarify who Mary Lee Pfeiffer is?
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - Mary Lee Pfeiffer is a well-known American artist, and her son is none other than the famous artist and sculptor, John Pfeiffer!
octo:meta-llama-3-8b-instruct - I apologize, but I couldn't find any information on a person named Mary Lee Pfeiffer or her son. It's possible that this is a pri

## How many r's in the word strawberry?

In [8]:
messages = [
    {"role": "user", "content": "How many r's in the word strawberry?"},
]
compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - There are 2 R's in the word "strawberry".
groq:llama3-8b-8192 - There are 2 R's in the word "strawberry".
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - There are 2 R's in the word "strawberry".
octo:meta-llama-3-8b-instruct - There are 2 R's in the word "strawberry".
together:meta-llama/Llama-3-8b-chat-hf - There are 2 R's in the word "strawberry".
openai:gpt-3.5-turbo - There are three r's in the word "strawberry."
replicate:meta/meta-llama-3-8b-instruct - Let me count them for you!

There are 2 R's in the word "strawberry".
aws:meta.llama3-1-8b-instruct-v1:0 - There are 3 r's in the word "strawberry".
groq:llama-3.1-8b-instant - There are 3 r's in the word "strawberry".
fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct - There are 2 r's in the word strawberry.
together:meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo - There are 2 r's in the word strawberry.
octo:meta-llama-3.1-8b-instruct - There are 3 r's in the word "strawberry".

## Which number is bigger?

In [9]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9?"},
]

In [10]:
compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - 9.9 is bigger than 9.11.
groq:llama3-8b-8192 - 9.11 is bigger than 9.9.
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - 9.9 is bigger than 9.11.
octo:meta-llama-3-8b-instruct - 9.11 is bigger than 9.9.
together:meta-llama/Llama-3-8b-chat-hf - 9.11 is bigger than 9.9.
openai:gpt-3.5-turbo - 9.9
replicate:meta/meta-llama-3-8b-instruct - Let me help you with that!

9.11 is bigger than 9.9.
aws:meta.llama3-1-8b-instruct-v1:0 - The number 9.11 is bigger than 9.9.
groq:llama-3.1-8b-instant - 9.9 is bigger than 9.11.
fireworks:accounts/fireworks/models/llama-v3p1-8b-instruct - 9.9 is bigger than 9.11.
together:meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo - To compare the two numbers, we need to look at the decimal part. 

9.11 has a decimal part of 0.11, and 9.9 has a decimal part of 0.9. 

Since 0.11 is greater than 0.9, 9.11 is bigger than 9.9.
octo:meta-llama-3.1-8b-instruct - 9.9 is bigger than 9.11.


In [11]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9? Think step by step."},
]

In [12]:
compare_llm(messages)

aws:meta.llama3-8b-instruct-v1:0 - Let's break it down step by step!

1. Compare the whole numbers: Both numbers have the same whole number part, which is 9.
2. Compare the decimal parts: 9.11 has a decimal part of 0.11, while 9.9 has a decimal part of 0.9.
3. Since 0.11 is smaller than 0.9, 9.11 is smaller than 9.9.

So, the correct answer is: 9.9 is bigger than 9.11.
groq:llama3-8b-8192 - Let's break it down step by step:

1. Both numbers have the same first digit, which is 9.
2. The second digit of 9.11 is 1, and the second digit of 9.9 is 9.
3. Since 9 is greater than 1, the second digit of 9.9 is larger than the second digit of 9.11.
4. Therefore, 9.9 is greater than 9.11.

So, the answer is: 9.9 is bigger than 9.11.
fireworks:accounts/fireworks/models/llama-v3-8b-instruct - Let's break it down step by step!

1. Compare the whole numbers: Both numbers have the same whole part, which is 9.
2. Compare the decimal parts: 0.11 is less than 0.9.

So, 9.11 is less than 9.9.
octo:meta-ll

## Takeaways
1. Not all LLMs are created equal - not even all Llama 3 (or 3.1) are created equal (by different providers).
2. Ask LLM to think step by step may help improve its reasoning.
3. The way LLM was trained and tokenized could lead it to some weird reasoning.
4. A more comprehensive benchmark would be desired, but a quick LLM comparison like shown here can be the first step.