# LLM Reasoning

This notebook compares how LLMs from different Generative AI providers perform on three examples that can show issues with LLM reasoning:

* [The Reversal Curse](https://github.com/lukasberglund/reversal_curse) shows that LLMs trained on "A is B" fail to learn "B is A".
* [How many r's in the word strawberry?](https://x.com/karpathy/status/1816637781659254908) shows "the weirdness of LLM Tokenization".  
* [Which number is bigger, 9.11 or 9.9?](https://x.com/DrJimFan/status/1816521330298356181) shows that "LLMs are alien beasts."

In [None]:
!cat ../.env.sample

Make sure your ~/.env file (copied from the .env.sample file above) has the API keys of the LLM providers to compare set before running the cell below:

In [None]:
import sys
sys.path.append('../../aisuite')

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

## Specify LLMs to Compare

In [None]:
import aisuite as ai

client = ai.Client()

In [None]:
import time

llms = [
        "anthropic:claude-3-5-sonnet-20240620",
        "aws:meta.llama3-1-8b-instruct-v1:0",
        "groq:llama3-8b-8192",
        "groq:llama3-70b-8192",
        "huggingface:mistralai/Mistral-7B-Instruct-v0.3",
        "openai:gpt-3.5-turbo",
       ]

def compare_llm(messages):
    execution_times = []
    responses = []
    for llm in llms:
        start_time = time.time()
        response = client.chat.completions.create(model=llm, messages=messages)
        end_time = time.time()
        execution_time = end_time - start_time
        responses.append(response.choices[0].message.content.strip())
        execution_times.append(execution_time)
        print(f"{llm} - {execution_time:.2f} seconds: {response.choices[0].message.content.strip()}")
    return responses, execution_times

## The Reversal Curse

In [None]:
messages = [
    {"role": "user", "content": "Who is Tom Cruise's mother?"},
]

responses, execution_times = compare_llm(messages)

In [None]:
import pandas as pd

def display(llms, execution_times, responses):
    data = {
        'Provider:Model Name': llms,
        'Execution Time': execution_times,
        'Model Response ': responses
    }
    
    df = pd.DataFrame(data)
    df.index = df.index + 1
    styled_df = df.style.set_table_styles(
        [{'selector': 'th', 'props': [('text-align', 'center')]}, 
         {'selector': 'td', 'props': [('text-align', 'center')]}]
    ).set_properties(**{'text-align': 'center'})
    
    return styled_df 

In [None]:
display(llms, execution_times, responses)

In [None]:
messages = [
    {"role": "user", "content": "Who is Mary Lee Pfeiffer's son?"},
]

responses, execution_times = compare_llm(messages)

In [None]:
display(llms, execution_times, responses)

## How many r's in the word strawberry?

In [None]:
messages = [
    {"role": "user", "content": "How many r's in the word strawberry?"},
]

responses, execution_times = compare_llm(messages)

In [None]:
display(llms, execution_times, responses)

## Which number is bigger?

In [None]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9?"},
]

responses, execution_times = compare_llm(messages)

In [None]:
display(llms, execution_times, responses)

In [None]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9? Think step by step."},
]

responses, execution_times = compare_llm(messages)

In [None]:
display(llms, execution_times, responses)

## Takeaways
1. Not all LLMs are created equal - not even all Llama 3 (or 3.1) are created equal (by different providers).
2. Ask LLM to think step by step may help improve its reasoning.
3. The way tokenization works in LLM could lead to a lot of weirdness in LLM (see AK's awesome [video](https://www.youtube.com/watch?v=zduSFxRajkE) for a deep dive).
4. A more comprehensive benchmark would be desired, but a quick LLM comparison like shown here can be the first step.