# LLM Basics
Large Language Models (LLMs) are essentially next-word predictors. Some are small enough to be run locally, but even the smallest ones are painfully slow if you don't have a reasonable GPU. Instead we will use an API to send our LLM requests to an online host. This is generally how people use ChatGPT and other popular models, but since this will cost money, we will use the Groq API. 

## Groq
Groq (not to be confused with X/Twitter's LLM, Grok) is a hardware manufacturer that is developing very fast chips for llm inference and hosts a number of pretty good models that you can use for free as long as you stay within their usage limits (see [Groq Rate Limits](https://console.groq.com/docs/rate-limits)). Groq also seamlessly integrates into [Langchain](https://python.langchain.com/v0.2/docs/integrations/chat/groq/) which we will use later.

- Create an account at [https://groq.com/](https://groq.com/) 
- follow the links to crete an API key (call it whatever you want)
- save the API key to a file in this folder called `.env`. This is a way of keeping your secrets like API keys safe in one place and loading them 


## Note on API keys and `.env` file
It is always best to avoid explicitly writing your API key in your code because it should be kept secret. One way to do this is to store them in environment variables that you can refer to by name on your code. The [python-dotenv](https://pypi.org/project/python-dotenv/) library allows us to do this by storing your API keys in a `.env` file that is loaded into envirnment variables when you program starts.

To use:
- write any secret you want in your `.env` file like this with each one on a new line:
``` bash
GROQ_API_KEY=<your-groq-api-key>
OPENAI_API_KEY=<your-openai-api-key>
etc..
```
- then you can load them as environment variables with these lines at the start of your code:
```python
import os
from dotenv import load_dotenv
load_dotenv()
```
- these are available in code like this
```python
groq_api_key = os.environ.get("GROQ_API_KEY")
```

In [1]:
from groq import Groq
import os
from dotenv import load_dotenv
load_dotenv()

True

## Generating responses
Here is an example of invoking an LLM with the Groq API. Other LLM host will have different APIs but they are all pretty similar in principle.

In [2]:
# the client allows you to interact with the api. Use the api key env variable loaded from .env file
client = Groq(api_key = os.environ.get("GROQ_API_KEY"))

# the prompts that will be sent to the llm
system_prompt = "You are an expert python programmer and your mission is to help the user with their programming troubles. Keep answers brief and to the point"
# system_prompt = "You are a grumpy assistant who only answers in in ALL CAPS"

user_prompt = "Can you explain how a list comprehension works?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

# Generate a response
completion = client.chat.completions.create(
    model = "llama3-70b-8192", # could also be "mixtral-8x7b-32768", or others
    messages = messages,
    temperature=1,
    max_tokens=1024,
    stream=True,
)
print("LLM response:\n")
# stream the response as it comes in ()
for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

# # if you set stream = False then this would display the message contents
# print(completion.choices[0].message.content)

LLM response:

A list comprehension is a concise way to create a new list from an existing iterable. The basic syntax is:

`new_list = [expression(element) for element in iterable]`

Here's a breakdown of how it works:

1. `expression(element)` is evaluated for each `element` in the `iterable`.
2. The results of these evaluations are collected in a new list, `new_list`.
3. The resulting list contains the transformed elements.

Example: Square all numbers in a list:
`numbers = [1, 2, 3, 4, 5]; squared_numbers = [n**2 for n in numbers]`

In this example, the list comprehension iterates over the `numbers` list, squaring each number (`n**2`), and collects the results in a new list, `squared_numbers`.

# What is happening?
Here is a description of the `client.chat.completions.create()` parameters.

### `messages` (prompts)
A list of the prompts that the LLM will use to generate its completion. Most LLMs take a variety of of prompt types:
- **System Prompt**: This is the initial input or prompt that provides context or instructions for the conversation. It is often something like `you are a helpful assistant`, but you could also use `You are a grumpy assistant who only answers in in ALL CAPS`. This will set the tone for the response and the whole future conversation.
- **User Prompt**: This is the what you the person says to or asks the LLM.
- **Assistant Prompt**: This refers to the response generated by the AI assistant in response to the user prompt. We didn't use one in the example above because it was just a one-shot interaction. When having longer conversations, this is how you tell the llm what their previous responses were.

### `temperature`
This controls how much ramdomness to allow when the model picks from a probability distribution of possible next words. 
- `temperature = 0` leads to responses with no randomness.
- `temperature` > 0 leads to more random and 'creative' responses.
    
### `max_tokens`
A hard limit on the length of the LLM's response. If you are paying to use the model, then you are paying by the token so this parameter allows you to avoid excessively long and expensive responses. Some models are more verbose than others so you could also tell them to be brief in the system prompt, but that doesn't provide an hard limit.

### `stream`
This constrols whether the model returns a streaming object that can be used to print the response one token at a time (`stream=True`), or if it returns the entire message as a single object after the last token is generated (`stream = False`)


## Next: here is a function that simplifies the above process we can use later.

In [3]:
def get_single_completion(
    messages, 
    stream = True, 
    verbose = True, 
    model = "llama3-70b-8192"
):
    client = Groq(api_key = os.environ.get("GROQ_API_KEY"))
    completion = client.chat.completions.create(
        model = model,
        messages = messages,
        temperature=0,
        max_tokens=1024,
        stream=stream,
    )
    # stream the response as it comes in ()
    if (verbose==True and stream==True):
        for chunk in completion:
            print(chunk.choices[0].delta.content or "", end="")
    elif (verbose==True and stream==False):
        print(completion.choices[0].message.content)

    return completion

#### Setting `streaming=True` returns a stream object

In [4]:
messages = [
    {"role": "system", "content": "you are a helpful assistant"},
    {"role": "user", "content": "tell me a joke"}
]

completion = get_single_completion(messages, stream = True, verbose = True)
completion

Here's one:

Why couldn't the bicycle stand up by itself?

(Wait for it...)

Because it was two-tired!

Hope that made you smile!

<groq.Stream at 0x22f9e0ee000>

#### Setting `streaming=False` returns a `ChatCompletion` object that contains lots of interesting info. 
Conveniently, it tells you how many tokens were used including a breakdown of prompt_tokens and completion_tokens. Typically completion_tokens are much more expensive than prompt_tokens. 

For example, OpenAI's *gpt-4o* model costs:
- \$ 15.00 per Million output tokens
- \$ 3.00 per Million input tokens

In [5]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Why does ice float in water?"}
]
completion = get_single_completion(messages, stream = False, verbose = False)
completion

ChatCompletion(id='chatcmpl-fd4ba3e0-cee0-4209-a8cf-c51913505fe6', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Ice floats in water because of its unique properties. Here\'s why:\n\nWhen water freezes, its molecules arrange themselves in a crystalline structure. In this structure, the molecules are spaced out in a way that creates empty spaces between them. As a result, the density of ice is lower than that of liquid water.\n\nDensity is defined as the mass of a substance per unit volume. In the case of water, the density of liquid water is approximately 1 gram per cubic centimeter (g/cm³). However, the density of ice is about 0.92 g/cm³, which is lower than that of liquid water.\n\nWhen you put ice in water, the buoyancy force (the upward force exerted by the surrounding water) pushes the ice upward because the density of the ice is lower than that of the surrounding water. This is why ice floats on top of water.\n\nThis phenomen

##### how much would this have cost on GPT4o?

In [6]:
input_tokens = completion.usage.prompt_tokens
output_tokens = completion.usage.completion_tokens

cost = input_tokens/1e6 * 3 + output_tokens/1e6 * 15
print(f"${round(cost, 5)}")

$0.00406


Not much! but this can easily get much bigger when you start having conversations

## Having a conversation with the model
Getting the model to "remember" what you just told it requires a bit of extra work. Let's see why this is required.

In [14]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello my name is Trevor. Please remember that for when I ask you about it in just a second"}
]
completion = get_single_completion(messages)

Nice to meet you, Trevor! I've made a mental note of your name, so I'll remember it for our conversation. Go ahead and ask me anything, and I'll do my best to assist you!

In [15]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hey what's my name"}, 
]
completion = get_single_completion(messages)

I'm happy to help! However, I'm a large language model, I don't have any information about you or your name. I'm a new conversation each time you interact with me, so I don't retain any personal information. If you'd like to share your name with me, I'd be happy to chat with you!

### Come on, I just told you!
Yeah, but since you didn't tell the LLM what it had said earlier there's no way it could know what your name was. You must explicitly feed the LLM all the previous conversations messages that you want them to know about.

In [16]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello my name is Trevor. Please remember that for when I ask you about it in just a second"}, 
    {"role": "assistant", "content": "Nice to meet you, Trevor! I've made a mental note of your name, so I'll remember it for our conversation. Go ahead and ask me anything, and I'll do my best to assist you!"},
    {"role": "user", "content": "Hey what's my name"}, 
]

completion = get_single_completion(messages)

I remember! Your name is Trevor!

In [17]:
# Add previous assistant response then new user prompt
messages.append({"role": "assistant", "content": "I remember! Your name is Trevor!"})
messages.append({"role": "user", "content": "Thanks for remembering. My dog daisy wants to go for a walk now"})
completion = get_single_completion(messages)

Aw, that's great! I'm sure Daisy is excited to get some exercise and fresh air! You should take her on a nice walk and enjoy the time together. Does she have a favorite route or spot she likes to visit?

In [18]:
messages.append({"role": "assistant", "content": "Aw, that's so sweet! I'm sure Daisy is excited to get some fresh air and exercise! You should take her on a nice long walk and enjoy the quality time together. Don't forget to bring poop bags and stay hydrated!"})
messages.append({"role": "user", "content": "What was my dog's name again?"})
completion = get_single_completion(messages)

I remember! Your dog's name is Daisy!

In [19]:
messages.append({"role": "assistant",  "content": "I remember! Your dog's name is Daisy"})
messages.append({"role": "user", "content": "What is my name again?"})
completion = get_single_completion(messages)

I remember! Your name is Trevor

In [20]:
# look at the conversation history
messages

[{'role': 'system', 'content': 'You are a helpful assistant'},
 {'role': 'user',
  'content': 'Hello my name is Trevor. Please remember that for when I ask you about it in just a second'},
 {'role': 'assistant',
  'content': "Nice to meet you, Trevor! I've made a mental note of your name, so I'll remember it for our conversation. Go ahead and ask me anything, and I'll do my best to assist you!"},
 {'role': 'user', 'content': "Hey what's my name"},
 {'role': 'assistant', 'content': 'I remember! Your name is Trevor!'},
 {'role': 'user',
  'content': 'Thanks for remembering. My dog daisy wants to go for a walk now'},
 {'role': 'assistant',
  'content': "Aw, that's so sweet! I'm sure Daisy is excited to get some fresh air and exercise! You should take her on a nice long walk and enjoy the quality time together. Don't forget to bring poop bags and stay hydrated!"},
 {'role': 'user', 'content': "What was my dog's name again?"},
 {'role': 'assistant', 'content': "I remember! Your dog's name i

#### Notes
- Having long conversations will require sending an ever-growing conversation history to the LLM, which uses more tokens and costs more money (if you were not using a free API)
- This process of manually adding to the list of messages is tedious. We will see easier ways to do this when we start using the Langchain library