# Inference with LLMs

Inference is the process of using a trianed LLM to generate human-like text from a given input prompt. Language models use their knowledge from training to formulate responses one word at time. The model leverages learned probabilites from bilions of parameters to predict and generate the next token in a sequence.

Attention mechanism is the key to LLMs understand context and generate coherent text based on it. When predicting the next word, not every word in a sentence has equal weight. 

## Context Length and Attention Span

Context length is the maximum number of tokens that the LLM can process at once. i.e. The size of the model's working memory.
These capabilities are limited by factors such as:
- Model's architecture and size
- Available computational resources
- The complexity of the input and desired output
In an ideal world, we could feed unilimted context to the model, but hardware constraints and computational costs make this impractical. This is why different models are designed with different context lengths to balance capability with efficiency.

Essentially, is the maximum number of tokens that a model considers when generating a response.

## The Two-Phase Inference Process

The text generating process can be brokend own into two main phases: prefill and decode. These phases work together, each playing a crucial role in producing coherent text:

* **Prefill Phase**
    The preparation stage, this phase involves three key steps:
    1. **Tokenization**: Convert the input text into tokens.
    2. **Embedding Conversinon**: Transforming these tokens into numerical representations that capture their meaning.
    3. **Initial Processing**: Running these embeddings through the model's neural network to create a rich understanding of the context. 

* **Decode Phase**
    Where the actual text generation happens:
    1. **Attention Computation**: Looking back at all previous tokens to understand context.
    2. **Probabilty Calculation**: Determining the likelihood of each possible next token.
    3. **Token Selection**: Choosing the next token based on these probabilities.
    4. **Continuation Check**: Deciding whether to continue or stop generation.
    This is memory-intensive because the model needs to keep track of all previously generated tokens and their relationship. 

## Sampling Strategies

Let's explore the various ways we can control this generation process. This is the moment where the model chooses between being more creative or more precise, the token selection process is adjusted.

**Token Selection**: When the moodel need to chose the next token, it starts with raw probabilities(logits) for every word in its vocabulary. But how we turn thse probabilities into actual choices? This is the process:
1. **Raw Logits:** Think of these as the model's initial gut feelings about each possible next word.
2. **Temperature Control:** Creativity measurement, from lower settings(<1) to make more focused and deterministic to higher settings(>1) to make choices more random and creative.
3. **Top-p sampling**: Instead of considering all possible words, look at the most likely ones that add up to a chosen probability threshold
4. **Top-k filtering:** Alternatvia approach, consider the k most likely words.

## Managing Repetition

One common behavior to LLMs is to repeat themselves, so it's used two penalties to prevent this:
- **Presence Penalty**: A fixed penalty applied to any token that has appeared before, regardless of how often. 
- **Frequency Penalty**: A scaling penalty that increaes based on how often a token has been used. The more a word appears, the less likely it is to be chosen again. 

## Controlling generation length

It's important to control how much text the LLM generates. This length is controlled using some methods as:
- **Token Limits:** Setting minimum and maximum token counts.
- **Stop Sequences:** Defining specific patterns that signal the end of generation.
- **End-of-Sequence Detection:** Letting the model naturally conclude its response.


## Beam Search

Beam search it's a multi responses strategie, where the model generate multiple possible paths of generated responses and then evaluate the best:
1. At each step, maintain multiple candidate sequences
2. For each candidate, compute probabilities for the next token
3. Keep only the most promising combinatins of sequences and next tokens
4. Continue this process until reaching a stop condition
5. Select the sequence with the highest overall probability

This approach often produces more coherent and grammatically correct text, but it requires more computational resources than simpler methods.

## Key Performance Metrics

When working with LLMs, frou critical metrics will shape your implementation decisions:
1. **Time to First Token(TTFT):** How quickly can you get the first response?
2. **Time Per Output Token (TPOT):** How fast can you generate the subsequent tokens?
3. **Throughput:** How many requests can you handle simultaneously?
4. **VRAM Usage:** How much GPU memory do you need?

## The Context Length Challenge

One of the most significant challenges in LLM inference is managing context length effectively. Longer contexts provide more information but come with substantial costs:
- **Memory usage:** Grows **quadratically** with context length
- **Processing speed:** Decreases **linearly** with longer contexts
- **Resource Allocation:** Requires careful balancing of VRAM usage

### The KV Cache Optimization

To address these challenges, one powerfull optimization is KV(Key-Value) caching, is to storage the key-value matrix for each token during inference in a cache memory. This technique significantly improves inference speed by storing and reusing intermediate calculations. This optimization:
- Reduces repeated calculations
- Improves generation speed
- Makes long-context generation practical

The trade-off is additional memory usage, but the performance benefits usually far outweigh this cost.