<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/09.llms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/09.llms.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# Large Language Models

📝 SALP chapter 10

## 🤔 How do we learn?
- Fluent speakers' vast vocabulary is key for language comprehension and production, aiding in studying knowledge acquisition.

- Adult vocabulary estimates range from 30,000 to 100,000 words, mostly learned early through spoken interactions.

- Children need to learn 7-10 words daily to reach adult vocabulary levels by age 20, a consistent rate across studies.

- Reading drives vocabulary growth, with children learning words faster than they encounter them in texts.

- The distributional hypothesis suggests word meanings are learned from co-occurrences in text, with real-world interactions enhancing this.

## 🤔 Could machines learn similarly?
- Pretraining large language models (LLMs) involves learning language and world knowledge from vast text data, enabling them to excel in various natural language tasks.

- LLMs have transformed tasks like summarization, translation, question answering, and chatbots by using the knowledge gained during pretraining.

- LLMs are often `autoregressive or casual`, predicting the next word from previous ones in text flowing sequence (mostly, left-to-right) during training.

- Text generation with LLMs is central to generative AI, which includes text, code, and image generation, using specific algorithms like greedy decoding and sampling.

- Almost any NLP task, such as summarization, can be framed as `word prediction` in LLMs, demonstrating their versatility.

## Overview of LLMs
- **Architecture**: Large Language Models (LLMs) are deep neural networks, often based on transformer architectures, that can process and generate human-like text by learning patterns in massive datasets.

- **Training**: LLMs are trained on vast amounts of text data from the internet, books, and other sources. This enables them to understand grammar, facts, and relationships between words to perform a variety of natural language tasks.

- **Scale**: These models are "large" due to their immense size, often containing billions or even trillions of parameters (weights and biases), which allows them to capture nuanced information and context in text.

- **Generalization**: LLMs can generalize across a wide range of language tasks like text completion, translation, summarization, and answering questions without task-specific training, relying on their broad training data.

- **Adaptability**: LLMs can be fine-tuned for specific applications, such as customer support, writing assistants, or domain-specific text generation, making them versatile across industries and tasks.

## LLMs Conditional Text Generation
- LLMs generate text token-by-token, using both the `input prompt` and `previously generated tokens`.
  - Long context windows (thousands of tokens) make transformers effective for this task.
- 🍎 Text completion – LLMs predict the next word based on context, leading to coherent outputs.

![Left-to-right (also called autoregressive) text completion](./images/llm/lrtextcomp.png)


### **Word Prediction for NLP Tasks**
- Many practical NLP tasks can be cast as `word prediction`.
- **Sentiment Analysis**: Predict sentiment by comparing probabilities of words like "positive" vs. "negative."
  - e.g. The sentiment analysis of `I like NLP` can be cast to
    - P(positive | The sentiment of `I like NLP` is:), and
    - P(negative | The sentiment of `I like NLP` is:)
- **Question Answering**: Predict the next word after a question to generate factual answers.
  - e.g: "Who wrote *The Origin of Species*?" → Answer: "Charles Darwin." can be cast to
    - P(w|Q: Who wrote the book ‘‘The Origin of Species"?  A:) over all possible next words
      - It is very likely we get `Charles`, add it to the context and continue the prediction
    - P(w|Q: Who wrote the book ‘‘The Origin of Species"?  A: Charles)
      - then it is very likely we get `Darwin`
- **Summarization**: LLMs generate summaries using prompt like long articles appended with `tl;dr` to condense it.
  - Transformers handle large context windows, using the entire article and generated text to produce concise summaries.

![Summarization with large language models using the tl;dr token and context-based autore-
gressive generation](./images/llm/texsum.png)

### **Decoding Strategies in LLMs**
- Choosing a word to generate based on the model’s probabilities is called `decoding`. There are 3 popular decoding strategies:
- **Greedy decoding**: Always selects the most probable next word $\hat{w}_t$ from the vocabulary $V$, but results in repetitive, generic text.
  - $\hat{w}_t = \arg \max_{w∈V} P(w|𝐰_{<t})$
  - Extremely predictable, identical contexts result in same output.
- **Beam search**: Extension of greedy decoding, works well for highly constrained tasks like machine translation.
  - It is expected generating a text in one language conditioned on a very specific text in another language.
- **Sampling methods**: repeatedly randomly samples words according to their probability
until a pre-determined length is reached or the end-of-sentence token is selected. 
  - Introduce diversity by generating less predictable outputs, improving text variation over greedy decoding.
  - It is the most common method for decoding in LLMs with a bit of `generalization`:
    - Conditioned on `prompts and previous selections`, words are sampled based on their conditional probabilities determined by a transformer language model.
    - Three popular sampling schemes: `random` sampling, `top-k` sampling, `nucleus or top-p` sampling, and `temperature` sampling

### Random Sampling 
- Generates a sequence of words $W = w_1, w_2, \cdots, w_N$, until the end-of-sequence token is hit:

  $
  i ← 1\\
  wᵢ ∼ p(w)\\
  while\ wᵢ != EOS\\
    i ← i+1\\
    wᵢ ∼ p(wᵢ | 𝐰_{<i})
  $
  - `x ∼ p(x)`: choose x by sampling from the distribution p(x)

- Random sampling may generate `strange or incoherent sentences` due to the large amount of low-probability words.
- Alternative sampling methods reduce the chance of selecting unlikely words.
  - by trading off between quality (favoring more probable words) and diversity (including middle-probability words for creativity).
- High-probability words lead to coherent but repetitive text, while middle-probability words enhance creativity at the cost of coherence.

### Top-k Sampling
- Top-k sampling generalizes greedy decoding by `selecting from the top k most likely words` instead of the single most probable word.
- At `each word generation`, the vocabulary is `truncated` to the top k words based on their likelihood, and the distribution is `renormalized`.
- A word is randomly sampled from these k words based on their `renormalized probabilities`.
- When `k = 1`, top-k sampling behaves the `same as greedy` decoding.
- Larger k values introduce more diverse text while maintaining quality by selecting words that are still sufficiently probable.

### Nucleus or top-p sampling
- Top-k sampling has a fixed k, but `word probability distributions vary by context`, making it less adaptable.
- `Top-p` sampling (nucleus sampling) selects words based on `covering a fixed p percent of the probability mass` instead of a fixed number of words.
- This approach aims to remove unlikely words while being more flexible across different contexts.
- Top-p sampling `dynamically adjusts the pool of candidate words`, ensuring better adaptability to varying probability distributions.
- Given a distribution $P(w_t |𝐰_{<t} )$, the top-p vocabulary $V^{(p)}$ is the smallest set of words such that 
  - $\displaystyle \sum_{w∈V^{(p)}}  P(w|𝐰_{<t} ) ≥ p$

### Temperature sampling
- Temperature sampling `reshapes the probability distribution` instead of truncating it, `adjusting word probabilities` based on a `temperature parameter` $τ$.
- In `low-temperature` sampling $τ ∈ (0,1]$, the probabilities of common words increase, making the distribution more `focused on high-probability` words.
  - The logits are divided by τ before being passed through softmax, enhancing the probability of the most likely words.
  - As τ approaches 0, the model becomes more "greedy," favoring the most probable word, while τ close to 1 leaves the distribution mostly unchanged.
- `High-temperature` sampling $τ > 1$ flattens the distribution, encouraging `more exploration and diversity` in word selection.

## Training LLMs with Self-Supervision
- A transformer language model is trained using `self-supervision`, predicting the next word in a text sequence `without needing additional labels`.
- The model minimizes prediction errors by using `cross-entropy loss`, which measures the difference between the predicted $\bm{\hat{y}}_t$ and actual $\bm{y}_t$ probability distributions.
  - $\displaystyle L_{CE} = -\sum_{w\in V} \bm{y}_t[w]\log \bm{\hat{y}}_t[w]$
- Cross-entropy loss is simplified by focusing on the probability assigned to the correct next word in the sequence.
  - $\displaystyle L_{CE}(\bm{\hat{y}}_t, \bm{y}_t) = -\log \bm{\hat{y}}_t[w_{t+1}]$
- At each time step, the model computes a probability distribution for the next word based on the correct input sequence.
- `Teacher forcing` is used, where the correct sequence of tokens is always fed to the model for prediction, rather than using its previous predictions.


![Training a transformer as a language model](./images/llm/trainllm.png)

- The average cross-entropy loss over the entire sequence is calculated, and weights are adjusted via gradient descent to minimize this loss.
- Unlike RNNs, transformers `process each item in the sequence in parallel`, as there’s no recurrence in hidden state calculations.
- Large models `fill the full context window` with text, packing multiple documents with special end-of-text tokens if necessary.
- Batch sizes for gradient descent are typically large, with GPT-3 models using up to 3.2 million tokens per batch.