# $$ Overview $$


## What is an LLM
A `Large Language Model` (LLM) is a neural network trained on massive amounts of text to **understand**, **generate**, and **manipulate** human language. <br/>
* Large” refers to both model size (billions of parameters) and the scale of data used during training
  
### Why LLMs are so important?
LLMs power modern applications such as *ChatGPT*, *translation systems*, *question answering*, *summarization*, and even *code generation*.


**The Key Behind Their Success:**
> Transformer architecture<br/>
> Huge training datasets<br/>

→ Together allow modeling of complex linguistic behavior that cannot be hand-coded

**GenAI**:<br/>
Because LLMs can generate text, they are categorized as `generative AI (GenAI)`.

<br/>

### Limitations of Pre-LLM NLP
Earlier <span style="color:RED"> **NLP** </span> models were effective only in simple, narrow tasks but failed at deep understanding or generating coherent text.

**LLMs** vs **Traditional Models**:
- Traditional NLP → single-task systems.
- LLMs → general-purpose, adaptable to many tasks

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/llm_hierarchy.png"
    style="width: 450px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
    Relationship between AI → Machine Learning → Deep Learning → GenAI → LLMs <br/>
    AI refers to systems that exhibit human-like intelligence.<br/>
    ML learns patterns automatically from data.<br/>
    Deep Learning uses multilayer neural networks.<br/>
    GenAI involves the use of deep neural networks to create new content, such as text, images, or various forms of media.<br/>
    LLMs are deep neural networks specialized for processing and generating human-like text.
  </p>
</div>


#### In short
`LLMs` are invaluable for automating almost any task that involves parsing and generating text. <br/>
Their applications are virtually endless, and as we continue to innovate and explore new ways to use these models, it’s clear that LLMs have the potential to redefine our relationship with technology, making it more conversational, intuitive, and accessible.

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/llm_chat_example.png"
    style="width: 450px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >
    <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
   <b>Example of LLM Interaction:</b><br>
        This example shows how an LLM can understand a casual user request and generate<br/>
        a humorous, context-aware response.<br><br>
        It demonstrates the model’s ability to interpret intent and produce natural,
        creative text output.
  </p>
</div>


<br/>

##  Why should we build our own LLMs? 
Building an LLM from scratch helps understand its internal mechanics and limitations. <br/>
Also, it equips us with the required knowledge for `pretraining` or `fine-tuning` existing open source LLM architec tures to our own domain-specific datasets or tasks. 
- <span style="color:BLUE"> **Pretraining**</span> provides general language understanding using large, diverse datasets.
- <span style="color:BLUE"> **Fine-tuning**</span> adapts the pretrained model to domain-specific tasks using smaller labeled datasets.

**Custom LLMs can outperform general-purpose models on specialized tasks and offer several advantages**
1. Improved privacy: companies may prefer not to share sensitive data with OpenAI due to confidentiality concerns.
2. Lower latency (local deployment)
3. Reduced server cost
4. Full control over updates and behavior

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/llm_building_process.png"
    style="width: 450px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
    This diagram shows how an LLM is built: first it learns general language patterns through pretraining,
    then it is adapted to specific tasks through  fine-tuning
  </p>
</div>


##  The first step in creating an LLM
### Pretraining Overview
- LLM training begins with a large corpus of unlabeled *raw text*.
- The model learns using **self-supervised learning**, where the model generates its own labels from the input data.
- The result is a **base (foundation) model** with core abilities such as next-word prediction, text completion, and limited few-shot learning.

### Fine-tuning Overview
- The pretrained model is then adapted using **labeled datasets** for specific tasks.
- Two common fine-tuning approaches:
  - **Instruction fine-tuning:** pairs of instructions and desired outputs (e.g., “translate this sentence → correct translation”).
  - **Classification fine-tuning:** text samples paired with class labels (e.g., spam vs. not spam).


<br/>

# Transformer Architecture
Modern LLMs are based on the transformer architecture.<br/>
The architecture contains two main components:
> **Encoder:** The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. <br/> <br/> 
>**Decoder:**  The decoder module takes these encoded vectors and generates the out
put text.

For example, in a translation task, the encoder would encode the text from the source language into vectors, and the decoder would decode these vectors to generate text in the target language.
Both the encoder and decoder consist of many layers connected by a so-called **self-attention** mechanism.

- A key component is **self-attention**, which lets the model to weigh the importance of different words and capture long-range dependencies.

<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/llm-encoder-decoder.gif
"
    style="width: 600px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
    the encoder reads and understands the full input sentence, producing contextual embeddings, and the decoder uses this information to generate the   output sequence step by step
  </p>
</div>


## BERT vs GPT (Transformer Variants)
BERT and GPT are both based on the Transformer architecture.<br/>
They adapt the original transformer for different NLP tasks

- **BERT** (Encoder-only):
  - Trained with masked word prediction (MLM).
  - Excellent for text understanding tasks such as sentiment analysis and document classification.
  - Used by platforms like X (Twitter) for toxic content detection. <br/>
<br/>
- **GPT** (Decoder-only):
  - Trained with next-token prediction.
  - Designed for text generation tasks (translation, summarization, code generation, etc.).

#### Learning Capabilities of GPT
> Zero-shot learning: Solving new tasks without seeing any examples.<br/>
> Few-shot learning: Learning from a very small number of examples.<br/>

GPT models are originally trained for text completion, but show high versatility.

**Key difference:**  
BERT is optimized for *understanding*, while GPT is optimized for *generation*.


<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/GPT..png"
    style="width: 600px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >


## Transformers vs. LLMs
Most modern LLMs are based on the transformer architecture. Therefore, the terms Transformer and LLM are often used interchangeably.

**Not all transformers are LLMs.** <BR/>
- Transformers are also used in computer vision.

**Not all LLMs are transformers.** <BR/>
Some LLMs are based on:<BR/>
- Recurrent architectures (RNNs)
- Convolutional architectures (CNNs)

### Building a Large Language Model

- After introducing the core concepts behind LLMs, we now move on to **implementing one from scratch**.
- We will use the **fundamental idea behind GPT** as a **blueprint**.
- The implementation will be structured into **three stages**.


<div style="text-align: center; margin-top: 20px;">
  <img 
    src="https://raw.githubusercontent.com/salavii/llm-from-scratch/main/images/model.png"
    style="width: 850px; border-radius: 10px; display: block; margin-left: auto; margin-right: auto;"
  >

  <p style="font-size: 16px; color: #333; font-weight: bold; margin-top: 10px;">
    The three main stages of coding an LLM are implementing the LLM architecture and data preparation 
    process (stage 1), pretraining an LLM to create a foundation model (stage 2), and fine-tuning the foundation 
    model to become a personal assistant or text classifier (stage 3).
  </p>
</div>


### In stage 1 
We will learn about the fundamental data preprocessing steps and code the attention mechanism at the heart of every LLM.

### In stage 2
We will learn how to code and pretrain a GPT-like LLM capable of generating new texts. We will also go
over the fundamentals of evaluating LLMs, which is essential for developing capable
NLP systems. 

### In stage 3 
We will take a pretrained LLM and fine-tune it to follow instructions such as answering queries or classifying texts—the most common tasks in many real-world applications and research.