<center>
    <h1>Large Language Models (LLMs)</h1>
</center>

# Brief Recap of Large Language Models

- Large Language Models are advanced AI systems trained on vast amounts of text data to understand and generate human-like text. 

- These models have revolutionized natural language processing by demonstrating remarkable capabilities in tasks like text generation, translation, and question-answering.

**Key Characteristics**

- Pre-trained on massive text datasets

- Utilize deep learning and transformer architecture
- Can perform various language tasks without task-specific training
- Generate contextually relevant and coherent responses

**Evolution and Impact**

- The development of LLMs has transformed from simple rule-based systems to sophisticated neural networks capable of handling billions of parameters. 

- Modern LLMs like GPT-3 (175 billion parameters) and similar models demonstrate remarkable capabilities in tasks ranging from translation to creative writing.

<center>
    <img src="static/image1.gif" alt="LLMs" style="width:50%;">
</center>

## LLM Architecture Overview

- The foundation of modern LLMs is built on the Transformer architecture, introduced in 2017's "Attention is All You Need" paper.

- The architecture of modern LLMs is based on the transformer model, which processes text through several sophisticated mechanisms:

**Input Processing Layer**

- Embedding Layer: Converts text tokens into numerical vectors

- Positional Encoding: Adds sequence information to maintain word order

**Transformer Block**

- Self-Attention Mechanism: Weighs word importance and relationships

- Multi-Head Attention: Processes multiple attention patterns simultaneously

- Feed-Forward Networks: Applies additional transformations to processed data

**Output Processing**

- Layer Normalization

- Residual Connections
- Final Linear Layers

**Key Architectural Features**

- **Parallel Processing**: Unlike traditional sequential models, Transformers process entire sequences simultaneously

- **Attention Mechanism**: Enables understanding of long-range dependencies and contextual relationships between words

- **Scalability**: Architecture supports models of varying sizes, from millions to billions of parameters

The architecture's efficiency comes from its ability to process text in parallel and maintain context through sophisticated attention mechanisms, making it particularly effective for large-scale language tasks.

<center>
    <img src="static/image2.webp" alt="LLMs" style="width:50%;">
</center>

## Major Advantages of Large Language Models

- Natural Language Understanding
    - Advanced comprehension of human language
    - Context-aware responses
    - Sentiment analysis capabilities
    - Understanding of nuanced expressions

- Task Versatility
    - Document summarization
    - Multi-language translation
    - Code generation and debugging
    - Content creation and editing
    - Question-answering systems

- Efficiency Improvements
    - Automated task completion
    - Rapid document processing
    - Parallel processing capabilities
    - Reduced manual intervention

- Business Applications
    - Data analysis and insights extraction
    - Market research automation
    - Customer service enhancement
    - Content generation at scale

- Learning Capabilities
    - In-context learning
    - Few-shot learning adaptation
    - Continuous improvement
    - Transfer learning across domains

- Cost Benefits
    - Reduced operational costs
    - Automated workflows
    - Faster time-to-market
    - Scalable solutions

# Interacting with LLMs (Llama Model)

## Approach 1: Using Hugging Face Transformers

### Step 1: Create a Hugging Face Account

If you don't already have a Hugging Face account, go to the [Hugging Face website](https://huggingface.co/) and click on "Sign Up" to create an account. Follow the verification process, including verifying your email address.

### Step 2: Generate a User Access Token

- Log in to your Hugging Face account.
- Navigate to your profile by clicking on your profile photo in the navigation bar.
- Open the settings menu and select "Access Tokens" from the sidebar.
- Click the "New token" button.
- Provide a name for your token and select an appropriate role:
  - `read`: Allows downloading models and datasets.
  - `write`: Allows downloading and uploading models and datasets.
- Click the "Generate a token" button.
- Copy the generated token by clicking the "Show" and then "Copy" buttons.

### Step 3: Request Access to Gated Models (If Necessary)

For gated models like the Llama family models, you need to request access:
- Go to the model page on the Hugging Face Hub.
- You will be prompted to share your username and email address with the model authors.
- Fill out any additional fields requested by the model authors.
- Click "Agree" to send the access request.
- If the model uses automatic approval, you will gain access immediately. Otherwise, you will need to wait for manual approval from the model authors.

**Link to Llama Family Models:** https://huggingface.co/meta-llama

### Step 4: Authenticate Using the Access Token

You can authenticate using the access token in several ways:

- Using the `huggingface_hub` Library
```python
from huggingface_hub import login

access_token = "YOUR TOKEN"
login(token=access_token)
```

- Setting the Environment Variable
You can set the `HF_TOKEN` environment variable:
```bash
export HF_TOKEN= YOUR_TOKEN
```
Or in your code:
```python
import os
os.environ["HF_TOKEN"] = "YOUR_TOKEN"
```

### Step 5: Load the Model and Tokenizer


**AutoTokenizer.from_pretrained()**

The AutoTokenizer automatically selects and loads the appropriate tokenizer for a given pre-trained model[4]. Here are the key parameters:

**Required Parameters:**
- `model_id`: The name or path of the pre-trained model (e.g., "meta-llama/Llama-3.1-8B")

**Optional Parameters:**
- `use_auth_token`: Boolean or string to use Hugging Face token for authentication
- `trust_remote_code`: Boolean to allow loading remote code
- `padding`: Boolean or string ('max_length' or 'longest') for input padding[5]
- `truncation`: Boolean or string to truncate inputs to max_length
- `max_length`: Integer for maximum sequence length
- `return_tensors`: Format of the returned tensors ("pt" for PyTorch, "tf" for TensorFlow)[5]

Example with additional parameters:
```python
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=True,
    trust_remote_code=True,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
```

**AutoModelForCausalLM.from_pretrained()**

This class is specifically designed for casual language modeling tasks, where the model generates text in a conversational manner[4]. Here are the key parameters:

**Required Parameters:**
- `model_id`: The name or path of the pre-trained model

**Optional Parameters:**
- `use_auth_token`: For authentication with Hugging Face
- `trust_remote_code`: Allow loading remote code
- `torch_dtype`: Specify the model's data type (e.g., torch.float16 for half precision)
- `device_map`: Specify how to distribute the model across available hardware
- `low_cpu_mem_usage`: Boolean to enable low CPU memory usage
- `cache_dir`: Directory for storing downloaded models
- `local_files_only`: Boolean to use only local files[3]
- `revision`: Specific model version to use

Example with additional parameters:
```python
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_auth_token=True,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    cache_dir="./model_cache",
    local_files_only=False,
    revision="main"
)
```

Use the `AutoTokenizer` and `AutoModelForCausalLM` from the `transformers` library to load the model and tokenizer, ensuring you use the access token for authentication:
```python
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=True,
    trust_remote_code=True
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_auth_token=True,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
```

#### NOTE:

You can also load the model from your local directory if you have already downloaded it.

```python
# Load from local directory
local_model_path = "./saved_model"
tokenizer = AutoTokenizer.from_pretrained(
    local_model_path,
    local_files_only=True
)
```

### Step 6: Use the Model

- 1. Input Preparation
```python
# Prepare input
prompt = "Write a short story about:"
inputs = tokenizer(
    prompt,
    return_tensors="pt",      # Return PyTorch tensors
    padding=True,             # Optional: pad sequences
    truncation=True,          # Optional: truncate sequences
    max_length=512,           # Optional: maximum sequence length
).to(model.device)           # Move to GPU if available
```

**Additional Tokenizer Parameters:**
- `add_special_tokens`: Boolean to add model's special tokens (default: True)
- `return_attention_mask`: Boolean to return attention mask (default: True)
- `return_token_type_ids`: Boolean to return token type IDs
- `return_overflowing_tokens`: Boolean to handle texts longer than max_length
- `return_special_tokens_mask`: Boolean to return special tokens mask

- 2. Text Generation
```python
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model.generate(
        inputs["input_ids"],
        max_length=200,           # Maximum length of generated text
        min_length=10,            # Minimum length of generated text
        temperature=0.7,          # Controls randomness (0.0 to 1.0)
        do_sample=True,           # Enable sampling
        num_beams=1,              # Number of beams for beam search
        no_repeat_ngram_size=2,   # Prevent repetition of n-grams
        pad_token_id=tokenizer.eos_token_id,  # Padding token ID
        eos_token_id=tokenizer.eos_token_id,  # End of sequence token ID
        attention_mask=inputs["attention_mask"],  # Optional attention mask
        top_k=50,                 # Top-k sampling parameter
        top_p=0.95,               # Nucleus sampling parameter
        repetition_penalty=1.2,   # Penalty for repeated tokens
        length_penalty=1.0,       # Length penalty (>1.0 favors longer sequences)
        num_return_sequences=1    # Number of sequences to generate
    )
```

**Detailed Parameter Explanation:**

1. **Sampling Parameters:**
```python
outputs = model.generate(
    inputs["input_ids"],
    # Sampling strategy
    do_sample=True,          # Enable sampling (vs greedy decoding)
    temperature=0.7,         # Higher = more random, Lower = more focused
    top_k=50,               # Limit to top k tokens during sampling
    top_p=0.95,             # Nucleus sampling probability threshold
)
```

2. **Length Control:**
```python
outputs = model.generate(
    inputs["input_ids"],
    # Length parameters
    max_length=200,         # Maximum sequence length
    min_length=10,          # Minimum sequence length
    length_penalty=1.0,     # Penalty for sequence length
    early_stopping=True,    # Stop when conditions are met
)
```

3. **Beam Search Parameters:**
```python
outputs = model.generate(
    inputs["input_ids"],
    # Beam search parameters
    num_beams=4,           # Number of beams for beam search
    no_repeat_ngram_size=2,# Prevent repetition of n-grams
    num_beam_groups=1,     # Number of groups for diverse beam search
    diversity_penalty=0.0  # Penalty for diverse beam search
)
```

4. **Repetition Control:**
```python
outputs = model.generate(
    inputs["input_ids"],
    # Repetition control
    repetition_penalty=1.2,    # Penalty for repeated tokens
    bad_words_ids=None,        # List of token IDs to prevent
    force_words_ids=None       # List of token IDs to force
)
```

- 3. Response Decoding
```python
# Basic decoding
response = tokenizer.decode(
    outputs[0],                 # First generated sequence
    skip_special_tokens=True,   # Skip special tokens like [PAD], [CLS], etc.
    clean_up_tokenization_spaces=True  # Clean up spaces from tokenization
)

# Batch decoding
responses = tokenizer.batch_decode(
    outputs,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True
)

# Advanced decoding with additional parameters
response = tokenizer.decode(
    outputs[0],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True,
    spaces_between_special_tokens=True,
    truncate_before_pattern=None
)
```

### Complete Example with Advanced Parameters:

```python
prompt = "Write a short story about:"
inputs = tokenizer(
    prompt,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        inputs["input_ids"],
        max_length=200,
        min_length=50,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        num_beams=4,
        no_repeat_ngram_size=2,
        repetition_penalty=1.2,
        length_penalty=1.0,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        attention_mask=inputs["attention_mask"]
    )

responses = tokenizer.batch_decode(
    outputs,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=True
)

for response in responses:
    print(response)
```

### Complete Working Code

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def initialize_model_and_tokenizer(model_id):
    # Initialize tokenizer with padding token
    tokenizer = AutoTokenizer.from_pretrained(
        model_id,
        use_auth_token=True,
        trust_remote_code=True
    )
    
    # Set padding token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    # Initialize model
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        use_auth_token=True,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
    
    return model, tokenizer

def generate_text(prompt, model, tokenizer):
    # Prepare input
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512,
        return_attention_mask=True
    ).to(model.device)
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=200,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Usage
model_id = 'meta-llama/Llama-3.1-8B'
model, tokenizer = initialize_model_and_tokenizer(model_id)
prompt = "Explain the meaning of life."
response = generate_text(prompt, model, tokenizer)
print(response)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.


## Approach 2: Using LlamaAPI client

### Step 1: Getting API Access

1. **Create an Account:**
   - Visit https://www.llama-api.com
   - Click on "Log In" → "Sign up"
   - Complete the registration process[4]

2. **Join Waitlist and Get Approval:**
   - After signup, you'll be added to the waitlist
   - Wait for the approval email (usually takes a few days)
   - Once approved, proceed to the next step[4]

3. **Obtain API Token:**
   - Log in to your Llama API account
   - Navigate to the "API Token" section
   - Find your API token on the page
   - Click the clipboard icon to copy your token[4]

### Step 2: Setting Up Development Environment



1. **Install Required Package:**
```bash
pip install llamaapi
```

2. **Create a New Python File:**
   - Create a new .py file in your preferred IDE
   - Import required libraries at the top of your file

### Step 3: Implementing the Code

1. **Initialize the Client:**
```python
from llamaapi import LlamaAPI
import json

# Replace with your actual API token
llama = LlamaAPI("your_api_token")
```

2. **Create the API Request:**
```python
api_request = {
    "model": "llama3.1-70b",    # Specify the model
    "messages": [
        {
            "role": "user",
            "content": "Explain the concept of machine learning"
        }
    ],
    "stream": False,            # Disable streaming
    "max_length": 500,          # Maximum response length
    "temperature": 0.7          # Control randomness (0.0 to 1.0)
}
```

3. **Execute the Request and Handle Response:**
```python
try:
    response = llama.run(api_request)
    print(json.dumps(response.json(), indent=2))
except Exception as e:
    print(f"An error occurred: {e}")
```

### Step 4: Additional Configuration Options

You can customize your request with these parameters[3]:
```python
api_request = {
    "model": "llama3.1-70b",
    "messages": [...],
    "max_token": 500,           # Maximum tokens to generate
    "temperature": 0.1,         # Lower = more focused outputs
    "top_p": 1.0,              # Nucleus sampling parameter
    "frequency_penalty": 1.0,   # Reduce repetition
    "stream": False            # Enable/disable streaming
}
```

### Step 5: Running the Complete Code

In [None]:
from llamaapi import LlamaAPI
import json

# Initialize client
llama = LlamaAPI("your_api_token")

# Create request
api_request = {
    "model": "llama3.1-70b",
    "messages": [
        {"role": "user", "content": "Explain the concept of machine learning"}
    ],
    "stream": False,
    "max_length": 500,
    "temperature": 0.7
}

try:
    # Get response
    response = llama.run(api_request)
    
    # Print formatted response
    print(json.dumps(response.json(), indent=2))
except Exception as e:
    print(f"An error occurred: {e}")

### Important Notes:

1. **Rate Limits:**
   - Limited to 20 questions per 60-second window
   - Exceeding this limit triggers a cooldown period

2. **Common Errors:**
   - 429: Too Many Requests (rate limit exceeded)
   - 408: Request Timeout
   - 401: Unauthorized (invalid credentials)

3. **Model Selection:**
   Available models include:
   - Mixtral models: mixtral-8x22b-instruct, mixtral-8x7b-instruct
   - Mistral models: mistral-7b-instruct
   - Qwen models: Various sizes from 0.5B to 110B

4. **Pricing:**
   - Costs vary based on the model used
   - Pricing is per 1 million tokens
   - Separate costs for input and output tokens