<center>
    <h1>Large Language Models (LLMs)</h1>
</center>

# Brief Recap of Large Language Models

- Large Language Models are advanced AI systems trained on vast amounts of text data to understand and generate human-like text. 

- These models have revolutionized natural language processing by demonstrating remarkable capabilities in tasks like text generation, translation, and question-answering.

**Key Characteristics**

- Pre-trained on massive text datasets

- Utilize deep learning and transformer architecture
- Can perform various language tasks without task-specific training
- Generate contextually relevant and coherent responses

**Evolution and Impact**

- The development of LLMs has transformed from simple rule-based systems to sophisticated neural networks capable of handling billions of parameters. 

- Modern LLMs like GPT-3 (175 billion parameters) and similar models demonstrate remarkable capabilities in tasks ranging from translation to creative writing.

<center>
    <img src="static/image1.gif" alt="LLMs" style="width:50%;">
</center>

## LLM Architecture Overview

- The foundation of modern LLMs is built on the Transformer architecture, introduced in 2017's "Attention is All You Need" paper.

- The architecture of modern LLMs is based on the transformer model, which processes text through several sophisticated mechanisms:

**Input Processing Layer**

- Embedding Layer: Converts text tokens into numerical vectors

- Positional Encoding: Adds sequence information to maintain word order

**Transformer Block**

- Self-Attention Mechanism: Weighs word importance and relationships

- Multi-Head Attention: Processes multiple attention patterns simultaneously

- Feed-Forward Networks: Applies additional transformations to processed data

**Output Processing**

- Layer Normalization

- Residual Connections
- Final Linear Layers

**Key Architectural Features**

- **Parallel Processing**: Unlike traditional sequential models, Transformers process entire sequences simultaneously

- **Attention Mechanism**: Enables understanding of long-range dependencies and contextual relationships between words

- **Scalability**: Architecture supports models of varying sizes, from millions to billions of parameters

The architecture's efficiency comes from its ability to process text in parallel and maintain context through sophisticated attention mechanisms, making it particularly effective for large-scale language tasks.

<center>
    <img src="static/image2.webp" alt="LLMs" style="width:50%;">
</center>

## Major Advantages of Large Language Models

- Natural Language Understanding
    - Advanced comprehension of human language
    - Context-aware responses
    - Sentiment analysis capabilities
    - Understanding of nuanced expressions

- Task Versatility
    - Document summarization
    - Multi-language translation
    - Code generation and debugging
    - Content creation and editing
    - Question-answering systems

- Efficiency Improvements
    - Automated task completion
    - Rapid document processing
    - Parallel processing capabilities
    - Reduced manual intervention

- Business Applications
    - Data analysis and insights extraction
    - Market research automation
    - Customer service enhancement
    - Content generation at scale

- Learning Capabilities
    - In-context learning
    - Few-shot learning adaptation
    - Continuous improvement
    - Transfer learning across domains

- Cost Benefits
    - Reduced operational costs
    - Automated workflows
    - Faster time-to-market
    - Scalable solutions

# Interacting with LLMs (Llama Model)

## Approach 1: Using Hugging Face Transformers

### Step 1: Create a Hugging Face Account

If you don't already have a Hugging Face account, go to the [Hugging Face website](https://huggingface.co/) and click on "Sign Up" to create an account. Follow the verification process, including verifying your email address.

### Step 2: Generate a User Access Token

- Log in to your Hugging Face account.
- Navigate to your profile by clicking on your profile photo in the navigation bar.
- Open the settings menu and select "Access Tokens" from the sidebar.
- Click the "New token" button.
- Provide a name for your token and select an appropriate role:
  - `read`: Allows downloading models and datasets.
  - `write`: Allows downloading and uploading models and datasets.
- Click the "Generate a token" button.
- Copy the generated token by clicking the "Show" and then "Copy" buttons.

### Step 3: Request Access to Gated Models (If Necessary)

For gated models like the Llama family models, you need to request access:
- Go to the model page on the Hugging Face Hub.
- You will be prompted to share your username and email address with the model authors.
- Fill out any additional fields requested by the model authors.
- Click "Agree" to send the access request.
- If the model uses automatic approval, you will gain access immediately. Otherwise, you will need to wait for manual approval from the model authors.

**Link to Llama Family Models:** https://huggingface.co/meta-llama

### Step 4: Authenticate Using the Access Token

You can authenticate using the access token in several ways:

- Using the `huggingface_hub` Library
```python
from huggingface_hub import login

access_token = "YOUR TOKEN"
login(token=access_token)
```

- Setting the Environment Variable
You can set the `HF_TOKEN` environment variable:
```bash
export HF_TOKEN= YOUR_TOKEN
```
Or in your code:
```python
import os
os.environ["HF_TOKEN"] = "YOUR_TOKEN"
```

### Step 5: Initializing the Model

The `initialize_model()` function sets up a LLaMA 3 (Large Language Model) with specific configurations.

**Core Components**

```python
def initialize_model():
    try:
        model = models.Llama3CausalLM.from_preset(
            "hf://meta-llama/Llama-3.2-1B-Instruct",
            dtype="bfloat16"
        )
        model.compile(sampler=TopKSampler(k=50))
        return model
    except Exception as e:
        print(f"Error initializing model: {e}")
        return None
```

**Parameters Breakdown**

1. **Model Preset Path**
   - Required parameter: `"hf://meta-llama/Llama-3.2-1B-Instruct"`
   - Specifies the HuggingFace model path for LLaMA 3
   - The "1B" indicates this is the 1 billion parameter version
   - "Instruct" suggests it's fine-tuned for instruction-following tasks

2. **Data Type (dtype)**
   - Optional parameter: `dtype="bfloat16"`
   - Sets the numerical precision for model weights
   - "bfloat16" is a memory-efficient format that maintains good numerical stability
   - Alternative options could include "float32" or "float16"

3. **Sampler Configuration**
   - Optional parameter: `k=50` in `TopKSampler`
   - Controls the randomness in text generation
   - Selects the top 50 most likely next tokens during generation
   - Higher k values increase diversity but may reduce quality
   - Lower k values make outputs more deterministic

**Error Handling**
- The try-except block catches any initialization errors
- Returns None if initialization fails
- Prints the specific error message for debugging

### Step 6: Generating the text

The `generate_text` function uses a pre-initialized LLaMA model to generate text responses based on a given prompt.

```python
def generate_text(model, prompt, max_length=200):
    try:
        response = model.generate(
            prompt,
            max_length=max_length
        )
        return response
    except Exception as e:
        return f"Error generating text: {e}"
```

**Required Parameters**

1. **model**
   - The initialized LLaMA model instance
   - Must be passed as the first argument
   - Should be a valid model object returned from `initialize_model()`

2. **prompt**
   - The input text that guides the model's response
   - Can be a question, instruction, or any text context
   - Should be a string format

**Optional Parameters**

1. **max_length**
   - Default value: 200 tokens
   - Controls the maximum length of the generated response
   - One token roughly equals 4 characters in English
   - Can be adjusted based on needs:
     - Lower values (50-100) for short responses
     - Higher values (500+) for longer content



### Complete Working Code

In [None]:
import os
import tensorflow as tf
os.environ["KERAS_BACKEND"] = "tensorflow"  # Set TensorFlow as backend

from keras_hub import models
from keras_nlp.samplers import TopKSampler
from huggingface_hub import login

def setup_authentication():
    try:
        login(token=os.environ["HF_TOKEN"])
        print("Successfully authenticated with Hugging Face")
    except Exception as e:
        print(f"Authentication error: {e}")
        return False
    return True

def initialize_model():
    try:
        # Initialize LLaMA 3 model using keras_hub
        model = models.Llama3CausalLM.from_preset(
            "hf://meta-llama/Llama-3.2-1B-Instruct",
            dtype="bfloat16"
        )
        
        # Configure the sampler for text generation
        model.compile(sampler=TopKSampler(k=50))  # You can adjust k value
        
        return model
        
    except Exception as e:
        print(f"Error initializing model: {e}")
        return None

def generate_text(model, prompt, max_length=200):
    try:
        # Generate response (tokenization is handled automatically)
        response = model.generate(
            prompt,
            max_length=max_length
        )
        return response
        
    except Exception as e:
        return f"Error generating text: {e}"

def main():
    # Setup authentication
    if not setup_authentication():
        print("Failed to authenticate. Please check your HF token.")
        return
    
    # Initialize model
    model = initialize_model()
    if model is not None:
        prompt = "Explain what is machine learning:"
        response = generate_text(model, prompt)
        print(f"Generated Response:\n{response}")

if __name__ == "__main__":
    # Replace with your actual Hugging Face token
    os.environ["HF_TOKEN"] = "FILL_YOUR_HUGGING_FACE_TOKEN"  
    main()

## Approach 2: Using LlamaAPI client

### Step 1: Getting API Access

1. **Create an Account:**
   - Visit https://www.llama-api.com
   - Click on "Log In" → "Sign up"
   - Complete the registration process

2. **Join Waitlist and Get Approval:**
   - After signup, you'll be added to the waitlist
   - Wait for the approval email (usually takes a few days)
   - Once approved, proceed to the next step

3. **Obtain API Token:**
   - Log in to your Llama API account
   - Navigate to the "API Token" section
   - Find your API token on the page
   - Click the clipboard icon to copy your token

### Step 2: Setting Up Development Environment



1. **Install Required Package:**
```bash
pip install llamaapi
```

2. **Create a New Python File:**
   - Create a new .py file in your preferred IDE
   - Import required libraries at the top of your file

### Step 3: Implementing the Code

1. **Initialize the Client:**
```python
from llamaapi import LlamaAPI
import json

# Replace with your actual API token
llama = LlamaAPI("your_api_token")
```

2. **Create the API Request:**
```python
api_request = {
    "model": "llama3.1-70b",    # Specify the model
    "messages": [
        {
            "role": "user",
            "content": "Explain the concept of machine learning"
        }
    ],
    "stream": False,            # Disable streaming
    "max_length": 500,          # Maximum response length
    "temperature": 0.7          # Control randomness (0.0 to 1.0)
}
```

3. **Execute the Request and Handle Response:**
```python
try:
    response = llama.run(api_request)
    print(json.dumps(response.json(), indent=2))
except Exception as e:
    print(f"An error occurred: {e}")
```

### Step 4: Additional Configuration Options

You can customize your request with these parameters[3]:
```python
api_request = {
    "model": "llama3.1-70b",
    "messages": [...],
    "max_token": 500,           # Maximum tokens to generate
    "temperature": 0.1,         # Lower = more focused outputs
    "top_p": 1.0,              # Nucleus sampling parameter
    "frequency_penalty": 1.0,   # Reduce repetition
    "stream": False            # Enable/disable streaming
}
```

### Step 5: Running the Complete Code

In [None]:
from llamaapi import LlamaAPI
import json

# Initialize client
llama = LlamaAPI("your_api_token")

# Create request
api_request = {
    "model": "llama3.1-70b",
    "messages": [
        {"role": "user", "content": "Explain the concept of machine learning"}
    ],
    "stream": False,
    "max_length": 500,
    "temperature": 0.7
}

try:
    # Get response
    response = llama.run(api_request)
    
    # Print formatted response
    print(json.dumps(response.json(), indent=2))
except Exception as e:
    print(f"An error occurred: {e}")

### Important Notes:

1. **Rate Limits:**
   - Limited to 20 questions per 60-second window
   - Exceeding this limit triggers a cooldown period

2. **Common Errors:**
   - 429: Too Many Requests (rate limit exceeded)
   - 408: Request Timeout
   - 401: Unauthorized (invalid credentials)

3. **Model Selection:**
   Available models include:
   - Mixtral models: mixtral-8x22b-instruct, mixtral-8x7b-instruct
   - Mistral models: mistral-7b-instruct
   - Qwen models: Various sizes from 0.5B to 110B

4. **Pricing:**
   - Costs vary based on the model used
   - Pricing is per 1 million tokens
   - Separate costs for input and output tokens