
# LLM Fundamentals: A Brief Introduction
# Recurrent Neural Networks Basics

Language models started with RNNs - neural networks that process text one word at a time. Think of RNN as a reader that remembers what it read before. However, basic RNNs had memory problems with long texts, so better versions were created.

LSTM networks solved the memory problem by adding a system to remember important things and forget unimportant ones. GRU networks came later as a simpler version that works almost as well. These improvements made it possible to process longer texts effectively.

# Working with Sequences
Language models handle different sequence tasks. The simplest is predicting the next word, like your phone's keyboard suggestions. More complex tasks include translation, where text goes from one language to another, or summarization, where long text becomes short.
These tasks use an encoder-decoder structure. The encoder reads and understands the input, while the decoder creates the output. Modern models use attention - a technique that helps focus on relevant parts of the input, much like how humans focus on important parts of a sentence.

# Attention Mechanism
Core Idea
Attention lets a model focus on important parts of input when creating output. Like when you translate "The cat sat on the mat" - to translate 'cat' you mainly need to focus on the word 'cat', not other words. Attention does exactly this - it gives different importance weights to different input words.
How It Works
The process uses Query, Key, and Value - like searching a library:

Query: What you're looking for
Keys: Labels on information
Values: The actual information

The model matches Query with Keys to figure out which Values are important right now. It then combines the important parts to make its output.
Why It's Important
Before attention, models had to remember everything as they processed text one word at a time. With attention, models can directly look at any part of the input when needed - like having the whole text visible at once instead of reading through a keyhole.
This is why modern language models work so well - they can easily connect related words even if they're far apart in the text.

# Training
Training involves showing the model examples and letting it learn from mistakes. The process needs lots of data and computing power. Models learn by predicting next words and checking if they're right. This simple task actually teaches them grammar, facts, and reasoning.

Challenges include managing computer memory, keeping training stable, and making sure the model learns useful patterns. Modern LLMs use these same basic principles but at a much larger scale, which gives them their impressive capabilities.
# Training vs Fine-tuning Language Models
Pre-training
Pre-training is like giving the model general education. The model learns language by reading massive amounts of text from the internet, books, and articles. During this phase, it learns:

Basic language understanding
Grammar and vocabulary
General knowledge
Basic reasoning abilities

This process is extremely expensive. It requires:

Months of training time
Thousands of GPUs
Millions of dollars
Huge datasets (hundreds of billions of tokens)

Fine-tuning
Fine-tuning is like specialized training for a specific job. You take a pre-trained model and teach it specific skills or knowledge. It's much cheaper and faster than pre-training because:

Uses much less data (hundreds to thousands of examples)
Takes hours instead of months
Can run on a few GPUs or even one
Costs hundreds instead of millions of dollars

Common fine-tuning scenarios:

Teaching specific writing styles
Training for customer service
Learning company-specific knowledge
Following specific formats or rules

Key Differences
Think of pre-training as building a brain, while fine-tuning is teaching new skills to that brain:

Pre-training learns from scratch, fine-tuning adjusts existing knowledge
Pre-training is general, fine-tuning is specific
Pre-training needs massive resources, fine-tuning is lightweight
Pre-training builds foundation, fine-tuning adds specialization

When to Use What
Use pre-training when:

Building a new base model from scratch
Need completely new language capabilities
Have massive resources available


**Use fine-tuning when:**

Adapting existing model to specific tasks
Need consistent output format
Have limited resources
Want to add domain expertise


Installation of libraries

In [None]:
!pip install torch numpy matplotlib pandas seaborn


In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from torch.utils.data import Dataset, DataLoader

In [None]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
def generate_sequence_data(samples=1000, sequence_length=20):

    x = np.linspace(0, 100, samples + sequence_length)
    y_pure = np.sin(0.1 * x)
    y_noisy = y_pure + np.random.normal(0, 0.1, len(x))

    plt.figure(figsize=(15, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x[:200], y_pure[:200], 'b-', label='Pure Signal')
    plt.title('Original Sine Wave')
    plt.xlabel('Time')
    plt.ylabel('Amplitude')
    plt.legend()
    plt.grid(True)

    plt.subplot(1, 2, 2)
    plt.plot(x[:200], y_noisy[:200], 'r.', label='Noisy Data')
    plt.plot(x[:200], y_pure[:200], 'b-', alpha=0.3, label='Pure Signal')
    plt.title('Training Data (with Noise)')
    plt.xlabel('Time')
    plt.ylabel('Amplitude')
    plt.legend()
    plt.grid(True)

    plt.tight_layout()
    plt.show()

    return y_noisy

In [None]:
def create_sequences(data, seq_length):
    """Create input/output sequences"""
    x, y = [], []
    for i in range(len(data) - seq_length):
        x.append(data[i:(i + seq_length)])
        y.append(data[i + seq_length])
    return np.array(x), np.array(y)

In [None]:
class SimpleRNN(nn.Module):
    def __init__(self, input_size=1, hidden_size=32, output_size=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(device)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])
        return out

class LSTMPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=32, output_size=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(1, x.size(0), self.hidden_size).to(device)
        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out[:, -1, :])
        return out

class GRUPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=32, output_size=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(device)
        out, _ = self.gru(x, h0)
        out = self.fc(out[:, -1, :])
        return out


In [None]:
def train_model(model, X_train, y_train, epochs=100):
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

    losses = []
    model.train()

    for epoch in range(epochs):
        optimizer.zero_grad()
        y_pred = model(X_train)
        loss = criterion(y_pred, y_train)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())

        if epoch % 10 == 0:
            print(f'Epoch {epoch}, Loss: {loss.item():.6f}')

    return losses

In [None]:
def plot_predictions(model, X, y, title):
    model.eval()
    with torch.no_grad():
        predictions = model(X).cpu().numpy()
        actual = y.cpu().numpy()

    plt.figure(figsize=(10, 4))
    plt.plot(actual[:100], 'b-', label='Actual', alpha=0.5)
    plt.plot(predictions[:100], 'r-', label='Predicted', alpha=0.5)
    plt.title(f'{title} - First 100 Predictions')
    plt.legend()
    plt.grid(True)
    plt.show()

    return predictions

In [None]:
print("Generating data...")
data = generate_sequence_data()
sequence_length = 20
X, y = create_sequences(data, sequence_length)

# Split data and convert to tensors
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Convert to tensors
X_train = torch.FloatTensor(X_train).reshape(-1, sequence_length, 1).to(device)
y_train = torch.FloatTensor(y_train).to(device)
X_test = torch.FloatTensor(X_test).reshape(-1, sequence_length, 1).to(device)
y_test = torch.FloatTensor(y_test).to(device)


In [None]:
print("\nInitializing models...")
models = {
    'RNN': SimpleRNN().to(device),
    'LSTM': LSTMPredictor().to(device),
    'GRU': GRUPredictor().to(device)
}

# Train all models
results = {}
for name, model in models.items():
    print(f"\nTraining {name}...")
    losses = train_model(model, X_train, y_train)
    results[name] = losses

In [None]:
plt.figure(figsize=(10, 6))
for name, losses in results.items():
    plt.plot(losses, label=name)
plt.title('Training Loss Comparison')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
print("\nGenerating predictions...")
for name, model in models.items():
    plot_predictions(model, X_test, y_test, name)

# Calculate and print MSE for each model
print("\nModel Performance (MSE):")
criterion = nn.MSELoss()
for name, model in models.items():
    model.eval()
    with torch.no_grad():
        test_predictions = model(X_test)
        mse = criterion(test_predictions, y_test)
        print(f"{name} MSE: {mse.item():.6f}")

print("\nTraining completed!")

1. Instruction Fine-tuning
Think of this as teaching the model to follow specific commands.
Best Use Cases:

Classification tasks
Information extraction
Structured output generation




Example Format:


```
    "instruction": "Extract the company names from this text",
    "input": "Apple and Microsoft announced a partnership today",
    "output": ["Apple", "Microsoft"]

```


Key Considerations:

Clear, consistent instructions
Well-defined output format
Quality of instruction examples

2. Chat Fine-tuning
Perfect for conversational applications.
Best Use Cases:

Customer service bots
Educational assistants
Interactive agents

Training Format:



```
{
    "messages": [
        {"role": "user", "content": "What's your return policy?"},
        {"role": "assistant", "content": "Our standard return window is 30 days..."},
        {"role": "user", "content": "What if the item is damaged?"},
        {"role": "assistant", "content": "For damaged items, we offer immediate replacement..."}
    ]
}
```




Key Points:

Conversation flow matters
Context handling is crucial
Role consistency is important

3. Text Completion
The simplest approach - completing or generating text.
Best For:

Code completion
Content generation
Text continuation

Format:


```
{
    "prompt": "def calculate_fibonacci(n):",
    "completion": "    if n <= 1:\n        return n\n    return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)"
}
```




# Example - BLOOM Fine-tuning

In [None]:
!pip install transformers datasets accelerate torch wandb
!pip install -q peft

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import matplotlib.pyplot as plt

In [None]:
model_name = "bigscience/bloom-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)

In [None]:
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
test_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")



In [None]:
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        truncation=True,
        max_length=128,
        padding="max_length",
        return_tensors="pt"
    )
    # Create labels for casual language modeling
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized



In [None]:
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)
test_tokenized = test_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=test_dataset.column_names
)

In [None]:
training_args = TrainingArguments(
    output_dir="bloom-wiki-tuned",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=500,
    logging_steps=100,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_steps=500
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=test_tokenized
)


print("Starting training...")
train_result = trainer.train()

trainer.save_model()


plt.figure(figsize=(10, 5))
training_loss = [x["loss"] for x in trainer.state.log_history if "loss" in x]
plt.plot(training_loss)
plt.title('Training Loss')
plt.xlabel('Steps')
plt.ylabel('Loss')
plt.show()


test_prompts = [
    "The history of Rome",
    "Quantum mechanics is",
    "The Industrial Revolution"
]

print("\nGeneration Examples:")
for prompt in test_prompts:
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=100,
        num_return_sequences=1,
        temperature=0.7
    )
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nPrompt: {prompt}")
    print(f"Generated: {generated_text}")

Step,Training Loss
100,1.3939
200,1.4513
300,1.3395
400,1.5019
