# Part 1: Baseline Benchmark

In this notebook, we'll explore the fundamental reasons why a team might choose to build a private AI pipeline, moving away from cloud-based solutions. We'll discuss the trade-offs and then run a baseline test with a powerful, yet memory-efficient, open-source Large Language Model (LLM) to see how it performs on a laptop.

<img src="no-cloud.jpeg" width="400">

## The Great Divide: Cloud vs. Local LLMs

**Large Data Centers (The Cloud):**

Think of services like Google's NotebookLM, OpenAI's ChatGPT, or Anthropic's Claude. These models are colossal, often with hundreds of billions or even trillions of parameters. They run in massive, distributed data centers with thousands of high-end GPUs.

- **Pros:** Incredible power, vast general knowledge, and no need for local hardware investment.
- **Cons:** Your data is sent to a third-party server. This is a non-starter for many use cases.

**Small Environments (In-House):**

This is our focus. A development team, a small company, or even a single researcher needs to work with sensitive data that cannot leave the network. The models are smaller (billions of parameters, not hundreds of billions), and they run on local hardware.

- **Pros:** Complete data privacy and security. You have total control over the model and the data.
- **Cons:** Requires local hardware (from a powerful laptop to a small server), and smaller models may not have the same breadth of knowledge as their cloud counterparts.

## Use Cases for Offline AI

Why go to all this trouble? Here are a few scenarios where private AI is not just a choice, but a necessity:

- **Healthcare:** Patient records (PHI) are protected by strict regulations like HIPAA.
- **Legal:** Attorney-client privilege and sensitive case data must remain confidential.
- **Finance:** Proprietary trading algorithms, financial reports, and customer data are all highly sensitive.
- **R&D:** Unpatented research, chemical formulas, and engineering plans are the crown jewels of a company.
- **Government/Defense:** Classified information and matters of national security.

## The Hardware Question: Laptop vs. Team Server

**Running on a Laptop:**

- **Feasibility:** Modern laptops, especially those with a decent amount of RAM and a dedicated GPU (like an NVIDIA RTX series), can run smaller LLMs (1B to 8B parameters) quite effectively. It's perfect for individual development and testing.
- **Limitations:** You'll be limited by your laptop's memory and processing power. Running larger models or handling many requests at once will be slow.

**Setting up a Team Server:**

- **Feasibility:** A dedicated server with one or more powerful GPUs (e.g., NVIDIA A100, H100) can be a shared resource for the entire team. It can run larger, more capable models and serve multiple users simultaneously.
- **Investment:** This requires a budget for hardware and someone to maintain it. However, it's often a small price to pay for data security.

## Benchmark: Vanilla LLM on a Laptop

Let's get our hands dirty. We'll install the necessary libraries and download the `ibm-granite/granite-3.3-2b-instruct` model. This is a powerful, open-source model that's small enough to run on a reasonably modern laptop.

We'll ask it about "MTV." Without any organization-specific context, it should default to its general knowledge.

In [None]:
!pip install transformers torch accelerate

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_name = "ibm-granite/granite-3.3-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16,
    device_map="auto"  # Automatically handle device placement
)

def query_llm(prompt):
    # Tokenize with attention mask to avoid warnings
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    
    # Move inputs to the same device as the model
    if torch.cuda.is_available():
        input_ids = input_ids.to(model.device)
        attention_mask = attention_mask.to(model.device)
    
    start_time = time.time()
    with torch.no_grad():  # Disable gradient computation for inference
        generation_output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=256,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.3,
            repetition_penalty=1.2,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    end_time = time.time()
    
    response = tokenizer.decode(generation_output[0], skip_special_tokens=True)
    duration = end_time - start_time
    
    print(f"Response generated in {duration:.2f} seconds.")
    print("\n---\n")
    print(response)

query_llm("What is MTV?")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Response generated in 17.15 seconds.

---

What is MTV?
MTV stands for Music Television, a cable and satellite television network that was founded in 1981. It initially focused on music videos, particularly alternative rock and hip hop, but later expanded its content to include reality TV shows, talk shows, animated programming, live events, and more. The original name of the channel was M2 (Music Television), which changed to MTV in 1987 when it launched nationally in the United States with a new logo design by Spike Jonze and Dave Hill.


### Expected Outcome

The model will almost certainly describe **MTV (Music Television)**, the American cable channel. It will talk about music videos, reality shows like *The Real World*, and its cultural impact. This is the correct "general knowledge" answer.

However, in our organization, "MTV" means **Migration Toolkit for Virtualization**. The model has no way of knowing this. This is the problem that Retrieval-Augmented Generation (RAG) will solve for us in the next notebooks.

Next up, we'll see how we can speed up this inference time with some clever optimizations from Intel.