# Part 2: Accelerating LLM Inference with Intel OpenVINO

In the previous notebook, we established a baseline for running an LLM locally. While it worked, the performance might not be fast enough for real-time applications. In this section, we'll use the Intel OpenVINO Toolkit and Optimum-Intel to significantly accelerate our model. This is a crucial step for making local LLMs practical, especially on hardware without high-end GPUs.

<img src="accelerate.jpeg" width="400">

## What is Intel OpenVINO?

**Intel® Distribution of OpenVINO™ Toolkit** is a free toolkit that helps developers optimize and deploy AI inference. It takes a trained model from a framework like PyTorch and optimizes it for high performance on a variety of Intel hardware (CPU, integrated GPU, etc.).

One of the key techniques we'll use is **quantization**. This process converts the model's weights from high-precision floating-point numbers (like FP16 or FP32) to lower-precision integers (like INT8). This makes the model smaller and dramatically faster to run, often with a minimal impact on accuracy.

### Why Optimize?

When you download a model from Hugging Face, it's usually in a generic format. By optimizing it for your specific hardware (in this case, an Intel CPU), you can unlock significant performance gains. This means faster responses, lower latency, and a better user experience, all without changing the model's underlying intelligence.

## Let's Get Optimizing

First, we'll install the necessary libraries. `optimum-intel` is the bridge that connects the Hugging Face ecosystem with Intel's optimization tools like OpenVINO.

In [None]:
!pip install optimum[openvino] transformers torch accelerate

Now, we'll load the same `ibm-granite/granite-3.3-2b-instruct` model, but this time we'll use `optimum.intel.OVModelForCausalLM`. We'll instruct it to export the model into an optimized format and apply INT8 quantization.

In [2]:
import torch
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM
import time
import os
from pathlib import Path

# Set tokenizer parallelism to avoid warnings
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

model_name = "ibm-granite/granite-3.3-2b-instruct"
optimized_model_dir = "./granite-2b-openvino"

# Load the tokenizer as usual
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and optimize the model for OpenVINO
# This will convert the model to OpenVINO format and apply INT8 quantization
print("Optimizing model with OpenVINO and INT8 quantization...")

# Check if optimized model already exists
if Path(optimized_model_dir).exists() and any(Path(optimized_model_dir).iterdir()):
    try:
        # Try to load a previously optimized model
        ov_model = OVModelForCausalLM.from_pretrained(optimized_model_dir)
        print("Loaded previously optimized model.")
    except Exception as e:
        print(f"Failed to load cached model: {e}")
        print("Creating new optimized model...")
        ov_model = OVModelForCausalLM.from_pretrained(
            model_name, 
            export=True, 
            load_in_8bit=True,
            use_cache=True
        )
        ov_model.save_pretrained(optimized_model_dir)
        print("Model optimized and saved.")
else:
    # If not found, export and quantize it
    print("Creating new optimized model...")
    ov_model = OVModelForCausalLM.from_pretrained(
        model_name, 
        export=True, 
        load_in_8bit=True,
        use_cache=True
    )
    ov_model.save_pretrained(optimized_model_dir)
    print("Model optimized and saved.")

def query_optimized_llm(prompt):
    # Tokenize with attention mask to avoid warnings
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    input_ids = inputs.input_ids
    attention_mask = inputs.attention_mask
    
    start_time = time.time()
    generation_output = ov_model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=256,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.3,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    end_time = time.time()
    
    response = tokenizer.decode(generation_output[0], skip_special_tokens=True)
    duration = end_time - start_time
    
    print(f"Response generated in {duration:.2f} seconds.")
    print("\n---\n")
    print(response)

query_optimized_llm("What is MTV?")

Optimizing model with OpenVINO and INT8 quantization...
Loaded previously optimized model.
Response generated in 8.07 seconds.

---

What is MTV?
MTV, which stands for Music Television, was an American cable and satellite television network that was founded in 1981. It initially focused on music videos, particularly those by alternative rock artists, but later expanded its programming to include a mix of music-related content such as concerts, interviews, and reality shows. The channel played a significant role in popularizing the music video format during the 1980s and helped launch the careers of many influential bands and artists.


### Comparison and Conclusion

Compare the `Response generated in...` time from this notebook to the time from our first notebook. You should see a **significant reduction**. On a typical modern laptop, the optimized model can be 2-4x faster, or even more!

This is the power of hardware-specific optimization. We haven't sacrificed our privacy, but we've made our local LLM much more usable.

**The Trade-Off:** The speed increase from quantization can sometimes result in a minor loss of precision. For most tasks, this is unnoticeable. However, for highly sensitive tasks, it's always good to evaluate the quantized model's accuracy against the original.

--- 

Now we have a **private** and **fast** LLM. But it's still generic. In the next and final part, we'll give it specialized knowledge about our organization by setting up a vector database and implementing RAG.