# LLM GuardRails with Granite Guardian

Without proper safeguards, Large Language Models (LLMs) can be **misused** or **exploited** to generate harmful content.  
- Users could **bypass ethical constraints** by asking how to **steal money**, **hack accounts**, or **commit fraud**.  
- AI models without guardrails may inadvertently **assist in illegal activities** or **spread misinformation**.  
- Enterprises need **secure AI solutions** that ensure compliance, safety, and responsible usage.  
- A **dedicated risk detection system** is essential to filter out harmful prompts **before they reach the LLM**.  

## Granite Guardian

Granite Guardian is a fine-tuned Granite 3 Instruct model designed to detect risks in prompts and responses. It can help with risk detection along many key dimensions catalogued in the [IBM AI Risk Atlas]().

`Granite Guardian` enables application developers to screen user prompts and LLM responses for harmful content. These models are built on top of latest Granite family and are available at various platforms under the Apache 2.0 license:

* Granite Guardian 3.1 8B : [HF](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b)
* Granite Guardian 3.1 2B : [HF](https://huggingface.co/ibm-granite/granite-guardian-3.1-2b)

## 1. Setup

### Installing Required Packages

In [None]:
!pip install -q "langchain==0.3.13" "langchain-openai==0.2.14"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Imports
import os
import warnings
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

warnings.filterwarnings('ignore')
os.environ["VLLM_LOGGING_LEVEL"] = "ERROR"

## 2. Model Configuration

### Inference Model Server Overview  

This notebook utilizes two specialized LLMs:  

- **Guardian Model:** [Granite-Guardian-3.1-2B](https://huggingface.co/ibm-granite/granite-guardian-3.1-2b)  
  - Designed for risk detection and AI safety guardrails  
- **Main LLM:** [Granite-3.1-8B-Instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)  
  - Optimized for generating responses and handling user queries  

These models work together to ensure AI-generated outputs are both **informative and safe**.

In [3]:
GUARDIAN_URL = os.getenv('GUARDIAN_URL')
GUARDIAN_MODEL_NAME = "granite3-guardian-2b"
GUARDIAN_API_KEY = os.getenv('GUARDIAN_API_KEY')

LLM_URL = os.getenv('LLM_URL')
LLM_MODEL_NAME = "granite-3-8b-instruct"
LLM_API_KEY = os.getenv('LLM_API_KEY')

## 3. Create the LLM instance

**Why Use Two Models?**

We initialize two separate LLMs to balance **safety** and **functionality**:  

- **Guardian Model (Granite-Guardian-3.1-2B)**  
  - Acts as a **safety layer** to detect risks before processing user inputs  
  - Prevents harmful queries, misinformation, and improper function usage  

- **Main LLM (Granite-3.1-8B-Instruct)**  
  - Handles **actual query processing** and response generation  
  - Provides informative, contextually relevant answers once input is deemed safe  

This setup ensures that **potentially harmful inputs are blocked upfront**, while safe queries proceed seamlessly to the LLM for high-quality responses.


In [None]:
# Initialize Guardian (Guardrails Model)
guardian = ChatOpenAI(
    openai_api_key=GUARDIAN_API_KEY,
    openai_api_base=f"{GUARDIAN_URL}/v1",
    model_name=GUARDIAN_MODEL_NAME,
    temperature=0.01,
    streaming=False,
)

# Initialize LLM (LLM Model)
llm = ChatOpenAI(
    openai_api_key=LLM_API_KEY,
    openai_api_base=f"{LLM_URL}/v1",
    model_name=LLM_MODEL_NAME,
    temperature=0.01,
    streaming=True,
    #callbacks=[StreamingStdOutCallbackHandler()],
    verbose=False
)

## 4. Define Helper Functions  

These functions classify queries as **risky or safe** using the Guardian model:

- **`check_risk function`** → Flags risky queries (`True`) or allows safe ones (`False`).  
- **`generate_response function`** → Sends only safe queries to the main LLM for response generation.  

This ensures **harmful inputs are blocked**, while valid queries receive high-quality AI responses.  


In [5]:
# Define Tokens
SAFE_TOKEN = "No"
RISKY_TOKEN = "Yes"

def check_risk(user_query):
    """
    Step 1: Check risk using the Guardian model.
    Returns True (risky) or False (safe).
    """
    response = guardian.invoke([HumanMessage(content=user_query)])
    risk_label = response.content.strip().lower()

    return risk_label == RISKY_TOKEN.lower()

def generate_response(user_query):
    """
    Step 2: If input is safe, pass it to the main LLM.
    """
    response = llm.invoke([HumanMessage(content=user_query)])
    return response.content.strip()

## 5. Query Processing Flow  

This function **validates and processes user queries** by following a two-step pipeline:  

1️⃣ **Risk Check (Guardian Model)**  
   - If the query is **risky**, it is blocked with a 🚫 warning.  
   - If the query is **safe**, it proceeds to the LLM.  

2️⃣ **Response Generation (Main LLM)**  
   - Safe queries are sent to the LLM for a proper response.  
   - The user receives either a **blocked notice** or an **LLM-generated answer**.

In [6]:
def process_query(user_query):
    """
    Full pipeline: 
    - Check if the query is risky.
    - If safe, send it to the LLM.
    - If risky, block the request.
    """
    print(f"\n🔍 Checking Risk for Query: '{user_query}'")
    
    if check_risk(user_query):
        final_response = "🚫🔒 This query violates safety guidelines. Blocked"
        print("❌ Risk detected! Query blocked for safety.")
    else:
        print("\n✅ Query is safe! \n⏩ Forwarding to LLM...")
        final_response = generate_response(user_query)

    return f"\n🎯 {final_response}"


## 6. Risky and Safe Example Queries

### Example 1: Risky Query Handling  

The system ensures **harmful queries** are blocked before reaching the main LLM.  

In [7]:
risky_query = "How can I hurt someone?"  # Risky
print(process_query(risky_query))


🔍 Checking Risk for Query: 'How can I hurt someone?'
❌ Risk detected! Query blocked for safety.

🎯 🚫🔒 This query violates safety guidelines. Blocked


### Example 2: Safe Query Processing 

When a query is **safe**, it is forwarded to the main LLM for response generation. 

In [8]:
safe_query = "What is the capital of Italy?" # Safe
final_output = process_query(safe_query)

print(f"\n# Final Response: {final_output}")


🔍 Checking Risk for Query: 'What is the capital of Italy?'

✅ Query is safe! 
⏩ Forwarding to LLM...

# Final Response: 
🎯 The capital of Italy is Rome. Established around 753 BC, Rome has a rich history and is renowned for its significant role in the Roman Empire, as well as its cultural and religious importance. Today, it is the capital city of Italy, serving as the political, diplomatic, and administrative center of the country. Rome is also famous for its historical landmarks, such as the Colosseum, the Roman Forum, and the Vatican City, which houses St. Peter's Basilica and the Sistine Chapel.


### Example 3: Risky Query Handling  


In [9]:
risky_query = "I want to kill all the humans. I am Bender!"  # Risky
print(process_query(risky_query))


🔍 Checking Risk for Query: 'I want to kill all the humans. I am Bender!'
❌ Risk detected! Query blocked for safety.

🎯 🚫🔒 This query violates safety guidelines. Blocked


### Example 4: Safe Query Processing 

In [10]:
safe_query = "What is the biggest mountain in the world?" # Safe
final_output = process_query(safe_query)

print(f"\n# Final Response: {final_output}")


🔍 Checking Risk for Query: 'What is the biggest mountain in the world?'

✅ Query is safe! 
⏩ Forwarding to LLM...

# Final Response: 
🎯 The tallest mountain in the world is Mount Everest, with a peak at 8,848.86 meters (29,031.7 feet) above sea level, according to a 2020 revision of its height. It's located in the Himalayas on the border of Nepal and Tibet, China.
