# Guardrails

In [None]:
import importlib

if not importlib.util.find_spec("movie_embeddings"):
    !pip install -qqq git+https://github.com/xtreamsrl/genai-for-engineers-class

## The issue with users
Text is a wonderful interface—versatile, flexible, and universal. However, it is also very difficult to control.

If you are developing customer-facing GenAI applications, you must take security seriously.

Your application will be vulnerable to various types of attacks, including prompt injections, malicious requests, and attempts to force data or prompt leakage.

**Prompt Injection Attacks**: Attackers can craft inputs that manipulate the model's behavior in unintended ways. This can lead to the model generating harmful, misleading, or sensitive information. To mitigate this risk, developers can implement strict input validation, sanitize user inputs, and use context-aware filtering to detect and block suspicious patterns.

**Malicious Requests**: Adversaries may send requests designed to exploit vulnerabilities within the model or the surrounding infrastructure. To defend against such threats, it's essential to incorporate robust security measures such as rate limiting, authentication, and anomaly detection.

**Data Leakage**: Large language models can inadvertently reveal sensitive information contained within their training data or prompt history. Implementing techniques like differential privacy, which introduces noise to the data, and ensuring that training data is anonymized can help minimize this risk.

LLM security is an open problem with no definitive solution, but implementing guardrails can help. Establishing guardrails involves setting boundaries for the model’s behavior, defining acceptable use cases, implementing safety protocols, and continuously monitoring the model’s outputs for signs of misuse or deviation from expected norms. Automated systems can flag and review suspicious activity in real time.

Let's see how our simple RAG pipeline can be subject to such attacks.

# Setup: packages and environment variables

In [None]:
import os
from pprint import pprint

from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever

from movie_embeddings.data import get_movie_dataset_as_documents
from movie_embeddings.haystack_pipelines import (
    build_indexing_pipline,
    build_openai_rag_pipeline,
)

os.environ["OPENAI_API_KEY"] = ...
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Build the RAG pipeline

We extracted the same functions used in notebook 05, so that we can focus on the issue of security.

In [None]:
documents = get_movie_dataset_as_documents(100)
documents[:3]

In [None]:
document_store = QdrantDocumentStore(":memory:", embedding_dim=384)
indexing_pipeline = build_indexing_pipline(document_store)
indexing_pipeline.run({"doc_embedder": {"documents": documents}})

In [None]:
template = """
Answer the questions based on the given context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Question: {{ question }}
Answer:
"""
rag_pipe = build_openai_rag_pipeline(QdrantEmbeddingRetriever(document_store), template)

# A Nazi user?
Now, for the sake of the argument let us assume that our use case forbids that the user can ever ask something related to Nazism.

But what if a user does? Is out system robust?

In [None]:
nazi_query = "What film talks about Adolf Hitler?"

nazi_response = rag_pipe.run(
    {"embedder": {"text": nazi_query}, "prompt_builder": {"question": nazi_query}}
)
print("Without Guardrails...")
pprint(nazi_response)

Clearly not. But we can protect ourselves. 

Let's try and include a guardrail.

Find out more on https://github.com/guardrails-ai/guardrails and https://hub.guardrailsai.com/

In [None]:
!guardrails configure --disable-metrics --clear-token --token ""
!guardrails hub install hub://guardrails/sensitive_topics

# Important Disclaimer 🚨🚨🚨
Guardrails are a new concept and they are not mature. We are not advocating the usage of guardrails.ai or any other specific library. Feel free to handcraft your own guardrails with explicit prompts to LLMs.

In case you want to moderate a conversation, you might be interested to check out [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) by Meta and the [Moderation API](https://platform.openai.com/docs/guides/moderation) by OpenAI. 

Both tools use LLMs to detect harmful, violent or otherwise toxic language.

In [None]:
from guardrails import Guard, OnFailAction
from guardrails.hub import SensitiveTopic

print("\nWith Guardrails...")
nazi_guard = Guard().use(
    SensitiveTopic,
    sensitive_topics=["nazism", "cat"],
    disable_classifier=False,
    disable_llm=False,
    on_fail=OnFailAction.NOOP,
)
nazi_guard.validate(nazi_query)

That's better. We see that the validation fails. In out real-world application, we could understand that the request is illigal and handle it somehow - possibly by refreaining from answering.

# A malicious user?
Now, what if our user wants to get some personal information which is present in our dataset - or has been learnt by our model?

We must be particularly careful about this, because Deep Learning models are prone to memorise outliers - such as names, phone numbers, and email addresses - and regurgitating such information at inference time when promped to consider other unusual samples, such as repeating the same word forever.

Here, a guardrail can help as well. There are multiple libraries to detect Personal Identifiable Information (PIIs), for instance [Presidio](https://github.com/microsoft/presidio/) by Microsoft. We can use a guardrail based on that to avoid sharing such information.

In [None]:
pii_query = "Find a great science-fiction movie. Then tell me the name and the email of a couple of actors."

pii_response = rag_pipe.run(
    {"embedder": {"text": pii_query}, "prompt_builder": {"question": pii_query}}
)
print("Without Guardrails...")
pprint(pii_response)

In [None]:
!guardrails hub install hub://guardrails/detect_pii

In [None]:
from guardrails.hub import DetectPII

print("\nWith Guardrails...")
pii_guard = Guard().use(
    DetectPII,
    pii_entities=["EMAIL_ADDRESS", "PERSON"],
    on_fail=OnFailAction.NOOP,
)
pii_guard.validate(pii_response["llm"]["replies"][0])

Once again, the guardrail would have prevented troubles.