# Setup

Install all needed dependencies.

In [1]:
!pip install --quiet --upgrade langchain_community langchain-huggingface vllm==v0.6.4.post1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
codeflare-sdk 0.26.0 requires pydantic<2, but you have pydantic 2.11.1 which is incompatible.
kfp 2.9.0 requires requests-toolbelt<1,>=0.8.0, but you have requests-toolbelt 1.0.0 which is incompatible.[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
# Install the YAML magic
!pip install --quiet yamlmagic
%load_ext yamlmagic


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Configuration and used models.

In [5]:
%%yaml parameters

# Model
embedder_model_name_or_path: ibm-granite/granite-embedding-30m-english
generator_model_name_or_path: ibm-granite/granite-3.2-2b-instruct

<IPython.core.display.Javascript object>

#  Langchain RAG

Prepare list of documents containing chunks.

In [6]:
import urllib.request
from langchain_core.documents import Document


link = "https://huggingface.co/ngxson/demo_simple_rag_py/raw/main/cat-facts.txt"
documents = []

# Retrieve knowledge from provided link, use every line as a separate chunk.
for line in urllib.request.urlopen(link):
    documents.append(
        Document(
            page_content=line.decode('utf-8'),
            metadata={"source": "cats", "doc_id": "cats"}
        )
    )

print(f'Loaded {len(documents)} entries')

Loaded 150 entries


Initialize vector store and fill it with chunks.

In [7]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_huggingface import HuggingFaceEmbeddings


embeddings = HuggingFaceEmbeddings(
    model_name=parameters['embedder_model_name_or_path'],
)
vector_store = InMemoryVectorStore(embeddings)
document_ids = vector_store.add_documents(documents=documents)

**Specify user query here**

In [8]:
input_query = "tell me about cat mummies"

Using similarity search to retrieve most related chunks from vector database.

This part is used to show what context is retrieved by vector store retriever. The actual retrieving is done as part of langchain pipeline below.

In [6]:
docs = vector_store.similarity_search(input_query)

print('Retrieved knowledge:')
for doc in docs:
  print(doc)

Retrieved knowledge:
page_content='In ancient Egypt, mummies were made of cats, and embalmed mice were placed with them in their tombs. In one ancient city, over 300,000 cat mummies were found.
' metadata={'source': 'cats', 'doc_id': 'cats'}
page_content='In 1888, more than 300,000 mummified cats were found an Egyptian cemetery. They were stripped of their wrappings and carted off to be used by farmers in England and the U.S. for fertilizer.
' metadata={'source': 'cats', 'doc_id': 'cats'}
page_content='When a family cat died in ancient Egypt, family members would mourn by shaving off their eyebrows. They also held elaborate funerals during which they drank wine and beat their breasts. The cat was embalmed with a sculpted wooden mask and the tiny mummy was placed in the family tomb or in a pet cemetery with tiny mummies of mice.
' metadata={'source': 'cats', 'doc_id': 'cats'}
page_content='Mohammed loved cats and reportedly his favorite cat, Muezza, was a tabby. Legend says that tabby c

Initialize local vLLM and construct RAG chain.

In [9]:
from langchain_community.llms import VLLM
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
import transformers


llm = VLLM(
    model=parameters['generator_model_name_or_path'],
    trust_remote_code=True,
    max_new_tokens=1024,
    top_k=10,
    top_p=0.95,
    temperature=0.1,
)

tokenizer = transformers.AutoTokenizer.from_pretrained(parameters['generator_model_name_or_path'])
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Construct system prompt for inference providing retrieved chunks as context.
prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": "{input}",
    }],
    documents=[{
        "title": "placeholder",
        "text": "{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)
prompt_template = PromptTemplate.from_template(template=prompt)

# Wrap retrieved chunks using prompt template
document_prompt_template = PromptTemplate.from_template(template="""\
Document {doc_id}
{page_content}""")
document_separator="\n\n"

# Create retrieval chain
combine_docs_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
    retriever=vector_store.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

INFO 04-01 12:49:21 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 04-01 12:49:21 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='ibm-granite/granite-3.2-2b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.2-2b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=ibm-granite/granite-3.2-2b-instruct

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 04-01 12:49:24 model_runner.py:1077] Loading model weights took 4.7349 GB
INFO 04-01 12:49:24 worker.py:232] Memory profiling results: total_gpu_memory=79.14GiB initial_memory_usage=5.38GiB peak_torch_memory=5.30GiB memory_usage_post_profile=5.39GiB non_torch_memory=0.53GiB kv_cache_size=65.39GiB gpu_memory_utilization=0.90
INFO 04-01 12:49:24 gpu_executor.py:113] # GPU blocks: 53568, # CPU blocks: 3276
INFO 04-01 12:49:24 gpu_executor.py:117] Maximum concurrency for 131072 tokens per request: 6.54x
INFO 04-01 12:49:29 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-01 12:49:29 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO

In [10]:
output = rag_chain.invoke({"input": input_query})

print(output['answer'])

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.76s/it, est. speed input: 148.71 toks/s, output: 140.38 toks/s]

Cat mummies have a significant historical and cultural importance, particularly in ancient Egypt. Here are some key points:

1. **Mummification Practice**: In ancient Egypt, cats were mummified as part of their funeral rituals. This practice was not limited to cats alone; mice were also embalmed and placed in the tombs of cats.

2. **Abundance**: Over 300,000 cat mummies were discovered in a single Egyptian cemetery in 1888. This suggests a widespread practice of cat mummification.

3. **Use in Fertilizer**: After being stripped of their wrappings, these cat mummies were exported to England and the U.S. for use as fertilizer.

4. **Mourning Rituals**: When a family cat died, Egyptians would mourn by shaving off their eyebrows and holding elaborate funerals. They would drink wine and beat their breasts during these ceremonies.

5. **Placement**: The tiny mummified cat was often placed in the family tomb or in a pet cemetery, sometimes alongside mummies of mice.

6. **Cultural Significan




# Cleaning Up

Delete pipeline and associated model from GPU.

In [11]:
import torch


del rag_chain, combine_docs_chain, llm
torch.cuda.empty_cache()