# Constructing a RAG System: LlamaIndex & Ollama

AMD Radeon GPUs are officially supported by [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/), ensuring compatibility with industry-standard software frameworks. This Jupyter notebook leverages [Ollama](https://ollama.com/) and [Llamaindex]((https://docs.llamaindex.ai/en/stable/)), powered by ROCm, to build a Retrieval-Augmented Generation (RAG) application. Llamaindex facilitates the creation of a pipeline from reading PDFs to indexing datasets and building a query engine, while Ollama provides the backend service for Large Language Model (LLM) inference.

## Prerequisites
This tutorial was developed and tested using the following setup:

### 1. Hardware
AMD Radeon™ GPUs: Ensure you have an AMD Radeon™ GPU that supports ROCm. This tutorial was tested on the AMD Radeon PRO W7900.
### 2. Software
- **ROCm 6.2**: Install ROCm by following the official installation instructions for Radeon GPUs.
- **Python 3.8**: Ensure Python is installed and available in your environment.
### 3. Environment
Root or sudo access is required for software installations and configuration.


## Install and Launch Jupyter Notebooks

If Jupyter is not already installed on your system, you can install it and launch JupyterLab using the following commands:

```bash
pip install jupyter
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
```
**Note**: Ensure that port 8888 is not already in use on your system before running the above command. If it is, you can specify a different port by replacing --port=8888 with another port number (e.g., --port=8890).

Once the command is executed, the terminal will display an output containing a URL and token. Copy and paste this URL into your web browser on the host machine to access JupyterLab. After launching JupyterLab, upload this notebook to the environment and continue following the steps in this tutorial.

## Install Ollama

Ollama provides seamless support for AMD ROCm GPUs, offering optimized performance out of the box. Use the following command to install Ollama on Linux:

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

**Note**: Ollama's official installation guide is available [here](https://github.com/ollama/ollama/blob/main/docs/linux.md).

Start Ollama and verify it is running:

In [None]:
!sudo systemctl start ollama
!sudo systemctl status ollama

## Download Models
Pull the required models for RAG:

**IMPORTANT**: If the Ollama server is running as a foreground process from the previous step, you need to continue the rest of this notebook in a new instance.

**Note**: Alternative models are available to use [here](https://ollama.com/search).

In [None]:
!ollama pull nomic-embed-text
!ollama pull llama3.1:8b

Verify the downloaded models:

In [3]:
!ollama list llama3.1

NAME           ID              SIZE      MODIFIED     
llama3.1:8b    42182419e950    4.7 GB    2 months ago    


Refer to [Ollama's documentation](https://github.com/ollama/ollama) for more details.

## (Optional) Install PyTorch

For this tutorial, PyTorch is optional. PyTorch utilities are used in this section for verificaiton purposes.

In [None]:
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

Check installed packages:

In [5]:
!pip list | grep torch

pytorch-triton-rocm                      3.1.0
torch                                    2.5.1+rocm6.2
torchaudio                               2.5.1+rocm6.2
torchvision                              0.20.1+rocm6.2


Verify GPU functionality:

In [6]:
import os
import torch
# Query GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    print('Using GPU:', torch.cuda.get_device_name(0))
    print('GPU properties:', torch.cuda.get_device_properties(0))
else:
    device = torch.device("cpu")
    print('Using CPU')

## Install LlamaIndex and Dependencies
Run the following command to install LlamaIndex and related packages:

In [None]:
!pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-chroma chromadb

Verify installations:

In [9]:
!pip list | grep llama-index

llama-index                              0.12.10
llama-index-agent-openai                 0.4.1
llama-index-cli                          0.4.0
llama-index-core                         0.12.10.post1
llama-index-embeddings-ollama            0.5.0
llama-index-embeddings-openai            0.3.1
llama-index-indices-managed-llama-cloud  0.6.3
llama-index-llms-ollama                  0.5.0
llama-index-llms-openai                  0.3.13
llama-index-multi-modal-llms-openai      0.4.2
llama-index-program-openai               0.3.1
llama-index-question-gen-openai          0.3.0
llama-index-readers-file                 0.4.3
llama-index-readers-llama-parse          0.4.0
llama-index-readers-web                  0.3.3
llama-index-vector-stores-chroma         0.4.1


## Build RAG Pipeline

### Set Up Indexing and Query Engine

Import necessary libraries:

In [10]:
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama

### Configure Embedding and LLM Models

LlamaIndex implements the Ollama client interface to interact with the Ollama service. Here we request both embedding and LLM services from Ollama.

In [11]:
# Set embedding model
emb_fn="nomic-embed-text"
Settings.embed_model = OllamaEmbedding(model_name=emb_fn)

# Set ollama model
Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120.0)

### Use Data for RAG

Download a PDF (e.g., ROCm documentation) and save it to the `./data` directory:

The SimpleDirectoryReader is the most commonly used data connector that just works.
Simply pass in a input directory or a list of files.
It will select the best file reader based on the file extensions.

In [12]:
documents = SimpleDirectoryReader(input_dir="./data/").load_data()

# Check the content
print(documents[10])

### Create Vector Dataset with Chroma

[Chroma DB](https://www.trychroma.com/) is a database that stores and queries embeddings, documents, and metadata for LLM apps which is also well integrated with LlamaIndex. We use it to create the vector dataset by sourcing the PDF file.


In [14]:
# Initialize client and save data
db = chromadb.PersistentClient(path="./chroma_db/rocm_db")
# create collection
chroma_collection = db.get_or_create_collection("rocm_db")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [15]:
# Build vector index per-document
vector_index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=20)],
)

### Create Query Engine

Next, create the query engine with a response mode. You can select the response mode based on your specific needs. For detailed guidance, refer to the [LlamaIndex documentation on response modes](https://docs.llamaindex.ai/en/v0.10.19/module_guides/deploying/query_engine/response_modes.html).


In [16]:
# Query your data
query_engine = vector_index.as_query_engine(response_mode="refine", similarity_top_k=10)

## Customize Query Prompts

Define task-specific prompts:

In [17]:
# Updating Prompt for Q&A
from llama_index.core import PromptTemplate

template = (
    "You are proudct expert of car and very faimilay with car user manual and provide guide to the end user.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the information from multiple sources and not prior knowledge\n"
    "answer the question according to the index dataset.\n"
    "if the question is not releate with ROCm and Radeon GPU, just say it is not releated with my knowledge base.\n"
    "if you don't know the answer, just say that I don't know.\n"
    "Answers need to be precise and concise.\n"
    "if the question is in chinese, please transclate chinese to english in advance"
    "Query: {query_str}\n"
    "Answer: "
)
qa_template = PromptTemplate(template)
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": qa_template}
)

template = (
    "The original query is as follows: {query_str}.\n"
    "We have provided an existing answer: {existing_answer}.\n"
    "We have the opportunity to refine the existing answer (only if needed) with some more context below.\n"
    "-------------\n"
    "{context_msg}\n"
    "-------------\n"
    "Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.\n"
    "if the question is 'who are you' , just say I am expert of AMD ROCm.\n"
    "Answers need to be precise and concise.\n"
    "Refined Answer: "
)

qa_template = PromptTemplate(template)

query_engine.update_prompts(
    {"response_synthesizer:refine_template": qa_template}
)

## Query Examples

Run the following queries:

- Query 1: Brief the steps to install the ROCm?

In [None]:
response = query_engine.query("Brief the steps to install the ROCm?")
print(response)

- Query 2: Which chapter is about installing PyTorch?

In [None]:
response = query_engine.query("Which chapter is about installing PyTorch?")
print(response)

- Query 3: How to verify PyTorch Installation?

In [None]:
response = query_engine.query("How to verify PyTorch Installation?")
print(response)

- Query 4: Could run ONNX on Radeon GPU?

In [None]:
response = query_engine.query("Could run ONNX on Radeon GPU?")
print(response)

## Conclusion

This tutorial demonstrates how to construct a RAG pipeline using LlamaIndex and Ollama on AMD Radeon GPUs with ROCm. For further details, refer to the respective documentation.