In this tutorial, we will build a Conversational AI System that can answer questions by retrieving the answers from a document.

This whole tutorial is divided into 2 parts:
- Part 1: Indexing and Storing the Data:
    - Step 1: Install all the dependencies (Python and Database)
        - Python Dependencies
        - Database Dependencies
        - PGVector Extension
    - Step 2: Load the LLMs and Embeddings Model
        - LLMs
        - Embeddings Model
    - Step 3: Setup the database
    - Step 4: Index the documents
        - Download the data
        - Load the data
        - Index and store the embeddings
- Part 2: Querying the indexed data using LLMs
    - Step 1: Load the LLMs and Embeddings Model
        - LLMs
        - Embeddings Model
    - Step 2: Load the index from the database
    - Step 3: Setup the query engine
    - Step 4: RAG pipeline in action
    - Step 5: Using gradio to build a UI

In this notebook, we will only cover the first part. The second part is covered in the notebook `2. Querying the indexed data using LLMs.ipynb`.

# Indexing and Storing the Data

## Step 1: Install all the dependencies (Python and Database)

### Install The Python Dependencies

We will use the following python libraries:
- `psycopg2`: PostgreSQL database adapter for Python. We will use it to connect to the database and execute SQL queries.
- `sqlalchemy`: SQL toolkit and Object Relational Mapper (ORM) that gives application developers the full power and flexibility of SQL. We will use it to convert the connection string into a URL that can be used to connect to the database.
- `llama-index`: Llama Index is a data framework for your LLM application. It provides a set of tools to help you manage your data and easily build your LLM application. We will use it to index the data and query the database using the LLMs.
- `langchain`: Langchain is also similar to Llama Index. It provides a set of tools to help you create LLMs powered applications. Though we will use it to create the embeddings model only.
- `torch`: PyTorch is an open source machine learning library based on the Torch library

In [1]:
# !conda install psycopg2
# !pip install sqlalchemy
# !pip install langchain
# !pip install llama-index
# !pip install torch

### Install the PostgreSQL database

In this tutorial, we will use the PostgreSQL database. You can download and install by following the instructions below:


Create the file repository configuration:
- `sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'`

Import the repository signing key:
- `wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -`

Update the package lists:
- `sudo apt-get update`

Install the latest version of PostgreSQL.
If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
- `sudo apt-get -y install postgresql-16`

Install the pgvector extension:
- `sudo apt-get -y install postgresql-16-pgvector`

Ensure that the server is running using the systemctl start command:
- `sudo systemctl start postgresql.service`

### Activate the PGVector extensions

Once the server is running, you can connect to it using the psql command:

switch to the postgres user
- `sudo -i -u postgres`

connect to the server
- `psql`

change the password
- `ALTER USER postgres WITH PASSWORD 'test123';`

enable the pgvector extension
- `create extension vector;`

list the installed extensions
- `\dx`

## Step 2: Load the LLMs and Embeddings Model

### PHI-2 as an LLM (or SLM?)

In this tutorial, we will build a Conversational AI System using Microsoft's Foundational Model - `PHI-2`, an SLM. We will also compare it with the other LLMs, such as `Mistral-7B` and `Mistral-7B-Instruct` model.

`PHI-2` is a transformer model, developed by Microsoft, which has about 2.7 billion parameters. `PHI-2` is the successor of `PHI-1.5` model, and is trained on the same dataset as `PHI-1.5` model, but augmented with additional data sources that includes various NLP synthetic texts and filtered websites. This augmented dataset was introduced to address the safety concerns and include educational values.

![Safety Score of PHI-2 (Taken from Official blog of Microsft)](asset/figure3_safety_scores-2048x995.png)
Safety scores computed on 13 demographics from ToxiGen. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones. (Taken from Official blog of Microsft)

After a fair comparison between `PHI-2`, and other LLMs which have parameters less than 13 billion, it was found that `PHI-2` outperforms most of the LLMs in the benchmarks testing common sense, language understanding, and logical reasoning.

![Comparison (Taken from Official blog of Microsft)](asset/figure2_phi_comp-2048x474.png)
Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models (Taken from Official blog of Microsft)

![Comparison (Taken from Official blog of Microsft)](asset/phi2-avg-performance.png)

Comparison between Phi-2 and other SLMs models (Taken from Official blog of Microsft)

`PHI-2` was developed by Microsoft keeping in mind not only safety concerns and educational values, but also to explore the power of Small Language Models (SLMs). The main goal of SLMs is to explore the power of LLMs without the need of large computing resources and make it more sustainable.

You can read more about `PHI-2` [on the official page of Mictrosoft](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/).

You can get the `PHI-2` model from the [official page of Mictrosoft on HuggingFace](https://huggingface.co/microsoft/phi-2).

In [2]:
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
# model_name = "mistralai/Mistral-7B-v0.1"
# model_name = "microsoft/phi-2"

Here we will be using `HuggingFaceLLM` provided by the `llama-index` library to create our PHI-2 SLM. This class is a wrapper around the HuggingFace Transformers library. Read more about it [here](https://docs.llamaindex.ai/en/stable/api_reference/llms/huggingface.html).

Here we need to define some of the parameters to be used by the `HuggingFaceLLM` class. 

The `context_window` parameter specifies the number of tokens to be used as context for the SLM model. The max context size allowed by `PHI-2` is 2048. 

The `max_new_tokens` parameter specifies the maximum number of new tokens to generate. Here we will limit it to 256,but you can play around with this number if you want longer answers. 

The `tokenizer_name` and `model_name` parameters specify the name of the tokenizer and model to use. We already have defined it above. Our main model it `PHI-2`, but later we will also show the comparison with other SLMs such as `Mistral-7B` and `Mistral-7B-Instruct`. 

The `device_map` parameter specifies the device to use for the model. We have GPU in our system so we will be using `cuda`. You can also set it to `CPU` if you do not have GPU in your system, but this will significantly increase the total runtime. If you have no way of accessing a GPU, you can use the quantized version of the model. You can get the quantized version of the `PHI-2` model from the [HuggingFace](https://huggingface.co/TheBloke/phi-2-GGUF). Keep in mind that the quantized version of the model is not as accurate as the original model. Also, you'll need to load the LLM accordingly. Loading the quantized version of the model is out of the scope of this tutorial.

The `model_kwargs` parameter specifies any additional keyword arguments to pass to the model. Here we are changing the data type of the model to `float16` to reduce the memory usage. 

There are other additional parameters that can be used with the `HuggingFaceLLM` class. You can read more about them [here](https://docs.llamaindex.ai/en/stable/api_reference/llms/huggingface.html).

Though having an LLM defined is not compulsory while indexing the data, we can simply use `llm=None` to turn off the LLMs. But there are several use cases where you might want to use an LLM while indexing the data. For example, if you want to index the data along with adding a summary of each document in the index and storing it in the database, you can use an LLM to generate the summary. This is a more advanced use case that will help RAG pipeline to generate better answers. Though we will not cover this use case in this tutorial, you can read more about it [here](https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary.html).

In [3]:
import torch
from llama_index.llms import HuggingFaceLLM

# Context Window specifies how many tokens to use as context for the LLM
context_window = 2048
# Max New Tokens specifies how many new tokens to generate for the LLM
max_new_tokens = 256
# Device specifies which device to use for the LLM
device = "cuda"

# Create the LLM using the HuggingFaceLLM class
llm = HuggingFaceLLM(
    context_window=context_window,
    max_new_tokens=max_new_tokens,
    tokenizer_name=model_name,
    model_name=model_name,
    device_map=device,
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.bfloat16}
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### BGE as an Embeddings Model

Apart from `PHI-2`, we will take help from BAAI (Beijing Academy of Artificial Intelligence) general embedding - BGE, an embedding model. More specifically, we will use the `bge-large-en-v1.5` model to create the embeddings for our dataset and then use the embeddings to index the data.

Before we can start using this embedding model, we first need to find it's embedding dimension as the embedding dimension is required to create the table in the database. You can find the embedding dimension of the `bge-large-en-v1.5` model on the [Official BAAI Page of HuggingFace](https://huggingface.co/BAAI/bge-large-en-v1.5). It's 1024. I've also included the image below for your reference.

![Embedding Dimentions of bge-large-en-v1.5](asset/embedding_dim.png)

You can also experiment with other embedding model listed in the above image. 

You can also find the embedding dimension of this model by doing a forward pass on a random input. You can do something like this: `len(embed_model.get_text_embedding("Hello world"))`

In [4]:
embedding_model_name = "BAAI/bge-large-en-v1.5"

Here we will be taking help from the `LangchainEmbedding` class provided by the `langchain` library to create our BGE embedding model. Llama-Index offers a set of ways to create embeddings models. You can read more about it [here](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#list-of-supported-embeddings).

We first need to load the `bge-large-en-v1.5` model using `HuggingFaceBgeEmbeddings` and then convert it into a `LangchainEmbedding` model. 

We also require the embedding dimension of the model. We already have discussed how to find the embedding dimension of the model in the above section. Simply put, we'll do a forward pass on a random input and find the length of the output vector. 

In [5]:
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
from llama_index.embeddings import LangchainEmbedding

# Create the embedding model using the HuggingFaceBgeEmbeddings class
embed_model = LangchainEmbedding(
  HuggingFaceBgeEmbeddings(model_name=embedding_model_name)
)

# Get the embedding dimension of the model by doing a forward pass with a dummy input
embed_dim = len(embed_model.get_text_embedding("Hello world")) # 1024

## Step 3: Setup the database

Once the server is running, you can connect to it using the psql command:

Switch to the postgres user

- `sudo -i -u postgres`

Connect to the server

- `psql`

Change the password

- `ALTER USER postgres WITH PASSWORD 'test123';`

Enable the PGVector extension

- `create extension vector;`

List the installed extensions

- `\dx`

As already mentioned, we will use PostgreSQL as our database. We will use the pgvector extension to store the embeddings in the database.

Before we can start indexing the data, we first need to create the connection with the database. We will use the `psycopg2` library to create the connection.

Before we can create the connection, we first need to create the connection string. The format of the connection string is `postgresql://{username}:{password}@{host}:{port}`.

We will also define the name of the database and the table to be used to store the indexed data.

In [6]:
connection_string = "postgresql://postgres:test123@localhost:5432"
db_name = "ragdb"
table_name = 'embeddings'

In [7]:
import psycopg2

# Connect to the database
conn = psycopg2.connect(connection_string)
# Set autocommit to True to avoid having to commit after every command
conn.autocommit = True

# Create the database
# If it already exists, then delete it and create a new one
with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

## Step 4: Index the documents

In this step we need to download the data and then load the data. After that we need index this data using LLMs and the Embeddings Model and finally store it in the database

### Download the data

Now that we have installed the python libraries and the database, setup our LLM and Embedding model along with creating the database, we now can download the data. 

In this tutorial we will build a Conversational AI System that can answer questions by retrieving the answers from a document.

More specifically, we will use Memento Movie Script as our dataset, because why not? This movie definitely needs a Conversational AI System to answer questions about it and understand it fully.

You can download the dataset from [here](https://stephenfollows.com/resource-docs/scripts/memento.pdf).

Once you have downloaded the dataset, we'll move it to the `data` folder.

In [8]:
# !wget https://stephenfollows.com/resource-docs/scripts/memento.pdf
# !mkdir data
# !mv memento.pdf data

In [9]:
data_path = "./data/"

### Load the data

Llama-Index wide variety of ways to load the data. The most common and easiest way is to use the `SimpleDirectoryReader` class. You can read more about it [here](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader.html) and understand in detail about it's arguments.

For the sake of simplicity, we will use the default arguments of the `SimpleDirectoryReader` class.

In [10]:
from llama_index import SimpleDirectoryReader

# Load the documents from the data path
documents = SimpleDirectoryReader(data_path).load_data()

### Index and store the embeddings

Now that we have almost everything (LLM, Embeddings Model, Database, and Data), we can finally index the data and store it in the database.

Storing the indexed data in the database is very simple yet crucial step. 

Though Llama-Index prvovides an easy way to store the indexed data using the `persist` method of index using `index.storage_context.persist(persist_dir="<persist_dir>")` but this method is not suitable for most of the production use cases. We want to store the indexed data in the database rather than storing it in the file system, so that we can easily reuse the indexed data from the database and query it using LLMs.

#### Setting up the Service context

Llama-Index by default uses the `OpenAI` models to index the data. But we want to use our own local LLMs and Embeddings Model to index the data. 

We can do this by setting up the `ServiceContext` of the index. 

While setting up the `ServiceContext` we need to specify the LLMs and Embeddings Model to be used. We also need to specify the additional parameters that control the behavior of the indexing process.

`chunk_size` parameter specifies size of the chunk to be created from the data. This chunk is then passed to the LLMs to generate the embeddings. 

`chunk_overlap` parameter specifies the number of tokens to overlap between the chunks. This is done to avoid the loss of information while creating the chunks.

We also have an option of setting up this `ServiceContext` globally using the `set_default_service_context` method provided by the `llama_index` library. You can read more about it [here](https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context.html#setting-global-configuration).

In [11]:
from llama_index import ServiceContext
from llama_index import set_global_service_context

# Set the chunk size and overlap that controls how the documents are chunked
chunk_size = 1024
chunk_overlap = 32

# Create the service context
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

# Set the global service context
set_global_service_context(service_context)

#### Indexing the data and storing it in the database

We start with parsing the connection string using the `sqlalchemy` library. This is done so as to create a URL object for easier interaction with the database.

Now using this URL object, we can create `vector_store` object with the help of `PGVectorStore` class provided by the `llama_index` library. This is used to store the embeddings, a high dimentional vector represnting the documents, in the PostgreSQL database. This step is crucial as we need to store the embeddings in the database so that we can query, apply filters, do a hybrid search, etc. on the indexed data using LLMs.

In [12]:
from sqlalchemy import make_url
from llama_index.vector_stores import PGVectorStore

# Creates a URL object from the connection string
url = make_url(connection_string)

# Create the vector store
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=table_name,
    embed_dim=embed_dim,
)

Next we configure the `StorageContext`, similar to the `ServiceContext` but for the storage of the indexed data. It manages how data is stored and retrieved within LlamaIndex. We specify that the `PGVectorStore` should be used to store the embeddings and related data in the database.

We then finally create the `index` using the `VectorStoreIndex` class provided by the `llama_index` library. This class is used to index the data using the specification provided by the `ServiceContext` and `StorageContext`. We can see how each page is indexed. In the background, the documents is first chunked into smaller chunks and then passed to the Embeddings Model to generate the embeddings. These embeddings are then finally stored in the database automatically.

In [13]:
from llama_index import StorageContext
from llama_index.indices.vector_store import VectorStoreIndex

# Create the storage context to be used while indexing and storing the vectors
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the index
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)

Parsing nodes:   0%|          | 0/156 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/156 [00:00<?, ?it/s]

In [None]:
# We finally close the connection to the database
conn.close()