In this tutorial, we will build a Conversational AI System that can answer questions by retrieving the answers from a document.

This whole tutorial is divided into 2 parts:
- Part 1: Indexing and Storing the Data
    - Step 1: Install all the dependencies (Python and Database)
    - Step 2: Setup the database, activate the correct extensions and create the tables
    - Step 3: Load the LLMs and Embeddings Model
    - Step 4: Load the data and index using LLMs and the Embeddings Model and finally store it in the database
- Part 2: Querying the indexed data using LLMs


# Indexing and Storing the Data

### Install The Python Dependencies

We will use the following python libraries:
- `psycopg2`: PostgreSQL database adapter for Python. We will use it to connect to the database and execute SQL queries.
- `sqlalchemy`: SQL toolkit and Object Relational Mapper (ORM) that gives application developers the full power and flexibility of SQL. We will use it to convert the connection string into a URL that can be used to connect to the database.
- `llama-index`: Llama Index is a data framework for your LLM application. It provides a set of tools to help you manage your data and easily build your LLM application. We will use it to index the data and query the database using the LLMs.
- `langchain`: Langchain is also similar to Llama Index. It provides a set of tools to help you create LLMs powered applications. Though we will use it to create the embeddings model only.
- `torch`: PyTorch is an open source machine learning library based on the Torch library

In [None]:
# !conda install psycopg2
# !pip install sqlalchemy
# !pip install langchain
# !pip install llama-index
# !pip install torch

### Install the PostgreSQL database

In this tutorial, we will use the PostgreSQL database. You can download and install by following the instructions below:


Create the file repository configuration:
- `sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'`

Import the repository signing key:
- `wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -`

Update the package lists:
- `sudo apt-get update`

Install the latest version of PostgreSQL.
If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
- `sudo apt-get -y install postgresql-16`

Install the pgvector extension:
- `sudo apt-get -y install postgresql-16-pgvector`

Ensure that the server is running using the systemctl start command:
- `sudo systemctl start postgresql.service`

Once the server is running, you can connect to it using the psql command:

switch to the postgres user
- `sudo -i -u postgres`

connect to the server
- `psql`

change the password
- `ALTER USER postgres WITH PASSWORD 'test123';`

enable the pgvector extension
- `create extension vector;`

list the installed extensions
- `\dx`

### Download the data

Now that we have installed the python libraries and the database, we can download the data. 

In this tutorial we will build a Conversational AI System that can answer questions by retrieving the answers from a document.

More specifically, we will use Memento Movie Script as our dataset, because why not? This movie definitely needs a Conversational AI System to answer questions about it and understand it fully.

You can download the dataset from [here](https://stephenfollows.com/resource-docs/scripts/memento.pdf).

Once you have downloaded the dataset, we'll move it to the `data` folder.

In [None]:
# !wget https://stephenfollows.com/resource-docs/scripts/memento.pdf

In [None]:
# !mkdir data
# !mv memento.pdf data

### Let's Dive In

In this tutorial, we will build a Conversational AI System using Microsoft's Foundational Model - `PHI-2`, an SLM. We will also compare it with the other LLMs, such as `Mistral-7B` and `Mistral-7B-Instruct` model.

`PHI-2` is a transformer model, developed by Microsoft, which has about 2.7 billion parameters. `PHI-2` is the successor of `PHI-1.5` model, and is trained on the same dataset as `PHI-1.5` model, but augmented with additional data sources that includes various NLP synthetic texts and filtered websites. This augmented dataset was introduced to address the safety concerns and include educational values.

After a fair comparison between `PHI-2`, and other LLMs which have parameters less than 13 billion, it was found that `PHI-2` outperforms most of the LLMs in the benchmarks testing common sense, language understanding, and logical reasoning.

`PHI-2` was developed by Microsoft keeping in mind not only safety concerns and educational values, but also to explore the power of Small Language Models (SLMs). The main goal of SLMs is to explore the power of LLMs without the need of large computing resources and make it more sustainable.

You can read more about `PHI-2` [on the official page of Mictrosoft](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/).

You can get the `PHI-2` model from the [official page of Mictrosoft on HuggingFace](https://huggingface.co/microsoft/phi-2).

In [None]:
# model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
# model_name = "mistralai/Mistral-7B-v0.1"
model_name = "microsoft/phi-2"

Apart from `PHI-2`, we will take help from BAAI (Beijing Academy of Artificial Intelligence) general embedding - BGE, an embedding model. More specifically, we will use the `bge-large-en-v1.5` model to create the embeddings for our dataset and then use the embeddings to index the data.

Before we can start using this embedding model, we first need to find it's embedding dimension as the embedding dimension is required to create the table in the database. You can find the embedding dimension of the `bge-large-en-v1.5` model on the [Official BAAI Page of HuggingFace](https://huggingface.co/BAAI/bge-large-en-v1.5). It's 1024. I've also included the image below for your reference.

![Embedding Dimentions of bge-large-en-v1.5](asset/embedding_dim.png)

You can also find the embedding dimension of this model by doing a forward pass on a random input. You can do something like this: `len(embed_model.get_text_embedding("Hello world"))`

In [None]:
embedding_model_name = "BAAI/bge-large-en-v1.5"

Now that we have the LLM (Or should I say SLM) and the embedding model, we can start indexing the data.

For that we first need to set the data path

In [None]:
data_path = "./data/"

As already mentioned, we will use PostgreSQL as our database. We will use the pgvector extension to store the embeddings in the database.

In [None]:
connection_string = "postgresql://postgres:test123@localhost:5432"
db_name = "vector_db"
table_name = 'embeddings'

In [None]:
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader(data_path).load_data()

In [None]:
chunk_size = 1024
chunk_overlap = 32

context_window = 2048
max_new_tokens = 256

device = "cuda"

In [None]:
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
from llama_index import ServiceContext
from llama_index.embeddings import LangchainEmbedding
from llama_index import set_global_service_context

from llama_index.llms import HuggingFaceLLM
import torch



llm = HuggingFaceLLM(
    context_window=context_window,
    max_new_tokens=max_new_tokens,
    tokenizer_name=model_name,
    model_name=model_name,
    device_map=device,
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.bfloat16}
)

embed_model = LangchainEmbedding(
  HuggingFaceBgeEmbeddings(model_name=embedding_model_name)
)

embed_dim = len(embed_model.get_text_embedding("Hello world")) # 1024

service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

set_global_service_context(service_context)

In [None]:
import psycopg2


conn = psycopg2.connect(connection_string)
conn.autocommit = True

with conn.cursor() as c:
    c.execute(f"DROP DATABASE IF EXISTS {db_name}")
    c.execute(f"CREATE DATABASE {db_name}")

In [None]:
from sqlalchemy import make_url
from llama_index import StorageContext
from llama_index.indices.vector_store import VectorStoreIndex
from llama_index.vector_stores import PGVectorStore

url = make_url(connection_string)
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=table_name,
    embed_dim=embed_dim,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)
