## Example of querying a set of documents with sources using Caikit-TGIS with Llama2, Langchain and a custom prompt

### Set the Inference server url (replace with your own address) and the model-id

In [1]:
inference_server_url = "your_server_address:port"
model_id = "Llama-2-7b-chat-hf"

inference_server_url = "caikit-tgis-example-isvc-predictor.kserve-demo.svc.cluster.local:80"
model_id = 'flan-t5-small-caikit'

CONNECTION_STRING = "postgresql+psycopg://vectordb:vectordb@postgresql.pgvector-demo.svc.cluster.local:5432/vectordb"

# If your endpoint is using a self-signed certificate, export the certificate chain as a .pem file and provide its path
# Example: certificate_chain = "certificate.pem"
# Adjust llm instantiation to use this parameter or not
certificate_chain_file = "certificate.pem"

In [2]:
# Optional, requirements if they are not already present
# !pip -q install grpcio grpcio-reflection

### Load some data from the folder where we have stored the PDF documentation

In [3]:
from langchain.document_loaders import PyPDFDirectoryLoader

# pdf_folder_path = 'rhods-doc'

# loader = PyPDFDirectoryLoader(pdf_folder_path)
# docs = loader.load()

### Split the data in chunks large enough to have meaningful answers, and some overlap not to miss anything

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1024, chunk_overlap = 40)
# all_splits = text_splitter.split_documents(docs)

### Store the data as embeddings in a vector database (Chroma)

In [5]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
# from langchain.vectorstores import Chroma

from langchain.vectorstores.pgvector import PGVector

embeddings = HuggingFaceEmbeddings()
vectorstore = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings)

# vectorstore = Chroma.from_documents(documents=all_splits, embedding=HuggingFaceEmbeddings())

Unable to connect optimized C data functions [No module named '_testbuffer'], falling back to pure Python


### Test data retrieving

In [6]:
# question = "How do I create a Data Science Project"
# docs = vectorstore.similarity_search(question)
# docs

[Document(page_content='-\n, and must start and end with an alphanumeric\ncharacter.\n5\n. \nEnter a \ndescription\n for your data science project.\n6\n. \nClick \nCreate\n.\nA project details page opens. From here, you can create workbenches, add cluster storage, and\nadd data connections to your project.\nVerification\nThe data science project that you created is displayed on the \nData science projects\n page.\nCHAPTER 4. CREATING A DATA SCIENCE PROJECT\n9', metadata={'source': 'rhods-doc/red_hat_openshift_data_science_self-managed-1.32-getting_started_with_red_hat_openshift_data_science_self-managed-en-us.pdf', 'page': 12}),
 Document(page_content='-\n, and must start and end with an alphanumeric\ncharacter.\n5\n. \nEnter a \ndescription\n for your data science project.\n6\n. \nClick \nCreate\n.\nA project details page opens. From here, you can create workbenches, add cluster storage, and\nCHAPTER 3. WORKING ON DATA SCIENCE PROJECTS\n11', metadata={'source': 'rhods-doc/red_hat_open

### Create the chain

In [7]:
from langchain.chains import RetrievalQA
import caikit_tgis_langchain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate



# NOTE: This template syntax is specific to Llama2
template="""<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
You will be given a question you need to answer, and a context to provide you with information. You must answer the question based as much as possible on this context.
Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

Question: {question}
Context: {context} [/INST]
"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

llm = caikit_tgis_langchain.CaikitLLM(
    inference_server_url=inference_server_url,
    model_id=model_id,
    certificate_chain=certificate_chain_file,
    streaming=True
)

qa_chain = RetrievalQA.from_chain_type(llm,
                                       retriever=vectorstore.as_retriever(),
                                       chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                                       return_source_documents=True)

### Launch the query

In [8]:
question = "How do I create a Data Science Project?"
result = qa_chain({"query": question, "min_new_tokens": 100, "max_new_tokens": 1024}, callbacks=[StreamingStdOutCallbackHandler()])

To create a data science project in Red Hat OpenShift Data Science, follow these steps:
1. Log in to your OpenShift Data Science account.
2. From the OpenShift Data Science dashboard, click on "Data Science Projects."
3. Click on "Create Data Science Project."
4. Enter a name for your data science project. The name must consist of lowercase alphanumeric characters, and it must start and end with an alphanumeric character.
5. Optional: Edit the resource name for your data science project.
6. Click "Create Data Science Project" to create your project.
Once your project is created, you can add workbenches, add cluster storage, add data connections, and add model servers to enhance the capabilities of your project.
Remember to always follow the prerequisites and procedure steps provided in the context to ensure that your project is created successfully.

### Print the sources

In [9]:
def remove_duplicates(input_list):
    unique_list = []
    for item in input_list:
        if item.metadata['source'] not in unique_list:
            unique_list.append(item.metadata['source'])
    return unique_list

results = remove_duplicates(result['source_documents'])

for s in results:
    print(s)

rhods-doc/red_hat_openshift_data_science_self-managed-1.32-getting_started_with_red_hat_openshift_data_science_self-managed-en-us.pdf
rhods-doc/red_hat_openshift_data_science_self-managed-1.32-working_on_data_science_projects-en-us.pdf
