<a href="https://colab.research.google.com/github/tiarawh1301/chatbot/blob/main/ID_Devjam_RAG_files_astra_langchain_dev_jam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/langchain_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prerequisites

You will need a vector-enabled Astra database and an OpenAI Account.

* Create an [Astra vector database](https://docs.datastax.com/en/astra-serverless/docs/getting-started/create-db-choices.html).
* Create an [OpenAI account](https://openai.com/)
* Within your database, create an [Astra DB Access Token](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html) with Database Administrator permissions.
* Get your Astra DB Endpoint:
  * `https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com`


## Setup
`ragstack-ai` includes all the packages you need to build a RAG pipeline.

In [None]:
! pip install -q ragstack-ai pypdf

In [None]:
import os
from getpass import getpass

# Enter your settings for Astra DB and OpenAI:
os.environ["ASTRA_DB_API_ENDPOINT"] = input("Enter your Astra DB API Endpoint: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

Enter your Astra DB API Endpoint: https://6c26120f-a0b8-47cb-98d5-9a85999778cc-us-east-2.apps.astra.datastax.com
Enter your Astra DB Token: ··········
Enter your OpenAI API Key: ··········


## Create RAG Pipeline

In [None]:
from langchain_community.document_loaders import JSONLoader

import json
from pathlib import Path
from pprint import pprint


file_path='filename.pdf'
data = json.loads(Path(file_path).read_text())


### Embedding Model and Vector Store

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_astradb import AstraDBVectorStore
import os

# Configure your embedding model and vector store
embedding = OpenAIEmbeddings(model="text-embedding-3-small")
vstore = AstraDBVectorStore(
    collection_name="id_devjam",
    embedding=embedding,
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
)
print("Astra vector store configured")

In [None]:
# Retrieve the text of a short story that will be indexed in the vector store
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA = ["amontillado.txt"]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13022  100 13022    0     0  25886      0 --:--:-- --:--:-- --:--:-- 25940


In [None]:
# Alternatively, provide your own file. However, you will want to update your queries to match the content of your file.

# Upload sample file (Note: this cell assumes you are on Google Colab)
# Local Jupyter notebooks can provide the path to their files directly by uncommenting and running just the next line).
SAMPLEDATA = ["<path_to_file>"]

from google.colab import files

print("Please upload your own sample file:")
uploaded = files.upload()
if uploaded:
    SAMPLEDATA = uploaded
else:
    raise ValueError("Cannot proceed without Sample Data. Please re-run the cell.")

print(f"Please make sure to change your queries to match the contents of your file!")

In [None]:
import os
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Loop through each file and load it into our vector store
documents = []
for filename in SAMPLEDATA:
    path = os.path.join(os.getcwd(), filename)

    # Supported file types are pdf and txt
    if filename.endswith(".pdf"):
        pdf_loader = PyPDFLoader(path)
        #Create document chunks & embeddings
        splitter = RecursiveCharacterTextSplitter(chunk_size=4076, chunk_overlap=64)
        documents = pdf_loader.load_and_split(text_splitter=splitter)
        print(documents)
        print(f"Documents from PDF: {len(documents)}.")
    elif filename.endswith(".txt"):
        loader = TextLoader(path)
        documents = loader.load_and_split()
        print(f"Processed txt file: {filename}")
    elif filename.endswith(".csv"):
        loader = loader = CSVLoader(file_path=path, source_column="product_id")
        documents = loader.load()
        print(documents)
        print(f"Processed csv file: {filename}")
    else:
        print(f"Unsupported file type: {filename}")

# empty the list of file names in case this cell is run multiple times
SAMPLEDATA = []

print(f"\nProcessing done.")

Processed txt file: DBS Group Holdings Ltd Annual Report 2022.txt

Processing done.


In [None]:
# Create embeddings by inserting your documents into the vector store.
inserted_ids = vstore.add_documents(documents)
print(f"\nInserted {len(inserted_ids)} documents.")


Inserted 104 documents.


In [None]:
# Checks your Collection to verify the Documents are embedded.
print(vstore.astra_db.collection("datastax_docs").find())

{'data': {'documents': [{'_id': '119d3bbb2cef4291aa107c24047e1a33', 'content': 'เอกสารที่ต้องส่งให้ HR : การลาทุกครั้งพนักงานจะต้องจัดส่ง E-mail เพื่อขออนุมัติ หากเป็นการลาติดต่อกัน 3 \nวัน พนักงานมีหน้าที่จะต้องส่งเอกสาร (เอกสารใบรับรองแพทย์ , ใบนัดหมายการพบแพทย์)', '$vector': [-0.015619659178691384, -0.07236226048344137, -0.011304397362975442, 0.005285975163778866, 0.03318074412218602, 0.05785451200552445, 0.012513377023683582, 0.04062876559423778, 0.01569908143812689, -0.034822133137681004, 0.06064310791675802, -0.020632069972215717, -0.02266174227802616, 0.009724782043772608, 0.007072969947480846, -0.0382990509511835, -0.04518229232561505, 0.020526173005419986, 0.03812255786916913, -0.005166842240286907, -0.008083393959600225, -0.02869781536867258, -0.006115493953098036, 0.03203353722644743, 0.04317026672030278, 0.03019800842843991, 0.0645259631716549, -0.018955383830434016, 0.008153991937464048, -0.029103749084776594, -0.018725941333699186, 0.04715901707935021, -0.0225381986107322

### Basic Retrieval

Retrieve context from your vector database, and pass it to the model with a prompt.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retriever = vstore.as_retriever(search_kwargs={"k": 5})

prompt_template = """
Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer. Respond to the question in the same language as the user query.
Context: {context}
Question: {question}
Your answer:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)
model = ChatOpenAI(model_name="gpt-4-0125-preview")

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke(
    "เอกสารนี้เกี่ยวกับอะไร?"
)

'เอกสารนี้เกี่ยวกับรายละเอียดและข้อมูลที่เกี่ยวข้องกับนโยบายการลางานของพนักงาน, ข้อมูลทั่วไปสำหรับแผนก HR, รวมถึงตารางวันหยุดอย่างเป็นทางการของบริษัท G-ABLE ประจำปี พ.ศ. 2567 นอกจากนี้ยังมีข้อมูลเกี่ยวกับเงื่อนไขและเอกสารที่จำเป็นสำหรับการลาแต่ละประเภท เช่น ลาพักร้อน, ลาป่วย, ลาสมรส, และลาไม่รับเงินเดือน รวมถึงการติดต่อ HR และข้อมูลสวัสดิการและการดูแลสุขภาพของพนักงาน.'