# Crayon RAG system - Basic RAG

 - Setup 
 - Loading data
 - 

## Step 1 Setting

In [43]:
import os
import llama_index
import chromadb
from importlib.metadata import version

from dotenv import load_dotenv,find_dotenv
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings


print(f"LlamaIndex version: {version('llama_index')}")
print(f"Weaviate version: {version('chromadb')}")

LlamaIndex version: 0.11.18
Weaviate version: 0.5.16


In [44]:
# Use this line of code if you have a local .env file
load_dotenv(find_dotenv()) 

# Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()

## Step 2 Loading data

In [45]:
from llama_index.core import SimpleDirectoryReader

# Load data
documents = SimpleDirectoryReader("./dataset/policy").load_data()

documents

[Document(id_='e745791d-a761-4baa-b302-41f921698320', embedding=None, metadata={'file_path': '/Users/tianyuliu/Code/llm/NLP_examples/src/llm/RAG/Crayon RAY/dataset/policy/Policy_1.txt', 'file_name': 'Policy_1.txt', 'file_type': 'text/plain', 'file_size': 6393, 'creation_date': '2024-11-16', 'last_modified_date': '2024-11-16'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='### Comprehensive Data Privacy Policy\r\n\r\n**1. Introduction**\r\n\r\n**Purpose of the Policy:**  \r\nAt [Company Name], safeguarding the privacy and security of personal data is a foundational principle of our business operations. This Data Privacy Policy is designed to transparently communicate our unwavering commitment to the protection of personal information across all aspects 

## Step 3: Chunk documents into Nodes

As the whole document is too large to fit into the context window of the LLM, you will need to partition it into smaller text chunks, which are called Nodes in LlamaIndex.

With the SentenceWindowNodeParser each sentence is stored as a chunk together with a larger window of text surrounding the original sentence as metadata.

In [46]:

from llama_index.core.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults(chunk_size=258, chunk_overlap=50)

# Extract nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)

# This block of code is to showcase what the nodes looks like
i=10
print(f"Text: \n{nodes[i].text}")

Text: 
##### 4.4 Accountability
- Implement a standardized AI incident reporting system, which ensures all potential issues are logged, investigated, and addressed promptly.
- Define clear escalation paths for ethical concerns related to AI, including a direct line to the AI Ethics Board.

#### 5. Implementation
##### 5.1 Governance
- Enhance the role of the AI Ethics Board to include periodic ethical reviews of existing AI systems, not just new projects, with the authority to recommend modifications or discontinuations based on ethical evaluations.
- Introduce a third-party ethics audit performed annually to provide an unbiased assessment of our AI practices.

##### 5.2 Risk Assessment
- Develop a comprehensive ethical risk assessment toolkit that includes templates, best practices, and guidelines to standardize the assessment process across the company.
- Employ predictive modeling to forecast potential ethical issues under various operational scenarios and use these insights to guid

## Step 4: Store the data into Vector DB (chromadb)

In [47]:
import chromadb
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext


# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./dataset/chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("quickstart")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
# build VectorStoreIndex that takes care of chunking documents
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(nodes, storage_context=storage_context)

# Create an index from the documents and save it to the disk.
# index = VectorStoreIndex.from_documents(
#     documents, storage_context=storage_context
# )

## Step 5: Create a retriever using Chroma

You'll now create a retriever that can retrieve data embeddings from the newly created Chroma vector store.

First, initialize the PersistentClient with the same path you specified while creating the Chroma vector store. You'll then retrieve the collection "quickstart" you created previously from Chroma. You can use this collection to initialize the ChromaVectorStore in which you store the embeddings of the website data. You can then use the from_vector_store function of VectorStoreIndex to load the index.

In [48]:

# Load from disk
load_client = chromadb.PersistentClient(path="./dataset/chroma_db")

# Fetch the collection
chroma_collection = load_client.get_collection("quickstart")

# Fetch the vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# Get the index from the vector store
index = VectorStoreIndex.from_vector_store(vector_store)

# Check if the retriever is working by trying to fetch the relevant docs related
# to the phrase 'MMLU' (Multimodal Machine Learning Understanding).
# If the length is greater than zero, it means that the retriever is
# functioning well.
# You can ask questions about your data using a generic interface called
# a query engine. You have to use the `as_query_engine` function of the
# index to create a query engine and use the `query` function of query engine
# to inquire the index.
test_query_engine = index.as_query_engine()
response = test_query_engine.query("AI Risk Assessment")
print(response)

Develop a comprehensive ethical risk assessment toolkit that includes templates, best practices, and guidelines to standardize the assessment process across the company. Employ predictive modeling to forecast potential ethical issues under various operational scenarios and use these insights to guide AI system development.


## Indexing 

In [49]:
sentence = response.source_nodes[0].node.text


print(f"Original Sentence: {sentence}")

Original Sentence: ##### 4.4 Accountability
- Implement a standardized AI incident reporting system, which ensures all potential issues are logged, investigated, and addressed promptly.
- Define clear escalation paths for ethical concerns related to AI, including a direct line to the AI Ethics Board.

#### 5. Implementation
##### 5.1 Governance
- Enhance the role of the AI Ethics Board to include periodic ethical reviews of existing AI systems, not just new projects, with the authority to recommend modifications or discontinuations based on ethical evaluations.
- Introduce a third-party ethics audit performed annually to provide an unbiased assessment of our AI practices.

##### 5.2 Risk Assessment
- Develop a comprehensive ethical risk assessment toolkit that includes templates, best practices, and guidelines to standardize the assessment process across the company.
- Employ predictive modeling to forecast potential ethical issues under various operational scenarios and use these insi

In [64]:
from llama_index.core.retrievers import VectorIndexRetriever

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

#Finding out the nodes for the new query:
nodes=retriever.retrieve("AI Risk Assessment")
print(nodes[0].text)

##### 4.4 Accountability
- Implement a standardized AI incident reporting system, which ensures all potential issues are logged, investigated, and addressed promptly.
- Define clear escalation paths for ethical concerns related to AI, including a direct line to the AI Ethics Board.

#### 5. Implementation
##### 5.1 Governance
- Enhance the role of the AI Ethics Board to include periodic ethical reviews of existing AI systems, not just new projects, with the authority to recommend modifications or discontinuations based on ethical evaluations.
- Introduce a third-party ethics audit performed annually to provide an unbiased assessment of our AI practices.

##### 5.2 Risk Assessment
- Develop a comprehensive ethical risk assessment toolkit that includes templates, best practices, and guidelines to standardize the assessment process across the company.
- Employ predictive modeling to forecast potential ethical issues under various operational scenarios and use these insights to guide AI sy

In [60]:
# print(f"{nodes[0]}")
nodes[0].metadata["file_name"]

'Policy_2.txt'

In [None]:
# chat
client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
)

chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you!"}
    ],
    model="gpt-3.5-turbo",
    temperature=0.2
)
print(chat_completion)
print(chat_completion.choices[0].message.content)