<a href="https://colab.research.google.com/github/reachrkr/llamaindexrag/blob/main/LlamaIndex_chroma_store_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q llama-index
!pip install -q openai
!pip install -q transformers
!pip install -q accelerate
!pip install -q optimum[exporters]
!pip install -q InstructorEmbedding
!pip install -q sentence_transformers
!pip install -q pypdf
!pip install -q chromadb

In [2]:
!pip install -q llama-index chromadb --quiet
#!pip install -q chromadb
#!pip install -q sentence-transformers
!pip install -q pydantic==1.10.11

In [20]:
from google.colab import userdata
openapi_key=userdata.get('OPENAI_API_KEY')

In [21]:
import os
os.environ["OPENAI_API_KEY"] = openapi_key

In [22]:
# !curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf


! curl https://arxiv.org/pdf/2106.07178.pdf --output AD1.pdf
! curl https://arxiv.org/pdf/1404.4679.pdf --output AD2.pdf
! curl https://www.kdd.org/exploration_files/18-1-Article1.pdf --output AD3.pdf




  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5448k  100 5448k    0     0  1491k      0  0:00:03  0:00:03 --:--:-- 1491k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2161k  100 2161k    0     0  1144k      0  0:00:01  0:00:01 --:--:-- 1143k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2276k  100 2276k    0     0  2820k      0 --:--:-- --:--:-- --:--:-- 2821k


In [23]:
! ls -ltr

total 22348
drwxr-xr-x 1 root root     4096 Nov  7 14:26 sample_data
-rw-r--r-- 1 root root 12742280 Nov  9 07:04 utput
drwxr-xr-x 4 root root     4096 Nov  9 08:04 chroma_db
-rw-r--r-- 1 root root  5579402 Nov  9 08:06 AD1.pdf
-rw-r--r-- 1 root root  2213819 Nov  9 08:06 AD2.pdf
-rw-r--r-- 1 root root  2331363 Nov  9 08:06 AD3.pdf


In [24]:
# import
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import OpenAI
from IPython.display import Markdown, display
import chromadb


In [25]:
# define embedding function
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

In [26]:
documents = SimpleDirectoryReader(
    input_files=["AD1.pdf","AD2.pdf","AD3.pdf"]
).load_data()

# CHROMA-DB

In [28]:
# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_client.delete_collection("AD_paper")
chroma_collection = chroma_client.create_collection("AD_paper")

### Set up ChromaVectorStore and load in data


In [29]:
# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


In [30]:
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [31]:
service_context = ServiceContext.from_defaults(embed_model=embed_model)

In [32]:

index = VectorStoreIndex.from_documents(documents,
                                        storage_context=storage_context,
                                        service_context=service_context)

In [33]:
index

<llama_index.indices.vector_store.base.VectorStoreIndex at 0x7979b978a2c0>

In [34]:
# Query Data
query_engine = index.as_query_engine()

response = query_engine.query("Reinforcement Learning Based Techniques?")

display(Markdown(f"{response}"))

Reinforcement learning based techniques have been successfully applied in the field of anomaly detection. One approach, called NAC, combines reinforcement learning and network embedding techniques to selectively harvest anomalous nodes. NAC is trained with labeled data and learns a node selection plan that can identify anomalous nodes in the undiscovered area of a graph. Another algorithm, GraphUCB, models both attribute information and structural information in attributed graphs for anomalous node detection. It uses the contextual multi-armed bandit technology to output potential anomalies and continuously optimizes the decision-making strategy based on expert evaluation. These reinforcement learning based techniques show promise in detecting anomalies in various types of graphs.

In [39]:
response = query_engine.query("Give a summary of all the techniques?")

display(Markdown(f"{response}"))

The techniques mentioned in the given context include feature-based and proximity-based approaches for anomaly detection in static plain graphs. Feature-based approaches extract structural graph-centric features, such as node degree and subgraph centrality, to identify outliers. Proximity-based approaches use the graph structure to quantify the closeness of nodes and identify associations. These techniques can be used together with other features extracted from additional information sources for outlier detection in the constructed graph.

### How to Persist: Saving to Disk

In [35]:
db = chromadb.PersistentClient(path="./chroma_db")


chroma_collection = db.get_or_create_collection("AD_paper")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

service_context = ServiceContext.from_defaults(embed_model=embed_model,
                                               chunk_size=800,
                                               chunk_overlap=20)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

### Load from Disk

In [36]:
# load from disk
db2 = chromadb.PersistentClient(path="./chroma_db")

chroma_collection = db2.get_or_create_collection("AD_paper")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
)

In [37]:
# Query Data
query_engine = index.as_query_engine()

response = query_engine.query("Reinforcement Learning Based Techniques?")

display(Markdown(f"{response}"))

Optimizing debt collections using constrained reinforcement learning is an example of a reinforcement learning based technique mentioned in the context information.