# **Document Search System using llama-index and OpenAI**

Install the required dependencies

In [1]:
!pip install llama-index==0.9.40  openai pypdf

Collecting llama-index==0.9.40
  Downloading llama_index-0.9.40-py3-none-any.whl.metadata (8.4 kB)
Collecting openai
  Downloading openai-1.46.1-py3-none-any.whl.metadata (24 kB)
Collecting pypdf
  Downloading pypdf-5.0.0-py3-none-any.whl.metadata (7.4 kB)
Collecting dataclasses-json (from llama-index==0.9.40)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting deprecated>=1.2.9.3 (from llama-index==0.9.40)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index==0.9.40)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting httpx (from llama-index==0.9.40)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting tenacity<9.0.0,>=8.2.0 (from llama-index==0.9.40)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting tiktoken>=0.3.3 (from llama-index==0.9.40)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_

In [3]:
import os
from google.colab import userdata
import openai

os.environ['OPENAI_API_KEY']=userdata.get('OPENAI_API_KEY')

**Load the Documents** : the documents from the specified directory will be read and loaded into the index.

In [4]:
from llama_index import VectorStoreIndex,SimpleDirectoryReader
documents=SimpleDirectoryReader("/content/data").load_data()

**Create the Index** : An index is created from the loaded documents. This index will be used for querying.

In [8]:
index=VectorStoreIndex.from_documents(documents,show_progress=True)
print(index)

Parsing nodes:   0%|          | 0/2936 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/900 [00:00<?, ?it/s]

<llama_index.indices.vector_store.base.VectorStoreIndex object at 0x78f6ae29a710>


**Query the Index** : You can query the index with any question related to the documents. The system will retrieve the best matching results based on vector similarity.

In [13]:
query1 = "what is django"
query2 = "tell me about abhishek yadav"
query3 = "How Django ORM model Works"
query_lists = [query1,query2,query3]
query_engine=index.as_query_engine()
for query in query_lists:
  response=query_engine.query(query)
  print("Query : ",query)
  print("Response : ",response)
  print('-'*100)


Query :  what is django
Response :  Django is a web framework and a programming tool that allows users to build websites. It is not a content management system (CMS) or a turnkey product in itself. Django provides tools to create database-driven web applications efficiently and easily.
----------------------------------------------------------------------------------------------------
Query :  tell me about abhishek yadav
Response :  Abhishek Yadav is currently pursuing a Bachelor's in Information Technology at Rajiv Gandhi Institute of Petroleum Technology in Jais, Amethi, Uttar Pradesh. He has a CGPA of 9.07 out of 10. Abhishek has experience working as an Intern Software Engineer at Ime Technologies and Consultancy Pvt Ltd in Mumbai, where he developed a web application dashboard for commodity price forecasting using Django and Microsoft SQL Server. Additionally, he has technical skills in programming languages such as Python, C, C++, and JavaScript, along with experience in develop

We can also retrieve documents using custom similarity thresholds:

In [20]:
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.postprocessor import SimilarityPostprocessor
from llama_index.response.pprint_utils import pprint_response

retriever=VectorIndexRetriever(index = index, similarity_top_k = 4)
postprocessor = SimilarityPostprocessor(similarity_cutoff = 0.80)
query_engine=RetrieverQueryEngine(retriever=retriever, node_postprocessors=[postprocessor])

response=query_engine.query("What is  Django models")
pprint_response(response,show_source=True)

Final Response: Django models are Python classes that subclass
`django.db.models.Model`. Each attribute of the model represents a
database field, and Django provides an automatically-generated
database-access API based on these models. Models define the structure
and behavior of the data being stored and generally map to individual
database tables.
______________________________________________________________________
Source Node 1/4
Node ID: 0f2f8d92-2f21-4d59-9dce-eb7209e2ccde
Similarity: 0.8328141639996558
Text: Django Documentation, Release 5.2.dev20240919152630 3.2 Models
and databases
Amodelisthesingle,definitivesourceofinformationaboutyourdata.
Itcontainstheessentialfieldsand behaviorsofthedatayou’restoring.Genera
lly,eachmodelmapstoasingledatabasetable. 3.2.1 Models
Amodelisthesingle,definitivesourceofinformationaboutyourdata.
Itcontainstheessential...
______________________________________________________________________
Source Node 2/4
Node ID: 82c5b091-ad8d-4120-9281-23a4279

**Persistence of the Index**  : This project supports persistence. If the index exists, it will be loaded. Otherwise, it will be created and saved to storage.

In [21]:
import os.path
from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("/content/data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

# either way we can now query the index
query_engine = index.as_query_engine()


In [25]:
response = query_engine.query("Give me 5 benifits of using Django ORM")
print(response)

1. Ability to run SQL aggregate queries within Django's ORM.
2. Option to return the results of aggregated queries directly or annotate objects in a QuerySet with the results.
3. Support for queries that can refer to another field in the query and traverse relationships to refer to fields on related models.
4. Control over managing the life-cycle of database tables for a model using the managed model option.
5. Enhanced performance through understanding QuerySets, lazy evaluation, and cached attributes in Django ORM.
