## Build Your First RAG System
1. Data Ingestion
2. Indexing
3. Retriever
4. Response Synthesizer
5. Quering

In [None]:
!pip install llama-index

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'sk-*****'

## Stage 1 : Data Ingestion

In [None]:
#Download the file
!mkdir './data/'
!wget 'https://raw.githubusercontent.com/aravindpai/Speech-Recognition/c9c45731e966592b1805929fc1585c72e1f34f10/dhs.txt' -O './data/dhs.txt'

--2024-01-02 14:04:18--  https://raw.githubusercontent.com/aravindpai/Speech-Recognition/c9c45731e966592b1805929fc1585c72e1f34f10/dhs.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20789 (20K) [text/plain]
Saving to: ‘./data/dhs.txt’


2024-01-02 14:04:18 (22.5 MB/s) - ‘./data/dhs.txt’ saved [20789/20789]



In [None]:
from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()

In [None]:
type(documents)

list

In [None]:
len(documents)

1

In [None]:
documents

[Document(id_='b3dd011e-e38f-4992-a78a-e1cbf1afce2f', embedding=None, metadata={'file_path': 'data/dhs.txt', 'file_name': 'dhs.txt', 'file_type': 'text/plain', 'file_size': 20789, 'creation_date': '2024-01-02', 'last_modified_date': '2024-01-02', 'last_accessed_date': '2024-01-02'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, hash='e6eef49f9dfc1312339a9b12b837e41f00bb0f8e49f7c9e14f1269048e7c12e9', text="\ufeffDataHack Summit 2023 (DHS) India’s most Futuristic AI Conference organized by Analytics Vidhya.Analytics Vidhya is the World’s leading and India’s largest data science community.Analytics Vidhya is founded by Kunal Jain. Analytics Vidhya aims to build the next generation data science ecosystem across the globe.We have helped millions of people realize

## Embedding Model

In [None]:
from llama_index.embeddings import OpenAIEmbedding
embed_model = OpenAIEmbedding()

In [None]:
embed_model

OpenAIEmbedding(model_name='text-embedding-ada-002', embed_batch_size=10, callback_manager=<llama_index.callbacks.base.CallbackManager object at 0x7b1fa5981570>, additional_kwargs={}, api_key='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', api_base='https://api.openai.com/v1', api_version='', max_retries=10, timeout=60.0, default_headers=None, reuse_client=True)

## LLM

In [None]:
from llama_index.llms import OpenAI
llm = OpenAI()

# Stage 2 : Indexing

In [None]:
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(
    llm = llm,
    embed_model=embed_model
)

In [None]:
from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents,service_context=service_context)

## Stage 3 : Retrieval

In [None]:
retriever = index.as_retriever()

In [None]:
retrieved_nodes = retriever.retrieve("What is the theme of DHS?")

In [None]:
(retrieved_nodes)[0].text

"To become a sponsor for DataHack Summit 2023, please contact the conference organizers for more information.The format of DHS 2023 includes Live Keynotes, Power Talks, Hack Sessions, Generative AI Sessions, Workshops, Awards Evening, The AI Showcase.In the AI Showcase, check out the latest and the best in Artificial Intelligence from exciting startups, solution providers to bleeding edge hardware and software providers! Awards Evening is to recognize the best in AI, the awards night uplifts and inspires everyone present.This showcases the groundbreaking innovations and business in the AI landscape.Workshops are each day-long hands-on session aimed to make sure you learn Artificial Intelligence by doing it yourself.No more lectures – just code with the help of experts. Hack Session is no better way to understand AI than seeing an expert building it in front of your eyes.Each Hack Session is a 60 to 90 minutes long live interactive session with an expert working in front of you! Generat

In [None]:
(retrieved_nodes)[1].text

"\ufeffDataHack Summit 2023 (DHS) India’s most Futuristic AI Conference organized by Analytics Vidhya.Analytics Vidhya is the World’s leading and India’s largest data science community.Analytics Vidhya is founded by Kunal Jain. Analytics Vidhya aims to build the next generation data science ecosystem across the globe.We have helped millions of people realize their data science dreams.We conduct hackathons, competitions, training & conferences and help companies find the right data science talent.\nDHS 2023 totally has 70+ AI Talks, 30+ Hack Sessions and 8+ Workshops.\nIt's the 4th edition of DHS.Here is the DHS website: https://www.analyticsvidhya.com/datahack-summit-2023/.The previous DHS happened in 2017, 2018 and 2019 at Bengaluru. The 4 day conference is taking place on 2nd – 5th August, 2023 at NIMHANS Convention Centre, Bengaluru. The 4th day consists of day-long workshops. The workshops are held in a table-and-chair set up.The venues for the workshops will be announced shortly.T

## Stage 4: Response Synthesis

In [None]:
from llama_index.response_synthesizers import get_response_synthesizer
response_synthesizers = get_response_synthesizer()

## Stage 5 : Query Engine

In [None]:
query_engine = index.as_query_engine(retriever=retriever,
                                     response_synthesizers=response_synthesizers)

In [None]:
response = query_engine.query("What is the theme of DHS?")

In [None]:
response.response

'The theme of DHS 2023 is "Infinite Possibilities: Exploring the Future with Generative AI".'

## End to End RAG Pipeline

In [None]:
import os
os.environ['OPENAI_API_KEY'] = 'sk-*****'

from llama_index import SimpleDirectoryReader,VectorStoreIndex
documents = SimpleDirectoryReader("data").load_data()

llm = OpenAI()
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()

print(query_engine.query("What is the theme of DHS?").response)

The theme of DHS is "Infinite Possibilities: Exploring the Future with Generative AI".
