<a href="https://colab.research.google.com/github/SamurAIGPT/LlamaIndex-course/blob/main/introduction/Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to LlamaIndex

### What does LlamaIndex do?

ChatGPT is trained on huge amounts of data. But what if you want to train ChatGPT on your private data, there are 3 ways in which you can achieve this

1. Train an open-source LLM like Llama on your data. (This is a complex and time taking process which is not scalable)
2. Pass all of your documents as prompt to LLM. (This has limitations since the context window size is limited)
3. Fetch and pass only the relevant documents as input to your LLM

LlamaIndex works using the 3rd method and we will work on how we can do that with the help of an example. Some of the concepts of LLM that we will use are Data connectors, indexes, retrievers, query engines, etc.

### Training ChatGPT over your documents

Here is an example of how you can train ChatGPT over your documents



### Install LlamaIndex and dependencies

In [1]:
!pip install llama_index

Collecting llama_index
  Downloading llama_index-0.7.11.post1-py3-none-any.whl (609 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m609.4/609.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken (from llama_index)
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama_index)
  Downloading dataclasses_json-0.5.13-py3-none-any.whl (26 kB)
Collecting langchain>=0.0.218 (from llama_index)
  Downloading langchain-0.0.240-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=0.26.4 (from llama_index)
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m5.8 

### Download the data to train on. We use state of the union text document to train over ChatGPT

In [2]:
!wget https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt
!mkdir data
!mv state_of_the_union.txt data/

--2023-07-24 12:14:42--  https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39027 (38K) [text/plain]
Saving to: ‘state_of_the_union.txt’


2023-07-24 12:14:42 (16.6 MB/s) - ‘state_of_the_union.txt’ saved [39027/39027]



### Train the chatbot using LlamaIndex

Now we will use LlamaIndex to train ChatGPT over our private data.

We are using Simple directory data reader from LlalaIndex to read the data from above downloaded file. This reader can read data from all the files in a directory and convert it into documents format which can be trained

Place your openai key in place of "OPEN-AI-KEY"

In [None]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
import openai
openai.api_key = "OPEN-AI-KEY"
# openai.api_key = "18**HAWiy"
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)

Now we create a LlamaIndex interface called query engine to query our documents.

With this you can now query over your data in natural language

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What is NATO?")
print(response)


NATO is the North Atlantic Treaty Organization, an intergovernmental military alliance between 29 North American and European countries. It was created to secure peace and stability in Europe after World War 2.
