### **Llaama index test notebook**
Created by Sean Kan, 5 Jan 2024

In [13]:
pip install llama-index transformers accelerate bitsandbytes

Defaulting to user installation because normal site-packages is not writeable
Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.41.3.post2-py3-none-any.whl.metadata (9.8 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.20.1-py3-none-any.whl.metadata (12 kB)
Collecting pyyaml>=5.1 (from transformers)
  Downloading PyYAML-6.0.1-cp39-cp39-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (6.7 kB)
Colle

### High Level Concepts
This is a quick guide to the high-level concepts you’ll encounter frequently when building LLM applications.

#### Retrieval Augmented Generation (RAG)
LLMs are trained on enormous bodies of data but they aren’t trained on your data. Retrieval-Augmented Generation (RAG) solves this problem by adding your data to the data LLMs already have access to. You will see references to RAG frequently in this documentation.

In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.
<img src="img/basic_rag.png" alt="rag" width="1000"/>


### **Stages within RAG**
There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

**Loading**: this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.

**Indexing**: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

**Storing**: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.

**Querying**: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.

**Evaluation**: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.



<img src="img/rag_stage.png" alt="alt text" width="1000" />


### **Important concepts within each step**
There are also some terms you’ll encounter that refer to steps within each of these stages.

#### **Loading stage**
**Nodes and Documents**: A Document is a container around any data source - for instance, a PDF, an API output, or retrieve data from a database. A Node is the atomic unit of data in LlamaIndex and represents a “chunk” of a source Document. Nodes have metadata that relate them to the document they are in and to other nodes.

**Connectors**: A data connector (often called a Reader) ingests data from different data sources and data formats into Documents and Nodes.

#### **Indexing Stage**
**Indexes**: Once you’ve ingested your data, LlamaIndex will help you index the data into a structure that’s easy to retrieve. This usually involves generating vector embeddings which are stored in a specialized database called a vector store. Indexes can also store a variety of metadata about your data.

**Embeddings** LLMs generate numerical representations of data called embeddings. When filtering your data for relevance, LlamaIndex will convert queries into embeddings, and your vector store will find data that is numerically similar to the embedding of your query.

#### **Querying Stage**
**Retrievers**: A retriever defines how to efficiently retrieve relevant context from an index when given a query. Your retrieval strategy is key to the relevancy of the data retrieved and the efficiency with which it’s done.

**Routers**: A router determines which retriever will be used to retrieve relevant context from the knowledge base. More specifically, the RouterRetriever class, is responsible for selecting one or multiple candidate retrievers to execute a query. They use a selector to choose the best option based on each candidate’s metadata and the query.

**Node Postprocessors**: A node postprocessor takes in a set of retrieved nodes and applies transformations, filtering, or re-ranking logic to them.

**Response Synthesizers**: A response synthesizer generates a response from an LLM, using a user query and a given set of retrieved text chunks.

#### **Putting it all together**
There are endless use cases for data-backed LLM applications but they can be roughly grouped into three categories:

**Query Engines**: A query engine is an end-to-end pipeline that allows you to ask questions over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM.

**Chat Engines**: A chat engine is an end-to-end pipeline for having a conversation with your data (multiple back-and-forth instead of a single question-and-answer).

**Agents**: An agent is an automated decision-maker powered by an LLM that interacts with the world via a set of tools. Agents can take an arbitrary number of steps to complete a given task, dynamically deciding on the best course of action rather than following pre-determined steps. This gives it additional flexibility to tackle more complex tasks.

### Demonstration
#### Data preparation
Only two types of data will be demonstrated in this notebook: unstructed documents (txt, pdf) and API (wikipedia)

In [3]:
# Download llama_index sample data, paul graham essays
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'

--2024-01-05 15:58:00--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay1.txt’


2024-01-05 15:58:01 (1.31 MB/s) - ‘data/paul_graham/paul_graham_essay1.txt’ saved [75042/75042]

--2024-01-05 15:58:01--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 

### Prepare the LLM
For some reason, indexing services requires OpenAI key and therefore we will prepare a local customised llm before indexing

In [34]:
from llama_index.llms import Ollama
llm = Ollama(model="mixtral")
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

config.json: 100%|██████████| 684/684 [00:00<00:00, 713kB/s]
model.safetensors: 100%|██████████| 133M/133M [00:05<00:00, 23.1MB/s] 
tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 149kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 5.44MB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 1.73MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 57.7kB/s]
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seankan/Library/Caches/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [28]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
# from llama_index.llms import HuggingFaceLLM
from llama_index.llms import Ollama


quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    )


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


# llm = HuggingFaceLLM(
#     model_name="HuggingFaceH4/zephyr-7b-beta",
#     tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
#     query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
#     context_window=3900,
#     max_new_tokens=256,
#     model_kwargs={"quantization_config": quantization_config},
#     # tokenizer_kwargs={},
#     generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
#     messages_to_prompt=messages_to_prompt,
#     # Changed to mps for apple silicon chips
#     device_map='mps',
# )

llm = Ollama(
  model="mixtral",
  query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    # Changed to mps for apple silicon chips
    device_map='mps',
  )
service_context = ServiceContext.from_defaults(llm=llm,embed_model="local")

RuntimeError: No GPU found. A GPU is needed for quantization.

### **Load the data into index**
The simplest way to use a Vector Store is to load a set of documents and build an index from them using 

```python
from_documents
```

In [9]:
pip install nbconvert

Defaulting to user installation because normal site-packages is not writeable
Collecting nbconvert
  Downloading nbconvert-7.14.0-py3-none-any.whl.metadata (7.7 kB)
Collecting bleach!=5.0.0 (from nbconvert)
  Downloading bleach-6.1.0-py3-none-any.whl.metadata (30 kB)
Collecting defusedxml (from nbconvert)
  Downloading defusedxml-0.7.1-py2.py3-none-any.whl (25 kB)
Collecting jinja2>=3.0 (from nbconvert)
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.1/133.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting jupyterlab-pygments (from nbconvert)
  Downloading jupyterlab_pygments-0.3.0-py3-none-any.whl.metadata (4.4 kB)
Collecting markupsafe>=2.0 (from nbconvert)
  Downloading MarkupSafe-2.1.3-cp39-cp39-macosx_10_9_universal2.whl.metadata (3.0 kB)
Collecting mistune<4,>=2.0.3 (from nbconvert)
  Downloading mistune-3.0.2-py3-none-any.whl.metadata (1.7 kB)
Collecting nbclient>=0.5.0 (from nbconv

In [32]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=null, embed_model="local:BAAI/bge-small-en-v1.5")

AttributeError: 'NotImplementedType' object has no attribute 'callback_manager'

In [11]:
import os.path

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("./data/.").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

# either way we can now query the index
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

ValueError: No files found in data.