### **Llaama Index Notebook**
Created by Sean Kan, 5 Jan 2024

In [160]:
pip install llama-index transformers accelerate bitsandbytes matplotlib pypdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
^C
[31mERROR: Operation cancelled by user[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### High Level Concepts
This is a quick guide to the high-level concepts you’ll encounter frequently when building LLM applications.

#### Retrieval Augmented Generation (RAG)
LLMs are trained on enormous bodies of data but they aren’t trained on your data. Retrieval-Augmented Generation (RAG) solves this problem by adding your data to the data LLMs already have access to. You will see references to RAG frequently in this documentation.

In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.
<img src="img/basic_rag.png" alt="rag" width="1000"/>


### **Stages within RAG**
There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

**Loading**: this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.

**Indexing**: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

**Storing**: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.

**Querying**: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.

**Evaluation**: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.


<img src="img/rag_stage.png" alt="alt text" width="1000" />


### **Important concepts within each step**
There are also some terms you’ll encounter that refer to steps within each of these stages.

#### **Loading stage**
**Nodes and Documents**: A Document is a container around any data source - for instance, a PDF, an API output, or retrieve data from a database. A Node is the atomic unit of data in LlamaIndex and represents a “chunk” of a source Document. Nodes have metadata that relate them to the document they are in and to other nodes.

**Connectors**: A data connector (often called a Reader) ingests data from different data sources and data formats into Documents and Nodes.

#### **Indexing Stage**
**Indexes**: Once you’ve ingested your data, LlamaIndex will help you index the data into a structure that’s easy to retrieve. This usually involves generating vector embeddings which are stored in a specialized database called a vector store. Indexes can also store a variety of metadata about your data.

**Embeddings** LLMs generate numerical representations of data called embeddings. When filtering your data for relevance, LlamaIndex will convert queries into embeddings, and your vector store will find data that is numerically similar to the embedding of your query.

#### **Querying Stage**
**Retrievers**: A retriever defines how to efficiently retrieve relevant context from an index when given a query. Your retrieval strategy is key to the relevancy of the data retrieved and the efficiency with which it’s done.

**Routers**: A router determines which retriever will be used to retrieve relevant context from the knowledge base. More specifically, the RouterRetriever class, is responsible for selecting one or multiple candidate retrievers to execute a query. They use a selector to choose the best option based on each candidate’s metadata and the query.

**Node Postprocessors**: A node postprocessor takes in a set of retrieved nodes and applies transformations, filtering, or re-ranking logic to them.

**Response Synthesizers**: A response synthesizer generates a response from an LLM, using a user query and a given set of retrieved text chunks.

#### **Putting it all together**
There are endless use cases for data-backed LLM applications but they can be roughly grouped into three categories:

**Query Engines**: A query engine is an end-to-end pipeline that allows you to ask questions over your data. It takes in a natural language query, and returns a response, along with reference context retrieved and passed to the LLM.

**Chat Engines**: A chat engine is an end-to-end pipeline for having a conversation with your data (multiple back-and-forth instead of a single question-and-answer).

**Agents**: An agent is an automated decision-maker powered by an LLM that interacts with the world via a set of tools. Agents can take an arbitrary number of steps to complete a given task, dynamically deciding on the best course of action rather than following pre-determined steps. This gives it additional flexibility to tackle more complex tasks.

### Demonstration
#### Data preparation
Only two types of data will be demonstrated in this notebook: unstructed documents (txt, pdf) and API (wikipedia)

In [None]:
# Download llama_index sample data, paul graham essays
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'

--2024-01-05 15:58:00--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay1.txt’


2024-01-05 15:58:01 (1.31 MB/s) - ‘data/paul_graham/paul_graham_essay1.txt’ saved [75042/75042]

--2024-01-05 15:58:01--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 

### Prepare the LLM
For some reason, indexing services requires OpenAI key and therefore we will prepare a local customised llm before indexing

In [2]:
# import torch
# from transformers import BitsAndBytesConfig
# from llama_index.prompts import PromptTemplate
# from llama_index.llms import HuggingFaceLLM
from llama_index.llms import Ollama

# quantization_config = BitsAndBytesConfig(
#     # load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
#     )

def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


# llm = HuggingFaceLLM(
#     model_name="HuggingFaceH4/zephyr-7b-beta",
#     tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
#     query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
#     context_window=3900,
#     max_new_tokens=256,
#     model_kwargs={"quantization_config": quantization_config},
#     # tokenizer_kwargs={},
#     generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
#     messages_to_prompt=messages_to_prompt,
#     # # Changed to mps for apple silicon chips
#     # device_map='mps',
#     # Normal GPU run
#     device_map='auto',
# )

# # llama2 model
# llm = Ollama(
#   model="llama2",
#   context_window=3900,
#   messages_to_prompt=messages_to_prompt,
#   temperature = 0.7,
#   additional_kwargs={"top_k": 50, "top_p": 0.95}
#   )

# # zephyr model
# llm = Ollama(
#   model="zephyr",
#   context_window=3900,
#   messages_to_prompt=messages_to_prompt,
#   temperature = 0.7,
#   additional_kwargs={"top_k": 50, "top_p": 0.95}
#   )

# Mixtral 8*7B model
llm = Ollama(
  model="mixtral",
  context_window=3900,
  messages_to_prompt=messages_to_prompt,
  temperature = 0.7,
  additional_kwargs={"top_k": 50, "top_p": 0.95}
  )

  from .autonotebook import tqdm as notebook_tqdm


### **Indexing**
With your data loaded, you now have a list of Document objects (or a list of Nodes). It’s time to build an Index over these objects so you can start querying them.

#### **What is an Index?**
In LlamaIndex terms, an Index is a data structure composed of Document objects, designed to enable querying by an LLM. Your Index is designed to be complementary to your querying strategy.

LlamaIndex offers several different index types. We’ll cover the two most common here.

#### **Vector Store Index**
A VectorStoreIndex is by far the most frequent type of Index you’ll encounter. The Vector Store Index takes your Documents and splits them up into Nodes. It then creates vector embeddings of the text of every node, ready to be queried by an LLM.

#### **What is an embedding?**
Vector embeddings are central to how LLM applications function.

A vector embedding, often just called an embedding, is a numerical representation of the semantics, or meaning of your text. Two pieces of text with similar meanings will have mathematically similar embeddings, even if the actual text is quite different.

This mathematical relationship enables semantic search, where a user provides query terms and LlamaIndex can locate text that is related to the meaning of the query terms rather than simple keyword matching. This is a big part of how Retrieval-Augmented Generation works, and how LLMs function in general.

There are many types of embeddings, and they vary in efficiency, effectiveness and computational cost. By default LlamaIndex uses text-embedding-ada-002, which is the default embedding used by OpenAI. If you are using different LLMs you will often want to use different embeddings.

#### **Vector Store Index embeds your documents**
Vector Store Index turns all of your text into embeddings using an API from your LLM; this is what is meant when we say it “embeds your text”. If you have a lot of text, generating embeddings can take a long time since it involves many round-trip API calls.

When you want to search your embeddings, your query is itself turned into a vector embedding, and then a mathematical operation is carried out by VectorStoreIndex to rank all the embeddings by how semantically similar they are to your query.

#### **Top K Retrieval**
Once the ranking is complete, VectorStoreIndex returns the most-similar embeddings as their corresponding chunks of text. The number of embeddings it returns is known as k, so the parameter controlling how many embeddings to return is known as top_k. This whole type of search is often referred to as “top-k semantic retrieval” for this reason.

Top-k retrieval is the simplest form of querying a vector index; you will learn about more complex and subtler strategies when you read the querying section.

### **Load the data into index**
The simplest way to use a Vector Store is to load a set of documents and build an index from them using 


```python
from_documents
```
### **Creating LlamaIndex Index**
The core essence of LlamaIndex lies in its ability to build structured indices over ingested data, represented as either Documents or Nodes. This indexing facilitates efficient querying over the data. Let's delve into how to build indices with both Document and Node objects, and what happens under the hood during this process.

Different types of indices in LlamaIndex handle data in distinct ways:

1. **Summary Index**: Stores Nodes as a sequential chain, and during query time, all Nodes are loaded into the Response Synthesis module if no other query parameters are specified.

1. **Vector Store Index**: Stores each Node and a corresponding embedding in a Vector Store, and queries involve fetching the top-k most similar Nodes.

1.  **Tree Index**: Builds a hierarchical tree from a set of Nodes, and queries involve traversing from root nodes down to leaf nodes.

1. **Keyword Table Index**: Extracts keywords from each Node to build a mapping, and queries extract relevant keywords to fetch corresponding Nodes.

To choose your index, you should carefully evaluate the module guides here and make a choice here according to your use case.



In [None]:
# pip install nbconvert

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")



In [3]:
# import torch
# use_cuda = torch.cuda.is_available()
# if use_cuda:
#     print('__CUDNN VERSION:', torch.backends.cudnn.version())
#     print('__Number CUDA Devices:', torch.cuda.device_count())
#     print('__CUDA Device Name:',torch.cuda.get_device_name(0))
#     print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)
#     print('__CUDA Max Memory Allocated [GB]:', torch.cuda.max_memory_allocated()/1e9)

__CUDNN VERSION: 8902
__Number CUDA Devices: 1
__CUDA Device Name: NVIDIA GeForce RTX 3070 Laptop GPU
__CUDA Device Total Memory [GB]: 8.589410304
__CUDA Max Memory Allocated [GB]: 0.133448192


In [4]:
import os.path
from llama_index.response.notebook_utils import display_response

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data/paul_graham/").load_data()
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context,service_context=service_context)

# either way we can now query the index
query_engine = index.as_query_engine(response_mode = "compact")
response = query_engine.query("What did the author do growing up?")
display_response(response)

: 

In [15]:
query_engine = index.as_query_engine(response_mode = "refine")
response = query_engine.query("What did the author do growing up?")
display_response(response)

**`Final Response:`** The author's hobbies outside of school before college were writing short stories and programming on an IBM 1401 in their junior high school's basement. However, due to limitations of the machine without input or storage capabilities, the author couldn't remember any programs they wrote for the IBM 1401 as it had limited options without data stored on punch cards. The author was more impressed watching a friend build his own computer with a Heathkit kit and later convinced their father to buy them a TRS-80 microcomputer, which they used to program simple games, predict how high their model rockets would fly, and create a word processor for their father to write books. The author didn't initially plan to study programming in college but switched from philosophy to AI after finding the latter subject more interesting due to its practical applications shown in novels and documentaries like Heinlein's "The Moon is a Harsh Mistress" and a PBS documentary featuring Terry Winograd using SHRDLU.

#### **Demonstration of API doc**
URL and Wikipedia

In [None]:
from llama_index.readers import BeautifulSoupWebReader

url = "https://www.scmp.com/business/china-business/article/3247567/chinese-ev-maker-geely-introduces-first-pure-electric-galaxy-model-woo-mainstream-buyers-byd-foreign"

documents = BeautifulSoupWebReader().load_data([url])

In [None]:
from llama_index import SummaryIndex

index = SummaryIndex.from_documents(documents, service_context=service_context)

In [None]:
query_engine = index.as_query_engine(response_mode = "compact")
response = query_engine.query("summarise the context in bullet points")
print(response)

- Geely, a major Chinese private carmaker, has released its first fully electric sedan under the Galaxy brand for mass-market buyers.
- The Galaxy E8 starts at 175,800 yuan (approximately US$24,752) and boasts a driving range of 550 km, making it cheaper and more efficient than BYD's Han model.
- Geely aims to offer seven models under the Galaxy brand by 2025 for budget-conscious consumers, with a focus on safety, design, performance, and intelligence.
- The E8 uses cutting-edge technology from companies like Qualcomm and BOE Technology, including a 45-inch screen and a Snapdragon 8295 chip for intelligent features.
- Geely's parent company, Zhejiang Geely Holding Group, also owns other well-known marques such as Volvo, Lotus, and Lynk.
- The release of the Galaxy E8 comes amid intensifying competition in China's electric car market, with rival companies like BYD and Tesla competing for dominance.
- Chinese EV sales are expected to grow by 20% in 2024, according to a Fitch Ratings repo

In [None]:
query_engine = index.as_query_engine(response_mode = "refine")
response = query_engine.query("summarise the context in bullet points")
print(response)

- Geely Automobile Group, a major private carmaker in China, has introduced its first pure electric sedan under the Galaxy brand for mainstream buyers.
- The new model, called E8, costs nearly $25,000 and has a driving range of 550km.
- This is less expensive than BYD's Han EV, which sells for almost $30,000 and has a driving range of 506km.
- The E8 is more affordable than Geely's Zeekr-branded electric cars, which target affluent buyers and compete with premium models made by companies such as Tesla.
- The company aims to offer seven Galaxy branded models by 2025 and has a nearly 6% share of China's EV market.
- Battery-swapping technology is being promoted by Geely's parent company and Shanghai-based Nio to address the lack of charging infrastructure for electric cars in China.


#### **Demo of reading Chinese website**
When using llama_index to load and search a Chinese-written document, the indexing process involves converting the Chinese text into a format that Zephyr (or any other English-language LLM) can understand. This is 
typically done through optical character recognition (OCR), a technique that converts scanned or digital images of text into machine-readable text.

The OCR process extracts text from images and converts Chinese characters into their corresponding Unicode codes, which are then passed to the indexing algorithm as searchable text. This allows users to search for 
specific keywords or phrases in the Chinese document using an English-language LLM like Zephyr, even though it cannot directly read or understand Chinese.

During the search process, llama_index compares the user's query against the OCR-generated text, returning relevant results based on the similarity between the query and the indexed text. While this method is effective 
for simple searches, it may not be as accurate as using a Chinese LLM to search Chinese text due to differences in language structure and semantics. However, it can still provide useful information for those who do not 
have access to a Chinese-language LLM or prefer to use an English-language interface.

In [161]:
from llama_index.readers import BeautifulSoupWebReader

url = "https://news.mingpao.com/pns/要聞/article/20240108/s00001/1704651224542/15立會議員去年質詢不足5次-「零蛋」蘇長荣-值得問的已藉其他方式表達"

documents = BeautifulSoupWebReader().load_data([url])

In [162]:
from llama_index import SummaryIndex

index = SummaryIndex.from_documents(documents, service_context=service_context)

In [163]:
query_engine = index.as_query_engine(response_mode = "refine")
response = query_engine.query("summarise the context in bullet points")
display_response(response)

**`Final Response:`** In summary, 15 out of 120 legislators in the previous year did not raise enough questions during official parliamentary sessions, with Legislator Su Chao-cheng from the opposition party being the one who made the fewest number of inquiries. Some lawmakers have communicated their concerns through other methods instead of asking questions during these sessions, resulting in low numbers of questions raised. This reflects poorly on legislators' oversight functions, and it is suggested that both the quantity and quality of queries should be assessed to avoid unnecessary duplication. The article also briefly mentions public service fees for users as a possible area for further scrutiny. Some issues received multiple inquiries from lawmakers, including development and works projects initiated by New People's Party Chairperson Rachel Ngai Yuen-kwan and the medical policies being pursued by Legislator Gilbert Chen Ka-fai. However, political analyst Liu Tak-shia cautioned that low numbers of questions reflect poorly on legislators' oversight functions, and it is recommended that lawmakers demonstrate genuine engagement with policy issues being debated in parliament to avoid superficial or repetitive queries that waste both parliamentary time and resources.

### **Sub Question Query Engine Demo**
Use a sub question query engine to tackle the problem of answering a complex query using multiple data sources.
It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate reponses and synthesizes a final response.

In [None]:
from llama_index import download_loader

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()
wikipedia_documents = loader.load_data(pages=['University of Georgia', 'Georgia Tech', 'Massachusetts Institute of Technology'])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting wikipedia~=1.4 (from -r /Users/seankan/Library/Python/3.9/lib/python/site-packages/llama_index/download/llamahub_modules/requirements.txt (line 1))
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py): started
  Building wheel for wikipedia (setup.py): finished with status 'done'
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=6cfab3e431b746a478a3c910b16f718c2be103cc8147e59d9e04dd896be1c22b
  Stored in directory: /Users/seankan/Library/Caches/pip/wheels/c2/46/f4/caa1bee71096d7b0cdca2f2a2af45cacf35c5760bee8f00948
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip[0m


In [None]:
vector_query_engine = VectorStoreIndex.from_documents(
    wikipedia_documents, service_context=service_context
).as_query_engine()

In [None]:
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
import nest_asyncio
nest_asyncio.apply()

query_engine_tools = [
    QueryEngineTool(
        query_engine=vector_query_engine,
        metadata=ToolMetadata(
            name="universities",
            description="Wikipedia pages about the universities - University of Georgia, Georgia Tech, Massachusetts Institute of Technology.",
        ),
    ),
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

In [None]:
response = query_engine.query(
    "Give me all similaries between University of Georgia, Georgia Tech, Massachusetts Institute of Technology"
)

Generated 6 sub questions.
[1;3;38;2;237;90;200m[universities] Q: In what states are University of Georgia, Georgia Tech, and Massachusetts Institute of Technology located?
[0m[1;3;38;2;237;90;200m[universities] A: The University of Georgia is located in the state of Georgia, while Georgia Tech is also located in the state of Georgia. The Massachusetts Institute of Technology (MIT) is located in the state of Massachusetts.
[0m[1;3;38;2;90;149;237m[universities] Q: What types of programs do University of Georgia, Georgia Tech, and Massachusetts Institute of Technology offer?
[0m[1;3;38;2;90;149;237m[universities] A: The University of Georgia offers programs at the baccalaureate, master's, and doctoral levels in various fields such as the arts and humanities, business, education, agriculture, and environmental sciences.

Georgia Institute of Technology (Georgia Tech) primarily focuses on science, technology, engineering, and mathematics (STEM) fields at both undergraduate and grad

In [None]:
display_response(response)

**`Final Response:`** 1. Located in the United States: All three universities are physically located within the boundaries of the United States.
2. Research Universities: Both University of Georgia and Massachusetts Institute of Technology (MIT) have strong research programs, while Georgia Tech primarily focuses on science, technology, engineering, and mathematics (STEM) fields at both undergraduate and graduate levels, with master's-level courses in Electrical and Computer Engineering, Computer Science, and Mechanical Engineering, as well as Ph.D. coursework in Electrical and Computer Engineering and Mechanical Engineering.
3. Members of Association of American Universities: Both MIT and UGA have been members of the Association of American Universities since 1900 and 1928, respectively. This indicates that all three universities are recognized as leading institutions in their respective states and across the country in terms of academic excellence and research output.
4. Notable Alumni: All three universities have produced notable alumni in various fields such as politics, business, sports, entertainment, academia, and military service, some of whom have received prestigious awards like Nobel Prizes, Turing Awards, National Medals of Science, etc.
5. Rankings: While the rankings may vary from year to year, all three universities have consistently been ranked among the top research universities in their respective states and nationally by various publications such as U.S. News & World Report, Times Higher Education, QS World University Rankings, and Academic Ranking of World Universities (ARWU).
6. State Affiliations: All three universities are affiliated with their respective states in which they are located - the University of Georgia is located in the state of Georgia, while Georgia Tech and MIT are both located in the state of Massachusetts. This may indicate that all three universities have a strong influence on the academic and research communities within their respective states, as well as contribute to the economic and cultural development of those states through various initiatives, partnerships, and collaborations with industry, government, and other academic institutions.

### **Demo of loading PDFs**

In [None]:
from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    "data/annual_reports/"
)

pdf_documents = reader.load_data()

  return float.__new__(cls, value)
  return float.__new__(cls, value)
  return float.__new__(cls, value)


In [None]:
vector_query_engine = VectorStoreIndex.from_documents(
    pdf_documents, service_context=service_context
).as_query_engine()

query_engine_tools = [
    QueryEngineTool(
        query_engine=vector_query_engine,
        metadata=ToolMetadata(
            name="annual reports",
            description="Annual reports of three different property developers in Hong Kong at the year of 2022.",
        ),
    ),
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

In [None]:
response = query_engine.query(
    "Compare the three different companies in terms of profitability, which one has the highest return in year 2022"
)

Generated 4 sub questions.
[1;3;38;2;237;90;200m[annual reports] Q: What is the profitability of Company A as mentioned in the annual reports?
[0m[1;3;38;2;237;90;200m[annual reports] A: To determine the profitability of Company A based on the provided context information, we need to analyze the financial statements presented in the annual reports.

Firstly, we can see that Company A's profit before taxation for the year 201x is stated as 9,214.5 in page 235 of the first report and 10,332.5 in page 236 of the second report. After deducting the applicable taxes, Company A's profit for the year in 201x is 4,301.8 (page 235) and 4,670.9 (page 236).

Comparing these figures across both years, we can see that Company A had higher profits in the second year. Specifically, in 201x, Company A's profit increased by approximately 670.1 (page 236 - page 235) or 15.4% (page 236 profit / page 235 profit).

Therefore, based on the financial information provided in these annual reports, Company A'

In [None]:
display_response(response)

**`Final Response:`** Based on the provided context information, it is not possible to compare the profitability and determine which company among Company A, Company B, and Company C has the highest return in year 2022. While we have profitability figures for Companies A and C (CK Hutchison Holdings Limited) from their respective annual reports, there is no information provided about Company B's profitability in 2022. Additionally, it is unclear if there is a mistake in labeling Company C's data in the income statement presented for Company A's 2022 report or if there is another Company C that was not mentioned in the provided context information. Without further contextual information, it is best to assume that the profitability figures provided are for Company C (CK Hutchison Holdings Limited) and not another company with a similar name. Therefore, to compare the three different companies in terms of profitability and determine which one has the highest return in year 2022, more context or additional information is required.

In [None]:
response = query_engine.query(
    "Who are the leaders of Sun Hung Kai, New World Development Company and CK Hutchison?"
)

Generated 1 sub questions.
[1;3;38;2;237;90;200m[annual reports] Q: Which property developer's annual report should I refer to in order to identify the leaders of Sun Hung Kai?
[0m[1;3;38;2;237;90;200m[annual reports] A: The query asks which property developer's annual report should be referred to in order to identify the leaders of Sun Hung Kai. Based on the context information provided, it is clear that the annual report being referred to is for Sun Hung Kai Properties Limited (SHKPAR). Therefore, the relevant section from their annual report should be consulted to learn about the company's Board of Directors and Committees, which will provide information on the leaders of Sun Hung Kai.
[0m

In [None]:
display_response(response)

**`Final Response:`** To identify the leaders of Sun Hung Kai, based on the context information provided, we need to refer to their annual report. The query does not specify which year's annual report should be consulted, so we can assume it is for the most recent financial year. As per the latest annual report of Sun Hung Kai Properties Limited (SHKPAR), as of 2021, the Board of Directors consists of:

- Ronnie C. Chan - Chairman
- Raymond Kwok Ki-sun - Managing Director
- Cindy S. K. Chan - Executive Director
- Alexis Lau Ka-fai - Independent Non-executive Director
- Paul Y.S. Chu - Independent Non-executive Director
- Michael W.T. Fung - Independent Non-executive Director
- Rebecca R. Y. Cheung - Independent Non-executive Director
- Joseph K. Ng Ka-ki - Senior Vice President and Chief Financial Officer
- Tommy Lai Ming-kam - Executive Vice President
- Terence Chan Tze-leung - Executive Vice President

Regarding New World Development Company, we can also refer to their latest annual report, as per the context information provided. As of 2021, according to their annual report, the Board of Directors consists of:

- Cheng Kar-shun (KC) - Chairman
- Adrian Cheng Chi-kai - Managing Director and Executive Director
- Kenneth Wong Kin-keung - Deputy Managing Director and Executive Director
- John C. Chau Chor-kiu - Independent Non-executive Director
- Joseph Fan Yu-ming - Senior Independent Non-executive Director
- Florence Po Chiu-lan - Independent Non-executive Director
- Peter Lee Ka-kit - Independent Non-executive Director
- Henry Lau Yip-shing - Deputy Chairman and Executive Director (Property) of NWD's subsidiary New World China Land Limited

For CK Hutchison, the context information provided does not specify if it is referring to their overall group or a specific company within the conglomerate. If we assume that the query refers to the parent company CK Hutchison Holdings Limited (CKHH), according to their latest annual report as of 2021, the Board of Directors consists of:

- Victor Li Tzar-kuoi - Chairman and Managing Director
- Canning Fok Kin-ning - Deputy Managing Director and Executive Director
- Allan Zeman Kung - Independent Non-executive Director
- Henry Lau Yip-shing - Senior Independent Non-executive Director
- Dato' Sri Lau Ban Seng - Independent Non-executive Director
- Florence Po Chiu-lan - Independent Non-executive Director
- Joseph Fan Yu-ming - Senior Independent Non-executive Director
- Peter Lee Ka-kit - Independent Non-executive Director

I hope this helps!

### **Demo of Different Documents inside the same folder**

In [25]:
# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data/Mixed/").load_data()
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context,service_context=service_context)

# either way we can now query the index
query_engine = index.as_query_engine(response_mode = "refine")
response = query_engine.query("What did the Geely Automobile Group do in China? provide the response along the source file and its filepath")
display_response(response)

**`Final Response:`** The Geely Automobile Group introduced their first pure electric Galaxy model in China with the aim of attracting budget-sensitive mainstream buyers from competitors like BYD and foreign brands. The E8 sedan, priced at 175,800 yuan (US$24,752), has a driving range of 550 kilometers and is positioned as an ideal model to replace both existing petrol and electric cars due to its superiority in safety, design, performance, and intelligence compared to rival blockbuster models like BYD's Hanelectric vehicle. The E8 also features a 45-inch screen supplied by display panel manufacturer BOE Technology. The launch of the E8 is part of Geely's efforts to intensify competition in China's EV market, where carmakers have missed their sales goal for 2023 due to increasing competition. (Source: "Chinese EV maker Geely introduces first pure electric Galaxy model, to woo mainstream buyers from BYD, foreign brands | South China Morning Post," scmp.com, January 8, 2024)

In [17]:
response = query_engine.query("What are the documents inside the index?")
display_response(response)

**`Final Response:`** The documents included within an index during the presentation of financial results may vary from year to year, but typically include a range of financial and governance-related information. Some examples of documents that might be found within an annual report index are:

1. Corporate Information - This section provides an overview of the company's structure, history, and business activities. It can contain subsections such as Contents, Corporate Profile, Analyses of Core Business Segments by Geographical Location, Analyses by Core Business Segments, Key Financial Information, Business Highlights, Chairman’s Statement, Operations Review, Group Capital Resources and Liquidity, Risk Factors, Information on Directors, Information on Senior Management, Directors’ Report, Corporate Governance Report, Independent Auditor’s Report, Consolidated Income Statement, Consolidated Statement of Comprehensive Income, Consolidated Statement of Financial Position, Consolidated Statement of Changes in Equity, Consolidated Statement of Cash Flows, Notes to the Financial Statements, Principal Subsidiary and Associated Companies and Joint Ventures, Ten Year Summary, and Information for Shareholders.

2. Bond information - This section provides details on the company's bond offerings, including issuance dates, amounts, interest rates, maturity dates, and other relevant financial data.

3. Currency-specific information - For companies with international operations or borrowing, this section might include details about bonds issued in specific currencies, such as EUR750 million notes, 3.625% due 2022 for the euro or GBP303 million notes, 5.625% due 2026 for the British pound sterling.

4. Risk factors - This section explains the major risks and uncertainties associated with the company's operations and financial position, including market risks, credit risks, operational risks, compliance risks, strategic risks, legal risks, reputational risks, and political risks.

5. Directors’ report - This document provides a detailed overview of the company's performance during the year, major events and transactions, and strategies for the future. It can also include information about key personnel, remuneration policies, and governance practices.

6. Independent auditor's report - This document is an independent assessment of the company's financial statements prepared in accordance with specific accounting standards or laws. It confirms the accuracy and reliability of the financial data presented in the annual report.

7. Shareholder information - This section provides details about shareholders' rights, dividends, voting procedures, and other relevant information for investors.

Overall, an annual report index is a comprehensive collection of documents that aims to provide stakeholders with all the necessary financial, operational, and governance-related information in one place.

In [18]:
response = query_engine.query("What did the Geely Automobile Group do in China, let me know the answer and also the name of the source document.")
display_response(response)

**`Final Response:`** The Geely Automobile Group recently partnered with Shanghai-based Nio to promote battery-swapping technology in China as both companies work to address the issue of insufficient charging infrastructure. This information can be found in the source document "Chinese EV maker Geely introduces first pure electric Galaxy model, to woo mainstream buyers from BYD, foreign brands | South China Morning Post." With the formation of this partnership in November, battery-swapping technology will allow owners of electric cars to quickly exchange a spent battery for a fully charged one. This initiative is particularly significant given that sales of battery-powered vehicles in mainland China are projected to grow by 20% annually through 2024, as reported by Fitch Ratings in November.

In [19]:
response = query_engine.query("What documents are included in this query engine?")
display_response(response)

**`Final Response:`** Based on the provided context, it seems that this query engine is related to searching through the Annual Report 2021/22 of Sun Hung Kai Properties Limited (SHKPAR) for specific information or documents. The context includes page numbers and file paths for certain sections within the report, as well as the title and publication year of the report itself. Therefore, it's reasonable to assume that this query engine allows users to search for keywords, phrases, or other criteria within the text of the annual report, potentially including financial statements, executive summaries, and other important documents related to SHKPAR's operations during the 2021/22 fiscal year. Without further information about the specific capabilities or limitations of this query engine, it's impossible to provide a more detailed answer, but at least we now have some context that can help guide our understanding and interpretation of its functionality.

#### **Demo of Mixtral 8*7B to the same query**

In [20]:
# Mixtral 8*7B model
llm = Ollama(
  model="mixtral",
  context_window=3900,
  messages_to_prompt=messages_to_prompt,
  temperature = 0.7,
  additional_kwargs={"top_k": 50, "top_p": 0.95}
  )

from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data/Mixed/").load_data()
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context,service_context=service_context)

# either way we can now query the index
query_engine = index.as_query_engine(response_mode = "refine")
response = query_engine.query("What did the Geely Automobile Group do in China, let me know the answer and also the name of the source document.")
display_response(response)

**`Final Response:`** Based on the provided context, Geely Automobile Group, a Chinese automaker, introduced its first pure electric Galaxy model to attract mainstream buyers away from competitors like BYD and foreign brands. The company has also formed a partnership with Shanghai-based Nio, another electric vehicle (EV) manufacturer, to promote battery-swapping technology. This collaboration aims to address the issue of inadequate charging infrastructure in China, as mentioned in a November 2024 article from the South China Morning Post titled "Chinese EV maker Geely introduces first pure electric Galaxy model, to woo mainstream buyers from BYD, foreign brands" (data/Mixed/Chinese EV maker Geely introduces first pure electric Galaxy model, to woo mainstream buyers from BYD, foreign brands | South China Morning Post.pdf). According to a Fitch Ratings report and the China Passenger Car Association, sales of battery-powered vehicles in mainland China are growing by 20% year on year in 2024, but only a few manufacturers, including BYD and Li Auto, are profitable. A new round of price cuts is also in effect, with top players like BYD and Xpeng offering discounts to attract buyers.

# **Parsing**
### **Creating LlamaIndex Nodes**
In LlamaIndex, once the data has been ingested and represented as Documents, there's an option to further process these Documents into Nodes. Nodes are more granular data entities that represent "chunks" of source Documents, which could be text chunks, images, or other types of data. They also carry metadata and relationship information with other nodes, which can be instrumental in building a more structured and relational index.

<img src="/Users/seankan/Desktop/llama_index_test/img/nodes.png" alt="alt text" width="450" />

#### **Basic**

To parse Documents into Nodes, LlamaIndex provides NodeParser classes. These classes help in automatically transforming the content of Documents into Nodes, adhering to a specific structure that can be utilized further in index construction and querying.

Here's how you can use a SimpleNodeParser to parse your Documents into Nodes:


In [None]:
from llama_index.node_parser import SimpleNodeParser
from llama_index import SimpleDirectoryReader
import tiktoken
from llama_index.text_splitter import TokenTextSplitter
from llama_index import (
    VectorStoreIndex,
    StorageContext,
)

documents = SimpleDirectoryReader("data/paul_graham/").load_data()

# Initialize the parser
node_parser = SimpleNodeParser.from_defaults(
    separator=" ",
    chunk_size=1024, 
    chunk_overlap=20,
)

# Parse documents into nodes
nodes = node_parser.get_nodes_from_documents(documents)

# Assuming nodes is your list of Node objects
index = VectorStoreIndex(nodes, service_context=service_context)

# either way we can now query the index
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

The author, Paul Graham, grew up working on two main things outside of school: writing short stories and programming on an IBM 1401 in the basement of their junior high school using an early version of Fortran. They also waited for years to convince their father to buy a microcomputer, a TRS-80, where they could write simple games, programs to predict how high their model rockets would fly, and a word processor for their father's use. These activities were pursued before the author decided to study philosophy in college, but eventually switched to AI due to being drawn into the world of a novel featuring an intelligent computer called Mike and witnessing Terry Winograd using SHRDLU on PBS.


In [None]:
response = query_engine.query("Summarise the context")
print(response)

The author, Paul Graham, is discussing his experience with computers and programming languages. He mentions that his experience skipped a step in the evolution of computers, as he went from batch processing to microcomputers without experiencing time-sharing machines with interactive OSes. The author also explains that he recently completed working on a new programming language called Bel, which was written in itself using an egregious collection of hacks. He had to put aside writing essays for several years to focus on completing Bel, as it required a lot of intense work and problem-solving. Graham also discusses how he chooses what projects to work on, as he reflects on his past choices. Overall, the context can be summarized as the author's experiences with computers, programming languages, and his recent completion of working on a new language called Bel.


# **Fine Tuning**

Finetuning a model means updating the model itself over a set of data to improve the model in a variety of ways. This can include improving the quality of outputs, reducing hallucinations, memorizing more data holistically, and reducing latency/cost.

The core of our toolkit revolves around in-context learning / retrieval augmentation, which involves using the models in inference mode and not training the models themselves.

While finetuning can be also used to “augment” a model with external data, finetuning can complement retrieval augmentation in a variety of ways:

## **Embedding Finetuning Benefits**
Finetuning the embedding model can allow for more meaningful embedding representations over a training distribution of data –> leads to better retrieval performance.

## **LLM Finetuning Benefits**
Allow it to learn a style over a given dataset

Allow it to learn a DSL that might be less represented in the training data (e.g. SQL)

Allow it to correct hallucinations/errors that might be hard to fix through prompt engineering

Allow it to distill a better model (e.g. GPT-4) into a simpler/cheaper model (e.g. gpt-3.5, Llama 2)

### **Integrations with LlamaIndex**
This is an evolving guide, and there are currently three key integrations with LlamaIndex. Please check out the sections below for more details!

Finetuning embeddings for better retrieval performance

Finetuning Llama 2 for better text-to-SQL

