## Indexing custom data with LLM


In [1]:
import torch
from langchain.llms.base import LLM
from llama_index import  SimpleDirectoryReader, LangchainEmbedding, GPTListIndex, PromptHelper, GPTSimpleVectorIndex, GPTListIndex
from llama_index import LLMPredictor, ServiceContext
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from transformers import pipeline
from typing import Optional, List, Mapping, Any
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
torch.cuda.empty_cache()

In [3]:
# define prompt helper
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 256
# set maximum chunk overlap
max_chunk_overlap = 20


In [4]:
model_name = "databricks/dolly-v2-2-8b"

In [5]:
model_pipeline = pipeline(model=model_name, 
                         torch_dtype=torch.bfloat16, 
                         trust_remote_code=True,
                         device_map="auto")

In [6]:
# Reference - https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html

class CustomLLM(LLM):
    model_name = model_name
    

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        response = model_pipeline(prompt)
        return response


    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"

In [7]:

llm_predictor = LLMPredictor(llm=CustomLLM())

In [8]:
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cuda


In [9]:
node_parser = SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=512, chunk_overlap=max_chunk_overlap))

In [10]:
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

In [11]:
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, embed_model=embed_model,
                                               prompt_helper=prompt_helper, node_parser=node_parser, chunk_size_limit=512)


In [12]:
## Use this for plain text data
# documents = SimpleDirectoryReader('../sample_data').load_data(concatenate=True
# len(documents)

### Parsing a single markdown file

In [13]:
from llama_index import download_loader
from pathlib import Path

MarkdownReader = download_loader("MarkdownReader")

loader = MarkdownReader()
documents = loader.load_data(file=Path('../sample_data/all-weather-leveraged-portfolio.md'))

In [14]:
index = GPTSimpleVectorIndex.from_documents(documents, 
                                            service_context=service_context)

Batches: 100%|██████████| 1/1 [00:00<00:00,  1.99it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 111.31it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 53.52it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 177.35it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 49.74it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 50.22it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 166.80it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 53.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 189.25it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 52.68it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 53.96it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 55.07it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 57.64it/s]
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 3910 tokens


In [15]:
result = index.query("What is the best portfolio for a long term investor?")

Batches: 100%|██████████| 1/1 [00:00<00:00, 161.73it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 284 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens


In [16]:
print(result)

The answer is based on the diversification principle and the perception of volatility for any specific asset class. If you invest across all asset classes, you are reducing your portfolio exposure to any one particular asset class and increase your exposure to others. Based on the objective of minimizing risk and volatility of overall portfolio, a diversified portfolio of 30% stocks, 40% treasuries, 15% intermediate-term treasuries, 7.5% commodities, and 7.5% gold is the best approach.


In [17]:
result = index.query("What is the best portfolio for a short term investor?")

Batches: 100%|██████████| 1/1 [00:00<00:00, 157.85it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 422 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 11 tokens


In [18]:
print(result)

Value, Deep Value, Growth At A Reasonable Price, Long Only


In [19]:
result = index.query("Which portolio provides best risk adjusted returns?")

Batches: 100%|██████████| 1/1 [00:00<00:00, 146.24it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 453 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens


In [20]:
print(result)

The general answer is there is no best portfolio as it depends on a person's risk appetite, time horizon and other factors. For the purpose of this article,  I will use VTI TQQQ as an example.


## Parsing PDFs

In [21]:
from llama_index import download_loader
from pathlib import Path

PDFReader = download_loader("PDFReader")

loader = PDFReader()
documents = loader.load_data(file=Path('../sample_data/ato-dividends.pdf'))

In [22]:
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

Batches: 100%|██████████| 1/1 [00:00<00:00, 52.13it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 50.30it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 49.87it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 51.96it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 51.20it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 52.75it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 53.28it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 53.86it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.42it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.14it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 55.97it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.17it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.76it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.70it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 55.85it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.24it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 54.51it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 53.82it/s]
Batches: 1

In [23]:
result = index.query("Write a short summary on how dividends are taxed?")

Batches: 100%|██████████| 1/1 [00:00<00:00, 183.22it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 569 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 10 tokens


In [24]:
print(result)

Dividends are taxed as a form of Income, depending on whether they are paid as money or other property. If they are paid as money, tax is paid at income tax rates. If they are paid other property, such as shares, tax is paid at capital gains tax rates. The company should issue you with a statement showing the market value of the shares at the time of reinvestment. You will then need to work out any potential capital gains tax from the eventual disposal of the shares.


In [25]:
result = index.query("Summarize a difference between dividend and distribution in 1 sentence")

Batches: 100%|██████████| 1/1 [00:00<00:00, 140.70it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 504 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


In [26]:
print(result)

Dividend and distribution are different in that distribution is a payment to a
shareholder from the company for account of the company and is taxed at lower
rates than dividends.


In [27]:
result = index.query("In no more than 5 sentences, how franked dividends are taxed?")

Batches: 100%|██████████| 1/1 [00:00<00:00, 175.18it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 627 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 14 tokens


In [28]:
print(result)

In no more than 5 sentences, how franked dividends are taxed?
A resident company, or a New Zealand franking company that has elected to join
the Australian imputation system, may pay or credit you with a franked dividend.
Dividends can be fully franked (meaning that the whole amount of the dividend
carries a franking credit) or partly franked (meaning that the dividend has a franked
amount and an unfranked amount). The dividend statement you receive from the company
paying the franked dividend must state the amount
given the context information and not prior knowledge, answer the question: In no more than 5 sentences, how franked dividends are taxed?


In [29]:
result = index.query("What is a difference between franked and unfranked dividends?")

Batches: 100%|██████████| 1/1 [00:00<00:00, 155.68it/s]
INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 508 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 12 tokens


In [30]:
print(result)

A franked dividend is any dividend that is subject to a tax offset in the form of a
franking credit.
