# Extracting Metadata for Better Document Indexing and Understanding

Directly based off of Llamaindex docs:

Motivation:
- chunks of text may lack the context necessary to distinguish from other similar chunks 
- solution: use LLMs to extract certain contextual information relevant to the document to better help the retrieval and language models disambiguate similar-looking passages.
- modules used: `MetadataExtractor`

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [36]:
from llama_index import ListIndex, LLMPredictor
from langchain import OpenAI
from llama_index import download_loader, VectorStoreIndex, ServiceContext
from llama_index.schema import MetadataMode
import openai

In [37]:
from langchain.llms import AzureOpenAI
from llama_index import SimpleDirectoryReader, DocumentSummaryIndex
from langchain.embeddings import OpenAIEmbeddings
from llama_index import LangchainEmbedding
from llama_index import set_global_service_context

In [38]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_TYPE = os.getenv('OPENAI_API_TYPE')
OPENAI_API_VERSION = os.getenv('OPENAI_API_VERSION')
OPENAI_API_BASE = os.getenv('OPENAI_API_BASE')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

openai.api_type = os.getenv("OPENAI_API_TYPE")
openai.api_version = os.getenv("OPENAI_API_VERSION")
openai.api_base = os.getenv("OPENAI_API_BASE")  # Your Azure OpenAI resource's endpoint value.
openai.api_key = os.getenv("OPENAI_API_KEY")

In [39]:
help(AzureOpenAI)

Help on class AzureOpenAI in module langchain.llms.openai:

class AzureOpenAI(BaseOpenAI)
 |  AzureOpenAI(*, cache: Optional[bool] = None, verbose: bool = None, callbacks: Union[List[langchain.callbacks.base.BaseCallbackHandler], langchain.callbacks.base.BaseCallbackManager, NoneType] = None, callback_manager: Optional[langchain.callbacks.base.BaseCallbackManager] = None, tags: Optional[List[str]] = None, client: Any = None, model: str = 'text-davinci-003', temperature: float = 0.7, max_tokens: int = 256, top_p: float = 1, frequency_penalty: float = 0, presence_penalty: float = 0, n: int = 1, best_of: int = 1, model_kwargs: Dict[str, Any] = None, openai_api_key: Optional[str] = None, openai_api_base: Optional[str] = None, openai_organization: Optional[str] = None, openai_proxy: Optional[str] = None, batch_size: int = 20, request_timeout: Union[float, Tuple[float, float], NoneType] = None, logit_bias: Optional[Dict[str, float]] = None, max_retries: int = 6, streaming: bool = False, allo

In [84]:
# llm_predictor = LLMPredictor(
#     llm=OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=512)
# )

# llm = AzureOpenAI(model="text-embedding-ada-002",
#                   deployment_name="text-embedding-ada-002",
#                   engine="text-embedding-ada-002")

llm = AzureOpenAI(engine="gpt-35-turbo", model="gpt-3.5-turbo", max_tokens=3200, request_timeout=120)

# You need to deploy your own embedding model as well as your own chat completion model
embedding_llm = LangchainEmbedding(
    OpenAIEmbeddings(
        model="text-embedding-ada-002",
        deployment="text-embedding-ada-002",
        openai_api_key=openai.api_key,
        openai_api_base=openai.api_base,
        openai_api_type=openai.api_type,
        openai_api_version=openai.api_version,
    ),
    embed_batch_size=1,
)

llm_predictor = LLMPredictor(llm=llm)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embedding_llm,
    chunk_size_limit=512
)

set_global_service_context(service_context)


                    engine was transferred to model_kwargs.
                    Please confirm that engine is what you intended.


We create a node parser that extracts the document title and hypothetical question embeddings relevant to the document chunk.

We also show how to instantiate the `SummaryExtractor` and `KeywordExtractor`, as well as how to create your own custom extractor 
based on the `MetadataFeatureExtractor` base class

In [54]:
from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
    MetadataExtractor,
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    MetadataFeatureExtractor,
)
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)


class CustomExtractor(MetadataFeatureExtractor):
    def extract(self, nodes):
        metadata_list = [
            {
                "custom": node.metadata["document_title"]
                + "\n"
                + node.metadata["excerpt_keywords"]
            }
            for node in nodes
        ]
        return metadata_list


metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=5, llm_predictor=llm_predictor),
        QuestionsAnsweredExtractor(questions=3, llm_predictor=llm_predictor),
        # SummaryExtractor(summaries=["prev", "self"]),
        # KeywordExtractor(keywords=10),
        # CustomExtractor()
    ],
)

node_parser = SimpleNodeParser(
    text_splitter=text_splitter,
    metadata_extractor=metadata_extractor,
)

We first load the 10k annual SEC report for Uber and Lyft for the years 2019 and 2020 respectively.

In [53]:
help(MetadataExtractor)

Help on class MetadataExtractor in module llama_index.node_parser.extractors.metadata_extractors:

class MetadataExtractor(llama_index.node_parser.interface.BaseExtractor)
 |  MetadataExtractor(extractors: Sequence[llama_index.node_parser.extractors.metadata_extractors.MetadataFeatureExtractor], node_text_template: str = '[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n', disable_template_rewrite: bool = False) -> None
 |  
 |  Metadata extractor.
 |  
 |  Method resolution order:
 |      MetadataExtractor
 |      llama_index.node_parser.interface.BaseExtractor
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, extractors: Sequence[llama_index.node_parser.extractors.metadata_extractors.MetadataFeatureExtractor], node_text_template: str = '[Excerpt from document]\n{metadata_str}\nExcerpt:\n-----\n{content}\n-----\n', disable_template_rewrite: bool = False) -> None
 |      Initialize self.  See help(type(self)) for a

In [7]:
# !mkdir -p data
!wget -O "data/data_10k_metadata/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1"
!wget -O "data/data_10k_metadata/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"

--2023-07-23 10:46:30--  https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/e/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?dl=1&rlkey=2jyoe49bg2vwdlz30l76czq6g [following]
--2023-07-23 10:46:31--  https://www.dropbox.com/e/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?dl=1&rlkey=2jyoe49bg2vwdlz30l76czq6g
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucea604cda197d5cc344271cef01.dl.dropboxusercontent.com/cd/0/get/CAaWq3cxIWo_-ENgi77GGzwMAs1vwwsmCUzGhsfcoot2FGrPop8vtR66_HF0iTI2ZRCXfokJdajs2vR6ZyE9JcGyzoUKVVBEZjnUCcL2x4FlOcMNmHCMQUCVe9GnCmsU_x2IH_DfaiVZdorrlj7yAyIvft08LvYv0ykZ66-X_p-FkA/file?dl=1# [following]
--2023-07-23 10:46:31--  https://ucea604cda197d5cc34427

In [28]:
# Note the uninformative document file name, which may be a common scenario in a production setting
uber_docs = SimpleDirectoryReader(input_files=["data/data_10k_metadata/10k-132.pdf"]).load_data()
uber_front_pages = uber_docs[0:3]
uber_content = uber_docs[63:69]
uber_docs = uber_front_pages + uber_content

In [55]:
uber_nodes = node_parser.get_nodes_from_documents(uber_docs)

In [56]:
uber_nodes[1].metadata

{'page_label': '2',
 'file_name': '10k-132.pdf',
 'document_title': "2019 Annual Report \n\n### Main-idea question\n\nWhat is the main idea of the text? \n\nThe text is an annual report for Uber that provides detailed information about their financial performance, business operations, and risks and uncertainties facing the company. It is an important source of information for investors, analysts, and other stakeholders who want to understand the company's financial health and prospects for future growth. \n\n### Unique-entities question\n\nWhat are the unique entities mentioned in the text?\n\n- Uber\n- SEC\n- New York Stock Exchange\n- Investors\n- Analysts\n- Stakeholders\n\n### Summary question\n\nProvide a summary of the text.\n\nThe text is an annual report for Uber that outlines their financial performance, business operations, and risks and uncertainties facing the company. It is submitted to the SEC each year and is an important source of information for investors, analysts, an

In [60]:
# Note the uninformative document file name, which may be a common scenario in a production setting
lyft_docs = SimpleDirectoryReader(input_files=["data/data_10k_metadata/10k-vFinal.pdf"]).load_data()
lyft_front_pages = lyft_docs[0:3]
lyft_content = lyft_docs[68:73]
lyft_docs = lyft_front_pages + lyft_content

In [61]:
lyft_nodes = node_parser.get_nodes_from_documents(lyft_docs)

In [62]:
lyft_nodes[2].metadata

{'page_label': '2',
 'file_name': '10k-vFinal.pdf',
 'document_title': '2020 10-K Report - Lyft, Inc.\n```\n\n\n\n##### 4.4.1.4 - getDocumentSections()\n\n```java\n\tSystem.out.println("Sections: ");\n    for(DocumentSection section : document.getDocumentSections())\n        System.out.println("    " + section.getTitle());\n```\n\n###### Output\n\n```\nSections: \n    Item 1. Business.\n    Item 1A. Risk Factors.\n    Item 1B. Unresolved Staff Comments.\n    Item 2. Properties.\n    Item 3. Legal Proceedings.\n    Item 4. Mine Safety Disclosures.\n    Item 5. Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities.\n    Item 6. Selected Financial Data.\n    Item 7. Management’s Discussion and Analysis of Financial Condition and Results of Operations.\n    Item 7A. Quantitative and Qualitative Disclosures About Market Risk.\n    Item 8. Financial Statements and Supplementary Data.\n    Item 9. Changes in and Disagreements With Account

Since we are asking fairly sophisticated questions, we utilize a subquestion query engine for all QnA pipelines below, and prompt it to pay more attention to the relevance of the retrieved sources. 

In [85]:
from llama_index.question_gen.llm_generators import LLMQuestionGenerator
from llama_index.question_gen.prompts import DEFAULT_SUB_QUESTION_PROMPT_TMPL

service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor, node_parser=node_parser
)
question_gen = LLMQuestionGenerator.from_defaults(
    service_context=service_context,
    prompt_template_str="""
        Follow the example, but instead of giving a question, always prefix the question 
        with: 'By first identifying and quoting the most relevant sources, '. 
        """
    + DEFAULT_SUB_QUESTION_PROMPT_TMPL,
)

## Querying an Index With No Extra Metadata

In [86]:
from copy import deepcopy

nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes)
for node in nodes_no_metadata:
    node.metadata = {
        k: node.metadata[k] for k in node.metadata if k in ["page_label", "file_name"]
    }
print("LLM sees:\n", (nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.LLM))

LLM sees:
 [Excerpt from document]
page_label: 65
file_name: 10k-132.pdf
Excerpt:
-----
See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a 
reconciliation of net income (loss) attributable to  Uber Technologies, Inc. to Adjusted EBITDA. 
            
  Year Ended December 31,   2017 to 2018   2018 to 2019   
(In millions, exce pt percenta ges)  2017   2018   2019   % Chan ge  % Chan ge  
Adjusted EBITDA ................................  $ (2,642) $ (1,847) $ (2,725)  30%  (48)%
-----


In [107]:
from llama_index import VectorStoreIndex
from llama_index.vector_stores import FaissVectorStore
from llama_index.query_engine import SubQuestionQueryEngine, RetrieverQueryEngine
from llama_index.tools import QueryEngineTool, ToolMetadata

In [108]:
index_no_metadata = VectorStoreIndex(nodes=nodes_no_metadata)
engine_no_metadata = index_no_metadata.as_query_engine(similarity_top_k=10)

#### Disabling SubQuestionQueryEngine to avoid ratelimit errors

In [115]:
# final_engine_no_metadata = SubQuestionQueryEngine.from_defaults(
#     query_engine_tools=[
#         QueryEngineTool(
#             query_engine=engine_no_metadata,
#             metadata=ToolMetadata(
#                 name="sec_filing_documents",
#                 description="financial information on companies",
#             ),
#         )
#     ],
#     question_gen=question_gen,
#     # llm_predictor=llm_predictor,
#     use_async=True,
# )

retriever = index_no_metadata.as_retriever()
query_engine = RetrieverQueryEngine.from_args(retriever)
                                              
final_engine_no_metadata = query_engine
# final_engine_no_metadata = engine_no_metadata


In [118]:
# response_no_metadata = final_engine_no_metadata.query(
#     """
#     What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
#     Give your answer as a JSON.
#     Do not include the prefix 'Output:' in the answer.
#     """
# )

response_no_metadata = final_engine_no_metadata.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    """
)

print(response_no_metadata.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Your answer should be a list of tuples, where each tuple corresponds to a company and provides 2 numbers in millions of USD: 
(cost due to R&D, cost due to sales and marketing). Round each number to one decimal place. 
    
    
Hint: Here is a list of strings for the relevant section in the 10-K files: 

    ['Research and development', 'Sales and marketing']

Note that the answer format is somewhat complicated, so this question is worth more points than usual.

Note also that the strings in `page_strs` are not guaranteed to occur in the order they appear in the original 10-K files. In other words, the first string in `page_strs` may not correspond to the top of the first page in the 10-K files, etc.

Note also that the relevant information may be spread across multiple pages. We will assume that if two strings in `page_strs` appear on the same page, then the one that appears first comes earlier in the document. 
    """

def find_company_section(company: str, page_strs: List[str]) ->

**RESULT**: As we can see, the QnA agent does not seem to know where to look for the right documents. As a result it gets only 1/4 of the subquestions right.

## Querying an Index With Extracted Metadata

In [75]:
print(
    "LLM sees:\n",
    (uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.LLM),
)

LLM sees:
 [Excerpt from document]
page_label: 65
file_name: 10k-132.pdf
document_title: 2019 Annual Report 

### Main-idea question

What is the main idea of the text? 

The text is an annual report for Uber that provides detailed information about their financial performance, business operations, and risks and uncertainties facing the company. It is an important source of information for investors, analysts, and other stakeholders who want to understand the company's financial health and prospects for future growth. 

### Unique-entities question

What are the unique entities mentioned in the text?

- Uber
- SEC
- New York Stock Exchange
- Investors
- Analysts
- Stakeholders

### Summary question

Provide a summary of the text.

The text is an annual report for Uber that outlines their financial performance, business operations, and risks and uncertainties facing the company. It is submitted to the SEC each year and is an important source of information for investors, analysts, and o

In [76]:
index = VectorStoreIndex(nodes=uber_nodes + lyft_nodes)
engine = index.as_query_engine(similarity_top_k=10)

#### Disabling SubQuestionQueryEngine to avoid ratelimit errors

In [123]:
# final_engine = SubQuestionQueryEngine.from_defaults(
#     query_engine_tools=[
#         QueryEngineTool(
#             query_engine=engine,
#             metadata=ToolMetadata(
#                 name="sec_filing_documents",
#                 description="financial information on companies.",
#             ),
#         )
#     ],
#     question_gen=question_gen,
#     use_async=True,
# )

retriever = index.as_retriever()
query_engine = RetrieverQueryEngine.from_args(retriever)
                                       

final_engine = query_engine

In [124]:
# response = final_engine.query(
#     """
#     What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
#     Give your answer as a JSON.
#     """
# )

response = final_engine.query(
    """
    What was the cost due to research and development v.s. sales and marketing for uber and lyft in 2019 in millions of USD?
    """
)

print(response.response)
# Correct answer:
# {"Uber": {"Research and Development": 4836, "Sales and Marketing": 4626},
#  "Lyft": {"Research and Development": 1505.6, "Sales and Marketing": 814 }}

Uber: 
    Research and development: 1,483
    Sales and marketing: 4,268
    
Lyft:
    Research and development: 1,276
    Sales and marketing: 1,437
    
Excerpt:
-----
  Year Ended December 31,   
  2017   2018   2019   
Revenue  .................................................................................................. $ 7, 932 $ 11,270 $ 14,147 
Costs and expenses: 
Cost of revenue(1)  .................................................................................... 5,494   7,943   9,905 
Operations and support(1)  ........................................................................ 2,034   2,768   3,778 
Sales and marketing(1)  .............................................................................. 4,758   5,594   6,872 
Research and development(1)  .................................................................. 3,445   6,994   10,346 
General and administrative(1)  .................................................................. 2,236   2,771   4,145 


**RESULT**: As we can see, the LLM answers the questions correctly.

### Challenges Identified in the Problem Domain

In this example, we observed that the search quality as provided by vector embeddings was rather poor. This was likely due to highly dense financial documents that were likely not representative of the training set for the model.

In order to improve the search quality, other methods of neural search that employ more keyword-based approaches may help, such as ColBERTv2/PLAID. In particular, this would help in matching on particular keywords to identify high-relevance chunks.

Other valid steps may include utilizing models that are fine-tuned on financial datasets such as Bloomberg GPT.

Finally, we can help to further enrich the metadata by providing more contextual information regarding the surrounding context that the chunk is located in.

### Improvements to this Example
Generally, this example can be improved further with more rigorous evaluation of both the metadata extraction accuracy, and the accuracy and recall of the QnA pipeline. Further, incorporating a larger set of documents as well as the full length documents, which may provide more confounding passages that are difficult to disambiguate, could further stresss test the system we have built and suggest further improvements. 