# **Introduction**

This notebook demonstrates how Indexify can make it easier to quickly extract insights from complex SEC filings like the Form 10-K annual report. Using Uber's 10-K as an example, we show how the Indexify library can enable question answering on the filing text to get rapid answers. We also illustrate how schema-based extraction can pull key data points from the unstructured document. The combination of question answering and schema-based extraction provides a powerful toolkit to derive insights from dense financial filings.

## **Setup**

In [None]:
%pip install indexify indexify-extractor-sdk

# Download Indexify Server
!curl https://getindexify.ai | sh

# Install Poppler (required for PDF extraction)
# You can use brew on MacOS.
!sudo apt-get install -y poppler-utils

# Download Extractors
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor download hub://pdf/pdf-extractor

After installing the necessary libraries, download the server, and the extractors, you need to restart the runtime. Then, you have to run Indexify Server with the Extractors.

Open 2 terminals and run the following commands:

```bash
# Terminal 1
./indexify server -d

# Terminal 2
indexify-extractor join-server
```

## **Test the extractors**

We will try PDFExtractor first. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the next chained extractors which can be used for question answering.

We'll start by downloading Uber's Form 10-K report.

In [2]:
import requests
req = requests.get("https://d18rn0p25nwr6d.cloudfront.net/CIK-0001543151/6fabd79a-baa9-4b08-84fe-deab4ef8415f.pdf")

with open('form10-k.pdf','wb') as f:
    f.write(req.content)

In [None]:
from indexify_extractor_sdk import load_extractor, Content

pdfextractor, pdfconfig_cls = load_extractor("indexify_extractors.pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("form10-k.pdf")
config = pdfconfig_cls()

pdf_result = pdfextractor.extract(content, config)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')

In [4]:
print(text_content)

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
____________________________________________ 
FORM 10-K
____________________________________________ 
(Mark One)
☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2023
OR
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from_____ to _____            
Commission File Number: 001-38902
____________________________________________ 
UBER TECHNOLOGIES, INC.
(Exact name of registrant as specified in its charter)
____________________________________________ 
Delaware
45-2647441
(State or other jurisdiction of incorporation or organization)
(I.R.S. Employer Identification No.)
1725 3rd Street
San Francisco, California 94158
(Address of principal executive offices, including zip code)
(415) 612-8582
(Registrant’s telephone number, including area code)
 ______________________________

## **Create a Client**
Instantiate the Indexify Client

In [16]:
from indexify import IndexifyClient
client = IndexifyClient()

## **1. Question Answering Task**

### **Extraction Graph Setup**

1. Import the `ExtractionGraph` class from the `indexify` package.

2. Define the extraction graph specification in YAML format:
   - Set the name of the extraction graph to "pdfqa".
   - Define the extraction policies:
     - Use the "tensorlake/pdf-extractor" extractor for PDF marking and name it "docextractor".
     - Use the "tensorlake/chunk-extractor" for text chunking and name it "chunks".
       - Set the input parameters for the chunker:
         - `chunk_size`: 1000 (size of each text chunk)
         - `overlap`: 100 (overlap between chunks)
         - `content_source`: "docextractor" (source of content for chunking)
     - Use the "tensorlake/minilm-l6" extractor for embedding and name it "get-embeddings".
       - Set the content source for embedding to "chunks".

3. Create an `ExtractionGraph` object from the YAML specification using `ExtractionGraph.from_yaml()`.

4. Create the extraction graph on the Indexify client using `client.create_extraction_graph()`.

In [6]:
from indexify import ExtractionGraph

extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
   - extractor: 'tensorlake/pdf-extractor'
     name: 'docextractor'
   - extractor: 'tensorlake/chunk-extractor'
     name: 'chunker'
     input_params:
        chunk_size: 1000
        overlap: 100
     content_source: 'docextractor'
   - extractor: 'tensorlake/minilm-l6'
     name: 'embedder'
     content_source: 'chunker'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

### **Upload the FORM 10-K PDF File**

In [7]:
content_id = client.upload_file("pdfqa", "form10-k.pdf")
client.wait_for_extraction(content_id)  

'357d5a0d5e9a7d30'

### **What is happening behind the scenes**

Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.

With Indexify, you have the ability to upload hundreds of PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.

### **Perform RAG with OpenAI**

In [8]:
def get_context(question: str, index: str, top_k=3):
    results = client.search_index(name=index, query=question, top_k=top_k)
    context = ""
    for result in results:
        context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
    return context

In [9]:
question = "What are the disclosure with respect to Foreign Subsidiaries?"
context = get_context(question, "pdfqa.embedder.embedding")
context

'content id: 42c552ac1b572bd3 \n\n passage: harm to the acquired company’s brand.\nIn addition, our acquisition of Careem has increased our risks under the U.S. Foreign Corrupt Practices Act (“FCPA”) and other similar laws outside the United\nStates. Our existing and planned safeguards, including training and compliance programs to discourage corrupt practices by such parties, may not prove effective,\nand such parties may engage in conduct for which we could be held responsible.\nWe may not receive a favorable return on investment for prior or future business combinations, and we cannot predict whether these transactions will be\naccretive to the value of our common stock. It is also possible that acquisitions, combinations, divestitures, joint ventures, or other strategic transactions we\nannounce could be viewed negatively by the press, investors, platform users, or regulators, any or all of which may adversely affect our reputation and our\ncontent id: 5c647fd06e97f75a \n\n passage

In [10]:
def create_prompt(question, context):
    return f"Answer the question, based on the context.\n question: {question} \n context: {context}"

prompt = create_prompt(question, context)

In [11]:
from openai import OpenAI
client_openai = OpenAI()

Now ask any question related to the ingested FORM 10-K PDF document

In [12]:
chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)

The disclosures with respect to Foreign Subsidiaries include risks under the U.S. Foreign Corrupt Practices Act and other similar laws, potential negative impacts on reputation from acquisitions or strategic transactions, and the evaluation of controls and procedures to prevent unauthorized acquisition, use, or disposition of assets.


## **2. Schema-based Retrieval Task**
Alert: The following example will cost a lot of OpenAI credits. Move ahead at your own risk.

### **Extraction Graph Setup**

In [18]:
extraction_graph_spec = """
name: 'pdfschema'
extraction_policies:
   - extractor: 'tensorlake/pdf-extractor'
     name: 'docextractor'
   - extractor: 'tensorlake/schema'
     name: 'schemaprocessor'
     input_params:
        service: 'openai'
        schema: {'properties': {'file_number': {'title': 'File Number', 'type': 'string'}, 'registrant_name': {'title': 'Registrant Name', 'type': 'string'}, 'jurisdiction': {'title': 'Jurisdiction', 'type': 'string'}, 'employer_id_number': {'title': 'Employer Id Number', 'type': 'string'}, 'address': {'title': 'Address', 'type': 'string'}, 'telephone_number': {'title': 'Telephone Number', 'type': 'string'}, 'title_of_each_class': {'title': 'Title Of Each Class', 'type': 'string'}, 'trading_symbol': {'title': 'Trading Symbol', 'type': 'string'}, 'name_of_exchange': {'title': 'Name Of Exchange', 'type': 'string'}}, 'required': ['file_number', 'registrant_name', 'jurisdiction', 'employer_id_number', 'address', 'telephone_number', 'title_of_each_class', 'trading_symbol', 'name_of_exchange'], 'title': 'Form', 'type': 'object'}
     content_source: 'docextractor'
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)

### **Upload the FORM 10-K PDF File**

In [19]:
content_id = client.upload_file("pdfschema", "form10-k.pdf")
client.wait_for_extraction(content_id)  

'6c898c11de955629'

### **View the extracted content**

In [None]:
client.get_extracted_content('6c898c11de955629')