# LlamaParse JSON Mode Tour

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_json.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows you how to use LlamaParse JSON mode with LlamaIndex and all the features it supports.

JSON mode gives you a rich variety of data and metadata about every page in your document.

## Setup

Let's bring in our imports and set up our API keys.

In [None]:
!pip install llama-index-core
!pip install llama-parse

In [None]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-xxx"

## Load Data

Let's load a large and complex PDF, San Francisco's 2023 proposed budget.

In [None]:
!wget 'https://www.dropbox.com/scl/fi/vip161t63s56vd94neqlt/2023-CSF_Proposed_Budget_Book_June_2023_Master_Web.pdf?rlkey=hemoce3w1jsuf6s2bz87g549i&dl=0' -O './san_francisco_budget_2023.pdf'

--2024-12-04 13:56:36--  https://www.dropbox.com/scl/fi/vip161t63s56vd94neqlt/2023-CSF_Proposed_Budget_Book_June_2023_Master_Web.pdf?rlkey=hemoce3w1jsuf6s2bz87g549i&dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.13.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.13.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uce36de57a25faafb038bd1e1d77.dl.dropboxusercontent.com/cd/0/inline/CfqOgRY-KYDIZOYMXWDEvq-vBxbuzmjGMg7TPEYNFxTGKcG4aHGoLUTv_SRj3qxHYx1UHn07R_joo9rVEIcO6mAnX8mVDaqVP9k5da3CeOx1FMoFzU_fH_xpfexr2EHKjWY/file# [following]
--2024-12-04 13:56:37--  https://uce36de57a25faafb038bd1e1d77.dl.dropboxusercontent.com/cd/0/inline/CfqOgRY-KYDIZOYMXWDEvq-vBxbuzmjGMg7TPEYNFxTGKcG4aHGoLUTv_SRj3qxHYx1UHn07R_joo9rVEIcO6mAnX8mVDaqVP9k5da3CeOx1FMoFzU_fH_xpfexr2EHKjWY/file
Resolving uce36de57a25faafb038bd1e1d77.dl.dropboxusercontent.com (uce36de57a25faafb038bd1e1d77.dl.dropboxusercontent.com)... 162.125.13.15
Connecting to uce36de57a25f

## Using LlamaParse in JSON Mode for PDF Reading

Let's parse our document! We're using Premium mode to get all the bells and whistles, so this takes a while the first time.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(verbose=True, premium_mode=True)
json_objs = parser.get_json_result("./san_francisco_budget_2023.pdf")

Started parsing the file under job_id 32778fb0-9e83-4b00-aebe-0d7f59ff0b5f


Let's see what we got:

In [None]:
import json

print(list(json_objs[0].keys()))

['pages', 'job_metadata', 'job_id', 'file_path']


Let's take a look at the metadata first:

In [None]:
pages = json_objs[0]["pages"]
metadata = json_objs[0]["job_metadata"]

print(json.dumps(metadata, indent=2))

{
  "credits_used": 5430.0,
  "job_credits_usage": 0,
  "job_pages": 0,
  "job_auto_mode_triggered_pages": 0,
  "job_is_cache_hit": true
}


This was a cache hit (I re-ran this notebook while writing it), so we can see how many credits were used to parse this initially, and also that no credits were used for a cache hit. The cache lasts 48 hours.

Now let's take a look at the real meat of the document. How many pages did we get?


In [None]:
print(len(pages))

362


That's a lot! Let's look at the very first page, which we know looks like this:

<img src="./json_tour_screenshots/page_1.png" alt="Page 1" width="300"/>


In [None]:
print(list(pages[0].keys()))

['page', 'text', 'md', 'images', 'charts', 'items', 'status', 'links', 'width', 'height', 'triggeredAutoMode', 'structuredData', 'noStructuredContent', 'noTextContent']


We get a long list of properties in JSON mode. In this tour we're going to touch on them all!

## Basic properties

Let's get the easy to explain properties out of the way first.

* `page`: this is simply the page number, starting at 1.
* `status`: this is the status of the page, which is usually "OK" unless there was an error processing the page.
* `width` and `height`: these are the dimensions of the page in pixels.
* `triggeredAutoMode`: this indicates whether the page triggered auto mode; see [LlamaParse docs](https://docs.cloud.llamaindex.ai/llamaparse/getting_started) for more details.
* `structuredData`/`noStructuredContent`: these are set if you are using structured mode; see [LlamaParse docs](https://docs.cloud.llamaindex.ai/llamaparse/getting_started) for more details.
* `noTextContent`: this is true if the page was empty of text.

Now let's take a closer look at the more complex properties.


### `text`

In [None]:
print(pages[0]["text"])

                    CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA



PROPOSED BUDGETFISCAL YEARS 2023-2024 & 2024-2025



                   LONDON N. BREED
                                  6 ANDCOUNTY g
                             1                                    23
                                o TVTS EngueRRAODSIOAVFierrO
            MAYOR’S OFFICE OF PUBLIC POLICY AND FINANCE
       Anna Duning, Director of Mayor’s                          Fisher Zhu, Fiscal and Policy Analyst
       Office of Public Policy and Finance                     Anya Shutovska, Fiscal and Policy Analyst
       Sally Ma, Deputy Budget Director
Radhika Mehlotra, Senior Fiscal and Policy Analyst               Jack English, Fiscal and Policy Analyst
    Damon Daniels, Fiscal and Policy Analyst                   Xang Hang, Junior Fiscal and Policy Analyst
   Matthew Puckett, Fiscal and Policy Analyst               Tabitha Romero-Bothi, Fiscal and Policy Assistant


As you can see, this is a laid-out plain text version of all the text LlamaParse was able to extract from the page. On a page that is mostly text this is often sufficient! But you can see on this page, with a complex layout with images, that it has some limitations. The text "and county" has been extracted from the seal of the city, as has some of the motto. We'll see later how we can use other structures provided to get around this.

### `md`

In [None]:
print(pages[0]["md"])

CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA

# PROPOSED BUDGET

FISCAL YEARS 2023-2024 & 2024-2025

## LONDON N. BREED

[A circular seal of the City and County of San Francisco is displayed]

### MAYOR'S OFFICE OF PUBLIC POLICY AND FINANCE

| Position | Name |
|----------|------|
| Director of Mayor's Office of Public Policy and Finance | Anna Duning |
| Deputy Budget Director | Sally Ma |
| Senior Fiscal and Policy Analyst | Radhika Mehlotra |
| Fiscal and Policy Analyst | Damon Daniels |
| Fiscal and Policy Analyst | Matthew Puckett |
| Fiscal and Policy Analyst | Fisher Zhu |
| Fiscal and Policy Analyst | Anya Shutovska |
| Fiscal and Policy Analyst | Jack English |
| Junior Fiscal and Policy Analyst | Xang Hang |
| Fiscal and Policy Assistant | Tabitha Romero-Bothi |


`md` is a Markdown representation of the page, and it has been parsed with more sophistication. Images have been accounted for: you get a description of the image instead of partial OCR from the image text. The table of authors has also been rendered as a markdown table. This renders visually quite nicely:

CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA

# PROPOSED BUDGET

FISCAL YEARS 2023-2024 & 2024-2025

## LONDON N. BREED

[A circular seal of the City and County of San Francisco is displayed]

### MAYOR'S OFFICE OF PUBLIC POLICY AND FINANCE

| Position | Name |
|----------|------|
| Director of Mayor's Office of Public Policy and Finance | Anna Duning |
| Deputy Budget Director | Sally Ma |
| Senior Fiscal and Policy Analyst | Radhika Mehlotra |
| Fiscal and Policy Analyst | Damon Daniels |
| Fiscal and Policy Analyst | Matthew Puckett |
| Fiscal and Policy Analyst | Fisher Zhu |
| Fiscal and Policy Analyst | Anya Shutovska |
| Fiscal and Policy Analyst | Jack English |
| Junior Fiscal and Policy Analyst | Xang Hang |
| Fiscal and Policy Assistant | Tabitha Romero-Bothi |

## `images`

This is an array of all the images on the page, including metadata and text OCRed out of the images, as well as a full-page screenshot of the entire page. Let's take a look at the first image (we're going to skip the `ocr` key for now because it's very large):


In [None]:
image_data = pages[0]["images"][0].copy()
del image_data["ocr"]
print(json.dumps(image_data, indent=2))

{
  "name": "img_p0_1.png",
  "height": 393,
  "width": 396,
  "x": 217.1828461,
  "y": 347.84094239999996,
  "original_width": 396,
  "original_height": 393
}


These fields are hopefully self-explanatory. Let's take a look at `img_p0_1.png` by getting the SDK to download the images for us.

In [None]:
# Make a copy of json_objs with only the first page to avoid downloading all the images
first_page_json = json_objs.copy()
first_page_json[0]["pages"] = [first_page_json[0]["pages"][0]]  # Keep only first page

# get the SDK to download all the images to a local directory for us
images = parser.get_images(first_page_json, download_path="./json_tour_screenshots")

print(images[0]["path"])

> Image for page 1: [{'name': 'img_p0_1.png', 'height': 393, 'width': 396, 'x': 217.1828461, 'y': 347.84094239999996, 'original_width': 396, 'original_height': 393, 'ocr': [{'x': 349, 'y': 197, 'w': 38, 'h': 62, 'confidence': '0.8849547199866379', 'text': '3'}, {'x': 30, 'y': 264, 'w': 50, 'h': 58, 'confidence': '0.5020322574275724', 'text': 'o'}, {'x': 229, 'y': 297, 'w': 52, 'h': 18, 'confidence': '0.2651256745127598', 'text': 'FierrO'}, {'x': 169, 'y': 313, 'w': 78, 'h': 18, 'confidence': '0.41575810326364837', 'text': 'EngueRRA'}, {'x': 155, 'y': -5, 'w': 155, 'h': 82, 'confidence': '0.9739526858073052', 'text': 'COUNTY'}, {'x': 51, 'y': 57, 'w': 104, 'h': 6, 'confidence': '0.9999688909318631', 'text': 'AND'}, {'x': 301, 'y': 74, 'w': 63, 'h': 36, 'confidence': '0.6971765886842753', 'text': 'g'}, {'x': 42, 'y': 67, 'w': 2, 'h': 118, 'confidence': '0.266282299733124', 'text': '6'}, {'x': 339, 'y': 116, 'w': 48, 'h': 81, 'confidence': '0.9965592468417981', 'text': '2'}, {'x': 2, 'y':

![](./json_tour_screenshots/32778fb0-9e83-4b00-aebe-0d7f59ff0b5f-img_p0_1.png)

The image has been accurately identified and extracted from the page. Let's take a look at that length `ocr` field now:

In [None]:
print(json.dumps(pages[0]["images"][0]["ocr"], indent=2))

[
  {
    "x": 349,
    "y": 197,
    "w": 38,
    "h": 62,
    "confidence": "0.8849547199866379",
    "text": "3"
  },
  {
    "x": 30,
    "y": 264,
    "w": 50,
    "h": 58,
    "confidence": "0.5020322574275724",
    "text": "o"
  },
  {
    "x": 229,
    "y": 297,
    "w": 52,
    "h": 18,
    "confidence": "0.2651256745127598",
    "text": "FierrO"
  },
  {
    "x": 169,
    "y": 313,
    "w": 78,
    "h": 18,
    "confidence": "0.41575810326364837",
    "text": "EngueRRA"
  },
  {
    "x": 155,
    "y": -5,
    "w": 155,
    "h": 82,
    "confidence": "0.9739526858073052",
    "text": "COUNTY"
  },
  {
    "x": 51,
    "y": 57,
    "w": 104,
    "h": 6,
    "confidence": "0.9999688909318631",
    "text": "AND"
  },
  {
    "x": 301,
    "y": 74,
    "w": 63,
    "h": 36,
    "confidence": "0.6971765886842753",
    "text": "g"
  },
  {
    "x": 42,
    "y": 67,
    "w": 2,
    "h": 118,
    "confidence": "0.266282299733124",
    "text": "6"
  },
  {
    "x": 339,
    "y": 116,
 

In this example, the OCR has attempted to extract text from the image of the seal and been foiled by text that is small and curved. This is very tricky text! For each chunk of text extracted, you get the approximate size and location of the text, as well as a confidence score of how well the text was extracted.


### `items`

This is the most valuable and complex property of JSON mode. This is an array of all the parsed elements on the page, as used to render the markdown, but separated out into their own objects. This is useful if you want to do more processing on the data. Let's take a look:

In [None]:
print(json.dumps(pages[0]["items"], indent=2))

[
  {
    "type": "text",
    "value": "CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA",
    "md": "CITY & COUNTY OF SAN FRANCISCO, CALIFORNIA",
    "bBox": {
      "x": 176,
      "y": 64,
      "w": 276.2,
      "h": 12
    }
  },
  {
    "type": "heading",
    "lvl": 1,
    "value": "PROPOSED BUDGET",
    "md": "# PROPOSED BUDGET",
    "bBox": {
      "x": 89,
      "y": 165,
      "w": 450.97,
      "h": 47
    }
  },
  {
    "type": "text",
    "value": "FISCAL YEARS 2023-2024 & 2024-2025",
    "md": "FISCAL YEARS 2023-2024 & 2024-2025",
    "bBox": {
      "x": 170,
      "y": 204,
      "w": 288.41,
      "h": 268.16
    }
  },
  {
    "type": "heading",
    "lvl": 2,
    "value": "LONDON N. BREED",
    "md": "## LONDON N. BREED",
    "bBox": {
      "x": 171,
      "y": 292,
      "w": 285.94,
      "h": 31
    }
  },
  {
    "type": "text",
    "value": "[A circular seal of the City and County of San Francisco is displayed]",
    "md": "[A circular seal of the City and County of S

As you can see you get different element types: text, headings, and tables. Each comes with its own `md` key containing a Markdown representation of that element, allowing you to easily summarize with only headings, tables only, etc..

The ability to extract tables from visual data is really powerful. Let's take a look at page 35, which has some bar charts that get automatically converted into tables:

<img src="./json_tour_screenshots/page_35.png" alt="Page 35" width="300"/>


The bar chart has been converted into a table, and even though explicit values are not included, the bar chart has been read and approximate values for each bar on the chart have been included!

In [None]:
print(json.dumps(pages[34]["items"][2], indent=2))

{
  "type": "table",
  "rows": [
    [
      "Age Group",
      "Number of Residents"
    ],
    [
      "Under 5 Years",
      "~50,000"
    ],
    [
      "5 to 19 Years",
      "~100,000"
    ],
    [
      "20 to 34 Years",
      "~250,000"
    ],
    [
      "35 to 59 Years",
      "~300,000"
    ],
    [
      "60 and Over",
      "~200,000"
    ]
  ],
  "md": "| Age Group | Number of Residents |\n|-----------|---------------------|\n| Under 5 Years | ~50,000 |\n| 5 to 19 Years | ~100,000 |\n| 20 to 34 Years | ~250,000 |\n| 35 to 59 Years | ~300,000 |\n| 60 and Over | ~200,000 |",
  "isPerfectTable": true,
  "csv": "\"Age Group\",\"Number of Residents\"\n\"Under 5 Years\",\"~50,000\"\n\"5 to 19 Years\",\"~100,000\"\n\"20 to 34 Years\",\"~250,000\"\n\"35 to 59 Years\",\"~300,000\"\n\"60 and Over\",\"~200,000\"",
  "bBox": {
    "x": null,
    "y": 88,
    "w": null,
    "h": 693
  }
}


### `links`

Our budget PDF doesn't have any links, so let's load a different PDF with links and see what we get.


In [None]:
!wget 'https://www.dropbox.com/scl/fi/hay06lyxc49gkuh91oek6/basic-link-1.pdf?rlkey=uije7yb0lxqgqwk7p7hnqepdx&dl=0' -O './basic-link-1.pdf'

--2024-12-04 16:28:51--  https://www.dropbox.com/scl/fi/hay06lyxc49gkuh91oek6/basic-link-1.pdf?rlkey=uije7yb0lxqgqwk7p7hnqepdx&dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.13.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.13.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc033fcc12352d23d9c3c9c49318.dl.dropboxusercontent.com/cd/0/inline/CfqXlto8B1pWMrE3JMODpdWurfIpVBvY0azNyYAAJ7xz5MlxWIMLd8KvNLioNbwT8ZIg8tc86XNB6QhMemTyXfxeCmuL1-GsA1HOFYqcdeItJAEIIFBWIoDyKeJQWU8lHt4/file# [following]
--2024-12-04 16:28:51--  https://uc033fcc12352d23d9c3c9c49318.dl.dropboxusercontent.com/cd/0/inline/CfqXlto8B1pWMrE3JMODpdWurfIpVBvY0azNyYAAJ7xz5MlxWIMLd8KvNLioNbwT8ZIg8tc86XNB6QhMemTyXfxeCmuL1-GsA1HOFYqcdeItJAEIIFBWIoDyKeJQWU8lHt4/file
Resolving uc033fcc12352d23d9c3c9c49318.dl.dropboxusercontent.com (uc033fcc12352d23d9c3c9c49318.dl.dropboxusercontent.com)... 162.125.13.15
Connecting to uc033fcc12352d23d9c3c9c49318.dl.dropboxusercontent.

In [None]:
link_parsed = parser.get_json_result("./basic-link-1.pdf")

Started parsing the file under job_id 8859f204-a3b5-49f8-b254-870be06cbfc2


This is a very simple document with some internal and external links:

<img src="./json_tour_screenshots/links_page.png" alt="Page 1" width="300"/>


The parser finds the external links and their labels and includes them in the `links` section:

In [None]:
link_page = link_parsed[0]["pages"][0]
print(json.dumps(link_page["links"], indent=2))

[
  {
    "url": "https://www.antennahouse.com/",
    "text": "Antenna House, Inc."
  },
  {
    "url": "https://www.antennahouse.com/",
    "text": "Linking to a website (https://www.antennahouse.com/)"
  }
]


This concludes our tour! I hope this makes clear the power of JSON mode and the flexibility it gives you over what parts of your documents you can use.