# RAG over the Caltrain Weekend Schedule 

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/caltrain/caltrain_text_mode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This example shows off LlamaParse parsing capabilities to build a functioning query pipeline over the Caltrain weekend schedule, a big timetable containing all trains northbound and southbound and their stops in various cities.

Naive parsing solutions mess up in representing this tabular representation, leading to LLM hallucinations. In contrast, LlamaParse text-mode spatially lays out the table in a neat format, enabling more sophisticated LLMs like gpt-4-turbo to understand the spacing and reason over all the numbers.

**NOTE**: LlamaParse markdown mode doesn't quite work yet - it's in development!

## Setup

Download the data.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
!wget "https://www.caltrain.com/media/31602/download?inline?inline" -O caltrain_schedule_weekend.pdf

## Initialize LlamaParse

Initialize LlamaParse in `text` mode which will represent complex documents incl. text, tables, and figures as nicely formatted text.

In [None]:
from llama_parse import LlamaParse

docs = LlamaParse(result_type="text").load_data("./caltrain_schedule_weekend.pdf")

  from .autonotebook import tqdm as notebook_tqdm


Started parsing the file under job_id 5f73353a-1f4b-480d-9eea-58d1d22b75f6


Take a look at the below text (and zoom out from the browser to really get the effect!). You'll see that the entire table is nicely laid out.

In [None]:
print(docs[0].get_content())

ZONE 2ZONE 3ZONE 4ZONE 4 ZONE 3ZONE 2ZONE 1ZONE 1
                                      Printer-Friendly Caltrain Schedule
              Northbound –                         WEEKEND SERVICE to SAN FRANCISCO                                                                                                                2XX Local


                  Train No.       221        225        229        233        237        241        245        249        253        257        261        265        269        273       *277       *281
                 Service Types      L2        L2          L2        L2         L2         L2         L2         L2         L2         L2         L2         L2         L2         L2         L2         L2
                      Tamien      7:12a      9:05a     10:05a     11:05a                1:05p                 3:05p                 5:05p                 7:05p                 9:05p                11:05p
           San Jose Diridon       7:19a      9:12a     10:12

## Initialize Query Engine

We now initialize a query engine over this data. Here we use a baseline summary index, which doesn't do vector indexing/chunking and instead dumps the entire text into the prompt.

We see that the LLM (gpt-4-turbo) is able to provide all the stops for train no 225 northbound.

In [None]:
from llama_index.core import SummaryIndex
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
index = SummaryIndex.from_documents(docs)
query_engine = index.as_query_engine(llm=llm)

In [None]:
response = query_engine.query(
    "What are the stops (and times) for train no 237 northbound?"
)

In [None]:
print(str(response))

The stops and times for train no. 237 northbound are as follows:

- San Jose Diridon: 12:12 PM
- Santa Clara: 12:18 PM
- Lawrence: 12:24 PM
- Sunnyvale: 12:28 PM
- Mountain View: 12:34 PM
- San Antonio: 12:37 PM
- California Ave: 12:42 PM
- Palo Alto: 12:46 PM
- Menlo Park: 12:50 PM
- Redwood City: 12:56 PM
- San Carlos: 1:01 PM
- Belmont: 1:04 PM
- Hillsdale: 1:08 PM
- Hayward Park: 1:11 PM
- San Mateo: 1:15 PM
- Burlingame: 1:19 PM
- Broadway: 1:22 PM
- Millbrae: 1:26 PM
- San Bruno: 1:30 PM
- S. San Francisco: 1:34 PM
- Bayshore: 1:41 PM
- 22nd Street: 1:46 PM
- San Francisco: 1:52 PM


In [None]:
response = query_engine.query(
    "What are all the trains (and times) that end at Tamien going Southbound?"
)

It gets most of the answers correct (to be fair it misses two trains).

In [None]:
print(str(response))

The trains that end at Tamien going Southbound are:

- Train 224 at 10:15a
- Train 228 at 11:45a
- Train 240 at 2:45p
- Train 248 at 4:45p
- Train 256 at 6:45p
- Train 264 at 8:45p
- Train 272 at 10:45p
- Train 284 at 1:49a


## Try Baseline

In contrast, we try a baseline approach with the default PDF reader (PyPDF) in `SimpleDirectoryReader`.

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
input_file = "caltrain_schedule_weekend.pdf"
reader = SimpleDirectoryReader(input_files=[input_file])
base_docs = reader.load_data()
index = SummaryIndex.from_documents(base_docs)
base_query_engine = index.as_query_engine(llm=llm)

In [None]:
print(base_docs[0].get_content())

Southbound  – WEEKEND SERVICE to SAN JOSE
Train No. 224 228 232 236 240 244 248 252 256 260 264 268 272 276 280 284
Service Types L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
San Francisco 8:28a 9:58a 10:58a 11:58a 12:58p 1:58p 2:58p 3:58p 4:58p 5:58p 6:58p 7:58p 8:58p 9:58p 10:58p 12:05a
22nd Street 8:33a 10:03a 11:03a 12:03p 1:03p 2:03p 3:03p 4:03p 5:03p 6:03p 7:03p 8:03p 9:03p 10:03p 11:03p 12:10a
Bayshore 8:38a 10:08a 11:08a 12:08p 1:08p 2:08p 3:08p 4:08p 5:08p 6:08p 7:08p 8:08p 9:08p 10:08p 11:08p 12:15a
S. San Francisco 8:45a 10:15a 11:15a 12:15p 1:15p 2:15p 3:15p 4:15p 5:15p 6:15p 7:15p 8:15p 9:15p 10:15p 11:15p 12:22a
San Bruno 8:49a 10:19a 11:19a 12:19p 1:19p 2:19p 3:19p 4:19p 5:19p 6:19p 7:19p 8:19p 9:19p 10:19p 11:19p 12:26a
Millbrae 8:53a 10:24a 11:24a 12:24p 1:24p 2:24p 3:24p 4:24p 5:24p 6:24p 7:24p 8:24p 9:24p 10:24p 11:24p 12:31a
Broadway 8:57a 10:27a 11:27a 12:27p 1:27p 2:27p 3:27p 4:27p 5:27p 6:27p 7:27p 8:27p 9:27p 10:27p 11:27p 12:35a
Burlingame 9:00a 10:31a 11:31

In [None]:
base_response = base_query_engine.query(
    "What are the stops (and times) for train no 237 northbound?"
)

In [None]:
print(str(base_response))

Train No. 237 northbound stops at the following stations and times:

- Tamien: 1:05p
- San Jose Diridon: 1:12p
- Santa Clara: 1:18p
- Lawrence: 1:24p
- Sunnyvale: 1:28p
- Mountain View: 1:34p
- San Antonio: 1:37p
- California Ave: 1:42p
- Palo Alto: 1:46p
- Menlo Park: 1:50p
- Redwood City: 1:56p
- San Carlos: 2:01p
- Belmont: 2:04p
- Hillsdale: 2:08p
- Hayward Park: 2:11p
- San Mateo: 2:15p
- Burlingame: 2:19p
- Broadway: 2:22p
- Millbrae: 2:26p
- San Bruno: 2:30p
- S. San Francisco: 2:34p
- Bayshore: 2:41p
- 22nd Street: 2:46p
- San Francisco: 2:52p


In [None]:
base_response = base_query_engine.query(
    "What are all the trains (and times) that end at Tamien going Southbound?"
)

Note that the trains don't line up with the times!

In [None]:
print(str(base_response))

The trains that end at Tamien going Southbound are:

- Train 224 at 10:15a
- Train 228 at 11:45a
- Train 240 at 2:45p
- Train 252 at 4:45p
- Train 264 at 6:45p
- Train 276 at 8:45p
- Train 284 at 10:45p
- Train 284 at 12:44a
