pdfs as parquet text #2

jacksonloper · 2025-11-03T19:03:48Z

I think we can store the texts of all the pdfs we have ever seen as a directory of parquet files with structure

sha256 | dateprocessed | text(list-of-strings)

This directory can be the basis for the other parsing pipelines. I think we can actually store that parquet directory right here in the repo, making it maximally transparent and allowing easy reproducibility for the rest of our stuff.

stefaneng

Cool idea! Looks good to me

stefaneng · 2025-11-03T23:53:54Z

pdf_parsing/parquet_files/20251103_134410_pdf_text.parquet

Curious why this one is 10.3 MB? is this all of the current documents?

jacksonloper added 4 commits November 3, 2025 14:02

pdfs as parquet text

45979b8

rest of the pdfs

606520c

woops missed one

7342645

Pdf --> text routine

3269624

stefaneng approved these changes Nov 3, 2025

View reviewed changes

stefaneng reviewed Nov 3, 2025

View reviewed changes

pdf_parsing/parquet_files/20251103_134410_pdf_text.parquet

Copy link

Member

stefaneng Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this one is 10.3 MB? is this all of the current documents?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pdfs as parquet text #2

pdfs as parquet text #2

Uh oh!

jacksonloper commented Nov 3, 2025 •

edited

Loading

Uh oh!

stefaneng left a comment

Uh oh!

stefaneng Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pdfs as parquet text #2

Are you sure you want to change the base?

pdfs as parquet text #2

Uh oh!

Conversation

jacksonloper commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefaneng left a comment

Choose a reason for hiding this comment

Uh oh!

stefaneng Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jacksonloper commented Nov 3, 2025 •

edited

Loading