feat: run PDF ingestion on document upload #29

gjreda · 2023-05-23T22:34:34Z

Tooling for PDF Ingestion #14

Including the new dependencies in the binary was rather finicky, so please pull down this branch and try to test it. I want to make sure it works on more machines than mine.

To do so:

$ poetry install
$ yarn python
$ yarn tauri dev

Note that this might take a minute or two to run once a document is uploaded. You can check the .lancedb directory for your document to make sure the process has completed successfully.

$ ls -la ~/.ref-studio/project-x/.lancedb
total 0
drwxr-xr-x  4 greg  staff  128 May 23 14:53 .
drwxr-xr-x  9 greg  staff  288 May 23 14:52 ..
drwxr-xr-x  5 greg  staff  160 May 23 14:52 A Few Useful Things to Know about Machine Learning.tei.lance
drwxr-xr-x  5 greg  staff  160 May 23 14:53 Machine Learning at Scale.tei.lance

When a document is uploaded, run the PDF ingestion pipeline so that we can perform Q&A over the documents.

Right now, the pipeline stores everything in the project working directory, but I suspect we'll want to change that later (there's little reason for the user to see this data):

Call the HF grobid server, writing the output back to a grobid directory within the project working directory
Convert the XML to JSON for easier parsing down the road
Use sentence-transformers to generate sentence embeddings for dense retrieval during AI interactions
Store embeddings via lancedb -- we will query this data in AI interactions

Some things to note:

Since this pipeline will not be instantaneous, we'll ideally get the user to upload all their PDFs at once (or point to a directory).
Both sentence-transformers and lancedb have some very hefty dependencies (torch, pyarrow). These increase the size of the binary significantly and slow down calls to it. This is something we'll want to optimize once we have the full end-to-end flow in place. We might even want to remove the dependency and use OpenAI to generate the embeddings.
The chunk strategy for documents is terribly naive right now and is worth improving later.
I'd like to get the end-to-end ingestion and Q&A flow in place first, so this does pull additional document metadata into lancedb (yet ... it's all in the json though).

- App reads file structure on startup - App adds sample structure on load - Redesigned FoldersView - Uploaded files are saved to `/uploads` folder - Selectede files in the sidebar show on center - TipTap files show with editor - other files show a placeholder view

gjreda · 2023-05-23T22:44:19Z

Think the diff on this PR should shrink significantly once @cguedes merges his here: #25. This branch is based off of that one.

gjreda · 2023-05-24T18:28:18Z

src/views/FoldersView.tsx

+    runPDFIngestion().then(() => {
+      console.log('PDFs ingested with success');
+      readAllProjectFiles().then(setFiles);
+    });


@cguedes @sehyod Advice or thoughts on how to do this piece are much appreciated (my JS/TS/React knowledge is minimal). I'll look into a way of communicating ingestion progress, but for now maybe we can just show a spinner or something off to the side until the sidecar command returns.

About the way to communicate progress, I think this is related to #33, we need to define the way we want to pass information between components, in this case the loading status/progress.

About how to do this piece, I guess it depends on when we want to run the ingestion process. For now, if we want to run it every time files are uploaded, that is the right place to do it. However, you need to call runPDFIngestion once files have been uploaded otherwise you will run into race conditions with the sidecar being called before the app directory receives the new files. I'm not a big fan of the .then chaining syntax and prefer the async/await, so you could write something like this:

const handleChange = async (files: FileList) => { await uploadFiles(files); console.log('File uploaded with success'); console.log(files); let updatedFiles = await readAllProjectFiles(); setFiles(updatedFiles); await runPDFIngestion() console.log('PDFs ingested with success'); updatedFiles = await readAllProjectFiles(); setFiles(updatedFiles); };

I guess it depends on when we want to run the ingestion process. For now, if we want to run it every time files are uploaded

I want to run it every time files are uploaded

I would provide runPDFIngestion with both the current array of file paths uploaded, and also the base directory for the uploads. Both absolute paths to the filesystem.

This would allow the ingestion to either read new files uploaded or the full list of files in the upload folder.

// In filesystem.ts export async function getUploadsDir() { return join(await getBaseDir(), UPLOADS_DIR); } // In FoldersView.tsx const handleChange = async (files: FileList) => { await uploadFiles(files); console.log('File uploaded with success'); console.log(files); let updatedFiles = await readAllProjectFiles(); setFiles(updatedFiles); const uploadedFilesPath = updatedFiles.map(f => f.path) const uploadsDir = await getUploadsDir() await runPDFIngestion(uploadedFilesPath, uploadsDir) console.log('PDFs ingested with success'); updatedFiles = await readAllProjectFiles(); setFiles(updatedFiles); };

@cguedes While helpful, I don't think passing current array of file paths uploaded is necessary. The Grobid client takes an input and output directory, walks the input directory tree, and will skip PDFs that it has already processed (found in the output directory).

I think it's ok for the ingest process to just take the input directory path (/uploads), since the backend can determine what has already been processed (output from grobid found in /grobid, converted json found in /storage, and embeddings found in .lancedb).

hammer · 2023-05-25T11:59:00Z

It might be worth making this PR inclusive of adding the newly parsed PDF document to the list of references in the references component.

hammer · 2023-05-25T12:00:22Z

Also as a reminder please reference the associated issue in the PR description!

hammer · 2023-05-25T12:03:59Z

Copying over some comments from Slack to have them on GitHub:

I'm less concerned with install size and more concerned with resident memory size, and especially memory leaks, cf. this terrifying thread tauri-apps/tauri#4026

But I hear you, if we end up shipping multiple models to the client we're going to be really bloating the download

BTW the several seconds of lag for each call is possibly because the model has to be loaded into memory for each call since we're calling the sidecar as a command-line binary not as a service. I may be wrong as I haven't fully grokked the Tauri architecture. It will be interesting to see if the lag goes down with WebSockets. Longer term we can also explore various inference optimization approaches such as TVM, Neural Magic, etc.

gjreda · 2023-05-25T17:48:15Z

It might be worth making this PR inclusive of adding the newly parsed PDF document to the list of references in the references component.

I'd prefer to leave out of this PR since it is already decently sized.

Something we can discuss more tomorrow, but the ingestion process writes json files to the filesystem, so one option is for the sidecar to return the filepath for the json file, which the client side can then read from the filesystem.

gjreda · 2023-05-25T21:48:08Z

Pushed an update so that this process now returns the uploaded reference metadata json once it has completed. I think it makes sense to separate this PR (backend ingestion) from its frontend presentation (I've created #49).

$ ./src-tauri/bin/python/main-aarch64-apple-darwin/main ingest --pdf_directory="/Users/greg/Library/Application Support/com.tauri.dev/project-x/uploads" | jq
{
  "project_name": "project-x",
  "references": [
    {
      "title": "Machine Learning at Scale",
      "authors": [
        {
          "full_name": "Sergei Izrailev",
          "given_name": "Sergei",
          "surname": "Izrailev",
          "email": "sizrailev@collective.com"
        },
        {
          "full_name": "Jeremy M Stanley",
          "given_name": "Jeremy",
          "surname": "Stanley",
          "email": "jstanley@collective.com"
        }
      ]
    },
    {
      "title": "A Few Useful Things to Know about Machine Learning",
      "authors": [
        {
          "full_name": "Pedro Domingos",
          "given_name": "Pedro",
          "surname": "Domingos",
          "email": "pedrod@cs.washington.edu"
        }
      ]
    },
    {
      "title": "Hidden Technical Debt in Machine Learning Systems",
      "authors": [
        {
          "full_name": "D Sculley",
          "given_name": "D",
          "surname": "Sculley",
          "email": "dsculley@google.com"
        },
        {
          "full_name": "Gary Holt",
          "given_name": "Gary",
          "surname": "Holt",
          "email": "gholt@google.com"
        },
        {
          "full_name": "Daniel Golovin",
          "given_name": "Daniel",
          "surname": "Golovin",
          "email": null
        },
        {
          "full_name": "Eugene Davydov",
          "given_name": "Eugene",
          "surname": "Davydov",
          "email": "edavydov@google.com"
        },
        {
          "full_name": "Todd Phillips",
          "given_name": "Todd",
          "surname": "Phillips",
          "email": "toddphillips@google.com"
        },
        {
          "full_name": "Dietmar Ebner",
          "given_name": "Dietmar",
          "surname": "Ebner",
          "email": "ebner@google.com"
        },
        {
          "full_name": "Vinay Chaudhary",
          "given_name": "Vinay",
          "surname": "Chaudhary",
          "email": "vchaudhary@google.com"
        },
        {
          "full_name": "Michael Young",
          "given_name": "Michael",
          "surname": "Young",
          "email": "mwyoung@google.com"
        },
        {
          "full_name": "Jean-Franc ¸ois Crespo",
          "given_name": "Jean-Franc ¸ois",
          "surname": "Crespo",
          "email": "jfcrespo@google.com"
        },
        {
          "full_name": "Dan Dennison",
          "given_name": "Dan",
          "surname": "Dennison",
          "email": "dennison@google.com"
        }
      ]
    }
  ]
}

gjreda · 2023-05-26T21:30:24Z

Ok, this should be good to merge now.

Some related action items:

It is slow due to PyInstaller --onefile. But, if I get rid of this flag, we get the errors I mentioned in Slack. I've filed issues Sidecar dylib binary errors: PyInstaller --onedir does not play nicely with Tauri sidecar pattern #56 and Sidecar process profiling #48 to investigate further.
Issue Remove embedding creation from PDF ingestion #50. Right now it is all one linear pipeline, but it makes sense to break up.
Issue Populate new references in ReferencesView after PDF ingestion #49. I'll leave this one to either @cguedes or @sehyod.

cguedes · 2023-05-29T10:40:27Z

scripts/reset_uploads.sh

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+PROJECT_DIR="/Users/greg/Library/Application Support/com.tauri.dev/project-x"


@gjreda was this a temp reset script or should we have one and make it user-agnostic?

gjreda and others added 29 commits May 18, 2023 16:27

Add PoC for python sidecar

f0513d5

Merge branch 'main' into gjreda/hello-python

3dda2ee

update readme

58ace5c

small change

e7f9642

that's not python

0c3fe77

Merge branch 'main' into gjreda/hello-python

1e44873

Add stub for xml reading

1b20ee5

change type

7de49fb

change script to move binary: old one had issues

b04da39

allow sidecar to accept arguments

232cbae

communicate with sidecar

b77625b

Merge branch 'main' into gjreda/poc-sidecar-passing

4b74646

don't check this in yet

ef08b0f

fix lock file

3fbe57c

Add drag and drop zone for documents

f6b56cb

Add sample rust code that opens the devtools on app open

5b4ce45

Add 120 printWidth

f539615

Ignore tauri python files (bin / main.spec)

385022f

Update gitignore

81c4c17

Merge branch 'upload-documents' into gjreda/grobid-upload

af24895

roughly get things working

4c6e88f

Merge branch 'main' into gjreda/grobid-upload

420e92b

don't need to collect data for sentence-transformers

5d80349

fix reset script

c9b9768

Run PDF ingestion on upload

674ced6

run ingest on document upload

7ac0254

linter fixes

205dcfa

fix reset script

fb1045f

gjreda commented May 24, 2023

View reviewed changes

clean much of this up

f68dc61

hammer mentioned this pull request May 25, 2023

feat: view PDF files inline #31

Merged

Return Reference response

abe7a2c

gjreda requested review from hammer, sehyod and cguedes May 25, 2023 21:54

gjreda marked this pull request as ready for review May 26, 2023 01:58

gjreda mentioned this pull request May 26, 2023

Convert PyInstaller one-liner into a build script #54

Closed

gjreda added 2 commits May 26, 2023 10:18

Merge branch 'main' into gjreda/grobid-upload

1194a46

changes based on this morning's discussion

5a9090d

gjreda mentioned this pull request May 26, 2023

Sidecar dylib binary errors: PyInstaller --onedir does not play nicely with Tauri sidecar pattern #56

Closed

gjreda mentioned this pull request May 26, 2023

Python styles / norms #26

Closed

update line length

e334aae

cguedes reviewed May 29, 2023

View reviewed changes

cguedes previously approved these changes May 29, 2023

View reviewed changes

cguedes added 2 commits May 29, 2023 11:52

Merge main into gjreda/grobid-upload

ae676eb

Fix issues

327694d

cguedes dismissed their stale review via 327694d May 29, 2023 10:55

cguedes requested review from sergioramos and cguedes May 29, 2023 10:55

cguedes approved these changes May 29, 2023

View reviewed changes

sergioramos merged commit 4a0c154 into main May 29, 2023

sergioramos deleted the gjreda/grobid-upload branch May 29, 2023 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: run PDF ingestion on document upload #29

feat: run PDF ingestion on document upload #29

gjreda commented May 23, 2023 •

edited

Loading

gjreda commented May 23, 2023

gjreda May 24, 2023

sehyod May 25, 2023 •

edited

Loading

hammer May 25, 2023

cguedes May 25, 2023

gjreda May 25, 2023

hammer commented May 25, 2023

hammer commented May 25, 2023

hammer commented May 25, 2023

gjreda commented May 25, 2023 •

edited

Loading

gjreda commented May 25, 2023

gjreda commented May 26, 2023 •

edited

Loading

cguedes May 29, 2023

		@@ -0,0 +1,8 @@
		#!/usr/bin/env bash

		PROJECT_DIR="/Users/greg/Library/Application Support/com.tauri.dev/project-x"

feat: run PDF ingestion on document upload #29

feat: run PDF ingestion on document upload #29

Conversation

gjreda commented May 23, 2023 • edited Loading

gjreda commented May 23, 2023

gjreda May 24, 2023

Choose a reason for hiding this comment

sehyod May 25, 2023 • edited Loading

Choose a reason for hiding this comment

hammer May 25, 2023

Choose a reason for hiding this comment

cguedes May 25, 2023

Choose a reason for hiding this comment

gjreda May 25, 2023

Choose a reason for hiding this comment

hammer commented May 25, 2023

hammer commented May 25, 2023

hammer commented May 25, 2023

gjreda commented May 25, 2023 • edited Loading

gjreda commented May 25, 2023

gjreda commented May 26, 2023 • edited Loading

cguedes May 29, 2023

Choose a reason for hiding this comment

gjreda commented May 23, 2023 •

edited

Loading

sehyod May 25, 2023 •

edited

Loading

gjreda commented May 25, 2023 •

edited

Loading

gjreda commented May 26, 2023 •

edited

Loading