Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: run PDF ingestion on document upload #29

Merged
merged 39 commits into from
May 29, 2023
Merged

Conversation

gjreda
Copy link
Collaborator

@gjreda gjreda commented May 23, 2023

Including the new dependencies in the binary was rather finicky, so please pull down this branch and try to test it. I want to make sure it works on more machines than mine.

To do so:

$ poetry install
$ yarn python
$ yarn tauri dev

Note that this might take a minute or two to run once a document is uploaded. You can check the .lancedb directory for your document to make sure the process has completed successfully.

$ ls -la ~/.ref-studio/project-x/.lancedb
total 0
drwxr-xr-x  4 greg  staff  128 May 23 14:53 .
drwxr-xr-x  9 greg  staff  288 May 23 14:52 ..
drwxr-xr-x  5 greg  staff  160 May 23 14:52 A Few Useful Things to Know about Machine Learning.tei.lance
drwxr-xr-x  5 greg  staff  160 May 23 14:53 Machine Learning at Scale.tei.lance

When a document is uploaded, run the PDF ingestion pipeline so that we can perform Q&A over the documents.

Right now, the pipeline stores everything in the project working directory, but I suspect we'll want to change that later (there's little reason for the user to see this data):

  1. Call the HF grobid server, writing the output back to a grobid directory within the project working directory
  2. Convert the XML to JSON for easier parsing down the road
  3. Use sentence-transformers to generate sentence embeddings for dense retrieval during AI interactions
  4. Store embeddings via lancedb -- we will query this data in AI interactions

Some things to note:

  1. Since this pipeline will not be instantaneous, we'll ideally get the user to upload all their PDFs at once (or point to a directory).
  2. Both sentence-transformers and lancedb have some very hefty dependencies (torch, pyarrow). These increase the size of the binary significantly and slow down calls to it. This is something we'll want to optimize once we have the full end-to-end flow in place. We might even want to remove the dependency and use OpenAI to generate the embeddings.
  3. The chunk strategy for documents is terribly naive right now and is worth improving later.
  4. I'd like to get the end-to-end ingestion and Q&A flow in place first, so this does pull additional document metadata into lancedb (yet ... it's all in the json though).
image

@gjreda
Copy link
Collaborator Author

gjreda commented May 23, 2023

Think the diff on this PR should shrink significantly once @cguedes merges his here: #25. This branch is based off of that one.

Comment on lines 36 to 39
runPDFIngestion().then(() => {
console.log('PDFs ingested with success');
readAllProjectFiles().then(setFiles);
});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cguedes @sehyod Advice or thoughts on how to do this piece are much appreciated (my JS/TS/React knowledge is minimal). I'll look into a way of communicating ingestion progress, but for now maybe we can just show a spinner or something off to the side until the sidecar command returns.

Copy link
Collaborator

@sehyod sehyod May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the way to communicate progress, I think this is related to #33, we need to define the way we want to pass information between components, in this case the loading status/progress.

About how to do this piece, I guess it depends on when we want to run the ingestion process. For now, if we want to run it every time files are uploaded, that is the right place to do it. However, you need to call runPDFIngestion once files have been uploaded otherwise you will run into race conditions with the sidecar being called before the app directory receives the new files. I'm not a big fan of the .then chaining syntax and prefer the async/await, so you could write something like this:

const handleChange = async (files: FileList) => {
    await uploadFiles(files);
    console.log('File uploaded with success');
    console.log(files);
    let updatedFiles = await readAllProjectFiles();
    setFiles(updatedFiles);

    await runPDFIngestion()
    console.log('PDFs ingested with success');
    updatedFiles = await readAllProjectFiles();
    setFiles(updatedFiles);
  };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on when we want to run the ingestion process. For now, if we want to run it every time files are uploaded

I want to run it every time files are uploaded

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would provide runPDFIngestion with both the current array of file paths uploaded, and also the base directory for the uploads. Both absolute paths to the filesystem.

This would allow the ingestion to either read new files uploaded or the full list of files in the upload folder.

// In filesystem.ts
export async function getUploadsDir() {
  return join(await getBaseDir(), UPLOADS_DIR);
}

// In FoldersView.tsx
const handleChange = async (files: FileList) => {
    await uploadFiles(files);
    console.log('File uploaded with success');
    console.log(files);
    let updatedFiles = await readAllProjectFiles();
    setFiles(updatedFiles);

    const uploadedFilesPath = updatedFiles.map(f => f.path)
    const uploadsDir = await getUploadsDir()
    await runPDFIngestion(uploadedFilesPath, uploadsDir)
    console.log('PDFs ingested with success');
    updatedFiles = await readAllProjectFiles();
    setFiles(updatedFiles);
  };

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cguedes While helpful, I don't think passing current array of file paths uploaded is necessary. The Grobid client takes an input and output directory, walks the input directory tree, and will skip PDFs that it has already processed (found in the output directory).

I think it's ok for the ingest process to just take the input directory path (/uploads), since the backend can determine what has already been processed (output from grobid found in /grobid, converted json found in /storage, and embeddings found in .lancedb).

@hammer
Copy link
Contributor

hammer commented May 25, 2023

It might be worth making this PR inclusive of adding the newly parsed PDF document to the list of references in the references component.

@hammer
Copy link
Contributor

hammer commented May 25, 2023

Also as a reminder please reference the associated issue in the PR description!

@hammer
Copy link
Contributor

hammer commented May 25, 2023

Copying over some comments from Slack to have them on GitHub:

I'm less concerned with install size and more concerned with resident memory size, and especially memory leaks, cf. this terrifying thread tauri-apps/tauri#4026

But I hear you, if we end up shipping multiple models to the client we're going to be really bloating the download

BTW the several seconds of lag for each call is possibly because the model has to be loaded into memory for each call since we're calling the sidecar as a command-line binary not as a service. I may be wrong as I haven't fully grokked the Tauri architecture. It will be interesting to see if the lag goes down with WebSockets. Longer term we can also explore various inference optimization approaches such as TVM, Neural Magic, etc.

@gjreda
Copy link
Collaborator Author

gjreda commented May 25, 2023

It might be worth making this PR inclusive of adding the newly parsed PDF document to the list of references in the references component.

I'd prefer to leave out of this PR since it is already decently sized.

Something we can discuss more tomorrow, but the ingestion process writes json files to the filesystem, so one option is for the sidecar to return the filepath for the json file, which the client side can then read from the filesystem.

@gjreda
Copy link
Collaborator Author

gjreda commented May 25, 2023

Pushed an update so that this process now returns the uploaded reference metadata json once it has completed. I think it makes sense to separate this PR (backend ingestion) from its frontend presentation (I've created #49).

$ ./src-tauri/bin/python/main-aarch64-apple-darwin/main ingest --pdf_directory="/Users/greg/Library/Application Support/com.tauri.dev/project-x/uploads" | jq
{
  "project_name": "project-x",
  "references": [
    {
      "title": "Machine Learning at Scale",
      "authors": [
        {
          "full_name": "Sergei Izrailev",
          "given_name": "Sergei",
          "surname": "Izrailev",
          "email": "sizrailev@collective.com"
        },
        {
          "full_name": "Jeremy M Stanley",
          "given_name": "Jeremy",
          "surname": "Stanley",
          "email": "jstanley@collective.com"
        }
      ]
    },
    {
      "title": "A Few Useful Things to Know about Machine Learning",
      "authors": [
        {
          "full_name": "Pedro Domingos",
          "given_name": "Pedro",
          "surname": "Domingos",
          "email": "pedrod@cs.washington.edu"
        }
      ]
    },
    {
      "title": "Hidden Technical Debt in Machine Learning Systems",
      "authors": [
        {
          "full_name": "D Sculley",
          "given_name": "D",
          "surname": "Sculley",
          "email": "dsculley@google.com"
        },
        {
          "full_name": "Gary Holt",
          "given_name": "Gary",
          "surname": "Holt",
          "email": "gholt@google.com"
        },
        {
          "full_name": "Daniel Golovin",
          "given_name": "Daniel",
          "surname": "Golovin",
          "email": null
        },
        {
          "full_name": "Eugene Davydov",
          "given_name": "Eugene",
          "surname": "Davydov",
          "email": "edavydov@google.com"
        },
        {
          "full_name": "Todd Phillips",
          "given_name": "Todd",
          "surname": "Phillips",
          "email": "toddphillips@google.com"
        },
        {
          "full_name": "Dietmar Ebner",
          "given_name": "Dietmar",
          "surname": "Ebner",
          "email": "ebner@google.com"
        },
        {
          "full_name": "Vinay Chaudhary",
          "given_name": "Vinay",
          "surname": "Chaudhary",
          "email": "vchaudhary@google.com"
        },
        {
          "full_name": "Michael Young",
          "given_name": "Michael",
          "surname": "Young",
          "email": "mwyoung@google.com"
        },
        {
          "full_name": "Jean-Franc ¸ois Crespo",
          "given_name": "Jean-Franc ¸ois",
          "surname": "Crespo",
          "email": "jfcrespo@google.com"
        },
        {
          "full_name": "Dan Dennison",
          "given_name": "Dan",
          "surname": "Dennison",
          "email": "dennison@google.com"
        }
      ]
    }
  ]
}

@gjreda gjreda requested review from hammer, sehyod and cguedes May 25, 2023 21:54
@gjreda gjreda marked this pull request as ready for review May 26, 2023 01:58
@gjreda
Copy link
Collaborator Author

gjreda commented May 26, 2023

Ok, this should be good to merge now.

Some related action items:

  1. It is slow due to PyInstaller --onefile. But, if I get rid of this flag, we get the errors I mentioned in Slack. I've filed issues Sidecar dylib binary errors: PyInstaller --onedir does not play nicely with Tauri sidecar pattern #56 and Sidecar process profiling #48 to investigate further.
  2. Issue Remove embedding creation from PDF ingestion #50. Right now it is all one linear pipeline, but it makes sense to break up.
  3. Issue Populate new references in ReferencesView after PDF ingestion #49. I'll leave this one to either @cguedes or @sehyod.
image

@gjreda gjreda mentioned this pull request May 26, 2023
@@ -0,0 +1,8 @@
#!/usr/bin/env bash

PROJECT_DIR="/Users/greg/Library/Application Support/com.tauri.dev/project-x"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjreda was this a temp reset script or should we have one and make it user-agnostic?

cguedes
cguedes previously approved these changes May 29, 2023
@sergioramos sergioramos merged commit 4a0c154 into main May 29, 2023
@sergioramos sergioramos deleted the gjreda/grobid-upload branch May 29, 2023 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants