-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: run PDF ingestion on document upload #29
Conversation
- App reads file structure on startup - App adds sample structure on load - Redesigned FoldersView - Uploaded files are saved to `/uploads` folder - Selectede files in the sidebar show on center - TipTap files show with editor - other files show a placeholder view
src/views/FoldersView.tsx
Outdated
runPDFIngestion().then(() => { | ||
console.log('PDFs ingested with success'); | ||
readAllProjectFiles().then(setFiles); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the way to communicate progress, I think this is related to #33, we need to define the way we want to pass information between components, in this case the loading status/progress.
About how to do this piece, I guess it depends on when we want to run the ingestion process. For now, if we want to run it every time files are uploaded, that is the right place to do it. However, you need to call runPDFIngestion
once files have been uploaded otherwise you will run into race conditions with the sidecar being called before the app directory receives the new files. I'm not a big fan of the .then
chaining syntax and prefer the async/await, so you could write something like this:
const handleChange = async (files: FileList) => {
await uploadFiles(files);
console.log('File uploaded with success');
console.log(files);
let updatedFiles = await readAllProjectFiles();
setFiles(updatedFiles);
await runPDFIngestion()
console.log('PDFs ingested with success');
updatedFiles = await readAllProjectFiles();
setFiles(updatedFiles);
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on when we want to run the ingestion process. For now, if we want to run it every time files are uploaded
I want to run it every time files are uploaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would provide runPDFIngestion
with both the current array of file paths uploaded, and also the base directory for the uploads. Both absolute paths to the filesystem.
This would allow the ingestion to either read new files uploaded or the full list of files in the upload folder.
// In filesystem.ts
export async function getUploadsDir() {
return join(await getBaseDir(), UPLOADS_DIR);
}
// In FoldersView.tsx
const handleChange = async (files: FileList) => {
await uploadFiles(files);
console.log('File uploaded with success');
console.log(files);
let updatedFiles = await readAllProjectFiles();
setFiles(updatedFiles);
const uploadedFilesPath = updatedFiles.map(f => f.path)
const uploadsDir = await getUploadsDir()
await runPDFIngestion(uploadedFilesPath, uploadsDir)
console.log('PDFs ingested with success');
updatedFiles = await readAllProjectFiles();
setFiles(updatedFiles);
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cguedes While helpful, I don't think passing current array of file paths uploaded is necessary. The Grobid client takes an input and output directory, walks the input directory tree, and will skip PDFs that it has already processed (found in the output directory).
I think it's ok for the ingest process to just take the input directory path (/uploads
), since the backend can determine what has already been processed (output from grobid found in /grobid
, converted json found in /storage
, and embeddings found in .lancedb
).
It might be worth making this PR inclusive of adding the newly parsed PDF document to the list of references in the references component. |
Also as a reminder please reference the associated issue in the PR description! |
Copying over some comments from Slack to have them on GitHub:
|
I'd prefer to leave out of this PR since it is already decently sized. Something we can discuss more tomorrow, but the ingestion process writes json files to the filesystem, so one option is for the sidecar to return the filepath for the json file, which the client side can then read from the filesystem. |
Pushed an update so that this process now returns the uploaded reference metadata json once it has completed. I think it makes sense to separate this PR (backend ingestion) from its frontend presentation (I've created #49). $ ./src-tauri/bin/python/main-aarch64-apple-darwin/main ingest --pdf_directory="/Users/greg/Library/Application Support/com.tauri.dev/project-x/uploads" | jq
{
"project_name": "project-x",
"references": [
{
"title": "Machine Learning at Scale",
"authors": [
{
"full_name": "Sergei Izrailev",
"given_name": "Sergei",
"surname": "Izrailev",
"email": "sizrailev@collective.com"
},
{
"full_name": "Jeremy M Stanley",
"given_name": "Jeremy",
"surname": "Stanley",
"email": "jstanley@collective.com"
}
]
},
{
"title": "A Few Useful Things to Know about Machine Learning",
"authors": [
{
"full_name": "Pedro Domingos",
"given_name": "Pedro",
"surname": "Domingos",
"email": "pedrod@cs.washington.edu"
}
]
},
{
"title": "Hidden Technical Debt in Machine Learning Systems",
"authors": [
{
"full_name": "D Sculley",
"given_name": "D",
"surname": "Sculley",
"email": "dsculley@google.com"
},
{
"full_name": "Gary Holt",
"given_name": "Gary",
"surname": "Holt",
"email": "gholt@google.com"
},
{
"full_name": "Daniel Golovin",
"given_name": "Daniel",
"surname": "Golovin",
"email": null
},
{
"full_name": "Eugene Davydov",
"given_name": "Eugene",
"surname": "Davydov",
"email": "edavydov@google.com"
},
{
"full_name": "Todd Phillips",
"given_name": "Todd",
"surname": "Phillips",
"email": "toddphillips@google.com"
},
{
"full_name": "Dietmar Ebner",
"given_name": "Dietmar",
"surname": "Ebner",
"email": "ebner@google.com"
},
{
"full_name": "Vinay Chaudhary",
"given_name": "Vinay",
"surname": "Chaudhary",
"email": "vchaudhary@google.com"
},
{
"full_name": "Michael Young",
"given_name": "Michael",
"surname": "Young",
"email": "mwyoung@google.com"
},
{
"full_name": "Jean-Franc ¸ois Crespo",
"given_name": "Jean-Franc ¸ois",
"surname": "Crespo",
"email": "jfcrespo@google.com"
},
{
"full_name": "Dan Dennison",
"given_name": "Dan",
"surname": "Dennison",
"email": "dennison@google.com"
}
]
}
]
} |
Ok, this should be good to merge now. Some related action items:
![]() |
@@ -0,0 +1,8 @@ | |||
#!/usr/bin/env bash | |||
|
|||
PROJECT_DIR="/Users/greg/Library/Application Support/com.tauri.dev/project-x" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjreda was this a temp reset script or should we have one and make it user-agnostic?
Including the new dependencies in the binary was rather finicky, so please pull down this branch and try to test it. I want to make sure it works on more machines than mine.
To do so:
Note that this might take a minute or two to run once a document is uploaded. You can check the
.lancedb
directory for your document to make sure the process has completed successfully.When a document is uploaded, run the PDF ingestion pipeline so that we can perform Q&A over the documents.
Right now, the pipeline stores everything in the project working directory, but I suspect we'll want to change that later (there's little reason for the user to see this data):
grobid
directory within the project working directorysentence-transformers
to generate sentence embeddings for dense retrieval during AI interactionslancedb
-- we will query this data in AI interactionsSome things to note:
sentence-transformers
andlancedb
have some very hefty dependencies (torch, pyarrow). These increase the size of the binary significantly and slow down calls to it. This is something we'll want to optimize once we have the full end-to-end flow in place. We might even want to remove the dependency and use OpenAI to generate the embeddings.chunk
strategy for documents is terribly naive right now and is worth improving later.