Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update ingest so uploads are only processed once #219

Merged
merged 4 commits into from
Jun 29, 2023

Conversation

gjreda
Copy link
Collaborator

@gjreda gjreda commented Jun 29, 2023

fixes #218

Note that IngestResponse will still contain all References, not just the newly uploaded Reference. Let me know if we want to change this.


$ ls -la $HOME"/Library/Application Support/com.tauri.dev/project-x/uploads"
total 5536
drwxr-xr-x  4 greg  staff      128 Jun 28 21:21 .
drwxr-xr-x  9 greg  staff      288 Jun 28 21:22 ..
-rw-r--r--@ 1 greg  staff   333450 Jun 28 15:24 Machine Learning at Scale.pdf
-rw-r--r--@ 1 greg  staff  2497316 Jun 28 15:36 grobid-fails.pdf

# run ingest
$ poetry run python main.py ingest --pdf_directory=$HOME"/Library/Application Support/com.tauri.dev/project-x/uploads"

# output from /tmp/refstudio-sidecar.log
2023-06-28 21:22:47,767 - sidecar.ingest - INFO - Grobid successfully parsed file: Machine Learning at Scale.pdf
2023-06-28 21:22:47,767 - sidecar.ingest - WARNING - Grobid failed to parse file: grobid-fails.pdf
2023-06-28 21:22:47,767 - sidecar.ingest - INFO - Converting 1 Grobid XML files to JSON
2023-06-28 21:22:47,776 - sidecar.ingest - INFO - Found 1 Grobid JSON files to parse
2023-06-28 21:22:47,776 - sidecar.ingest - INFO - Creating Reference from file: Machine Learning at Scale.json
2023-06-28 21:22:47,780 - sidecar.ingest - INFO - Found 1 txt files from Grobid parsing errors
2023-06-28 21:22:47,780 - sidecar.ingest - INFO - Creating Reference from file: grobid-fails_500.txt
2023-06-28 21:22:47,780 - sidecar.ingest - INFO - Created 2 Reference objects: 1 successful Grobid parses, 1 Grobid failures
2023-06-28 21:22:47,780 - sidecar.ingest - INFO - Saving references to file: /Users/greg/Library/Application Support/com.tauri.dev/project-x/.storage/references.json
2023-06-28 21:22:47,812 - sidecar.ingest - INFO - Finished ingestion for project: project-x


# add a new upload ...
$ cp ~/Downloads/2212.08037v2.pdf uploads/.

# ... and run ingest again
# there are now three uploads - two from above + one new upload
$ poetry run python main.py ingest --pdf_directory=$HOME"/Library/Application Support/com.tauri.dev/project-x/uploads"

# output from logs - note only the new upload is processed
2023-06-28 21:27:17,850 - sidecar.ingest - INFO - Starting ingestion for project: project-x
2023-06-28 21:27:17,850 - sidecar.ingest - INFO - Found 1 new uploads to ingest
2023-06-28 21:27:17,850 - sidecar.ingest - INFO - Copying 2212.08037v2.pdf to /Users/greg/Library/Application Support/com.tauri.dev/project-x/.staging
2023-06-28 21:27:17,851 - sidecar.ingest - INFO - Calling Grobid server for 1 files
2023-06-28 21:27:29,573 - sidecar.ingest - INFO - Finished calling Grobid server
2023-06-28 21:27:29,574 - sidecar.ingest - INFO - Grobid successfully parsed file: 2212.08037v2.pdf
2023-06-28 21:27:29,574 - sidecar.ingest - INFO - Converting 1 Grobid XML files to JSON
2023-06-28 21:27:29,632 - sidecar.ingest - INFO - Found 1 Grobid JSON files to parse
2023-06-28 21:27:29,632 - sidecar.ingest - INFO - Creating Reference from file: 2212.08037v2.json
2023-06-28 21:27:29,634 - sidecar.ingest - INFO - Found 0 txt files from Grobid parsing errors
2023-06-28 21:27:29,634 - sidecar.ingest - INFO - Created 1 Reference objects: 1 successful Grobid parses, 0 Grobid failures
2023-06-28 21:27:29,634 - sidecar.ingest - INFO - Saving references to file: /Users/greg/Library/Application Support/com.tauri.dev/project-x/.storage/references.json
2023-06-28 21:27:29,684 - sidecar.ingest - INFO - Finished ingestion for project: project-x

# check references.json -- should be 3 references
$ cat .storage/references.json | jq '.[] | {"source_filename": .source_filename, "title": .title}'
{
  "source_filename": "Machine Learning at Scale.pdf",
  "title": "Machine Learning at Scale"
}
{
  "source_filename": "grobid-fails.pdf",
  "title": null
}
{
  "source_filename": "2212.08037v2.pdf",
  "title": "Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models"
}

@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Merging #219 (ae1c524) into main (b70068c) will increase coverage by 0.24%.
The diff coverage is 82.07%.

@@            Coverage Diff             @@
##             main     #219      +/-   ##
==========================================
+ Coverage   72.18%   72.42%   +0.24%     
==========================================
  Files         103      103              
  Lines        5069     5125      +56     
  Branches      405      405              
==========================================
+ Hits         3659     3712      +53     
- Misses       1392     1395       +3     
  Partials       18       18              
Impacted Files Coverage Δ
python/main.py 0.00% <0.00%> (ø)
python/sidecar/ingest.py 91.47% <82.00%> (+0.65%) ⬆️
python/sidecar/shared.py 96.61% <100.00%> (+0.31%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@gjreda gjreda marked this pull request as ready for review June 29, 2023 02:29
Copy link
Collaborator

@sehyod sehyod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how feasible that is, but it would be great to have a test checking that when _load_references returns a reference, the reference is excluded from the ingestion process

@cguedes cguedes merged commit ef64f0e into main Jun 29, 2023
11 checks passed
@cguedes cguedes deleted the 218-update-ingest-so-uploads-are-only-processed-once branch June 29, 2023 10:45
@gjreda
Copy link
Collaborator Author

gjreda commented Jun 29, 2023

I don't know how feasible that is, but it would be great to have a test checking that when _load_references returns a reference, the reference is excluded from the ingestion process

@sehyod can you say more about this? As implemented, we exclude any references from _load_references here: https://github.com/refstudio/refstudio/pull/219/files#diff-ce6aff220b56341156a579ce17e68875c3179400b864d7519b8a657c8dbae910R148-R152.

Are you saying you'd like a test for the _get_files_to_ingest method? To make sure it excludes references that were loaded via _load_references?

@sehyod
Copy link
Collaborator

sehyod commented Jun 30, 2023

Yes, I meant adding a test to make sure references that were loaded via _load_references are excluded. But to be honest, the code is pretty straightforward, I think it's fine without the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ingest so uploads are only processed once
3 participants