Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate reference citation keys for ingested references #114

Merged

Conversation

gjreda
Copy link
Collaborator

@gjreda gjreda commented Jun 7, 2023

fixes #112

Adds citation_key and published_date to the Reference response.

Depending on whether the "uniqueness" postfix of a, b, c ... 1, 2, 3 should be handled by the frontend or sidecar, this PR might require more work. See discussion here.

@@ -4,5 +4,4 @@
yarn check
yarn test

yarn pylint:check
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that someone added this to yarn check (thank you!), so we don't need to call it twice here

Comment on lines +55 to +68
def update(self, target: typing.Reference):
"""
Update a Reference in storage with the target reference.
This is used when the client has updated the reference in the UI.
"""
for i, source in enumerate(self.references):
if source.filename_md5 == target.filename_md5:
logger.info(f"Updating reference {source.source_filename} ({source.filename_md5})")
self.references[i] = target
break
else:
logger.info(f"Unable to find reference {target.source_filename} ({target.filename_md5})")
return
self.save()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cguedes @sehyod for now, i'm assuming that you will be passing back in the entire Reference object, inclusive of chunks and such. it feels a little heavy, but seemed easiest for now. let me know if that's an issue for y'all.

Copy link
Collaborator Author

@gjreda gjreda Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, we could pass in the filename hash (or we should probably use a UUID) and dict, making the function signature something like:

def update(reference_id: UUID, updates: dict[string, Any])

where dict[string, Any] => field_name_to_update: value_to_set. would need to think through how to do this for subtypes like Author though. maybe individual methods for update_reference and update_author to make things clear and specific.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok, also bc this is JsonStorage file. Maybe we need to expose to the frontend more granular operations (ex: rename citation). So this is ok for now.

Comment on lines +33 to +35
given_name: str | None = None
surname: str | None = None
email: str | None = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grobid won't always have these fields, but from what I can tell full_name will always be present if it can parse any authors.

We'll be able to get a lot more consistent here once we start work on further augmenting our references with #140

@@ -7,7 +7,7 @@ function parsePdfIngestionResponse(response: IngestResponse): ReferenceItem[] {
return response.references.map((reference) => ({
id: reference.filename_md5,
title: reference.title ?? reference.source_filename.replace('.pdf', ''),
authors: (reference.authors ?? []).map((author) => ({ fullName: author.full_name, surname: author.surname })),
authors: (reference.authors ?? []).map((author) => ({ fullName: author.full_name })),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cguedes @sehyod I modified this to remove surname since it is not guaranteed we will have it even though we might have reference.authors. Feel free to push a commit if there's a better way to do this!

@gjreda gjreda marked this pull request as ready for review June 15, 2023 21:19
@gjreda gjreda requested review from cguedes and sehyod and removed request for cguedes June 15, 2023 21:41
Comment on lines +55 to +68
def update(self, target: typing.Reference):
"""
Update a Reference in storage with the target reference.
This is used when the client has updated the reference in the UI.
"""
for i, source in enumerate(self.references):
if source.filename_md5 == target.filename_md5:
logger.info(f"Updating reference {source.source_filename} ({source.filename_md5})")
self.references[i] = target
break
else:
logger.info(f"Unable to find reference {target.source_filename} ({target.filename_md5})")
return
self.save()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok, also bc this is JsonStorage file. Maybe we need to expose to the frontend more granular operations (ex: rename citation). So this is ok for now.

@cguedes cguedes self-requested a review June 19, 2023 16:23
@cguedes
Copy link
Collaborator

cguedes commented Jun 19, 2023

@gjreda just found an error (date parsing) in the ingestion for the Machine Learning at Scale.pdf.

image

Copy link
Collaborator

@cguedes cguedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pending error when parsing some PDF files (check my latest comment).

@gjreda gjreda requested a review from cguedes June 19, 2023 17:37
@codecov
Copy link

codecov bot commented Jun 19, 2023

Codecov Report

Merging #114 (a6208bf) into main (929c094) will not change coverage.
The diff coverage is 0.00%.

❗ Current head a6208bf differs from pull request most recent head f12f61f. Consider uploading reports for the commit f12f61f to get more accurate results

@@           Coverage Diff           @@
##             main     #114   +/-   ##
=======================================
  Coverage   36.73%   36.73%           
=======================================
  Files          52       52           
  Lines        2488     2488           
  Branches      114      114           
=======================================
  Hits          914      914           
  Misses       1551     1551           
  Partials       23       23           
Impacted Files Coverage Δ
src/api/ingestion.ts 0.00% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@gjreda gjreda merged commit 615e0c1 into main Jun 19, 2023
9 checks passed
@gjreda gjreda deleted the 112-generate-reference-citation-key-for-ingested-references branch June 19, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate reference citation key for ingested references
2 participants