Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index fields to subject sets #5859

Merged
merged 9 commits into from
Jan 26, 2021
Merged

Conversation

eatyourgreens
Copy link
Contributor

@eatyourgreens eatyourgreens commented Dec 10, 2020

Add a new field, subjectSet.metadata.indexFields, to new subject sets. This field contains a comma-separated list of any manifest headings that begin with &. If none of those headings are present, then this field isn't created.

For example, these manifest headings:

subject_id,image_name_1,&origin,link,&attribution,license,#secret_description

will set subjectSet.metadata.indexFields to origin,attribution.

Staging branch URL: https://pr-5859.pfe-preview.zooniverse.org

Required Manual Testing

  • Does the non-logged in home page render correctly?
  • Does the logged in home page render correctly?
  • Does the projects page render correctly?
  • Can you load project home pages?
  • Can you load the classification page?
  • Can you submit a classification?
  • Does talk load correctly?
  • Can you post a talk comment?

Review Checklist

  • Does it work in all major browsers: Firefox, Chrome, Edge, Safari?
  • Does it work on mobile?
  • Can you rm -rf node_modules/ && npm install and app works as expected?
  • If the component is in coffeescript, is it converted to ES6? Is it free of eslint errors? Is the conversion its own commit?
  • Are the tests passing locally and on Travis?

Optional

@eatyourgreens
Copy link
Contributor Author

Once we're happy with this, we'll need to port it over to the Python tools.

@shaunanoordin
Copy link
Member

Code looks good, though the only thing I haven't figured out yet is how/where/if the PFE Subject Set page creates Manifest files when none is provided. I'm running tests now to...

(Basic tests)

  • create Subject Set X, then upload a batch of Subjects with a Manifest file with &indexFields
  • create Subject Set Y, then upload a batch of Subjects with no Manifest file

Notes for self:

  • If Subject Set X is created, and then a batch of Subjects with Manifest A is uploaded, then Subject Set X will have the .indexFields information from Manifest A.
  • If a second batch of Subjects with Manifest B is uploaded... I think the .indexFields information from Manifest B will overwrite all the previous information from Manifest A.
  • If a third batch of Subjects with NO manifest is uploaded... then question mark? (In theory, the PFE Subject Set page would create a basic Manifest file, which would have no .indexFields... which means indexFields.length would be 0, so subjectSet.update() won't trigger. But again, run checks to confirm that data in addIndexField isn't undefined for whatever reason.)
  • Shenanigans MIGHT occur if someone has a header row that looks like filename,&series,"&console, PC, or platform",&,price but I don't think this is something we'd need to realistically consider

@eatyourgreens
Copy link
Contributor Author

Maybe it should append to metadata.indexFields, rather than overwriting it? 🤔

@shaunanoordin
Copy link
Member

I think overwriting makes sense. ✅ If new Subjects are added to an existing Subject Set, I can see three scenarios...

  • a Manifest file is included with NO &indexFields. In which case, the current SubjectSet.metadata.indexFields shouldn't change - this scenario is accounted for in the code.
    • reason 1: it's easy for a project owner to later forget to add &indexFields when uploading new Subjects, and this human error shouldn't cause the project to break, and reason 2: it's very unlikely that a project owner wants to REMOVE &indexFields, and this niche scenario can be easily accounted for by us manually editing the subjectset.metadata
  • no Manifest file is included. In which case, the scenario should be treated the same as above
  • a Manifest file is included with &indexFields. In which case, it makes sense to overwrite instead of append
    • If a project owner adds a manifest with &indexFields, it means they're aware of the index system, and can be responsible for ensuring the &indexFields are consistent with the previous uploads, or changed on purpose.
    • appending runs the risk of creating too much junk code in the metadata, which will be a pain because then we'd need a mechanism to remove items in the &indexFields field.

I have no idea why I spent 10 minutes writing an essay that basically just says "actually overwrite is fine".

@shaunanoordin
Copy link
Member

shaunanoordin commented Dec 10, 2020

[UI][minor] Future Consideration
EDIT: tagging this as a possible quirk

  • in Subject manifests, an & ampersand is now used to indicate a certain metadata field is to be considered a special indexField.
    • c.f. how the ! EXCLAMATION MARK! is used to indicate a certain metadata field should be field hidden from users
  • The & ampersand is still treated like any other character in terms of a metadata field, so each Subject in the manifest will literally have subject.metadata.&someField (instead of the more readable subject.metadata.someField)
  • Our current "metadata viewer" components on PFE and front-end-monorepo will literally display the &someField field with the ampersand.

Screen Shot 2020-12-10 at 17 49 16

This is a low impact, low priority consideration. In my head right now are the following paths:

  • Option 1: change all our "metadata viewer" components to strip & from when rendering the fields
    • Sounds fine, because these components already hide !hidden fields so there's precedent
  • Option 2: adjust subjectsFromManifest to strip away any preceding & to keep the field names consistent
    • Sounds fine, but I'm not sure if there's any reason we want to know the & indexField indicator on a per Subject basis as opposed to on a per Subject Set basis
  • Option 3: don't worry about it, it's just an ampersand

Sudden thought: following previous post, what if the project owner uploads the first batch of Subjects with the manifest that looks like !filename,&series,&awesomeness and the second batch of Subjects has !filename,series,awesomeness?

  • That'd mean means the different batch of subjects will have different metadata fields because '&series' !== 'series'
  • But, this falls on the project owner's shoulders to be responsible for keeping the fields consistent.

Wait, bigger question: our Subject Set Indexing tool, does it CARE if the metadata field is 'someField' or '&someField' ?

EDIT: wait, I was testing this on master. Retrying to see if & isn't being stripped properly on subject-set-index-fields EDIT EDIT: No, it was definitely the subject-set-index-fields branch. The & isn't being stripped.

Copy link
Member

@shaunanoordin shaunanoordin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

This PR allows the PFE project builder's Subjects upload system to identify the special "index fields" that's relevant to a Subject Set.

  • Index fields are identified in the Subject manifest by an & ampersand in their metadata field name (in the header row of the manifest.csv), e.g. &someField.
  • When a batch of Subjects are uploaded to a Subject Set, if the batch has a Manifest file WITH & index fields defined, those index fields will be recorded in the Subject Set's metadata.
    • See the Testing section for a more solid example.
    • When a Subject upload has NO Manifest included, or includes a Manifest file with & no index fields, then the upload proceeds normally and ignores all the aforementioned & index fields considerations.

Testing

Baseline test:

  • 5 image files are dragged and dropped into a new, empty Subject Set:
    • URL: https://local.zooniverse.org:3735/lab/1898/subject-sets/4825
    • Result: no issues.
      • All 5 images are uploaded normally. Each Subject only has the default 'Filename' in its subject.metadata.
      • Subject Set has an empty subject_set.metadata configuration.

Main test:

  • 5 images + 1 manifest file (with index fields) are dragged and dropped into a new, empty Subject Set:
    • URL: https://local.zooniverse.org:3735/lab/1898/subject-sets/4824
      • Manifest file looks like so:
        character,!filename,&series,&awesomeness
        Mei,overwatch-2-mei.jpg,overwatch,10
        Mercy,overwatch-2-mercy.jpg,overwatch,7
        Tracer,overwatch-2-tracer.jpg,overwatch,9
        Geed,ultraman-geed.png,ultraman,8
        Orb,ultraman-orb.png,ultraman,10
        
    • Result: no issues, but possible quirk (see dev notes)
      • All 5 images are uploaded, with the appropriate metadata fields
      • Subject Set has the following subject_set.metadata configuration:
        {
          indexFields: "series,awesomeness"
        }
        

Tested on localhost+macOS10+Chrome. LGTM 👍

Dev Notes

❓ While the Subject Set metadata looks fine, there's a possible quirk with how the individual Subject metadata looks like.

In the testing example above, one of the Subjects has a subject.metadata that looks like...

{
  "!filename": "overwatch-2-mei.jpg"
  "&awesomeness": "10"
  "&series": "overwatch"
  "character": "Mei"
}

...and I'm not sure if it's supposed to be...

{
  "!filename": "overwatch-2-mei.jpg"
  "awesomeness": "10"
  "series": "overwatch"
  "character": "Mei"
}

Basically I'm not sure if the Subject-searching functionality will go through every Subject looking for a metadata field called series or &series. ❓

Status

I'm going to give this a 👍 for now because the code works as is, and I'm not sure if the quirk mentioned in the dev notes is a quirk, or something working as expected.

The only additional check I'm going to do is to see if & could be a special character in any scenario. It's not an issue with standard CSVs (i.e. if we're working with plain text editors) as far as I can tell, but I need to check with Microsoft Excel or Google Sheets or whatever to see that they don't try to play smart and encode & as & during an export operation. This is a VERY niche worry, so don't sweat it for this PR.

@coveralls
Copy link

coveralls commented Dec 10, 2020

Coverage Status

Coverage increased (+0.08%) to 50.879% when pulling 46a4131 on subject-set-index-fields into cdf2324 on master.

@eatyourgreens
Copy link
Contributor Author

I think you're right about trimming each subject's metadata. The field names of the indexed metadata in the Redis search should match the actual field names, which means trimming extra characters from the subjects that are fed to Redis.

@eatyourgreens
Copy link
Contributor Author

@shaunanoordin I've added a helper to clean up individual subjects too, before uploading them. If this still looks good, I'll merge it on Monday.

@eatyourgreens
Copy link
Contributor Author

I'm also not sure that we'll use & as the special character, so I've made that configurable here.

export default {
INDEX_FIELD_HEADER: '&'
}

Copy link
Member

@shaunanoordin shaunanoordin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review (Update)

These changes LGTM! I'm giving this a double approval.

  • These new changes ensure that any index field that looks like &someField will be cleaned up so we get Subject.metadata.someField (without the ampersand)
  • Tests look good.
    • Baseline "5 images, no manifest" upload works fine.
    • Main test of "5 images, manifest with &indexFields" works fine as well 👍
    • (test manifest and image files are the exact same as with the initial PR review)

Screen Shot 2020-12-11 at 16 59 33

Status

LGTM. I have a very minor suggestion for the code, but this PR is overall ready to be merged on Monday. 👍

@eatyourgreens
Copy link
Contributor Author

I used the wrong character. 🤦‍♂️ It should be %, not &. I'll change it before merging on Monday.

Add a new field, subjectSet.metadata.indexFields, to new subject sets. This field contains a comma-separated list of any manifest headings that begin with `&`.
Move subject set helpers to `./helpers/subject-sets`. Add `cleanSubjectData` which trims whitespace and removes leading ampersands from subject metadata fields.
@eatyourgreens eatyourgreens force-pushed the subject-set-index-fields branch from 663264c to 46a4131 Compare January 26, 2021 10:47
@eatyourgreens eatyourgreens merged commit f62f3f8 into master Jan 26, 2021
@eatyourgreens eatyourgreens deleted the subject-set-index-fields branch January 26, 2021 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants