-
Notifications
You must be signed in to change notification settings - Fork 4
Integrate multitask splitter+parser model #505
Conversation
…ask deep reference model
…to refparser.py - involves getting rid of the now unneeded parser model, and a slightly different structure in yield_structured_references
…test script for the multitask model
…ad a unneccessarily massive weights file
sectioned_documents = transform_scraper_file(scraper_file) | ||
|
||
# Instantiate deep_reference_parser model here (not in loop!) | ||
|
||
section_splitter = SplitSection() | ||
splitter_parser = SplitParser(config_file=MULTITASK_CFG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly the config_file should be the default on the version you shipped to not expose that here but this is a minor thing
from .fuzzy_match import FuzzyMatcher | ||
from .file_manager import FileManager | ||
from .serialiser import serialise_matched_reference, serialise_reference | ||
from .exact_match import ExactMatcher | ||
|
||
__all__ = [ | ||
split_reference, | ||
predict_components, | ||
merge_components, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you still need to do merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, everything is done all together in parse.structure_reference
now.
@@ -21,20 +21,15 @@ | |||
ExactMatcher) | |||
from .settings import settings | |||
|
|||
from deep_reference_parser.split_section import SplitSection | |||
from deep_reference_parser.split_parse import SplitParser | |||
from deep_reference_parser.common import MULTITASK_CFG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this duplciation? why there are two files doing the same thing?
This PR moves into a combined split / parse logic but note that this is not necessary. The multi task model does not make joint predictions, just trains on two tasks i.e. multitask. You can still have a separate split and parse method which would make integrating a bit easier. |
@@ -0,0 +1,39 @@ | |||
[DEFAULT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need the test config file for test_split_parse.py
since the default config_multitask.ini
file points to an unnecessarily large embeddings file which would slow down the testing significantly. Or do you mean it should be located in a different folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant that it should live in deep reference parser not in reach. also if it ends up staying here maybe you can remove everything but the build part as the others are not needed.
|
||
|
||
def predict_components(model, reference_components): | ||
def structure_reference(reference_components): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that doing the job of merge_components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add input output or args and returns statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are certain issues that need discussing before this merges. Those are:
- having a test config file here
- combining split and parse which is not strictly necessary
- remove pool_map which should be avoided or better understood
https://datalabs-public.s3.eu-west-2.amazonaws.com/reach_evaluator/reach_evaluator-2020.1.1-py3-none-any.whl | ||
|
||
https://github.com/wellcometrust/deep_reference_parser/releases/download/2020.4.29/deep_reference_parser-2020.4.5-py3-none-any.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to be updated in split_reach/extracter/requirements.txt
too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i didnt realise each component had its own makefile+requirements!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsorros 👍 for the review, on your points:
|
ideally the test file should be in the deep reference end not here. reach should use deep reference parser as a package without needing to define such a config to test. split and parse could still be used separately, this would mean predicting twice, it would make integrating easier but i understand this is inefficient, just making the point that an easier transition could be made before combining the two. removing pool_map should be left to devs, it is up to them but it is not as simple as saying that map is used at the moment, as you are breaking an easy way to parallelise the code in case running reach takes longer than needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the requirement change, but I'll leave the final approval to @nsorros
could you comment on the removal of pool_map as well? |
@nsorros Sorry, just saw your comment. It's okay to remove it for the time being, but it'd probably be good to put it back at some point, to speed things up. In the first time, we can monitor this to see if it actually needs this |
@@ -144,33 +137,27 @@ def get_file( | |||
|
|||
|
|||
def yield_structured_references(scraper_file, | |||
pool_map, logger, model_file=DEFAULT_MODEL_FILE): | |||
pool_map, logger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to pass pool_map here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file will be deleted (it's in reach/
)
""" | ||
|
||
logger.info("[+] Reading input files") | ||
|
||
# Loading the scraper results | ||
scraper_file = get_file(scraper_file, "", get_scraped=True) | ||
|
||
# Loading the pipeline | ||
model = get_file(model_file, 'pickle') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know sam hated this function get_file but i still remember it with nostalgia
@@ -224,14 +205,13 @@ def yield_structured_references(scraper_file, | |||
) | |||
|
|||
|
|||
def parse_references(scraper_file, model_file, num_workers, logger): | |||
def parse_references(scraper_file, num_workers, logger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also do not need num_workers anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, will be deleted
structured_reference = {ref_class: '' for ref_class in settings.REF_CLASSES} | ||
for component in settings.DRP_REF_COMPONENTS: | ||
structured_reference[ | ||
settings.COMPONENT_NAME_MAP.get(component, component) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would add a comment in the end of the line to say "leave component unchanged if no map" as the logic .get(component, component)
makes you stop and think what is going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will do, but I'm just going to make the change to split_reach/extracter/refparse/utils/parse.py
for component in settings.DRP_REF_COMPONENTS: | ||
structured_reference[ | ||
settings.COMPONENT_NAME_MAP.get(component, component) | ||
] = ' '.join([r[0] for r in reference_components if r[1]==component]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic of merge is different from the previous one. here you are merging all components that have been tagged the same. i understand that we do not have a way to evaluate the impact of this but could we write an email to @jdu to push about the resolution of the unique id or get a timeline for it, it is quite important to be able to argue those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is that the deep reference parser doesn't output probabilities of the predictions, which was the information needed for the old way of finding the most likely 'group' of authors, titles, etc. Perhaps there is a better way to deal with this though than just grouping them all together
for component in settings.DRP_REF_COMPONENTS: | ||
structured_reference[ | ||
settings.COMPONENT_NAME_MAP.get(component, component) | ||
] = ' '.join([r[0] for r in reference_components if r[1]==component]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find it really difficult to repeat comments and i am afraid that if you keep both changes that code in the two different places will differ. i will not hold the changes due to that but you have been warned :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sam is just going to delete the entire reach/
folder, so I'm only making the edits to the split_reach
folder now, but I am being careful. It was a bit of a nightmare to undo the commits since they were all paired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to approve at this point, there are some minor comments on my end. the most problematic bit is that the code is repeated which if liz addresses comments i am sure that will lead to different functionalities. i strongly suggest implementing the changes in one place and removing the others. as an example of divergence i noticed that pool_map and num_workers has been removed in one place and not in the other.
…/reach into integrate-multitask-parser
* Updates DRP whl location to latest github release * Add new names for deep reference components and mapping to old names to settings * new easier way to structure references which the output of the multitask deep reference model * Subbing in the SplitParser model instead of the SplitSection model into refparser.py - involves getting rid of the now unneeded parser model, and a slightly different structure in yield_structured_references * Delete old tests for the split and parse tasks, and combine into one test script for the multitask model * Add test config files so that deep reference parser tests dont download a unneccessarily massive weights file * get rid of old parse functions from init file * Update release version of deep reference parser * Correct path for test config * Update new whl location in extracter requirements * Delete test of deep splitparser and associated config file * remove pool map from refparse * Adding comment to make structure reference dict naming clearer
Description
Note: No need to review anything from
reach/
as this will all be deleted after the merge.Addresses issue #500.
In this PR I replace
SplitSection
withSplitParser
inrefparse.py
. This means instead of using the deep reference splitter model plus the naive bayes parser model, we now combine splitting and parsing by using the deep reference multitask model.This has knock-on effects for various other scripts - changing the tests, the init and settings files, the functions in
parse.py
(now reduced to just one), adding a file (test_config_multitask.ini) which will make the tests on this model a bit quicker than they would be with the default config file.Note: I make changes to both the scripts in
reach/refparse
andsplit_reach/extracter/refparse
, which are essentially the same changes for both file locations. In each commit I add each pair of scripts (e.g.reach/refparse/utils/__init__.py
andsplit_reach/extracter/refparse/utils/__init__.py
) so in terms of reviewing more easily you might want to go through commit by commit.Type of change
Please delete options that are not relevant.
Fix #(issue)
to your PR)How Has This Been Tested?
make test
Checklist:
#(issue)