Skip to content
This repository has been archived by the owner on Feb 4, 2022. It is now read-only.

Integrate multitask splitter+parser model #505

Merged
merged 15 commits into from
May 13, 2020
Merged

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Apr 27, 2020

Description

Note: No need to review anything from reach/ as this will all be deleted after the merge.

Addresses issue #500.

In this PR I replace SplitSection with SplitParser in refparse.py. This means instead of using the deep reference splitter model plus the naive bayes parser model, we now combine splitting and parsing by using the deep reference multitask model.

This has knock-on effects for various other scripts - changing the tests, the init and settings files, the functions in parse.py (now reduced to just one), adding a file (test_config_multitask.ini) which will make the tests on this model a bit quicker than they would be with the default config file.

Note: I make changes to both the scripts in reach/refparse and split_reach/extracter/refparse, which are essentially the same changes for both file locations. In each commit I add each pair of scripts (e.g. reach/refparse/utils/__init__.py and split_reach/extracter/refparse/utils/__init__.py) so in terms of reviewing more easily you might want to go through commit by commit.

Type of change

Please delete options that are not relevant.

  • 🐛 Bug fix (Add Fix #(issue) to your PR)
  • ✨ New feature
  • 🔥 Breaking change
  • 📝 Documentation update

How Has This Been Tested?

make test

Checklist:

  • My code follows the style guidelines of this project (pep8 AND pyflakes)
  • I have commented my code, particularly in hard-to-understand areas
  • If needed, I changed related parts of the documentation
  • I included tests in my PR
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • If my PR aims to fix an issue, I referenced it using #(issue)

@lizgzil lizgzil changed the title Integrate multitask parser Integrate multitask splitter+parser model Apr 27, 2020
@lizgzil lizgzil marked this pull request as ready for review April 29, 2020 12:31
sectioned_documents = transform_scraper_file(scraper_file)

# Instantiate deep_reference_parser model here (not in loop!)

section_splitter = SplitSection()
splitter_parser = SplitParser(config_file=MULTITASK_CFG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly the config_file should be the default on the version you shipped to not expose that here but this is a minor thing

from .fuzzy_match import FuzzyMatcher
from .file_manager import FileManager
from .serialiser import serialise_matched_reference, serialise_reference
from .exact_match import ExactMatcher

__all__ = [
split_reference,
predict_components,
merge_components,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you still need to do merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, everything is done all together in parse.structure_reference now.

@@ -21,20 +21,15 @@
ExactMatcher)
from .settings import settings

from deep_reference_parser.split_section import SplitSection
from deep_reference_parser.split_parse import SplitParser
from deep_reference_parser.common import MULTITASK_CFG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this duplciation? why there are two files doing the same thing?

@nsorros
Copy link
Contributor

nsorros commented Apr 30, 2020

This PR moves into a combined split / parse logic but note that this is not necessary. The multi task model does not make joint predictions, just trains on two tasks i.e. multitask. You can still have a separate split and parse method which would make integrating a bit easier.

@@ -0,0 +1,39 @@
[DEFAULT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the test config file for test_split_parse.py since the default config_multitask.ini file points to an unnecessarily large embeddings file which would slow down the testing significantly. Or do you mean it should be located in a different folder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant that it should live in deep reference parser not in reach. also if it ends up staying here maybe you can remove everything but the build part as the others are not needed.



def predict_components(model, reference_components):
def structure_reference(reference_components):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that doing the job of merge_components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add input output or args and returns statements

Copy link
Contributor

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certain issues that need discussing before this merges. Those are:

  • having a test config file here
  • combining split and parse which is not strictly necessary
  • remove pool_map which should be avoided or better understood

https://datalabs-public.s3.eu-west-2.amazonaws.com/reach_evaluator/reach_evaluator-2020.1.1-py3-none-any.whl

https://github.com/wellcometrust/deep_reference_parser/releases/download/2020.4.29/deep_reference_parser-2020.4.5-py3-none-any.whl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be updated in split_reach/extracter/requirements.txt too :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i didnt realise each component had its own makefile+requirements!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizgzil
Copy link
Contributor Author

lizgzil commented May 1, 2020

@nsorros 👍 for the review, on your points:

  • "having a test config file here" - see my response above, I was a bit unsure what you meant
  • "combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.
  • "remove pool_map which should be avoided or better understood" : I've added an issue Investigate paralysation of yielding structured references #507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

@nsorros
Copy link
Contributor

nsorros commented May 1, 2020

@nsorros 👍 for the review, on your points:

  • "having a test config file here" - see my response above, I was a bit unsure what you meant
  • "combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.
  • "remove pool_map which should be avoided or better understood" : I've added an issue Investigate paralysation of yielding structured references #507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

ideally the test file should be in the deep reference end not here. reach should use deep reference parser as a package without needing to define such a config to test.

split and parse could still be used separately, this would mean predicting twice, it would make integrating easier but i understand this is inefficient, just making the point that an easier transition could be made before combining the two.

removing pool_map should be left to devs, it is up to them but it is not as simple as saying that map is used at the moment, as you are breaking an easy way to parallelise the code in case running reach takes longer than needed.

@lizgzil
Copy link
Contributor Author

lizgzil commented May 4, 2020

@nsorros good point about the testing! Of course it should just be tested in the deep reference parser repo. I've removed these tests and the config (from split_reach/ only since reach/ will be deleted anyway) in this commit c623c5b

Copy link
Contributor

@SamDepardieu SamDepardieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the requirement change, but I'll leave the final approval to @nsorros

@nsorros
Copy link
Contributor

nsorros commented May 5, 2020

I'm happy with the requirement change, but I'll leave the final approval to @nsorros

could you comment on the removal of pool_map as well?

@SamDepardieu
Copy link
Contributor

@nsorros Sorry, just saw your comment. It's okay to remove it for the time being, but it'd probably be good to put it back at some point, to speed things up. In the first time, we can monitor this to see if it actually needs this

@SamDepardieu SamDepardieu self-requested a review May 12, 2020 11:01
@@ -144,33 +137,27 @@ def get_file(


def yield_structured_references(scraper_file,
pool_map, logger, model_file=DEFAULT_MODEL_FILE):
pool_map, logger):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass pool_map here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file will be deleted (it's in reach/)

"""

logger.info("[+] Reading input files")

# Loading the scraper results
scraper_file = get_file(scraper_file, "", get_scraped=True)

# Loading the pipeline
model = get_file(model_file, 'pickle')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know sam hated this function get_file but i still remember it with nostalgia

@@ -224,14 +205,13 @@ def yield_structured_references(scraper_file,
)


def parse_references(scraper_file, model_file, num_workers, logger):
def parse_references(scraper_file, num_workers, logger):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also do not need num_workers anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, will be deleted

structured_reference = {ref_class: '' for ref_class in settings.REF_CLASSES}
for component in settings.DRP_REF_COMPONENTS:
structured_reference[
settings.COMPONENT_NAME_MAP.get(component, component)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would add a comment in the end of the line to say "leave component unchanged if no map" as the logic .get(component, component) makes you stop and think what is going on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will do, but I'm just going to make the change to split_reach/extracter/refparse/utils/parse.py

for component in settings.DRP_REF_COMPONENTS:
structured_reference[
settings.COMPONENT_NAME_MAP.get(component, component)
] = ' '.join([r[0] for r in reference_components if r[1]==component])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic of merge is different from the previous one. here you are merging all components that have been tagged the same. i understand that we do not have a way to evaluate the impact of this but could we write an email to @jdu to push about the resolution of the unique id or get a timeline for it, it is quite important to be able to argue those changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that the deep reference parser doesn't output probabilities of the predictions, which was the information needed for the old way of finding the most likely 'group' of authors, titles, etc. Perhaps there is a better way to deal with this though than just grouping them all together

for component in settings.DRP_REF_COMPONENTS:
structured_reference[
settings.COMPONENT_NAME_MAP.get(component, component)
] = ' '.join([r[0] for r in reference_components if r[1]==component])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find it really difficult to repeat comments and i am afraid that if you keep both changes that code in the two different places will differ. i will not hold the changes due to that but you have been warned :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sam is just going to delete the entire reach/ folder, so I'm only making the edits to the split_reach folder now, but I am being careful. It was a bit of a nightmare to undo the commits since they were all paired.

Copy link
Contributor

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to approve at this point, there are some minor comments on my end. the most problematic bit is that the code is repeated which if liz addresses comments i am sure that will lead to different functionalities. i strongly suggest implementing the changes in one place and removing the others. as an example of divergence i noticed that pool_map and num_workers has been removed in one place and not in the other.

@lizgzil lizgzil merged commit 9a8acb0 into master May 13, 2020
@lizgzil lizgzil deleted the integrate-multitask-parser branch May 13, 2020 09:44
jdu pushed a commit that referenced this pull request Jun 2, 2020
* Updates DRP whl location to latest github release

* Add new names for deep reference components and mapping to old names to settings

* new easier way to structure references which the output of the multitask deep reference model

* Subbing in the SplitParser model instead of the SplitSection model into refparser.py - involves getting rid of the now unneeded parser model, and a slightly different structure in yield_structured_references

* Delete old tests for the split and parse tasks, and combine into one test script for the multitask model

* Add test config files so that deep reference parser tests dont download a unneccessarily massive weights file

* get rid of old parse functions from init file

* Update release version of deep reference parser

* Correct path for test config

* Update new whl location in extracter requirements

* Delete test of deep splitparser and associated config file

* remove pool map from refparse

* Adding comment to make structure reference dict naming clearer
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants