Integrate multitask splitter+parser model #505

lizgzil · 2020-04-27T15:56:20Z

Description

Note: No need to review anything from reach/ as this will all be deleted after the merge.

Addresses issue #500.

In this PR I replace SplitSection with SplitParser in refparse.py. This means instead of using the deep reference splitter model plus the naive bayes parser model, we now combine splitting and parsing by using the deep reference multitask model.

This has knock-on effects for various other scripts - changing the tests, the init and settings files, the functions in parse.py (now reduced to just one), adding a file (test_config_multitask.ini) which will make the tests on this model a bit quicker than they would be with the default config file.

Note: I make changes to both the scripts in reach/refparse and split_reach/extracter/refparse, which are essentially the same changes for both file locations. In each commit I add each pair of scripts (e.g. reach/refparse/utils/__init__.py and split_reach/extracter/refparse/utils/__init__.py) so in terms of reviewing more easily you might want to go through commit by commit.

Type of change

Please delete options that are not relevant.

🐛 Bug fix (Add Fix #(issue) to your PR)
✨ New feature
🔥 Breaking change
📝 Documentation update

How Has This Been Tested?

make test

Checklist:

My code follows the style guidelines of this project (pep8 AND pyflakes)
I have commented my code, particularly in hard-to-understand areas
If needed, I changed related parts of the documentation
I included tests in my PR
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
If my PR aims to fix an issue, I referenced it using #(issue)

…to settings

…ask deep reference model

…to refparser.py - involves getting rid of the now unneeded parser model, and a slightly different structure in yield_structured_references

…test script for the multitask model

…ad a unneccessarily massive weights file

nsorros · 2020-04-30T08:13:09Z

reach/refparse/refparse.py

    sectioned_documents = transform_scraper_file(scraper_file)

    # Instantiate deep_reference_parser model here (not in loop!)
-
-    section_splitter = SplitSection()
+    splitter_parser = SplitParser(config_file=MULTITASK_CFG)


possibly the config_file should be the default on the version you shipped to not expose that here but this is a minor thing

nsorros · 2020-04-30T08:15:40Z

reach/refparse/utils/__init__.py

 from .fuzzy_match import FuzzyMatcher
 from .file_manager import FileManager
 from .serialiser import serialise_matched_reference, serialise_reference
 from .exact_match import ExactMatcher

 __all__ = [
-    split_reference,
-    predict_components,
-    merge_components,


don't you still need to do merge?

no, everything is done all together in parse.structure_reference now.

nsorros · 2020-04-30T08:21:19Z

split_reach/extracter/refparse/refparse.py

@@ -21,20 +21,15 @@
                   ExactMatcher)
 from .settings import settings

-from deep_reference_parser.split_section import SplitSection
+from deep_reference_parser.split_parse import SplitParser
+from deep_reference_parser.common import MULTITASK_CFG


is this duplciation? why there are two files doing the same thing?

reach/refparse/refparse.py

nsorros · 2020-04-30T08:26:25Z

This PR moves into a combined split / parse logic but note that this is not necessary. The multi task model does not make joint predictions, just trains on two tasks i.e. multitask. You can still have a separate split and parse method which would make integrating a bit easier.

nsorros · 2020-04-30T08:26:53Z

reach/refparse/tests/test_config_multitask.ini

@@ -0,0 +1,39 @@
+[DEFAULT]


this should not be here.

we need the test config file for test_split_parse.py since the default config_multitask.ini file points to an unnecessarily large embeddings file which would slow down the testing significantly. Or do you mean it should be located in a different folder?

i meant that it should live in deep reference parser not in reach. also if it ends up staying here maybe you can remove everything but the build part as the others are not needed.

nsorros · 2020-04-30T08:27:44Z

reach/refparse/utils/parse.py

-
-
-def predict_components(model, reference_components):
+def structure_reference(reference_components):


is that doing the job of merge_components?

add input output or args and returns statements

nsorros

There are certain issues that need discussing before this merges. Those are:

having a test config file here
combining split and parse which is not strictly necessary
remove pool_map which should be avoided or better understood

SamDepardieu · 2020-04-30T08:51:16Z

requirements.txt

 https://datalabs-public.s3.eu-west-2.amazonaws.com/reach_evaluator/reach_evaluator-2020.1.1-py3-none-any.whl
+
+https://github.com/wellcometrust/deep_reference_parser/releases/download/2020.4.29/deep_reference_parser-2020.4.5-py3-none-any.whl


This line needs to be updated in split_reach/extracter/requirements.txt too :)

oh i didnt realise each component had its own makefile+requirements!

lizgzil · 2020-05-01T15:27:25Z

@nsorros 👍 for the review, on your points:

"having a test config file here" - see my response above, I was a bit unsure what you meant
"combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.
"remove pool_map which should be avoided or better understood" : I've added an issue Investigate paralysation of yielding structured references #507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

nsorros · 2020-05-01T17:41:21Z

@nsorros 👍 for the review, on your points:

"having a test config file here" - see my response above, I was a bit unsure what you meant

"combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.

"remove pool_map which should be avoided or better understood" : I've added an issue Investigate paralysation of yielding structured references #507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

ideally the test file should be in the deep reference end not here. reach should use deep reference parser as a package without needing to define such a config to test.

split and parse could still be used separately, this would mean predicting twice, it would make integrating easier but i understand this is inefficient, just making the point that an easier transition could be made before combining the two.

removing pool_map should be left to devs, it is up to them but it is not as simple as saying that map is used at the moment, as you are breaking an easy way to parallelise the code in case running reach takes longer than needed.

lizgzil · 2020-05-04T11:55:35Z

@nsorros good point about the testing! Of course it should just be tested in the deep reference parser repo. I've removed these tests and the config (from split_reach/ only since reach/ will be deleted anyway) in this commit c623c5b

SamDepardieu

I'm happy with the requirement change, but I'll leave the final approval to @nsorros

nsorros · 2020-05-05T12:18:49Z

I'm happy with the requirement change, but I'll leave the final approval to @nsorros

could you comment on the removal of pool_map as well?

SamDepardieu · 2020-05-12T10:58:48Z

@nsorros Sorry, just saw your comment. It's okay to remove it for the time being, but it'd probably be good to put it back at some point, to speed things up. In the first time, we can monitor this to see if it actually needs this

nsorros · 2020-05-13T07:16:52Z

reach/refparse/refparse.py

@@ -144,33 +137,27 @@ def get_file(


 def yield_structured_references(scraper_file,
-                pool_map, logger, model_file=DEFAULT_MODEL_FILE):
+                pool_map, logger):


no need to pass pool_map here

this file will be deleted (it's in reach/)

nsorros · 2020-05-13T07:17:21Z

reach/refparse/refparse.py

    """

    logger.info("[+] Reading input files")

    # Loading the scraper results
    scraper_file = get_file(scraper_file, "", get_scraped=True)

-    # Loading the pipeline
-    model = get_file(model_file, 'pickle')


i know sam hated this function get_file but i still remember it with nostalgia

nsorros · 2020-05-13T07:18:04Z

reach/refparse/refparse.py

@@ -224,14 +205,13 @@ def yield_structured_references(scraper_file,
    )


-def parse_references(scraper_file, model_file, num_workers, logger):
+def parse_references(scraper_file, num_workers, logger):


you also do not need num_workers anymore

again, will be deleted

nsorros · 2020-05-13T07:22:20Z

reach/refparse/utils/parse.py

+    structured_reference = {ref_class: '' for ref_class in settings.REF_CLASSES}
+    for component in settings.DRP_REF_COMPONENTS:
+        structured_reference[
+            settings.COMPONENT_NAME_MAP.get(component, component)


i would add a comment in the end of the line to say "leave component unchanged if no map" as the logic .get(component, component) makes you stop and think what is going on

ok will do, but I'm just going to make the change to split_reach/extracter/refparse/utils/parse.py

nsorros · 2020-05-13T07:24:09Z

reach/refparse/utils/parse.py

+    for component in settings.DRP_REF_COMPONENTS:
+        structured_reference[
+            settings.COMPONENT_NAME_MAP.get(component, component)
+            ] = ' '.join([r[0] for r in reference_components if r[1]==component])


this logic of merge is different from the previous one. here you are merging all components that have been tagged the same. i understand that we do not have a way to evaluate the impact of this but could we write an email to @jdu to push about the resolution of the unique id or get a timeline for it, it is quite important to be able to argue those changes.

The reason for this is that the deep reference parser doesn't output probabilities of the predictions, which was the information needed for the old way of finding the most likely 'group' of authors, titles, etc. Perhaps there is a better way to deal with this though than just grouping them all together

nsorros · 2020-05-13T07:25:27Z

split_reach/extracter/refparse/utils/parse.py

+    for component in settings.DRP_REF_COMPONENTS:
+        structured_reference[
+            settings.COMPONENT_NAME_MAP.get(component, component)
+            ] = ' '.join([r[0] for r in reference_components if r[1]==component])


i find it really difficult to repeat comments and i am afraid that if you keep both changes that code in the two different places will differ. i will not hold the changes due to that but you have been warned :P

Sam is just going to delete the entire reach/ folder, so I'm only making the edits to the split_reach folder now, but I am being careful. It was a bit of a nightmare to undo the commits since they were all paired.

nsorros

happy to approve at this point, there are some minor comments on my end. the most problematic bit is that the code is repeated which if liz addresses comments i am sure that will lead to different functionalities. i strongly suggest implementing the changes in one place and removing the others. as an example of divergence i noticed that pool_map and num_workers has been removed in one place and not in the other.

…/reach into integrate-multitask-parser

* Updates DRP whl location to latest github release * Add new names for deep reference components and mapping to old names to settings * new easier way to structure references which the output of the multitask deep reference model * Subbing in the SplitParser model instead of the SplitSection model into refparser.py - involves getting rid of the now unneeded parser model, and a slightly different structure in yield_structured_references * Delete old tests for the split and parse tasks, and combine into one test script for the multitask model * Add test config files so that deep reference parser tests dont download a unneccessarily massive weights file * get rid of old parse functions from init file * Update release version of deep reference parser * Correct path for test config * Update new whl location in extracter requirements * Delete test of deep splitparser and associated config file * remove pool map from refparse * Adding comment to make structure reference dict naming clearer

lizgzil added 7 commits April 27, 2020 11:31

Updates DRP whl location to latest github release

0b6209d

Add new names for deep reference components and mapping to old names …

30a51e5

…to settings

new easier way to structure references which the output of the multit…

56b2454

…ask deep reference model

Subbing in the SplitParser model instead of the SplitSection model in…

2633194

…to refparser.py - involves getting rid of the now unneeded parser model, and a slightly different structure in yield_structured_references

Delete old tests for the split and parse tasks, and combine into one …

2448e13

…test script for the multitask model

Add test config files so that deep reference parser tests dont downlo…

5e88954

…ad a unneccessarily massive weights file

get rid of old parse functions from init file

5f14803

lizgzil changed the title ~~Integrate multitask parser~~ Integrate multitask splitter+parser model Apr 27, 2020

lizgzil mentioned this pull request Apr 27, 2020

Incorporate new multitask deep reference parser model #500

Closed

lizgzil added 2 commits April 29, 2020 13:28

Update release version of deep reference parser

ba33304

Correct path for test config

eba181d

lizgzil requested review from SamDepardieu, nsorros, jdu and aCampello April 29, 2020 12:31

lizgzil marked this pull request as ready for review April 29, 2020 12:31

nsorros reviewed Apr 30, 2020

View reviewed changes

reach/refparse/refparse.py Show resolved Hide resolved

nsorros reviewed Apr 30, 2020

View reviewed changes

nsorros suggested changes Apr 30, 2020

View reviewed changes

SamDepardieu suggested changes Apr 30, 2020

View reviewed changes

lizgzil mentioned this pull request Apr 30, 2020

Use less characters in text to be split #506

Merged

11 tasks

Update new whl location in extracter requirements

0ffba4f

Delete test of deep splitparser and associated config file

c623c5b

SamDepardieu reviewed May 5, 2020

View reviewed changes

SamDepardieu self-requested a review May 12, 2020 11:01

SamDepardieu approved these changes May 12, 2020

View reviewed changes

lizgzil added 2 commits May 12, 2020 16:35

remove pool map from refparse

059b306

Merge branch 'master' into integrate-multitask-parser

d3d870a

nsorros reviewed May 13, 2020

View reviewed changes

nsorros approved these changes May 13, 2020

View reviewed changes

lizgzil added 2 commits May 13, 2020 10:38

Adding comment to make structure reference dict naming clearer

6202bae

Merge branch 'integrate-multitask-parser' of github.com:wellcometrust…

7f72f7d

…/reach into integrate-multitask-parser

lizgzil merged commit 9a8acb0 into master May 13, 2020

lizgzil deleted the integrate-multitask-parser branch May 13, 2020 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate multitask splitter+parser model #505

Integrate multitask splitter+parser model #505

lizgzil commented Apr 27, 2020 •

edited

Loading

nsorros Apr 30, 2020

nsorros Apr 30, 2020

lizgzil Apr 30, 2020

nsorros Apr 30, 2020

nsorros commented Apr 30, 2020

nsorros Apr 30, 2020

lizgzil Apr 30, 2020

nsorros May 1, 2020

nsorros Apr 30, 2020

lizgzil Apr 30, 2020

nsorros May 13, 2020

nsorros left a comment

SamDepardieu Apr 30, 2020

lizgzil Apr 30, 2020

lizgzil Apr 30, 2020

lizgzil commented May 1, 2020

nsorros commented May 1, 2020

lizgzil commented May 4, 2020

SamDepardieu left a comment

nsorros commented May 5, 2020

SamDepardieu commented May 12, 2020

nsorros May 13, 2020

lizgzil May 13, 2020

nsorros May 13, 2020

nsorros May 13, 2020

lizgzil May 13, 2020

nsorros May 13, 2020

lizgzil May 13, 2020

nsorros May 13, 2020

lizgzil May 13, 2020

nsorros May 13, 2020

lizgzil May 13, 2020

nsorros left a comment



		def predict_components(model, reference_components):
		def structure_reference(reference_components):

		https://datalabs-public.s3.eu-west-2.amazonaws.com/reach_evaluator/reach_evaluator-2020.1.1-py3-none-any.whl

		https://github.com/wellcometrust/deep_reference_parser/releases/download/2020.4.29/deep_reference_parser-2020.4.5-py3-none-any.whl

Integrate multitask splitter+parser model #505

Integrate multitask splitter+parser model #505

Conversation

lizgzil commented Apr 27, 2020 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsorros commented Apr 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsorros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lizgzil commented May 1, 2020

nsorros commented May 1, 2020

lizgzil commented May 4, 2020

SamDepardieu left a comment

Choose a reason for hiding this comment

nsorros commented May 5, 2020

SamDepardieu commented May 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsorros left a comment

Choose a reason for hiding this comment

lizgzil commented Apr 27, 2020 •

edited

Loading