Make exact matcher look in full text by lizgzil · Pull Request #252 · wellcometrust/reach

lizgzil · 2019-10-07T14:20:05Z

Fix #202

Description

This PR adapts the code to look for exact matches in all the policy text, rather than just in the references sections.

Rename exact matcher variables to not have anything to do with sections
Adds an optional argument in get_scraping_results in file_manager.py which allows you to input which columns to include in the scraper output dictionary. This will mean we can include the text column needed to exact match on all the text. I did this rather than just updating the global variable SCRAPING_COLUMNS with 'text' since for the fuzzy match it is unneeded to store the large text data in memory.
I create a new function transform_scraper_text_file in refparse.py to specifically transform the scraped full text for the exact match. This function is v simliar to transform_scraper_file, but I chose to create another function rather than using an if statement in the same one - perhaps this would be better though?
Added the exact matcher task to the policy dag (editing exact_match_refs_operator.py and policy.py).
Adds 8 new exact matcher tests
Changes to test_pdf_objects to allow for these tests to be run (or skipped) on a mac (this should be fixed properly at some point, see issue Fix PDF extraction for OS X(in pdf_parser.py) #273 )

Type of change

Please delete options that are not relevant.

🐛 Bug fix (Add Fix #(issue) to your PR)

Running time increases a lot!

A ran locally for the 70 documents in the latest MSF scrape, using:

python -m policytool.refparse.refparse \
    --scraper-file "s3://datalabs-dev/reach-airflow/output/policy/parsed-pdfs/msf/parsed-pdfs-msf.json.gz" \
    --references-file "s3://datalabs-data/wellcome_publications/uber_api_publications.csv" \
    --model-file "s3://datalabs-data/reference_parser_models/reference_parser_pipeline.pkl" \
    --output-url "file://./tmp/parser-output/output_folder_name"

this took 504 seconds and finds 10 doc-publication matches.

Without these changes this took 8 seconds and found 2 doc-publication matches (reassuring also found in the 10 full text search matches).

I tried flashtext to make it quicker, but it didn't work on looking for whole sentences in large amounts of text - it works for 1 or 2 words in text.
I tried Whoosh to make this quicker, but it took 1774 seconds and found 7538 matches - obviously there is probably a bug somewhere, but I didn't investigate any further.
I tried Spacy's PhraseMatcher (pip install spacy and python -m spacy download en_core_web_sm) but aborted trying this out, since in airflow we dont need it, so it doesnt matter if refparse ran locally is slow

I've created an issue #257, if someone wants to pick it up this performance issue in the future.

How Has This Been Tested?

>>> make docker-test
74 passed, 4 warnings in 10.56s
>>> make test
 72 passed, 2 skipped, 5 warnings in 7.34s
>>> cd reach/refparse/
>>> python -m unittest
Ran 35 tests in 0.112s

OK

I ran:

source build/virtualenv/bin/activate
eval $(./export_env.py)
docker-compose up -d
./docker_exec.sh airflow test policy-test ExactMatchRefs.msf 2019-11-8

it took about 3 hours to run and processed 5377194 publications and found 8132 matches.
I noticed my results have duplicates in, fixing this might help speed things up if we delete duplicate policy docs before the exact matcher runs. I added some info about this duplication to issue #183

SamDepardieu · 2019-10-07T15:30:14Z

this took 504 seconds and finds 10 doc-publication matches.

Without these changes this took 8 seconds and found 2 doc-publication matches (reassuring also found in the 10 full text search matches).

I don't see anything obvious, but this is slightly worrying in my opinion. 504 seconds for 70 documents is a lot, especially considering that some of our organisations (e.g. parliament) have around 60k documents.

We may want to profile this code to find out what's taking so long and to see if we can come with a solution for it to run faster before merging this PR.

I'd be happy to help you with that if needed, though

lizgzil · 2019-10-07T15:33:46Z

@SamDepardieu

I'd be happy to help you with that if needed, though
That'd be great! Yes, it's quite a huge increase in time :/

nsorros · 2019-10-07T15:59:24Z

this will stop the yielding if any of the rows does not have a text column

nsorros · 2019-10-07T16:00:21Z

you can add it now by document.get('uri', None)

sam added this as 'url' in his latest PR so using that instead of 'uri'

nsorros · 2019-10-07T16:01:53Z

Also note that this change does not affect the Reach tool as you would need to change the airflow task exact match. You need to search in the full text index not only in reference section index. You should add this before we merge.

nsorros

You need to implement the change for the Ariflow DAG as well

ivyleavedtoadflax · 2019-10-07T16:02:25Z

There is a package I used in a previous role which is significantly faster than regex, and could be used in ExactMatcher. I'll see if I can dig it out!

ivyleavedtoadflax · 2019-10-07T16:03:58Z

Found it We had good results with this.

https://www.analyticsvidhya.com/blog/2017/11/flashtext-a-library-faster-than-regular-expressions/

https://github.com/vi3k6i5/flashtext

lizgzil · 2019-10-08T10:13:10Z

@hblanks could you check that these lines are performing as you think they should/are parts of it redundant? My first thought is that if document[section_column]: should make the try redundant (i.e. if it passes the if then it should also pass the try in this case)? And as @nsorros says below, if there is no 'sections' part of the dict for a document then this will stop all the rest of the documents being processed.

lizgzil · 2019-10-09T10:36:18Z

@SamDepardieu or @hblanks please can you check my commit changing the airflow stuff if you have time? Tests passed, but I'm not confident the arguments of exact_match_refs_operator.ExactMatchRefsOperator are correct.

lizgzil · 2019-10-09T10:37:55Z

also (@ivyleavedtoadflax @SamDepardieu ) I've given up on making the local running of refparse quicker in this PR since it won't effect the product performance anyway. I've created an issue #257, if someone wants to pick it up in the future.

nsorros · 2019-10-10T10:18:55Z

es_index is not a good name, all indices in elastic search are es_index. i suggest full_text_index or es_full_text index

nsorros · 2019-10-10T10:20:04Z

@SamDepardieu or @hblanks please can you check my commit changing the airflow stuff if you have time? Tests passed, but I'm not confident the arguments of exact_match_refs_operator.ExactMatchRefsOperator are correct.

Did you run the DAG? Did the exact matcher found matches?

Test pass as I think there are no tests testing the exact matcher.

SamDepardieu · 2019-10-10T15:00:31Z

@SamDepardieu or @hblanks please can you check my commit changing the airflow stuff if you have time? Tests passed, but I'm not confident the arguments of exact_match_refs_operator.ExactMatchRefsOperator are correct.

Did you run the DAG? Did the exact matcher found matches?

Test pass as I think there are no tests testing the exact matcher.

The best way to test the dag is to run the policy-test DAG. If you're able to docker-compose up -d Reach, you can access Airflow's interface on localhost:8080 and then just click the little "play" button on the right of the policy-test dag

ivyleavedtoadflax · 2019-10-30T15:21:04Z

for simple tests like this it is probably preferable to use pytest which means you can do away with creating a test class, and define everything as functions. It's no biggie, especially because pytest understand unittest tests, but pytest is much more user friendly to write.

lizgzil · 2019-10-30T18:14:58Z

The policy test dag is broken (not even running) and I'm not sure why. The error in Airflow is Broken DAG: [/airflow/dags/policy.py] dictionary update sequence element #0 has length 18; 2 is required, but this doesnt give me enough info to debug!

hblanks · 2019-10-31T08:10:22Z

@lizgzil - yes, the place for debugging dags is unfortunately in the airflow web (and maybe scheduler logs), using docker-compose logs airflow-web, sometimes with options. It's yucky.

Your branch also could stand to be rebased on top of the reach rename. Would you like it if I did this and after we debugged the DAG together?

lizgzil · 2019-10-31T10:09:45Z

Your branch also could stand to be rebased on top of the reach rename. Would you like it if I did this and after we debugged the DAG together?

@hblanks Yes please, thatd be great!

lizgzil · 2019-10-31T16:47:07Z

The error log for an full text task is:

[2019-10-31 16:43:07,601] {logging_mixin.py:95} INFO - [�[34m2019-10-31 16:43:07,600�[0m] {�[34mbase.py:�[0m149} WARNING�[0m - �[1mPOST�[0m �[1mhttp://elasticsearch:9200:9200/policy-test-docs/_delete_by_query�[0m [status:�[1mN/A�[0m request:0.000s]�[0m
�[31mTraceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Not sure if this is to do with my connection or not (after a quick google of what a socket.gaierror is?

Will try with my dns not being 8.8.8.8 when I get back off holiday, but if anyone else want to try running the test-dag in the mean time then that's be great :)

…ns, and add an optional argument in filemanager which allows you to decide which columns to include in the scraper output dictionary. This will mean we can include the text column needed to exact match on all the text, rather than just the sections

…t the name of the sections column is called, create brand new function 'transform_scraper_text_file' whcih specifically transforms for the text column in the scraper, adding new argument 'scraping_columns' to get_file which deals with allowing the user to query any column names theyd like

…policy.py, and define threshold variable

- The new connection method added the port twice to the host, leading the lib to try to connecto 'http://elasticsearch9200:9200. Changed the conenction method to split the port from the configuration variable.

…location

…EPMCMetadata task, query correct column names in the exact match, save publications doi and pmid is exact match output, add logger information, fix keys for elasticsearch paths

lizgzil · 2019-11-08T15:09:09Z

Thanks to @SamDepardieu and @hblanks I've fixed my issues with Airflow.

I ran ./docker_exec.sh airflow test policy-test ExactMatchRefs.msf 2019-11-8 and got 8132 matches in 3 hours. As noted in the description this takes longer than really necessary since there are duplicates in the ES EPMC Metadata index. I'm currently running the whole policy-test dag, but I anticipate it'll take a long time! (note msf is the smallest data and it still took 3hours).

It'd be great to get this merged so we could see if the policy-test dag succeeds not using my computer, so a review of the code / suggestions on how to speed up performance / help fixing issue #183 would be very much appreciated :)

…c, where pdfs will be interpreted differently

hblanks · 2019-11-12T07:31:14Z

@lizgzil - thanks for keeping this going. A couple thoughts:

Because we use the test DAG to quickly verify the pipeline, the test DAG has to complete within a fairly small amount of time.
It doesn't sound like we can get the test DAG running quickly right away with the exact matcher in place.
But, we do want to have the exact matcher available in the codebase, and maybe even able to run in a deployment environment.

So, my two cents would be that we create a separate dag, policy-test-exact-match, which we can use to try speeding up things as best we can. The first place is probably batching search queries to ElasticSearch, if that can be done -- or else running them in a thread pool.

…in flow of tasks

long time ago

hblanks · 2019-11-14T14:18:28Z

OK! This branch is rebased to a new branch, fix-exact-matcher-rebase. After adding a small commit, the exact matcher test DAG Liz added runs in short order. So, I suggest we:

Close this PR.
Create a PR for the new branch.
Maybe update dag.py so that it the main DAG doesn't run the exact matcher, but so that the test DAG does?

lizgzil · 2019-11-14T14:27:48Z

Closing this and opening a rebased version #285

lizgzil requested review from SamDepardieu, aCampello, hblanks, ivyleavedtoadflax and nsorros October 7, 2019 14:45

lizgzil changed the title ~~[WIP] Make exact matcher look in full text~~ Make exact matcher look in full text Oct 7, 2019

nsorros reviewed Oct 7, 2019

View reviewed changes

nsorros previously requested changes Oct 7, 2019

View reviewed changes

lizgzil commented Oct 8, 2019

View reviewed changes

lizgzil mentioned this pull request Oct 9, 2019

Exact matching is slow on local run of refparse #257

Open

nsorros reviewed Oct 10, 2019

View reviewed changes

ivyleavedtoadflax reviewed Oct 30, 2019

View reviewed changes

hblanks force-pushed the fix-exact-matcher branch from 552db8d to 2a18a80 Compare October 31, 2019 11:23

lizgzil added 3 commits November 6, 2019 11:03

get rid of spaces

03a58e4

lizgzil and others added 8 commits November 6, 2019 11:03

Corrections to the transform_scraper functions

0d3e519

Add exact match task to policy dag

f663e84

Change name of es_index

48aca7c

Add exact matcher tests

5c1bc30

Changes to the ariflow task for exact matching - rename variables in …

f1ae868

…policy.py, and define threshold variable

WIP - mark places where we should be using get_es_hosts()

2deb176

Updating es_hosts

8b79f9e

Refactoring for connect to take list of hosts

edc0def

lizgzil force-pushed the fix-exact-matcher branch from edb68cc to edc0def Compare November 6, 2019 11:07

SamDepardieu and others added 4 commits November 6, 2019 12:24

Fix es connection method

5b24d3e

- The new connection method added the port twice to the host, leading the lib to try to connecto 'http://elasticsearch9200:9200. Changed the conenction method to split the port from the configuration variable.

Ignore 404 on docs deletion if index does not exist yet

5f69613

Use exact s3 path to publications and correct index to the full text …

cd5ebac

…location

Change publications path to the same empc key as used for the ESIndex…

cce0d64

…EPMCMetadata task, query correct column names in the exact match, save publications doi and pmid is exact match output, add logger information, fix keys for elasticsearch paths

Make changes to test_pdf_objects to allow for tests to be run on a ma…

7c1e406

…c, where pdfs will be interpreted differently

lizgzil requested a review from jdu November 8, 2019 16:39

lizgzil added 2 commits November 14, 2019 10:51

separate out policy-test-exact-matcher dag to policy-test dag

03c0193

Include more EPMC pubs in exact-matcher test dag, and include spider …

f4e8dd4

…in flow of tasks

lizgzil closed this Nov 14, 2019

lizgzil mentioned this pull request Nov 14, 2019

Add Exact Matcher #285

Merged

1 task

Conversation

lizgzil commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Running time increases a lot!

How Has This Been Tested?

Uh oh!

SamDepardieu commented Oct 7, 2019

Uh oh!

lizgzil commented Oct 7, 2019

Uh oh!

nsorros Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

nsorros Oct 7, 2019

Choose a reason for hiding this comment

Uh oh!

lizgzil Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

nsorros commented Oct 7, 2019

Uh oh!

nsorros left a comment

Choose a reason for hiding this comment

Uh oh!

ivyleavedtoadflax commented Oct 7, 2019

Uh oh!

ivyleavedtoadflax commented Oct 7, 2019

Uh oh!

lizgzil Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

lizgzil commented Oct 9, 2019

Uh oh!

lizgzil commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsorros Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsorros commented Oct 10, 2019

Uh oh!

SamDepardieu commented Oct 10, 2019

Uh oh!

ivyleavedtoadflax Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

lizgzil commented Oct 30, 2019

Uh oh!

hblanks commented Oct 31, 2019

Uh oh!

lizgzil commented Oct 31, 2019

Uh oh!

lizgzil commented Oct 31, 2019

Uh oh!

lizgzil commented Nov 8, 2019

Uh oh!

hblanks commented Nov 12, 2019

Uh oh!

hblanks commented Nov 14, 2019

Uh oh!

lizgzil commented Nov 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lizgzil commented Oct 7, 2019 •

edited

Loading

lizgzil commented Oct 9, 2019 •

edited

Loading

nsorros Oct 10, 2019 •

edited

Loading