Skip to content
This repository was archived by the owner on Feb 4, 2022. It is now read-only.

Make exact matcher look in full text#252

Closed
lizgzil wants to merge 18 commits intomasterfrom
fix-exact-matcher
Closed

Make exact matcher look in full text#252
lizgzil wants to merge 18 commits intomasterfrom
fix-exact-matcher

Conversation

@lizgzil
Copy link
Copy Markdown
Contributor

@lizgzil lizgzil commented Oct 7, 2019

Fix #202

Description

This PR adapts the code to look for exact matches in all the policy text, rather than just in the references sections.

  • Rename exact matcher variables to not have anything to do with sections
  • Adds an optional argument in get_scraping_results in file_manager.py which allows you to input which columns to include in the scraper output dictionary. This will mean we can include the text column needed to exact match on all the text. I did this rather than just updating the global variable SCRAPING_COLUMNS with 'text' since for the fuzzy match it is unneeded to store the large text data in memory.
  • I create a new function transform_scraper_text_file in refparse.py to specifically transform the scraped full text for the exact match. This function is v simliar to transform_scraper_file, but I chose to create another function rather than using an if statement in the same one - perhaps this would be better though?
  • Added the exact matcher task to the policy dag (editing exact_match_refs_operator.py and policy.py).
  • Adds 8 new exact matcher tests
  • Changes to test_pdf_objects to allow for these tests to be run (or skipped) on a mac (this should be fixed properly at some point, see issue Fix PDF extraction for OS X(in pdf_parser.py) #273 )

Type of change

Please delete options that are not relevant.

  • 🐛 Bug fix (Add Fix #(issue) to your PR)

Running time increases a lot!

A ran locally for the 70 documents in the latest MSF scrape, using:

python -m policytool.refparse.refparse \
    --scraper-file "s3://datalabs-dev/reach-airflow/output/policy/parsed-pdfs/msf/parsed-pdfs-msf.json.gz" \
    --references-file "s3://datalabs-data/wellcome_publications/uber_api_publications.csv" \
    --model-file "s3://datalabs-data/reference_parser_models/reference_parser_pipeline.pkl" \
    --output-url "file://./tmp/parser-output/output_folder_name"

this took 504 seconds and finds 10 doc-publication matches.

Without these changes this took 8 seconds and found 2 doc-publication matches (reassuring also found in the 10 full text search matches).

  • I tried flashtext to make it quicker, but it didn't work on looking for whole sentences in large amounts of text - it works for 1 or 2 words in text.
  • I tried Whoosh to make this quicker, but it took 1774 seconds and found 7538 matches - obviously there is probably a bug somewhere, but I didn't investigate any further.
  • I tried Spacy's PhraseMatcher (pip install spacy and python -m spacy download en_core_web_sm) but aborted trying this out, since in airflow we dont need it, so it doesnt matter if refparse ran locally is slow

I've created an issue #257, if someone wants to pick it up this performance issue in the future.

How Has This Been Tested?

>>> make docker-test
74 passed, 4 warnings in 10.56s
>>> make test
 72 passed, 2 skipped, 5 warnings in 7.34s
>>> cd reach/refparse/
>>> python -m unittest
Ran 35 tests in 0.112s

OK

I ran:

source build/virtualenv/bin/activate
eval $(./export_env.py)
docker-compose up -d
./docker_exec.sh airflow test policy-test ExactMatchRefs.msf 2019-11-8

it took about 3 hours to run and processed 5377194 publications and found 8132 matches.
I noticed my results have duplicates in, fixing this might help speed things up if we delete duplicate policy docs before the exact matcher runs. I added some info about this duplication to issue #183

@lizgzil lizgzil changed the title [WIP] Make exact matcher look in full text Make exact matcher look in full text Oct 7, 2019
@SamDepardieu
Copy link
Copy Markdown
Contributor

this took 504 seconds and finds 10 doc-publication matches.

Without these changes this took 8 seconds and found 2 doc-publication matches (reassuring also found in the 10 full text search matches).

I don't see anything obvious, but this is slightly worrying in my opinion. 504 seconds for 70 documents is a lot, especially considering that some of our organisations (e.g. parliament) have around 60k documents.

We may want to profile this code to find out what's taking so long and to see if we can come with a solution for it to run faster before merging this PR.

I'd be happy to help you with that if needed, though

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Oct 7, 2019

@SamDepardieu

I'd be happy to help you with that if needed, though
That'd be great! Yes, it's quite a huge increase in time :/

Comment thread policytool/refparse/refparse.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will stop the yielding if any of the rows does not have a text column

Comment thread policytool/refparse/refparse.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add it now by document.get('uri', None)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sam added this as 'url' in his latest PR so using that instead of 'uri'

@nsorros
Copy link
Copy Markdown
Contributor

nsorros commented Oct 7, 2019

Also note that this change does not affect the Reach tool as you would need to change the airflow task exact match. You need to search in the full text index not only in reference section index. You should add this before we merge.

nsorros
nsorros previously requested changes Oct 7, 2019
Copy link
Copy Markdown
Contributor

@nsorros nsorros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to implement the change for the Ariflow DAG as well

@ivyleavedtoadflax
Copy link
Copy Markdown
Contributor

There is a package I used in a previous role which is significantly faster than regex, and could be used in ExactMatcher. I'll see if I can dig it out!

@ivyleavedtoadflax
Copy link
Copy Markdown
Contributor

Found it We had good results with this.

https://www.analyticsvidhya.com/blog/2017/11/flashtext-a-library-faster-than-regular-expressions/

https://github.com/vi3k6i5/flashtext

Comment thread policytool/refparse/refparse.py Outdated
Comment on lines 51 to 55
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hblanks could you check that these lines are performing as you think they should/are parts of it redundant? My first thought is that if document[section_column]: should make the try redundant (i.e. if it passes the if then it should also pass the try in this case)? And as @nsorros says below, if there is no 'sections' part of the dict for a document then this will stop all the rest of the documents being processed.

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Oct 9, 2019

@SamDepardieu or @hblanks please can you check my commit changing the airflow stuff if you have time? Tests passed, but I'm not confident the arguments of exact_match_refs_operator.ExactMatchRefsOperator are correct.

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Oct 9, 2019

also (@ivyleavedtoadflax @SamDepardieu ) I've given up on making the local running of refparse quicker in this PR since it won't effect the product performance anyway. I've created an issue #257, if someone wants to pick it up in the future.

Copy link
Copy Markdown
Contributor

@nsorros nsorros Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

es_index is not a good name, all indices in elastic search are es_index. i suggest full_text_index or es_full_text index

@nsorros
Copy link
Copy Markdown
Contributor

nsorros commented Oct 10, 2019

@SamDepardieu or @hblanks please can you check my commit changing the airflow stuff if you have time? Tests passed, but I'm not confident the arguments of exact_match_refs_operator.ExactMatchRefsOperator are correct.

Did you run the DAG? Did the exact matcher found matches?

Test pass as I think there are no tests testing the exact matcher.

@SamDepardieu
Copy link
Copy Markdown
Contributor

@SamDepardieu or @hblanks please can you check my commit changing the airflow stuff if you have time? Tests passed, but I'm not confident the arguments of exact_match_refs_operator.ExactMatchRefsOperator are correct.

Did you run the DAG? Did the exact matcher found matches?

Test pass as I think there are no tests testing the exact matcher.

The best way to test the dag is to run the policy-test DAG. If you're able to docker-compose up -d Reach, you can access Airflow's interface on localhost:8080 and then just click the little "play" button on the right of the policy-test dag

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simple tests like this it is probably preferable to use pytest which means you can do away with creating a test class, and define everything as functions. It's no biggie, especially because pytest understand unittest tests, but pytest is much more user friendly to write.

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Oct 30, 2019

The policy test dag is broken (not even running) and I'm not sure why. The error in Airflow is Broken DAG: [/airflow/dags/policy.py] dictionary update sequence element #0 has length 18; 2 is required, but this doesnt give me enough info to debug!

@hblanks
Copy link
Copy Markdown
Contributor

hblanks commented Oct 31, 2019

@lizgzil - yes, the place for debugging dags is unfortunately in the airflow web (and maybe scheduler logs), using docker-compose logs airflow-web, sometimes with options. It's yucky.

Your branch also could stand to be rebased on top of the reach rename. Would you like it if I did this and after we debugged the DAG together?

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Oct 31, 2019

Your branch also could stand to be rebased on top of the reach rename. Would you like it if I did this and after we debugged the DAG together?

@hblanks Yes please, thatd be great!

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Oct 31, 2019

Screenshot 2019-10-31 at 16 33 51

The error log for an full text task is:

[2019-10-31 16:43:07,601] {logging_mixin.py:95} INFO - [�[34m2019-10-31 16:43:07,600�[0m] {�[34mbase.py:�[0m149} WARNING�[0m - �[1mPOST�[0m �[1mhttp://elasticsearch:9200:9200/policy-test-docs/_delete_by_query�[0m [status:�[1mN/A�[0m request:0.000s]�[0m
�[31mTraceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Not sure if this is to do with my connection or not (after a quick google of what a socket.gaierror is?

Will try with my dns not being 8.8.8.8 when I get back off holiday, but if anyone else want to try running the test-dag in the mean time then that's be great :)

…ns, and add an optional argument in filemanager which allows you to decide which columns to include in the scraper output dictionary. This will mean we can include the text column needed to exact match on all the text, rather than just the sections
…t the name of the sections column is called, create brand new function 'transform_scraper_text_file' whcih specifically transforms for the text column in the scraper, adding new argument 'scraping_columns' to get_file which deals with allowing the user to query any column names theyd like
SamDepardieu and others added 4 commits November 6, 2019 12:24
 - The new connection method added the port twice to the host, leading the lib to try to connecto 'http://elasticsearch9200:9200. Changed the conenction method to split the port from the configuration variable.
…EPMCMetadata task, query correct column names in the exact match, save publications doi and pmid is exact match output, add logger information, fix keys for elasticsearch paths
@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Nov 8, 2019

Thanks to @SamDepardieu and @hblanks I've fixed my issues with Airflow.

I ran ./docker_exec.sh airflow test policy-test ExactMatchRefs.msf 2019-11-8 and got 8132 matches in 3 hours. As noted in the description this takes longer than really necessary since there are duplicates in the ES EPMC Metadata index. I'm currently running the whole policy-test dag, but I anticipate it'll take a long time! (note msf is the smallest data and it still took 3hours).

It'd be great to get this merged so we could see if the policy-test dag succeeds not using my computer, so a review of the code / suggestions on how to speed up performance / help fixing issue #183 would be very much appreciated :)

…c, where pdfs will be interpreted differently
@lizgzil lizgzil requested a review from jdu November 8, 2019 16:39
@hblanks
Copy link
Copy Markdown
Contributor

hblanks commented Nov 12, 2019

@lizgzil - thanks for keeping this going. A couple thoughts:

  1. Because we use the test DAG to quickly verify the pipeline, the test DAG has to complete within a fairly small amount of time.
  2. It doesn't sound like we can get the test DAG running quickly right away with the exact matcher in place.
  3. But, we do want to have the exact matcher available in the codebase, and maybe even able to run in a deployment environment.

So, my two cents would be that we create a separate dag, policy-test-exact-match, which we can use to try speeding up things as best we can. The first place is probably batching search queries to ElasticSearch, if that can be done -- or else running them in a thread pool.

@nsorros nsorros dismissed their stale review November 14, 2019 12:00

long time ago

@hblanks
Copy link
Copy Markdown
Contributor

hblanks commented Nov 14, 2019

OK! This branch is rebased to a new branch, fix-exact-matcher-rebase. After adding a small commit, the exact matcher test DAG Liz added runs in short order. So, I suggest we:

  1. Close this PR.
  2. Create a PR for the new branch.
  3. Maybe update dag.py so that it the main DAG doesn't run the exact matcher, but so that the test DAG does?

@lizgzil
Copy link
Copy Markdown
Contributor Author

lizgzil commented Nov 14, 2019

Closing this and opening a rebased version #285

@lizgzil lizgzil closed this Nov 14, 2019
@lizgzil lizgzil mentioned this pull request Nov 14, 2019
1 task
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The exact matcher is only used in reference section texts

5 participants