Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Parse notes fields #166

Merged
merged 25 commits into from Jul 17, 2019
Merged

Parse notes fields #166

merged 25 commits into from Jul 17, 2019

Conversation

mscarey
Copy link
Contributor

@mscarey mscarey commented Jul 4, 2019

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Description

Uses BeautifulSoup to collect the multiple HTML elements that the Notes field may be spread across, and then merges the text from all of them.
Also uses BeautifulSoup to collect the Deceased field.

Under the current approach of parsing the HTML with regex, the notes field is truncated if it continues through multiple HTML elements, and the Deceased field is truncated if a number occurs in the middle, e.g. "Hispanic male, 19 years of age" is truncated to "Hispanic male, 19". BeautifulSoup makes it easier to navigate and parse the HTML elements.

I added tests to verify that Notes fields start and end with the right substrings for all the bulletins cited in issues #73 and #81. However, https://austintexas.gov/news/traffic-fatality-15-4 is still affected by issue #77 (multiple fatalities in one bulletin).

To test #150, I changed the parse_page_content function so the Deceased field would be in its return value. Otherwise, there was no apparent way to test the value of that string. Unfortunately I had to add lines to several other tests to ignore the Deceased field in the return value of parse_page_content, similarly to what's already been done with the Notes field.

Checklist:

  • [] I have updated the documentation accordingly
  • I have written unit tests

Fixes: #73
Fixes: #81
Fixes: #150

Changed the part of parse_deceased_field that removed some items and then used the last remaining item as the First Name. I wasn't able to verify that the issue (scrapd#74, recording a nickname as the first name) is fixed in the cli.
Allows the parse_deceased_field to be tested by passing in strings, not split lists
To handle gender and ethnicity fields in a format like "W/F"
These tags should be removed from the notes by BeautifulSoup, and Coveralls says the tests aren't touching this line.
Fixes the parsing bug where the Deceased field 'Hispanic male, 19 years of age' was truncated to 'Hispanic male, 19'.

Closes scrapd#150.
The deceased field wasn't actually in the return value of any function, so in order to test its value, I had to move the line that deletes the deceased field outside the parse_page_content function.
Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Matt!

It definitely increased the reliability of the Notes parsing!

@@ -6,6 +6,7 @@
from urllib.parse import urljoin

import aiohttp
from bs4 import BeautifulSoup, SoupStrainer, NavigableString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I had one small change to request regarding the imports.

@rgreinho
Copy link
Member

rgreinho commented Jul 5, 2019

Also I implemented a fix for #150 as well in #172. It might cause some interference with your PR.

rgreinho
rgreinho previously approved these changes Jul 6, 2019
Adds another path through parse_deceased_field, where the deceased field is the full text of a <p> element, even if the element contains extraneous tags.
@mscarey
Copy link
Contributor Author

mscarey commented Jul 13, 2019

I don't exactly understand the error blocking this, but is it just a connection failure, and nothing to do with the pull request?

@rgreinho
Copy link
Member

There is a deeper issue, it does not seem to be a Network problem.

If you run the integration tests locally, you can reproduce the problem:

$ nox -s test-integrations
nox > Running session test-integrations
nox > Creating virtualenv using python3.7 in .nox/test-integrations
nox > pip install -rrequirements-dev.txt
nox > pip install -e .
nox > pytest -x --junitxml=/tmp/pytest/junit-py37.xml --cov-report term-missing --cov-report html --cov=scrapd -m integrations --reruns 3 --reruns-delay 5 -r R /Users/remy/projects/scrapd/scrapd/tests
=============================================================== test session starts ================================================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.12.0
rootdir: /Users/remy/projects/scrapd/scrapd, inifile: setup.cfg
plugins: socket-0.3.3, cov-2.6.1, forked-1.0.2, asyncio-0.10.0, bdd-3.1.0, xdist-1.28.0, mock-1.10.4, rerunfailures-7.0
gw0 [3] / gw1 [3] / gw2 [3] / gw3 [3]
.RR                                                                                                                                          [100%]R [100%]R [100%]R [100%]R [100%]FCoverage.py warning: No data was collected. (no-data-collected)
Future exception was never retrieved
future: <Future finished exception=ServerDisconnectedError(None)>
aiohttp.client_exceptions.ServerDisconnectedError: None

===================================================================== FAILURES =====================================================================
_____________________________________________ test_collect_information[json-Jan 15 2018-Jan 18 2018-1] _____________________________________________
[gw2] darwin -- Python 3.7.4 /Users/remy/projects/scrapd/scrapd/.nox/test-integrations/bin/python3.7

to_date = 'Jan 18 2018', format = 'json', from_date = 'Jan 15 2018', entry_count = 1
request = <FixtureRequest for <Function test_collect_information[json-Jan 15 2018-Jan 18 2018-1]>>

        example_converters={'entry_count': int},
>   )
    def test_collect_information():

tests/step_defs/test_retrieve.py:16:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.nox/test-integrations/lib/python3.7/site-packages/pytest_bdd/scenario.py:195: in _execute_scenario
    _execute_step_function(request, scenario, step, step_func)
.nox/test-integrations/lib/python3.7/site-packages/pytest_bdd/scenario.py:136: in _execute_step_function
    step_func(**kwargs)
tests/step_defs/test_retrieve.py:42: in ensure_results
    results, _ = event_loop.run_until_complete(apd.async_retrieve(pages=-1, **time_range))
/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:579: in run_until_complete
    return future.result()
scrapd/core/apd.py:926: in async_retrieve
    page_res = await asyncio.gather(*tasks)
.nox/test-integrations/lib/python3.7/site-packages/tenacity/_asyncio.py:43: in call
    do = self.iter(retry_state=retry_state)
.nox/test-integrations/lib/python3.7/site-packages/tenacity/__init__.py:332: in iter
    six.raise_from(retry_exc, fut.exception())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = None, from_value = TypeError('sequence item 0: expected str instance, ValueError found')

>   ???
E   tenacity.RetryError: RetryError[<Future at 0x1121b5950 state=finished raised TypeError>]

<string>:3: RetryError
-------------------------------------------------- generated xml file: /tmp/pytest/junit-py37.xml --------------------------------------------------

---------- coverage: platform darwin, python 3.7.4-final-0 -----------
Name                        Stmts   Miss  Cover   Missing
---------------------------------------------------------
scrapd/core/apd.py            379     53    86%   40-45, 111, 139, 162, 198, 217-255, 308, 312, 508, 510, 514, 516, 518-520, 573, 580, 587, 594, 601, 608, 691-692, 696-697, 869, 874, 908-909, 941, 943-944, 964
scrapd/core/constant.py        15      0   100%
scrapd/core/date_utils.py      32      3    91%   86, 101, 103
scrapd/core/formatter.py       51      7    86%   78, 87, 110-111, 124-125, 155
---------------------------------------------------------
TOTAL                         477     63    87%
Coverage HTML written to dir htmlcov

============================================================= rerun test summary info ==============================================================
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 15 2018-Jan 18 2018-1]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 15 2018-Jan 18 2018-1]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 15 2018-Jan 18 2018-1]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================= 1 failed, 1 passed, 4 warnings, 6 rerun in 62.70 seconds =============================================
nox > Command pytest -x --junitxml=/tmp/pytest/junit-py37.xml --cov-report term-missing --cov-report html --cov=scrapd -m integrations --reruns 3 --reruns-delay 5 -r R /Users/remy/projects/scrapd/scrapd/tests failed with exit code 2
nox > Session test-integrations failed.

@rgreinho
Copy link
Member

If you run this command from your PR branch, you can see the issue:

$ scrapd -v --from "Jan 2018" --to "Dec 2018" --format json

Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
RetryError[<Future at 0x105969f90 state=finished raised TypeError>]
Traceback (most recent call last):

  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/tenacity/_asyncio.py", line 46, in call
    result = yield from fn(*args, **kwargs)
                        │   │       └ {}
                        │   └ (<aiohttp.client.ClientSession object at 0x104799290>, 'http://austintexas.gov/news/traffic-fatality-55-3')
                        └ <function fetch_and_parse at 0x10464bef0>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 872, in fetch_and_parse
    d = parse_page(page, url)
        │          │     └ 'http://austintexas.gov/news/traffic-fatality-55-3'
        │          └ '<!doctype html>\n<!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->\n<!--[if IE 7 ]><html class="ie ie7" lang="en">...
        └ <function parse_page at 0x10464bdd0>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 842, in parse_page
    logger.debug(f'Fatality report {url} was not parsed correctly:\n\t * ' + '\n\t * '.join(err))
    │      │                                                                                └ [ValueError('cannot parse Deceased: Unidentified Hispanic male'), 'age is invalid: None']
    │      └ <bound method Logger._make_log_function.<locals>.log_function of <loguru._logger.Logger object at 0x102c2d450>>
    └ <loguru._logger.Logger object at 0x102c2d450>

TypeError: sequence item 0: expected str instance, ValueError found


The above exception was the direct cause of the following exception:


Traceback (most recent call last):

  File "/Users/remy/projects/scrapd/scrapd/venv/bin/scrapd", line 10, in <module>
    sys.exit(cli())
    │   │    └ <click.core.Command object at 0x1048e70d0>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <bound method BaseCommand.main of <click.core.Command object at 0x1048e70d0>>
           └ <click.core.Command object at 0x1048e70d0>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x102539550>
         │    └ <bound method Command.invoke of <click.core.Command object at 0x1048e70d0>>
         └ <click.core.Command object at 0x1048e70d0>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
           │   │      │    │           └ <click.core.Context object at 0x102539550>
           │   │      │    └ <function cli at 0x10468eb90>
           │   │      └ <click.core.Command object at 0x1048e70d0>
           │   └ <bound method Context.invoke of <click.core.Context object at 0x102539550>>
           └ <click.core.Context object at 0x102539550>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
           │         │       └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
           │         └ ()
           └ <function cli at 0x10468eb90>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
           │ │                       │       └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
           │ │                       └ ()
           │ └ <function get_current_context at 0x102c0fb90>
           └ <function cli at 0x10468eb00>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/cli/cli.py", line 69, in cli
    command.execute()
    │       └ <bound method AbstractCommand.execute of <scrapd.cli.cli.Retrieve object at 0x10478f690>>
    └ <scrapd.cli.cli.Retrieve object at 0x10478f690>

> File "/Users/remy/projects/scrapd/scrapd/scrapd/cli/base.py", line 33, in execute
    sys.exit(self._execute())
    │   │    │    └ <bound method Retrieve._execute of <scrapd.cli.cli.Retrieve object at 0x10478f690>>
    │   │    └ <scrapd.cli.cli.Retrieve object at 0x10478f690>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/cli/cli.py", line 84, in _execute
    self.args['backoff'],
    │    └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
    └ <scrapd.cli.cli.Retrieve object at 0x10478f690>

  File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
           │    │                  └ <coroutine object async_retrieve at 0x102478e60>
           │    └ <bound method BaseEventLoop.run_until_complete of <_UnixSelectorEventLoop running=False closed=True debug=False>>
           └ <_UnixSelectorEventLoop running=False closed=True debug=False>

  File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
           │      └ <built-in method result of _asyncio.Task object at 0x10495d9f0>
           └ <Task finished coro=<async_retrieve() done, defined at /Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py:883> exception=Retr...

  File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 926, in async_retrieve
    page_res = await asyncio.gather(*tasks)
    │                │       │       └ [<generator object AsyncRetrying.call at 0x1059e1950>, <generator object AsyncRetrying.call at 0x1059e1f50>, <generator object A...
    │                │       └ <function gather at 0x102b239e0>
    │                └ <module 'asyncio' from '/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/__init__.p...
    └ [{'Case': '18-3060392', 'Fatal crashes this year': '62', 'Date': datetime.date(2018, 11, 2), 'Time': datetime.time(7, 22), 'Loca...

  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/tenacity/_asyncio.py", line 43, in call
    do = self.iter(retry_state=retry_state)
    │    │    │    │           └ <tenacity.RetryCallState object at 0x1058e94d0>
    │    │    │    └ <tenacity.RetryCallState object at 0x1058e94d0>
    │    │    └ <bound method BaseRetrying.iter of <AsyncRetrying object at 0x1059a9690 (stop=<tenacity.stop.stop_after_attempt object at 0x1059...
    │    └ <AsyncRetrying object at 0x1059a9690 (stop=<tenacity.stop.stop_after_attempt object at 0x1059a9a10>, wait=<tenacity.wait.wait_ex...
    └ <tenacity.DoAttempt object at 0x1058fb290>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 332, in iter
    six.raise_from(retry_exc, fut.exception())
    │   │          │          │   └ <bound method Future.exception of <Future at 0x105969f90 state=finished raised TypeError>>
    │   │          │          └ <Future at 0x105969f90 state=finished raised TypeError>
    │   │          └ RetryError(<Future at 0x105969f90 state=finished raised TypeError>)
    │   └ <function raise_from at 0x10391ce60>
    └ <module 'six' from '/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/six.py'>
  File "<string>", line 3, in raise_from

tenacity.RetryError: RetryError[<Future at 0x105969f90 state=finished raised TypeError>]

It fails with a type error. My guess is that you messed up updating your branch at some point. We updated the internal structure to use datetime.date() objects for the dates and datetime.time() objects for the times.

The JSON formatter was also updated to reflect this changes.

These 2 options may be good pointers to start digging.

@rgreinho
Copy link
Member

You can also run pytest manually to make the output more verbose:

$ pytest -s -x -vvv -n0 -m integrations tests/
=============================================================== test session starts ================================================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.12.0 -- /Users/remy/projects/scrapd/scrapd/venv/bin/python3.7
cachedir: .pytest_cache
rootdir: /Users/remy/projects/scrapd/scrapd, inifile: setup.cfg
plugins: socket-0.3.3, cov-2.6.1, forked-1.0.2, asyncio-0.10.0, bdd-3.1.0, xdist-1.28.0, mock-1.10.4, rerunfailures-7.0
collected 136 items / 133 deselected / 3 selected

tests/step_defs/test_retrieve.py::test_collect_information[csv-Jan 15 2019-Jan 18 2019-2] Fatal crashes this year,Case,Date,Time,Location,First Name,Last Name,Ethnicity,Gender,DOB,Age,Link,Notes
1,19-0150158,01/15/2019,06:20 AM,10500 block of N IH 35 SB,David,Sell,White,male,07/09/1987,31,http://austintexas.gov/news/traffic-fatality-1-4,The preliminary investigation shows that a 2000 Peterbilt semi truck was travelling southbound in the center lane on IH 35 when it struck pedestrian David Sell. The driver stopped as soon as it was possible to do so and remained on scene. He reported not seeing the pedestrian prior to impact given that it was still dark at the time of the crash. Sell was pronounced deceased at the scene at 6:24 a.m. No charges are expected to be filed.
2,19-0161105,01/16/2019,03:42 PM,West William Cannon Drive and Ridge Oak Road,Ann,Bottenfield-Seago,White,female,02/15/1960,58,http://austintexas.gov/news/traffic-fatality-2-3,"The preliminary investigation shows that the grey, 2003 Volkwagen Jetta being driven by Ann Bottenfield-Seago failed to yield at a stop sign while attempting to turn westbound on to West William Cannon Drive from Ridge Oak Road. The Jetta collided with a black, 2017 Chevrolet truck that was eastbound in the inside lane of West William Cannon Drive. Bottenfield-Seago was pronounced deceased at the scene. The passenger in the Jetta and the driver of the truck were both transported to a local hospital with non-life threatening injuries. No charges are expected to be filed."
PASSED
tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72] FAILED

===================================================================== FAILURES =====================================================================
_______________________________________________ test_collect_information[json-Jan 2018-Dec 2018-72] ________________________________________________

self = <AsyncRetrying object at 0x1110514d0 (stop=<tenacity.stop.stop_after_attempt object at 0x11043e9d0>, wait=<tenacity.wa...bject at 0x10eebd4d0>, before=<function before_nothing at 0x10eeb47a0>, after=<function after_nothing at 0x10eebc3b0>)>
fn = <function fetch_and_parse at 0x10fb12d40>
args = (<aiohttp.client.ClientSession object at 0x110f06950>, 'http://austintexas.gov/news/traffic-fatality-55-3'), kwargs = {}
retry_state = <tenacity.RetryCallState object at 0x110fab1d0>, do = <tenacity.DoAttempt object at 0x110dd1810>

    @asyncio.coroutine
    def call(self, fn, *args, **kwargs):
        self.begin(fn)

        retry_state = RetryCallState(
            retry_object=self, fn=fn, args=args, kwargs=kwargs)
        while True:
            do = self.iter(retry_state=retry_state)
            if isinstance(do, DoAttempt):
                try:
>                   result = yield from fn(*args, **kwargs)

venv/lib/python3.7/site-packages/tenacity/_asyncio.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

session = <aiohttp.client.ClientSession object at 0x110f06950>, url = 'http://austintexas.gov/news/traffic-fatality-55-3'

    @retry()
    async def fetch_and_parse(session, url):
        """
        Parse a fatality page from a URL.

        :param aiohttp.ClientSession session: aiohttp session
        :param str url: detail page URL
        :return: a dictionary representing a fatality.
        :rtype: dict
        """
        # Retrieve the page.
        page = await fetch_detail_page(session, url)
        if not page:
            raise ValueError(f'The URL {url} returned a 0-length content.')

        # Parse it.
>       d = parse_page(page, url)

scrapd/core/apd.py:872:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

page = '<!doctype html>\n<!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->\n<!--[if IE 7 ]><html class="ie ie7" ...v><!-- /footer -->  <div id="disable_messages-debug-div" style="display:none;"><pre>NULL</pre></div></body>\n</html>\n'
url = 'http://austintexas.gov/news/traffic-fatality-55-3'

    def parse_page(page, url):
        """
        Parse the page using all parsing methods available.

        :param str  page: the content of the fatality page
        :param str url: detail page URL
        :return: a dictionary representing a fatality.
        :rtype: dict
        """
        # Parse the page.
        twitter_d = parse_twitter_fields(page)
        page_d, err = parse_page_content(page, bool(twitter_d.get(Fields.NOTES)))
        if err:
>           logger.debug(f'Fatality report {url} was not parsed correctly:\n\t * ' + '\n\t * '.join(err))
E           TypeError: sequence item 0: expected str instance, ValueError found

scrapd/core/apd.py:842: TypeError

The above exception was the direct cause of the following exception:

to_date = 'Dec 2018', entry_count = 72, format = 'json', from_date = 'Jan 2018'
request = <FixtureRequest for <Function test_collect_information[json-Jan 2018-Dec 2018-72]>>

        example_converters={'entry_count': int},
>   )
    def test_collect_information():

tests/step_defs/test_retrieve.py:16:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
venv/lib/python3.7/site-packages/pytest_bdd/scenario.py:195: in _execute_scenario
    _execute_step_function(request, scenario, step, step_func)
venv/lib/python3.7/site-packages/pytest_bdd/scenario.py:136: in _execute_step_function
    step_func(**kwargs)
tests/step_defs/test_retrieve.py:42: in ensure_results
    results, _ = event_loop.run_until_complete(apd.async_retrieve(pages=-1, **time_range))
/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:579: in run_until_complete
    return future.result()
scrapd/core/apd.py:926: in async_retrieve
    page_res = await asyncio.gather(*tasks)
venv/lib/python3.7/site-packages/tenacity/_asyncio.py:43: in call
    do = self.iter(retry_state=retry_state)
venv/lib/python3.7/site-packages/tenacity/__init__.py:332: in iter
    six.raise_from(retry_exc, fut.exception())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = None, from_value = TypeError('sequence item 0: expected str instance, ValueError found')

>   ???
E   tenacity.RetryError: RetryError[<Future at 0x11107e650 state=finished raised TypeError>]

<string>:3: RetryError
========================================= 1 failed, 1 passed, 133 deselected, 1 warnings in 14.00 seconds ==========================================

There is a TypeError somewhere as it expects a string somewhere, but receive a ValueError. Maybe an exception is not handled correctly.

@rgreinho
Copy link
Member

Here is your issue:

File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 842, in parse_page
    logger.debug(f'Fatality report {url} was not parsed correctly:\n\t * ' + '\n\t * '.join(err))
    │      │                                                                                └ [ValueError('cannot parse Deceased: Unidentified Hispanic male'), 'age is invalid: None']
    │      └ <bound method Logger._make_log_function.<locals>.log_function of <loguru._logger.Logger object at 0x102c2d450>>
    └ <loguru._logger.Logger object at 0x102c2d450>

TypeError: sequence item 0: expected str instance, ValueError found

You store an exception in the error list, where only strings (i.e. error messages) are expected.

Happy fixing 😉

@mscarey
Copy link
Contributor Author

mscarey commented Jul 13, 2019

Thanks. I couldn't figure out that error, my best guess was that it was failing to get some expected text from an http request.

Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regenerated the data sets with your branch. It looks really good! Great work tackling this problem!

Two questions regarding the new output:

{
     "Age": 62,
     "Case": "17-3460912",
     "DOB": "01/22/1955",
     "Date": "12/12/2017",
+    "Deceased": "Robert Lance Trewitt, White male (D.O.B. 1-22-55)",
     "Ethnicity": "White",
     "Fatal crashes this year": "70",
     "First Name": "Robert",
@@ -26,14 +28,15 @@
     "Last Name": "Trewitt",
     "Link": "http://austintexas.gov/news/traffic-fatality-70-0",
     "Location": "8400 block of Research Blvd. Southbound",
-    "Notes": "The preliminary investigation indicates that a 2002, blue, Chevrolet truck was traveling southbound in the 8400 block of Research Blvd. in the inside lane when it swerved into the right lane to avoid a collision. When the Chevrolet entered the right lane, it struck a 1998 BMW motorcycle from behind, knocking the driver off the bike and over a retaining wall. The driver fell to the frontage road where he sustained serious injuries. The driver of the motorcycle was transported to Dell Seton Medical Center at the University of Texas where he died as a result of his injuries on Wednesday, December 13, 2017.",
-    "Time": "2:13 p.m."
+    "Notes": "The preliminary investigation indicates that a 2002, blue, Chevrolet truck was traveling southbound in the 8400 block of Research Blvd. in the inside lane when it swerved into the right lane to avoid a collision. When the Chevrolet entered the right lane, it struck a 1998 BMW motorcycle from behind, knocking the driver off the bike and over a retaining wall. The driver fell to the frontage road where he sustained serious injuries. The driver of the motorcycle was transported to Dell Seton Medical Center at the University of Texas where he died as a result of his injuries on Wednesday, December 13, 2017.\n\tThis case is still being investigated.",
+    "Time": "02:13 PM"
   },
  1. The deceased field appears in the output, is there a way we can remove it? (not a blocker, but it would definitely be cleaner without temporary fields in the final output)
  2. The notes look definitely better, but they contain a lot of \n or \tcharacters. Is there a way to interpret them instead of displaying them? Erf scratch that 🤦‍♂

('traffic-fatality-4-6', 'White female, DOB 12/31/1960'),
('traffic-fatality-20-4', 'Hispanic male, 19 years of age'),
('traffic-fatality-25-4', ', Hispanic male, D.O.B. 6-9-70'),
('traffic-fatality-73-2', 'White male, DOB 02/09/80'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add your tests using the same structure as before? I created the fixture to load the page initially to test the full flow, but I quickly regretted it and I plan to remove it in the future.

I also think it was more readable before and we had more tests cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old style of tests weren't really adequate though. Remember we had issue #92 (Deceased fields that can't be parsed because they contain no DOB), and I got the old-style test passing for https://austintexas.gov/news/traffic-fatality-20-4 in #125. But then you had to open #150, because the same bulletin still wasn't being scraped correctly, so I added these tests that check the parsing of the whole page. Also, I think BeautifulSoup depends on having a valid HTML tree to select text from.

@mscarey
Copy link
Contributor Author

mscarey commented Jul 15, 2019

@rgreinho are you sure the Deceased field is showing up in the final output? I just ran the test command scrapd --from "Jan 15 2019" --to "Jan 18 2019" --format json and the Deceased field isn't in the output. I do have del page_d['Deceased'] on line 847 of apd.py.

@rgreinho
Copy link
Member

Run this command from your branch:

$ scrapd --from "12/12/2017" --to "12/12/2017"

[
  {
    "Age": 62,
    "Case": "17-3460912",
    "DOB": "01/22/1955",
    "Date": "12/12/2017",
    "Deceased": "Robert Lance Trewitt, White male (D.O.B. 1-22-55)",
    "Ethnicity": "White",
    "Fatal crashes this year": "70",
    "First Name": "Robert",
    "Gender": "male",
    "Last Name": "Trewitt",
    "Link": "http://austintexas.gov/news/traffic-fatality-70-0",
    "Location": "8400 block of Research Blvd. Southbound",
    "Notes": "The preliminary investigation indicates that a 2002, blue, Chevrolet truck was traveling southbound in the 8400 block of Research Blvd. in the inside lane when it swerved into the right lane to avoid a collision. When the Chevrolet entered the right lane, it struck a 1998 BMW motorcycle from behind, knocking the driver off the bike and over a retaining wall. The driver fell to the frontage road where he sustained serious injuries. The driver of the motorcycle was transported to Dell Seton Medical Center at the University of Texas where he died as a result of his injuries on Wednesday, December 13, 2017.\n\tThis case is still being investigated.",
    "Time": "02:13 PM"
  }
]

You can see the the Deceased field.

@mscarey
Copy link
Contributor Author

mscarey commented Jul 16, 2019

I think this fixes it...I never looked very closely at parse_twitter_description, but I think the Deceased field wasn't being deleted if the Twitter description had been parsed. And then I had the test ignoring Deceased in the return value of parse_page, when it only should have been ignored for parse_page_content.

@rgreinho
Copy link
Member

Nice job @mscarey! 👍

I'll use this branch to regenerate the data sets tonight and I'll keep you posted!

Copy link
Member

@rgreinho rgreinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great @mscarey ! Very good job on these issues!

@mergify mergify bot merged commit b842e6b into scrapd:master Jul 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect age/DOB parsing Only the last paragraph of the "Notes" is being parsed Notes are not parsed
2 participants