Parse notes fields #166

mscarey · 2019-07-04T05:29:26Z

Types of changes

Bug fix (non-breaking change which fixes an issue)

Description

Uses BeautifulSoup to collect the multiple HTML elements that the Notes field may be spread across, and then merges the text from all of them.
Also uses BeautifulSoup to collect the Deceased field.

Under the current approach of parsing the HTML with regex, the notes field is truncated if it continues through multiple HTML elements, and the Deceased field is truncated if a number occurs in the middle, e.g. "Hispanic male, 19 years of age" is truncated to "Hispanic male, 19". BeautifulSoup makes it easier to navigate and parse the HTML elements.

I added tests to verify that Notes fields start and end with the right substrings for all the bulletins cited in issues #73 and #81. However, https://austintexas.gov/news/traffic-fatality-15-4 is still affected by issue #77 (multiple fatalities in one bulletin).

To test #150, I changed the parse_page_content function so the Deceased field would be in its return value. Otherwise, there was no apparent way to test the value of that string. Unfortunately I had to add lines to several other tests to ignore the Deceased field in the return value of parse_page_content, similarly to what's already been done with the Notes field.

Checklist:

[] I have updated the documentation accordingly
I have written unit tests

Fixes: #73
Fixes: #81
Fixes: #150

Changed the part of parse_deceased_field that removed some items and then used the last remaining item as the First Name. I wasn't able to verify that the issue (scrapd#74, recording a nickname as the first name) is fixed in the cli.

Allows the parse_deceased_field to be tested by passing in strings, not split lists

To handle gender and ethnicity fields in a format like "W/F"

fixes scrapd#73 and scrapd#81

These tags should be removed from the notes by BeautifulSoup, and Coveralls says the tests aren't touching this line.

Fixes the parsing bug where the Deceased field 'Hispanic male, 19 years of age' was truncated to 'Hispanic male, 19'. Closes scrapd#150.

The deceased field wasn't actually in the return value of any function, so in order to test its value, I had to move the line that deletes the deceased field outside the parse_page_content function.

rgreinho

Great work Matt!

It definitely increased the reliability of the Notes parsing!

rgreinho · 2019-07-05T16:15:52Z

scrapd/core/apd.py

@@ -6,6 +6,7 @@
 from urllib.parse import urljoin

 import aiohttp
+from bs4 import BeautifulSoup, SoupStrainer, NavigableString


Prefer one import per line (https://docs.openstack.org/charm-guide/latest/coding-guidelines.html#import-style).

rgreinho

Sorry I had one small change to request regarding the imports.

rgreinho · 2019-07-05T16:58:22Z

Also I implemented a fix for #150 as well in #172. It might cause some interference with your PR.

Adds another path through parse_deceased_field, where the deceased field is the full text of a <p> element, even if the element contains extraneous tags.

mscarey · 2019-07-13T18:07:20Z

I don't exactly understand the error blocking this, but is it just a connection failure, and nothing to do with the pull request?

rgreinho · 2019-07-13T18:38:10Z

There is a deeper issue, it does not seem to be a Network problem.

If you run the integration tests locally, you can reproduce the problem:

$ nox -s test-integrations
nox > Running session test-integrations
nox > Creating virtualenv using python3.7 in .nox/test-integrations
nox > pip install -rrequirements-dev.txt
nox > pip install -e .
nox > pytest -x --junitxml=/tmp/pytest/junit-py37.xml --cov-report term-missing --cov-report html --cov=scrapd -m integrations --reruns 3 --reruns-delay 5 -r R /Users/remy/projects/scrapd/scrapd/tests
=============================================================== test session starts ================================================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.12.0
rootdir: /Users/remy/projects/scrapd/scrapd, inifile: setup.cfg
plugins: socket-0.3.3, cov-2.6.1, forked-1.0.2, asyncio-0.10.0, bdd-3.1.0, xdist-1.28.0, mock-1.10.4, rerunfailures-7.0
gw0 [3] / gw1 [3] / gw2 [3] / gw3 [3]
.RR                                                                                                                                          [100%]R [100%]R [100%]R [100%]R [100%]FCoverage.py warning: No data was collected. (no-data-collected)
Future exception was never retrieved
future: <Future finished exception=ServerDisconnectedError(None)>
aiohttp.client_exceptions.ServerDisconnectedError: None

===================================================================== FAILURES =====================================================================
_____________________________________________ test_collect_information[json-Jan 15 2018-Jan 18 2018-1] _____________________________________________
[gw2] darwin -- Python 3.7.4 /Users/remy/projects/scrapd/scrapd/.nox/test-integrations/bin/python3.7

to_date = 'Jan 18 2018', format = 'json', from_date = 'Jan 15 2018', entry_count = 1
request = <FixtureRequest for <Function test_collect_information[json-Jan 15 2018-Jan 18 2018-1]>>

        example_converters={'entry_count': int},
>   )
    def test_collect_information():

tests/step_defs/test_retrieve.py:16:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.nox/test-integrations/lib/python3.7/site-packages/pytest_bdd/scenario.py:195: in _execute_scenario
    _execute_step_function(request, scenario, step, step_func)
.nox/test-integrations/lib/python3.7/site-packages/pytest_bdd/scenario.py:136: in _execute_step_function
    step_func(**kwargs)
tests/step_defs/test_retrieve.py:42: in ensure_results
    results, _ = event_loop.run_until_complete(apd.async_retrieve(pages=-1, **time_range))
/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:579: in run_until_complete
    return future.result()
scrapd/core/apd.py:926: in async_retrieve
    page_res = await asyncio.gather(*tasks)
.nox/test-integrations/lib/python3.7/site-packages/tenacity/_asyncio.py:43: in call
    do = self.iter(retry_state=retry_state)
.nox/test-integrations/lib/python3.7/site-packages/tenacity/__init__.py:332: in iter
    six.raise_from(retry_exc, fut.exception())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = None, from_value = TypeError('sequence item 0: expected str instance, ValueError found')

>   ???
E   tenacity.RetryError: RetryError[<Future at 0x1121b5950 state=finished raised TypeError>]

<string>:3: RetryError
-------------------------------------------------- generated xml file: /tmp/pytest/junit-py37.xml --------------------------------------------------

---------- coverage: platform darwin, python 3.7.4-final-0 -----------
Name                        Stmts   Miss  Cover   Missing
---------------------------------------------------------
scrapd/core/apd.py            379     53    86%   40-45, 111, 139, 162, 198, 217-255, 308, 312, 508, 510, 514, 516, 518-520, 573, 580, 587, 594, 601, 608, 691-692, 696-697, 869, 874, 908-909, 941, 943-944, 964
scrapd/core/constant.py        15      0   100%
scrapd/core/date_utils.py      32      3    91%   86, 101, 103
scrapd/core/formatter.py       51      7    86%   78, 87, 110-111, 124-125, 155
---------------------------------------------------------
TOTAL                         477     63    87%
Coverage HTML written to dir htmlcov

============================================================= rerun test summary info ==============================================================
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 15 2018-Jan 18 2018-1]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 15 2018-Jan 18 2018-1]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 15 2018-Jan 18 2018-1]
RERUN tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================= 1 failed, 1 passed, 4 warnings, 6 rerun in 62.70 seconds =============================================
nox > Command pytest -x --junitxml=/tmp/pytest/junit-py37.xml --cov-report term-missing --cov-report html --cov=scrapd -m integrations --reruns 3 --reruns-delay 5 -r R /Users/remy/projects/scrapd/scrapd/tests failed with exit code 2
nox > Session test-integrations failed.

rgreinho · 2019-07-13T18:42:50Z

If you run this command from your PR branch, you can see the issue:

$ scrapd -v --from "Jan 2018" --to "Dec 2018" --format json

Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
RetryError[<Future at 0x105969f90 state=finished raised TypeError>]
Traceback (most recent call last):

  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/tenacity/_asyncio.py", line 46, in call
    result = yield from fn(*args, **kwargs)
                        │   │       └ {}
                        │   └ (<aiohttp.client.ClientSession object at 0x104799290>, 'http://austintexas.gov/news/traffic-fatality-55-3')
                        └ <function fetch_and_parse at 0x10464bef0>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 872, in fetch_and_parse
    d = parse_page(page, url)
        │          │     └ 'http://austintexas.gov/news/traffic-fatality-55-3'
        │          └ '<!doctype html>\n<!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->\n<!--[if IE 7 ]><html class="ie ie7" lang="en">...
        └ <function parse_page at 0x10464bdd0>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 842, in parse_page
    logger.debug(f'Fatality report {url} was not parsed correctly:\n\t * ' + '\n\t * '.join(err))
    │      │                                                                                └ [ValueError('cannot parse Deceased: Unidentified Hispanic male'), 'age is invalid: None']
    │      └ <bound method Logger._make_log_function.<locals>.log_function of <loguru._logger.Logger object at 0x102c2d450>>
    └ <loguru._logger.Logger object at 0x102c2d450>

TypeError: sequence item 0: expected str instance, ValueError found


The above exception was the direct cause of the following exception:


Traceback (most recent call last):

  File "/Users/remy/projects/scrapd/scrapd/venv/bin/scrapd", line 10, in <module>
    sys.exit(cli())
    │   │    └ <click.core.Command object at 0x1048e70d0>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <bound method BaseCommand.main of <click.core.Command object at 0x1048e70d0>>
           └ <click.core.Command object at 0x1048e70d0>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x102539550>
         │    └ <bound method Command.invoke of <click.core.Command object at 0x1048e70d0>>
         └ <click.core.Command object at 0x1048e70d0>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
           │   │      │    │           └ <click.core.Context object at 0x102539550>
           │   │      │    └ <function cli at 0x10468eb90>
           │   │      └ <click.core.Command object at 0x1048e70d0>
           │   └ <bound method Context.invoke of <click.core.Context object at 0x102539550>>
           └ <click.core.Context object at 0x102539550>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
           │         │       └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
           │         └ ()
           └ <function cli at 0x10468eb90>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
           │ │                       │       └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
           │ │                       └ ()
           │ └ <function get_current_context at 0x102c0fb90>
           └ <function cli at 0x10468eb00>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/cli/cli.py", line 69, in cli
    command.execute()
    │       └ <bound method AbstractCommand.execute of <scrapd.cli.cli.Retrieve object at 0x10478f690>>
    └ <scrapd.cli.cli.Retrieve object at 0x10478f690>

> File "/Users/remy/projects/scrapd/scrapd/scrapd/cli/base.py", line 33, in execute
    sys.exit(self._execute())
    │   │    │    └ <bound method Retrieve._execute of <scrapd.cli.cli.Retrieve object at 0x10478f690>>
    │   │    └ <scrapd.cli.cli.Retrieve object at 0x10478f690>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>

  File "/Users/remy/projects/scrapd/scrapd/scrapd/cli/cli.py", line 84, in _execute
    self.args['backoff'],
    │    └ {'verbose': 1, 'from_': 'Jan 2018', 'to': 'Dec 2018', 'format_': 'json', 'attempts': 3, 'backoff': 3, 'pages': -1}
    └ <scrapd.cli.cli.Retrieve object at 0x10478f690>

  File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
           │    │                  └ <coroutine object async_retrieve at 0x102478e60>
           │    └ <bound method BaseEventLoop.run_until_complete of <_UnixSelectorEventLoop running=False closed=True debug=False>>
           └ <_UnixSelectorEventLoop running=False closed=True debug=False>

  File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
           │      └ <built-in method result of _asyncio.Task object at 0x10495d9f0>
           └ <Task finished coro=<async_retrieve() done, defined at /Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py:883> exception=Retr...

  File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 926, in async_retrieve
    page_res = await asyncio.gather(*tasks)
    │                │       │       └ [<generator object AsyncRetrying.call at 0x1059e1950>, <generator object AsyncRetrying.call at 0x1059e1f50>, <generator object A...
    │                │       └ <function gather at 0x102b239e0>
    │                └ <module 'asyncio' from '/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/__init__.p...
    └ [{'Case': '18-3060392', 'Fatal crashes this year': '62', 'Date': datetime.date(2018, 11, 2), 'Time': datetime.time(7, 22), 'Loca...

  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/tenacity/_asyncio.py", line 43, in call
    do = self.iter(retry_state=retry_state)
    │    │    │    │           └ <tenacity.RetryCallState object at 0x1058e94d0>
    │    │    │    └ <tenacity.RetryCallState object at 0x1058e94d0>
    │    │    └ <bound method BaseRetrying.iter of <AsyncRetrying object at 0x1059a9690 (stop=<tenacity.stop.stop_after_attempt object at 0x1059...
    │    └ <AsyncRetrying object at 0x1059a9690 (stop=<tenacity.stop.stop_after_attempt object at 0x1059a9a10>, wait=<tenacity.wait.wait_ex...
    └ <tenacity.DoAttempt object at 0x1058fb290>
  File "/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 332, in iter
    six.raise_from(retry_exc, fut.exception())
    │   │          │          │   └ <bound method Future.exception of <Future at 0x105969f90 state=finished raised TypeError>>
    │   │          │          └ <Future at 0x105969f90 state=finished raised TypeError>
    │   │          └ RetryError(<Future at 0x105969f90 state=finished raised TypeError>)
    │   └ <function raise_from at 0x10391ce60>
    └ <module 'six' from '/Users/remy/projects/scrapd/scrapd/venv/lib/python3.7/site-packages/six.py'>
  File "<string>", line 3, in raise_from

tenacity.RetryError: RetryError[<Future at 0x105969f90 state=finished raised TypeError>]

It fails with a type error. My guess is that you messed up updating your branch at some point. We updated the internal structure to use datetime.date() objects for the dates and datetime.time() objects for the times.

The JSON formatter was also updated to reflect this changes.

These 2 options may be good pointers to start digging.

rgreinho · 2019-07-13T18:50:47Z

You can also run pytest manually to make the output more verbose:

$ pytest -s -x -vvv -n0 -m integrations tests/
=============================================================== test session starts ================================================================
platform darwin -- Python 3.7.4, pytest-4.4.1, py-1.8.0, pluggy-0.12.0 -- /Users/remy/projects/scrapd/scrapd/venv/bin/python3.7
cachedir: .pytest_cache
rootdir: /Users/remy/projects/scrapd/scrapd, inifile: setup.cfg
plugins: socket-0.3.3, cov-2.6.1, forked-1.0.2, asyncio-0.10.0, bdd-3.1.0, xdist-1.28.0, mock-1.10.4, rerunfailures-7.0
collected 136 items / 133 deselected / 3 selected

tests/step_defs/test_retrieve.py::test_collect_information[csv-Jan 15 2019-Jan 18 2019-2] Fatal crashes this year,Case,Date,Time,Location,First Name,Last Name,Ethnicity,Gender,DOB,Age,Link,Notes
1,19-0150158,01/15/2019,06:20 AM,10500 block of N IH 35 SB,David,Sell,White,male,07/09/1987,31,http://austintexas.gov/news/traffic-fatality-1-4,The preliminary investigation shows that a 2000 Peterbilt semi truck was travelling southbound in the center lane on IH 35 when it struck pedestrian David Sell. The driver stopped as soon as it was possible to do so and remained on scene. He reported not seeing the pedestrian prior to impact given that it was still dark at the time of the crash. Sell was pronounced deceased at the scene at 6:24 a.m. No charges are expected to be filed.
2,19-0161105,01/16/2019,03:42 PM,West William Cannon Drive and Ridge Oak Road,Ann,Bottenfield-Seago,White,female,02/15/1960,58,http://austintexas.gov/news/traffic-fatality-2-3,"The preliminary investigation shows that the grey, 2003 Volkwagen Jetta being driven by Ann Bottenfield-Seago failed to yield at a stop sign while attempting to turn westbound on to West William Cannon Drive from Ridge Oak Road. The Jetta collided with a black, 2017 Chevrolet truck that was eastbound in the inside lane of West William Cannon Drive. Bottenfield-Seago was pronounced deceased at the scene. The passenger in the Jetta and the driver of the truck were both transported to a local hospital with non-life threatening injuries. No charges are expected to be filed."
PASSED
tests/step_defs/test_retrieve.py::test_collect_information[json-Jan 2018-Dec 2018-72] FAILED

===================================================================== FAILURES =====================================================================
_______________________________________________ test_collect_information[json-Jan 2018-Dec 2018-72] ________________________________________________

self = <AsyncRetrying object at 0x1110514d0 (stop=<tenacity.stop.stop_after_attempt object at 0x11043e9d0>, wait=<tenacity.wa...bject at 0x10eebd4d0>, before=<function before_nothing at 0x10eeb47a0>, after=<function after_nothing at 0x10eebc3b0>)>
fn = <function fetch_and_parse at 0x10fb12d40>
args = (<aiohttp.client.ClientSession object at 0x110f06950>, 'http://austintexas.gov/news/traffic-fatality-55-3'), kwargs = {}
retry_state = <tenacity.RetryCallState object at 0x110fab1d0>, do = <tenacity.DoAttempt object at 0x110dd1810>

    @asyncio.coroutine
    def call(self, fn, *args, **kwargs):
        self.begin(fn)

        retry_state = RetryCallState(
            retry_object=self, fn=fn, args=args, kwargs=kwargs)
        while True:
            do = self.iter(retry_state=retry_state)
            if isinstance(do, DoAttempt):
                try:
>                   result = yield from fn(*args, **kwargs)

venv/lib/python3.7/site-packages/tenacity/_asyncio.py:46:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

session = <aiohttp.client.ClientSession object at 0x110f06950>, url = 'http://austintexas.gov/news/traffic-fatality-55-3'

    @retry()
    async def fetch_and_parse(session, url):
        """
        Parse a fatality page from a URL.

        :param aiohttp.ClientSession session: aiohttp session
        :param str url: detail page URL
        :return: a dictionary representing a fatality.
        :rtype: dict
        """
        # Retrieve the page.
        page = await fetch_detail_page(session, url)
        if not page:
            raise ValueError(f'The URL {url} returned a 0-length content.')

        # Parse it.
>       d = parse_page(page, url)

scrapd/core/apd.py:872:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

page = '<!doctype html>\n<!--[if lt IE 7 ]><html class="ie ie6" lang="en"> <![endif]-->\n<!--[if IE 7 ]><html class="ie ie7" ...v><!-- /footer -->  <div id="disable_messages-debug-div" style="display:none;"><pre>NULL</pre></div></body>\n</html>\n'
url = 'http://austintexas.gov/news/traffic-fatality-55-3'

    def parse_page(page, url):
        """
        Parse the page using all parsing methods available.

        :param str  page: the content of the fatality page
        :param str url: detail page URL
        :return: a dictionary representing a fatality.
        :rtype: dict
        """
        # Parse the page.
        twitter_d = parse_twitter_fields(page)
        page_d, err = parse_page_content(page, bool(twitter_d.get(Fields.NOTES)))
        if err:
>           logger.debug(f'Fatality report {url} was not parsed correctly:\n\t * ' + '\n\t * '.join(err))
E           TypeError: sequence item 0: expected str instance, ValueError found

scrapd/core/apd.py:842: TypeError

The above exception was the direct cause of the following exception:

to_date = 'Dec 2018', entry_count = 72, format = 'json', from_date = 'Jan 2018'
request = <FixtureRequest for <Function test_collect_information[json-Jan 2018-Dec 2018-72]>>

        example_converters={'entry_count': int},
>   )
    def test_collect_information():

tests/step_defs/test_retrieve.py:16:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
venv/lib/python3.7/site-packages/pytest_bdd/scenario.py:195: in _execute_scenario
    _execute_step_function(request, scenario, step, step_func)
venv/lib/python3.7/site-packages/pytest_bdd/scenario.py:136: in _execute_step_function
    step_func(**kwargs)
tests/step_defs/test_retrieve.py:42: in ensure_results
    results, _ = event_loop.run_until_complete(apd.async_retrieve(pages=-1, **time_range))
/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py:579: in run_until_complete
    return future.result()
scrapd/core/apd.py:926: in async_retrieve
    page_res = await asyncio.gather(*tasks)
venv/lib/python3.7/site-packages/tenacity/_asyncio.py:43: in call
    do = self.iter(retry_state=retry_state)
venv/lib/python3.7/site-packages/tenacity/__init__.py:332: in iter
    six.raise_from(retry_exc, fut.exception())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = None, from_value = TypeError('sequence item 0: expected str instance, ValueError found')

>   ???
E   tenacity.RetryError: RetryError[<Future at 0x11107e650 state=finished raised TypeError>]

<string>:3: RetryError
========================================= 1 failed, 1 passed, 133 deselected, 1 warnings in 14.00 seconds ==========================================

There is a TypeError somewhere as it expects a string somewhere, but receive a ValueError. Maybe an exception is not handled correctly.

rgreinho · 2019-07-13T19:04:23Z

Here is your issue:

File "/Users/remy/projects/scrapd/scrapd/scrapd/core/apd.py", line 842, in parse_page
    logger.debug(f'Fatality report {url} was not parsed correctly:\n\t * ' + '\n\t * '.join(err))
    │      │                                                                                └ [ValueError('cannot parse Deceased: Unidentified Hispanic male'), 'age is invalid: None']
    │      └ <bound method Logger._make_log_function.<locals>.log_function of <loguru._logger.Logger object at 0x102c2d450>>
    └ <loguru._logger.Logger object at 0x102c2d450>

TypeError: sequence item 0: expected str instance, ValueError found

You store an exception in the error list, where only strings (i.e. error messages) are expected.

Happy fixing 😉

mscarey · 2019-07-13T19:33:57Z

Thanks. I couldn't figure out that error, my best guess was that it was failing to get some expected text from an http request.

rgreinho

I regenerated the data sets with your branch. It looks really good! Great work tackling this problem!

Two questions regarding the new output:

{
     "Age": 62,
     "Case": "17-3460912",
     "DOB": "01/22/1955",
     "Date": "12/12/2017",
+    "Deceased": "Robert Lance Trewitt, White male (D.O.B. 1-22-55)",
     "Ethnicity": "White",
     "Fatal crashes this year": "70",
     "First Name": "Robert",
@@ -26,14 +28,15 @@
     "Last Name": "Trewitt",
     "Link": "http://austintexas.gov/news/traffic-fatality-70-0",
     "Location": "8400 block of Research Blvd. Southbound",
-    "Notes": "The preliminary investigation indicates that a 2002, blue, Chevrolet truck was traveling southbound in the 8400 block of Research Blvd. in the inside lane when it swerved into the right lane to avoid a collision. When the Chevrolet entered the right lane, it struck a 1998 BMW motorcycle from behind, knocking the driver off the bike and over a retaining wall. The driver fell to the frontage road where he sustained serious injuries. The driver of the motorcycle was transported to Dell Seton Medical Center at the University of Texas where he died as a result of his injuries on Wednesday, December 13, 2017.",
-    "Time": "2:13 p.m."
+    "Notes": "The preliminary investigation indicates that a 2002, blue, Chevrolet truck was traveling southbound in the 8400 block of Research Blvd. in the inside lane when it swerved into the right lane to avoid a collision. When the Chevrolet entered the right lane, it struck a 1998 BMW motorcycle from behind, knocking the driver off the bike and over a retaining wall. The driver fell to the frontage road where he sustained serious injuries. The driver of the motorcycle was transported to Dell Seton Medical Center at the University of Texas where he died as a result of his injuries on Wednesday, December 13, 2017.\n\tThis case is still being investigated.",
+    "Time": "02:13 PM"
   },

The deceased field appears in the output, is there a way we can remove it? (not a blocker, but it would definitely be cleaner without temporary fields in the final output)
~~The notes look definitely better, but they contain a lot of \n or \tcharacters. Is there a way to interpret them instead of displaying them?~~ Erf scratch that 🤦‍♂

rgreinho · 2019-07-13T20:12:21Z

tests/core/test_apd.py

+    ('traffic-fatality-4-6', 'White female, DOB 12/31/1960'),
+    ('traffic-fatality-20-4', 'Hispanic male, 19 years of age'),
+    ('traffic-fatality-25-4', ', Hispanic male, D.O.B. 6-9-70'),
+    ('traffic-fatality-73-2', 'White male, DOB 02/09/80'),


Can you add your tests using the same structure as before? I created the fixture to load the page initially to test the full flow, but I quickly regretted it and I plan to remove it in the future.

I also think it was more readable before and we had more tests cases.

The old style of tests weren't really adequate though. Remember we had issue #92 (Deceased fields that can't be parsed because they contain no DOB), and I got the old-style test passing for https://austintexas.gov/news/traffic-fatality-20-4 in #125. But then you had to open #150, because the same bulletin still wasn't being scraped correctly, so I added these tests that check the parsing of the whole page. Also, I think BeautifulSoup depends on having a valid HTML tree to select text from.

mscarey · 2019-07-15T21:19:01Z

@rgreinho are you sure the Deceased field is showing up in the final output? I just ran the test command scrapd --from "Jan 15 2019" --to "Jan 18 2019" --format json and the Deceased field isn't in the output. I do have del page_d['Deceased'] on line 847 of apd.py.

rgreinho · 2019-07-15T23:29:20Z

Run this command from your branch:

$ scrapd --from "12/12/2017" --to "12/12/2017"

[
  {
    "Age": 62,
    "Case": "17-3460912",
    "DOB": "01/22/1955",
    "Date": "12/12/2017",
    "Deceased": "Robert Lance Trewitt, White male (D.O.B. 1-22-55)",
    "Ethnicity": "White",
    "Fatal crashes this year": "70",
    "First Name": "Robert",
    "Gender": "male",
    "Last Name": "Trewitt",
    "Link": "http://austintexas.gov/news/traffic-fatality-70-0",
    "Location": "8400 block of Research Blvd. Southbound",
    "Notes": "The preliminary investigation indicates that a 2002, blue, Chevrolet truck was traveling southbound in the 8400 block of Research Blvd. in the inside lane when it swerved into the right lane to avoid a collision. When the Chevrolet entered the right lane, it struck a 1998 BMW motorcycle from behind, knocking the driver off the bike and over a retaining wall. The driver fell to the frontage road where he sustained serious injuries. The driver of the motorcycle was transported to Dell Seton Medical Center at the University of Texas where he died as a result of his injuries on Wednesday, December 13, 2017.\n\tThis case is still being investigated.",
    "Time": "02:13 PM"
  }
]

You can see the the Deceased field.

mscarey · 2019-07-16T21:13:56Z

I think this fixes it...I never looked very closely at parse_twitter_description, but I think the Deceased field wasn't being deleted if the Twitter description had been parsed. And then I had the test ignoring Deceased in the return value of parse_page, when it only should have been ignored for parse_page_content.

rgreinho · 2019-07-17T13:54:52Z

Nice job @mscarey! 👍

I'll use this branch to regenerate the data sets tonight and I'll keep you posted!

rgreinho

It looks great @mscarey ! Very good job on these issues!

mscarey added 19 commits April 17, 2019 17:52

use first item of deceased_field for First Name

ad536b6

Changed the part of parse_deceased_field that removed some items and then used the last remaining item as the First Name. I wasn't able to verify that the issue (scrapd#74, recording a nickname as the first name) is fixed in the cli.

make missing name data throw IndexError

65c9e87

add name with punctuation to parsing test

9fb41cc

move split method within deceased field parser

3698f8f

Allows the parse_deceased_field to be tested by passing in strings, not split lists

split deceased field on some slashes

7ffadbb

To handle gender and ethnicity fields in a format like "W/F"

Merge https://github.com/scrapd/scrapd

510032e

Merge remote-tracking branch 'origin'

7f40cde

Merge branch 'master' of https://github.com/scrapd/scrapd

f7f6722

Merge branch 'master' of https://github.com/scrapd/scrapd

6944604

Merge branch 'master' of https://github.com/scrapd/scrapd

efea210

parse notes with beautifulsoup

c569587

add tests for notes fields

3254da2

fixes scrapd#73 and scrapd#81

add bs4 to requirements

4b4c97e

make notes_from_element a separate function

b2ebc3d

add equal sign in requirements.txt

e1ce57d

remove check for img tags in notes

984cbb6

These tags should be removed from the notes by BeautifulSoup, and Coveralls says the tests aren't touching this line.

add tests for parsing to get correct age field

44b8127

parse deceased field with BeautifulSoup

bb13ec0

Fixes the parsing bug where the Deceased field 'Hispanic male, 19 years of age' was truncated to 'Hispanic male, 19'. Closes scrapd#150.

add tests for deceased field after parsing

0269953

The deceased field wasn't actually in the return value of any function, so in order to test its value, I had to move the line that deletes the deceased field outside the parse_page_content function.

rgreinho approved these changes Jul 5, 2019

View reviewed changes

rgreinho suggested changes Jul 5, 2019

View reviewed changes

change import style for bs4

8504d0c

rgreinho previously approved these changes Jul 6, 2019

View reviewed changes

Parse deceased field with stray <strong> tag

d1ca796

Adds another path through parse_deceased_field, where the deceased field is the full text of a <p> element, even if the element contains extraneous tags.

mscarey dismissed rgreinho’s stale review via d1ca796 July 9, 2019 02:48

mscarey added 2 commits July 8, 2019 21:48

add test data for stray <strong> tag

89bff2c

fix variable assignment for parsed deceased field

cf1341a

convert ValueError to str before storing it

969fdd6

rgreinho suggested changes Jul 14, 2019

View reviewed changes

delete Deceased fields from Twitter descriptions

5453f0f

rgreinho approved these changes Jul 17, 2019

View reviewed changes

mergify bot merged commit b842e6b into scrapd:master Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse notes fields #166

Parse notes fields #166

mscarey commented Jul 4, 2019 •

edited

rgreinho left a comment

rgreinho Jul 5, 2019

rgreinho left a comment

rgreinho commented Jul 5, 2019

mscarey commented Jul 13, 2019

rgreinho commented Jul 13, 2019

rgreinho commented Jul 13, 2019

rgreinho commented Jul 13, 2019

rgreinho commented Jul 13, 2019

mscarey commented Jul 13, 2019

rgreinho left a comment •

edited

rgreinho Jul 13, 2019

mscarey Jul 15, 2019

mscarey commented Jul 15, 2019

rgreinho commented Jul 15, 2019

mscarey commented Jul 16, 2019

rgreinho commented Jul 17, 2019

rgreinho left a comment

Parse notes fields #166

Parse notes fields #166

Conversation

mscarey commented Jul 4, 2019 • edited

Types of changes

Description

Checklist:

rgreinho left a comment

Choose a reason for hiding this comment

rgreinho Jul 5, 2019

Choose a reason for hiding this comment

rgreinho left a comment

Choose a reason for hiding this comment

rgreinho commented Jul 5, 2019

mscarey commented Jul 13, 2019

rgreinho commented Jul 13, 2019

rgreinho commented Jul 13, 2019

rgreinho commented Jul 13, 2019

rgreinho commented Jul 13, 2019

mscarey commented Jul 13, 2019

rgreinho left a comment • edited

Choose a reason for hiding this comment

rgreinho Jul 13, 2019

Choose a reason for hiding this comment

mscarey Jul 15, 2019

Choose a reason for hiding this comment

mscarey commented Jul 15, 2019

rgreinho commented Jul 15, 2019

mscarey commented Jul 16, 2019

rgreinho commented Jul 17, 2019

rgreinho left a comment

Choose a reason for hiding this comment

mscarey commented Jul 4, 2019 •

edited

rgreinho left a comment •

edited