Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

ScraPD fails when not retrieving a full data entry #96

Closed
rgreinho opened this issue Apr 27, 2019 · 2 comments · Fixed by #116
Closed

ScraPD fails when not retrieving a full data entry #96

rgreinho opened this issue Apr 27, 2019 · 2 comments · Fixed by #116
Assignees
Labels
exp/beginner Good for newcomers kind/bug Something isn't working

Comments

@rgreinho
Copy link
Member

Issue Type

  • Bug report

Current Behavior

Some times, ScrAPD only retrieves the link to the detail page an no extra information, causing the parser to crash:

++ scrapd -v --format json --from 'Jan 1 2019' --to 'Dec 31 2019'
Fetching page 1...
'Date'
Traceback (most recent call last):

 File "/home/circleci/project/venv/bin/scrapd", line 10, in <module>
   sys.exit(cli())
   │   │    └ <click.core.Command object at 0x7feab9fe14a8>
   │   └ <built-in function exit><module 'sys' (built-in)>

 File "/home/circleci/project/venv/lib/python3.7/site-packages/click/core.py", line 764, in __call__
   return self.main(*args, **kwargs)
          │    │     │       └ {}
          │    │      ()
          │    └ <bound method BaseCommand.main of <click.core.Command object at 0x7feab9fe14a8>><click.core.Command object at 0x7feab9fe14a8>
 File "/home/circleci/project/venv/lib/python3.7/site-packages/click/core.py", line 717, in main
   rv = self.invoke(ctx)
        │    │      └ <click.core.Context object at 0x7feabef59780>
        │    └ <bound method Command.invoke of <click.core.Command object at 0x7feab9fe14a8>><click.core.Command object at 0x7feab9fe14a8>
 File "/home/circleci/project/venv/lib/python3.7/site-packages/click/core.py", line 956, in invoke
   return ctx.invoke(self.callback, **ctx.params)
          │   │      │    │           │   └ {'verbose': 1, 'format_': 'json', 'from_': 'Jan 1 2019', 'to': 'Dec 31 2019', 'gcontributors': None, 'gcredentials': None, 'page...
          │   │      │    │           └ <click.core.Context object at 0x7feabef59780>
          │   │      │    └ <function cli at 0x7feab5589ae8>
          │   │      └ <click.core.Command object at 0x7feab9fe14a8>
          │   └ <bound method Context.invoke of <click.core.Context object at 0x7feabef59780>>
          └ <click.core.Context object at 0x7feabef59780>
 File "/home/circleci/project/venv/lib/python3.7/site-packages/click/core.py", line 555, in invoke
   return callback(*args, **kwargs)
          │         │       └ {'verbose': 1, 'format_': 'json', 'from_': 'Jan 1 2019', 'to': 'Dec 31 2019', 'gcontributors': None, 'gcredentials': None, 'page...
          │          ()
          └ <function cli at 0x7feab5589ae8>
 File "/home/circleci/project/venv/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
   return f(get_current_context(), *args, **kwargs)
          │ │                       │       └ {'verbose': 1, 'format_': 'json', 'from_': 'Jan 1 2019', 'to': 'Dec 31 2019', 'gcontributors': None, 'gcredentials': None, 'page...
          │ │                       └ ()
          │ └ <function get_current_context at 0x7feabcdf0268>
          └ <function cli at 0x7feab5589a60>
 File "/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/cli/cli.py", line 76, in cli
   command.execute()
   │       └ <bound method AbstractCommand.execute of <scrapd.cli.cli.Retrieve object at 0x7feaba04d9b0>>
   └ <scrapd.cli.cli.Retrieve object at 0x7feaba04d9b0>
> File "/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/cli/base.py", line 34, in execute
   sys.exit(self._execute())
   │   │    │    └ <bound method Retrieve._execute of <scrapd.cli.cli.Retrieve object at 0x7feaba04d9b0>>
   │   │    └ <scrapd.cli.cli.Retrieve object at 0x7feaba04d9b0>
   │   └ <built-in function exit>
   └ <module 'sys' (built-in)>
 File "/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/cli/cli.py", line 91, in _execute
   self.args['to'],
   │    └ {'verbose': 1, 'format_': 'json', 'from_': 'Jan 1 2019', 'to': 'Dec 31 2019', 'gcontributors': None, 'gcredentials': None, 'page...
   └ <scrapd.cli.cli.Retrieve object at 0x7feaba04d9b0>
 File "/usr/local/lib/python3.7/asyncio/runners.py", line 43, in run
   return loop.run_until_complete(main)
          │    │                  └ <coroutine object async_retrieve at 0x7feab5772e48>
          │    └ <bound method BaseEventLoop.run_until_complete of <_UnixSelectorEventLoop running=False closed=True debug=False>><_UnixSelectorEventLoop running=False closed=True debug=False>
 File "/usr/local/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
   return future.result()
          │      └ <built-in method result of _asyncio.Task object at 0x7feab5588188><Task finished coro=<async_retrieve() done, defined at /home/circleci/project/venv/lib/python3.7/site-packages/scrapd/core/apd.p...
 File "/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/core/apd.py", line 455, in async_retrieve
   entry for entry in page_res if date_utils.is_in_range(entry[Fields.DATE], from_, to)
                      │           │          │                 │      │      │      └ 'Dec 31 2019'
                      │           │          │                 │      │      └ 'Jan 1 2019'
                      │           │          │                 │      └ 'Date'
                      │           │          │                 └ <class 'scrapd.core.constant.Fields'>
                      │           │          └ <function is_in_range at 0x7feab6cb2ae8>
                      │           └ <module 'scrapd.core.date_utils' from '/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/core/date_utils.py'>
                      └ [{'Fatal crashes this year': '23', 'Date': '04/14/2019', 'Notes': 'This is Austin’s 23rd fatal traffic crash of 2019, resulting ...
 File "/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/core/apd.py", line 455, in <listcomp>
   entry for entry in page_res if date_utils.is_in_range(entry[Fields.DATE], from_, to)
   │         │                    │          │           │     │      │      │      └ 'Dec 31 2019'
   │         │                    │          │           │     │      │      └ 'Jan 1 2019'
   │         │                    │          │           │     │      └ 'Date'
   │         │                    │          │           │     └ <class 'scrapd.core.constant.Fields'>
   │         │                    │          │           └ {'Link': 'http://austintexas.gov/news/traffic-fatality-22-3'}
   │         │                    │          └ <function is_in_range at 0x7feab6cb2ae8>
   │         │                    └ <module 'scrapd.core.date_utils' from '/home/circleci/project/venv/lib/python3.7/site-packages/scrapd/core/date_utils.py'>
   │         └ {'Link': 'http://austintexas.gov/news/traffic-fatality-22-3'}
   └ {'Link': 'http://austintexas.gov/news/traffic-fatality-22-3'}

KeyError: 'Date'
Traceback (most recent call last):
 File "/home/circleci/project/tools/scrapd-merger.py", line 73, in <module>
   main()
 File "/home/circleci/project/tools/scrapd-merger.py", line 28, in main
   results = merge(json.loads(args.old.read()), json.loads(args.infile.read()))
 File "/usr/local/lib/python3.7/json/__init__.py", line 348, in loads
   return _default_decoder.decode(s)
 File "/usr/local/lib/python3.7/json/decoder.py", line 337, in decode
   obj, end = self.raw_decode(s, idx=_w(s, 0).end())
 File "/usr/local/lib/python3.7/json/decoder.py", line 355, in raw_decode
   raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Exited with code 1

Expected Behavior

ScrAPD should fail more graciously.

Possible Solution

Implement a retry mechanism.

@rgreinho rgreinho added the kind/bug Something isn't working label Apr 27, 2019
@rgreinho rgreinho added this to the 1.5.2 - Keep the code clean milestone Apr 28, 2019
@rgreinho
Copy link
Member Author

It looks like some times, parse_page simply returns an empty dictionary, causing the entry to only contain the link of the fatality detailed page.

A solution would be to check for an empty result, raise an exception there if nothing is returned, and add some retries with exponential back off around it.

@rgreinho rgreinho added the exp/beginner Good for newcomers label Apr 29, 2019
@rgreinho rgreinho added this to To do in Maintaince mode or second milestone via automation Apr 29, 2019
@rgreinho rgreinho moved this from To do to In progress in Maintaince mode or second milestone May 1, 2019
@rgreinho rgreinho self-assigned this May 1, 2019
rgreinho added a commit to rgreinho/scrapd that referenced this issue May 1, 2019
Adds retries to the function retrieving the detailed pages and
extracting the information.

Fixes scrapd#96
mergify bot pushed a commit that referenced this issue May 2, 2019
Adds retries to the function retrieving the detailed pages and
extracting the information.

Fixes #96
@rgreinho
Copy link
Member Author

rgreinho commented May 2, 2019

Closed by #116.

@rgreinho rgreinho closed this as completed May 2, 2019
Maintaince mode or second milestone automation moved this from In progress to Done May 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
exp/beginner Good for newcomers kind/bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant