Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix two bugs in scraper code. #3063

Merged
merged 3 commits into from
May 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog/3063.bugfix.1.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix `sunpy.util.scraper.Scraper` failing if a directory is not found on a remote server.
1 change: 1 addition & 0 deletions changelog/3063.bugfix.2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Correctly zero pad milliseconds in the `sunpy.util.scraper.Scraper` formatting to prevent errors when the millisecond value was less than 100.
14 changes: 11 additions & 3 deletions sunpy/util/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import re
import datetime
from ftplib import FTP
from urllib.error import HTTPError
from urllib.request import urlopen

from bs4 import BeautifulSoup
Expand Down Expand Up @@ -67,9 +68,11 @@ def __init__(self, pattern, **kwargs):
else:
now = datetime.datetime.now()
milliseconds_ = int(now.microsecond / 1000.)
self.now = now.strftime(self.pattern[0:milliseconds.start()] +
str(milliseconds_) +
self.pattern[milliseconds.end():])
self.now = now.strftime('{start}{milli:03d}{end}'.format(
start=self.pattern[0:milliseconds.start()],
milli=milliseconds_,
end=self.pattern[milliseconds.end():]
))

def matches(self, filepath, date):
return date.strftime(self.pattern) == filepath
Expand Down Expand Up @@ -234,6 +237,11 @@ def filelist(self, timerange):
filesurls.append(fullpath)
finally:
opn.close()
except HTTPError as http_err:
# Ignore missing directories (issue #2684).
if http_err.code == 404:
continue
raise
except Exception:
raise
return filesurls
Expand Down
32 changes: 26 additions & 6 deletions sunpy/util/tests/test_scraper.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import datetime
from unittest.mock import patch, Mock

import pytest

import astropy.units as u
Expand Down Expand Up @@ -140,22 +143,29 @@ def testURL_pattern():
assert not s._URL_followsPattern('fd_20130410_ar_231211.fts.gz')


@pytest.mark.xfail
def testURL_patternMilliseconds():
def testURL_patternMillisecondsGeneric():
s = Scraper('fd_%Y%m%d_%H%M%S_%e.fts')
# NOTE: Seems that if below fails randomly - not understood why
# with `== True` fails a bit less...
assert s._URL_followsPattern('fd_20130410_231211_119.fts')
assert not s._URL_followsPattern('fd_20130410_231211.fts.gz')
assert not s._URL_followsPattern('fd_20130410_ar_231211.fts.gz')


def testURL_patternMillisecondsZeroPadded():
# Asserts solution to ticket #1954.
# Milliseconds must be zero-padded in order to match URL lengths.
now_mock = Mock(return_value=datetime.datetime(2019, 4, 19, 0, 0, 0, 4009))
with patch('datetime.datetime', now=now_mock):
s = Scraper('fd_%Y%m%d_%H%M%S_%e.fts')
now_mock.assert_called_once()
assert s.now == 'fd_20190419_000000_004.fts'


@pytest.mark.xfail
def testFilesRange_sameDirectory_local():
# Fails due to an IsADirectoryError, wrapped in a URLError, after `requests`
# tries to open a directory as a binary file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is something we can fix or should this test actually always fail?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent question. I don't know if scanning directories was a functioning feature at some point or not. It's definitely fixable/implementable, but it'd take some effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that does seem like an undertaking. I wonder if we want to support that in the future or just drop it. What do you think @Cadair?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have no idea. If there isn't an issue for this already we should probably open one. Also I am guessing @dpshelio wrote this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a check for ftp type urls (uris? I never know which) maybe add a check for file type urls and raise not implemented exception and alter the test to match? I mean it can't work as as it is?

s = Scraper('/'.join(['file:/', rootdir,
'EIT', 'efz%Y%m%d.%H%M%S_s.fits']))
print(s.pattern)
print(s.now)
startdate = parse_time((2004, 3, 1, 4, 0))
enddate = parse_time((2004, 3, 1, 6, 30))
assert len(s.filelist(TimeRange(startdate, enddate))) == 3
Expand Down Expand Up @@ -199,3 +209,13 @@ def test_ftp():
s = Scraper(pattern)
timerange = TimeRange('2016/5/18 15:28:00', '2016/5/20 16:30:50')
assert len(s.filelist(timerange)) == 2


@pytest.mark.remote_data
def test_filelist_url_missing_directory():
# Asserts solution to ticket #2684.
# Attempting to access data for the year 1960 results in a 404, so no files are returned.
pattern = 'http://lasp.colorado.edu/eve/data_access/evewebdataproducts/level2/%Y/%j/'
s = Scraper(pattern)
timerange = TimeRange('1960/01/01 00:00:00', '1960/01/02 00:00:00')
assert len(s.filelist(timerange)) == 0