New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats for JATS-Con talk #108

Closed
Daniel-Mietchen opened this Issue Oct 1, 2013 · 77 comments

Comments

Projects
None yet
2 participants
@Daniel-Mietchen
Member

Daniel-Mietchen commented Oct 1, 2013

For
#98 ,
I suggest we run the stats as per
#102
and
#101
for at least a year instead of a week, so as to have a better basis for discussion.

We can also include any other material that may be of interest and that hasn't found its way into the paper.

@ghost ghost assigned erlehmann Oct 1, 2013

@Daniel-Mietchen

This comment has been minimized.

Show comment
Hide comment
@Daniel-Mietchen

Daniel-Mietchen Oct 5, 2013

Member

Any news on this?

Member

Daniel-Mietchen commented Oct 5, 2013

Any news on this?

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 16, 2013

Creating PMC IDs for 2013, analog to #102 (comment):

$ ./oa-pmc-ids --from 2012-01-01 --until 2013-01-01 --verbose > pmc-ids-from-2012-01-01-until-2013-01-01

erlehmann commented Oct 16, 2013

Creating PMC IDs for 2013, analog to #102 (comment):

$ ./oa-pmc-ids --from 2012-01-01 --until 2013-01-01 --verbose > pmc-ids-from-2012-01-01-until-2013-01-01
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 16, 2013

Oops. My naive implementation does not work for that case.

  File "./oa-pmc-ids", line 35, in get_records
    for record in records:
  File "./oa-pmc-ids", line 24, in get_records
    request = get(url)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 52, in get
    return request('get', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 39, in request
    s = kwargs.pop('session') if 'session' in kwargs else sessions.session()
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 329, in session
    return Session(**kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 86, in __init__
    self.init_poolmanager()
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 97, in init_poolmanager
    maxsize=self.config.get('pool_maxsize')
  File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 55, in __init__
    self.pools = RecentlyUsedContainer(num_pools)
  File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/_collections.py", line 40, in __init__
    self.access_log_lock = RLock()
  File "/usr/lib/python2.7/threading.py", line 102, in RLock
    return _RLock(*args, **kwargs)
  File "/usr/lib/python2.7/threading.py", line 107, in __init__
    _Verbose.__init__(self, verbose)
RuntimeError: maximum recursion depth exceeded while calling a Python object

erlehmann commented Oct 16, 2013

Oops. My naive implementation does not work for that case.

  File "./oa-pmc-ids", line 35, in get_records
    for record in records:
  File "./oa-pmc-ids", line 24, in get_records
    request = get(url)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 52, in get
    return request('get', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 39, in request
    s = kwargs.pop('session') if 'session' in kwargs else sessions.session()
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 329, in session
    return Session(**kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 86, in __init__
    self.init_poolmanager()
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 97, in init_poolmanager
    maxsize=self.config.get('pool_maxsize')
  File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 55, in __init__
    self.pools = RecentlyUsedContainer(num_pools)
  File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/_collections.py", line 40, in __init__
    self.access_log_lock = RLock()
  File "/usr/lib/python2.7/threading.py", line 102, in RLock
    return _RLock(*args, **kwargs)
  File "/usr/lib/python2.7/threading.py", line 107, in __init__
    _Verbose.__init__(self, verbose)
RuntimeError: maximum recursion depth exceeded while calling a Python object
@Daniel-Mietchen

This comment has been minimized.

Show comment
Hide comment
@Daniel-Mietchen

Daniel-Mietchen Oct 16, 2013

Member

And if you use tee instead of --verbose?

On Wed, Oct 16, 2013 at 5:24 PM, Nils Dagsson Moskopp <
notifications@github.com> wrote:

Oops. My naive implementation does not work for that case.

File "./oa-pmc-ids", line 35, in get_records
for record in records:
File "./oa-pmc-ids", line 24, in get_records
request = get(url)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 52, in get
return request('get', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 39, in request
s = kwargs.pop('session') if 'session' in kwargs else sessions.session()
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 329, in session
return Session(**kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 86, in init
self.init_poolmanager()
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 97, in init_poolmanager
maxsize=self.config.get('pool_maxsize')
File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 55, in init
self.pools = RecentlyUsedContainer(num_pools)
File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/_collections.py", line 40, in init
self.access_log_lock = RLock()
File "/usr/lib/python2.7/threading.py", line 102, in RLock
return _RLock(_args, *_kwargs)
File "/usr/lib/python2.7/threading.py", line 107, in init
_Verbose.init(self, verbose)
RuntimeError: maximum recursion depth exceeded while calling a Python object


Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-26428028
.

Member

Daniel-Mietchen commented Oct 16, 2013

And if you use tee instead of --verbose?

On Wed, Oct 16, 2013 at 5:24 PM, Nils Dagsson Moskopp <
notifications@github.com> wrote:

Oops. My naive implementation does not work for that case.

File "./oa-pmc-ids", line 35, in get_records
for record in records:
File "./oa-pmc-ids", line 24, in get_records
request = get(url)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 52, in get
return request('get', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 39, in request
s = kwargs.pop('session') if 'session' in kwargs else sessions.session()
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 329, in session
return Session(**kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 86, in init
self.init_poolmanager()
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 97, in init_poolmanager
maxsize=self.config.get('pool_maxsize')
File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/poolmanager.py", line 55, in init
self.pools = RecentlyUsedContainer(num_pools)
File "/usr/lib/python2.7/dist-packages/requests/packages/urllib3/_collections.py", line 40, in init
self.access_log_lock = RLock()
File "/usr/lib/python2.7/threading.py", line 102, in RLock
return _RLock(_args, *_kwargs)
File "/usr/lib/python2.7/threading.py", line 107, in init
_Verbose.init(self, verbose)
RuntimeError: maximum recursion depth exceeded while calling a Python object


Reply to this email directly or view it on GitHubhttps://github.com//issues/108#issuecomment-26428028
.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 16, 2013

“a typical Python implementation allows 1000 recursions, which is plenty for non-recursively written code and for code that recurses to traverse, for example, a typical parse tree, but not enough for a recursively written loop over a large list.” http://neopythonic.blogspot.com.au/2009/04/tail-recursion-elimination.html

erlehmann commented Oct 16, 2013

“a typical Python implementation allows 1000 recursions, which is plenty for non-recursively written code and for code that recurses to traverse, for example, a typical parse tree, but not enough for a recursively written loop over a large list.” http://neopythonic.blogspot.com.au/2009/04/tail-recursion-elimination.html

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 16, 2013

The problem is my recursive URL get function. I'll have to refactor it, let me think about it.

erlehmann commented Oct 16, 2013

The problem is my recursive URL get function. I'll have to refactor it, let me think about it.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 17, 2013

Refactored URL get function to be iterative as of ab78a5f.

erlehmann commented Oct 17, 2013

Refactored URL get function to be iterative as of ab78a5f.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 17, 2013

Currently testing refactoring URL function.

erlehmann commented Oct 17, 2013

Currently testing refactoring URL function.

@Daniel-Mietchen

This comment has been minimized.

Show comment
Hide comment
@Daniel-Mietchen

Daniel-Mietchen Oct 17, 2013

Member

In terms of dates, I would suggest to go for something like Sep 18, 2012- Sep 17, 2013, rather than the complete year 2012.

Member

Daniel-Mietchen commented Oct 17, 2013

In terms of dates, I would suggest to go for something like Sep 18, 2012- Sep 17, 2013, rather than the complete year 2012.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 17, 2013

Fixed URL get function several times. Should work now.

$ nohup sh -c './oa-pmc-ids --from 2012-01-01 --until 2013-01-01 --verbose > pmc-ids-from-2012-01-01-until-2013-01-01'

erlehmann commented Oct 17, 2013

Fixed URL get function several times. Should work now.

$ nohup sh -c './oa-pmc-ids --from 2012-01-01 --until 2013-01-01 --verbose > pmc-ids-from-2012-01-01-until-2013-01-01'
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 17, 2013

Btw, it seems that PMCIDs constantly get added and removed. At first I suspected a bug in my code.

erlehmann commented Oct 17, 2013

Btw, it seems that PMCIDs constantly get added and removed. At first I suspected a bug in my code.

@Daniel-Mietchen

This comment has been minimized.

Show comment
Hide comment
@Daniel-Mietchen

Daniel-Mietchen Oct 17, 2013

Member

Can you describe in a bit more detail what you mean? The "removed" part is rather new to me.

Member

Daniel-Mietchen commented Oct 17, 2013

Can you describe in a bit more detail what you mean? The "removed" part is rather new to me.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 17, 2013

I only found one removed entry – maybe that is a failure in my method. I'll document it.

erlehmann commented Oct 17, 2013

I only found one removed entry – maybe that is a failure in my method. I'll document it.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Number seems much too round, but I have not found the error:

$ wc -w pmc-ids-from-2012-01-01-until-2013-01-01
974000 pmc-ids-from-2012-01-01-until-2013-01-01

erlehmann commented Oct 21, 2013

Number seems much too round, but I have not found the error:

$ wc -w pmc-ids-from-2012-01-01-until-2013-01-01
974000 pmc-ids-from-2012-01-01-until-2013-01-01
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Publications for May 2013:

$ ./oa-pmc-ids --from 2013-03-01 --until 2013-04-01 > pmc-ids-from-2013-03-01-until-2013-04-01

erlehmann commented Oct 21, 2013

Publications for May 2013:

$ ./oa-pmc-ids --from 2013-03-01 --until 2013-04-01 > pmc-ids-from-2013-03-01-until-2013-04-01
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Confirming number of PMC IDs:

$ wc -w 

This is slightly higher than in #102 (comment), therefore probably correct.

erlehmann commented Oct 21, 2013

Confirming number of PMC IDs:

$ wc -w 

This is slightly higher than in #102 (comment), therefore probably correct.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Confirming number of PMC IDs:

$ wc -w <pmc-ids-from-2013-03-01-until-2013-04-01
139674

This is slightly higher than in #102 (comment), therefore probably correct.

erlehmann commented Oct 21, 2013

Confirming number of PMC IDs:

$ wc -w <pmc-ids-from-2013-03-01-until-2013-04-01
139674

This is slightly higher than in #102 (comment), therefore probably correct.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Creating database on host files.mi.ur.de (user erlehmann).

$ nohup sh -c 'cat pmc-ids-from-2013-03-01-until-2013-04-01 | ./oa-get download-metadata pmc_pmcid 2>oa-get-download-metadata.log'

erlehmann commented Oct 21, 2013

Creating database on host files.mi.ur.de (user erlehmann).

$ nohup sh -c 'cat pmc-ids-from-2013-03-01-until-2013-04-01 | ./oa-get download-metadata pmc_pmcid 2>oa-get-download-metadata.log'
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3492181&id=PMC3492194&id=PMC3492197&id=PMC3492232&id=PMC3492240&id=PMC3492249&id=PMC3492256&id=PMC3492264&id=PMC3492270&id=PMC3492285&id=PMC3492294&id=PMC3492296&id=PMC3492329&id=PMC3492343&id=PMC3492351&id=PMC3492354&id=PMC3492359&id=PMC3492364&id=PMC3492391&id=PMC3492395&id=PMC3492396&id=PMC3492410&id=PMC3492419&id=PMC3492439&id=PMC3492443&id=PMC3492445&id=PMC3492456&id=PMC3492461&id=PMC3492494&id=PMC3492658&id=PMC3492666&id=PMC3492676&id=PMC3492689&id=PMC3492691&id=PMC3492692&id=PMC3492696&id=PMC3492697&id=PMC3492698&id=PMC3492699&id=PMC3492700&id=PMC3492701&id=PMC3492768&id=PMC3492772&id=PMC3492784&id=PMC3492785&id=PMC3492791&id=PMC3376429&id=PMC3492852&id=PMC3610441&id=PMC3492863&id=PMC3492872&id=PMC3492894&id=PMC3492899&id=PMC3492927&id=PMC3493006&id=PMC3493008&id=PMC3493011&id=PMC3493018&id=PMC3493046&id=PMC3493049&id=PMC3493051&id=PMC3493063&id=PMC3493075&id=PMC3493086&id=PMC3493204&id=PMC3493218&id=PMC3493258&id=PMC3493259&id=PMC3493270&id=PMC3493294&id=PMC3493308&id=PMC3493318&id=PMC3493330&id=PMC3493335&id=PMC3493344&id=PMC3493352&id=PMC3493355&id=PMC3493362&id=PMC3493369&id=PMC3493371&id=PMC3493374&id=PMC3493392&id=PMC3493450&id=PMC3493453&id=PMC3493460&id=PMC3493474&id=PMC3493482&id=PMC3493506&id=PMC3493507&id=PMC3493534&id=PMC3493543&id=PMC3493548&id=PMC3493568&id=PMC3493581&id=PMC3493586&id=PMC3493592&id=PMC3493599&id=PMC3493604&id=PMC3493612&id=PMC3493624&id=PMC3493642&id=PMC3493651&id=PMC3493658&id=PMC3493661&id=PMC3493665&id=PMC3493666&id=PMC3493667&id=PMC3493669&id=PMC3493671&id=PMC3493672&id=PMC3493673&id=PMC3493736&id=PMC3493795&id=PMC3493804&id=PMC3493833&id=PMC3493976&id=PMC3493977&id=PMC3494003&id=PMC3494020&id=PMC3494065&id=PMC3494076&id=PMC3494105&id=PMC3494113&id=PMC3494159&id=PMC3494176&id=PMC3494187&id=PMC3494197&id=PMC3494207&id=PMC3494208&id=PMC3494210&id=PMC3494217&id=PMC3494218&id=PMC3494228&id=PMC3494234&id=PMC3494236&id=PMC3125448&id=PMC3494380&id=PMC3494381&id=PMC3494382&id=PMC3494513&id=PMC3494524&id=PMC3494551&id=PMC3494569&id=PMC3494576&id=PMC3494608&id=PMC3494680&id=PMC3494693&id=PMC3494709&id=PMC3494716&id=PMC3494720&id=PMC3494831&id=PMC3494869&id=PMC3494870&id=PMC3494871&id=PMC3494875&id=PMC3494975&id=PMC3494979&id=PMC3494987&id=PMC3495014&id=PMC3495025&id=PMC3495034&id=PMC3495041&id=PMC3495096&id=PMC3495101&id=PMC3495205&id=PMC3495210&id=PMC3495214&id=PMC3495216&id=PMC3495264&id=PMC3495272&id=PMC3495274&id=PMC3495277&id=PMC3495292&id=PMC3495294&id=PMC3495298&id=PMC3495299&id=PMC3492977&id=PMC3495336&id=PMC3495337&id=PMC3495340&id=PMC3495345&id=PMC3495369&id=PMC3495373&id=PMC3495384&id=PMC3495386&id=PMC3495388&id=PMC3495389&id=PMC3495393&id=PMC3495395&id=PMC3495397&id=PMC3495398&id=PMC3495403&id=PMC3495415&id=PMC3495423&id=PMC3495627&id=PMC3495630&id=PMC3495655&id=PMC3495658&id=PMC3495662&id=PMC3495666&id=PMC3495676&id=PMC3495706&id=PMC3495709&id=PMC3495710&id=PMC3495711&id=PMC3495717&id=PMC3495718&id=PMC3495731&id=PMC3495769&id=PMC3495774&id=PMC3495780&id=PMC3495782&id=PMC3495811&id=PMC3495837&id=PMC3495839&id=PMC3495858&id=PMC3495892&id=PMC3495964&id=PMC3495973&id=PMC3496123&id=PMC3496124&id=PMC3496132&id=PMC3496152&id=PMC3496190&id=PMC3496214&id=PMC3496368&id=PMC3496403&id=PMC3496407&id=PMC3496410&id=PMC3496472&id=PMC3496473&id=PMC3496474&id=PMC3496477&id=PMC3496478&id=PMC3496480&id=PMC3496481&id=PMC3496482&id=PMC3496516&id=PMC3496517&id=PMC3496524&id=PMC3496530&id=PMC3496538&id=PMC3496539&id=PMC3496540&id=PMC3496542&id=PMC3496544&id=PMC3496546&id=PMC3496547&id=PMC3496550&id=PMC3496551&id=PMC3496552&id=PMC3496553&id=PMC3496555&id=PMC3496565&id=PMC3496579&id=PMC3496582&id=PMC3496597&id=PMC3496621&id=PMC3496624&id=PMC3496625&id=PMC3496626&id=PMC3496645&id=PMC3496662&id=PMC3496668&id=PMC3496690&id=PMC3496720&id=PMC3496723&id=PMC3496736&id=PMC3496850&id=PMC3496855&id=PMC3496877&id=PMC3496903&id=PMC3496911&id=PMC3496931&id=PMC3496944&id=PMC3496984&id=PMC3496986&id=PMC3496989&id=PMC3496990&id=PMC3496993&id=PMC3496996&id=PMC3497002&id=PMC3497005&id=PMC3497006&id=PMC3497007&id=PMC3497010&id=PMC3497027&id=PMC3497036&id=PMC3497037&id=PMC3497039&id=PMC3497053&id=PMC3497056&id=PMC3497063&id=PMC3497071&id=PMC3497086&id=PMC3497091&id=PMC3497100&id=PMC3497110&id=PMC3497225&id=PMC3497228&id=PMC3497230&id=PMC3497233&id=PMC3497249&id=PMC3497250&id=PMC3497256&id=PMC3497264&id=PMC3497266&id=PMC3497273&id=PMC3497274&id=PMC3497289&id=PMC3497297&id=PMC3497299&id=PMC3497300&id=PMC3497305&id=PMC3497320&id=PMC3497323&id=PMC3497324&id=PMC3497347&id=PMC3497348&id=PMC3497451&id=PMC3497458&id=PMC3497579&id=PMC3497708&id=PMC3497718&id=PMC3497856&id=PMC3497864&id=PMC3497871&id=PMC3497880&id=PMC3497894&id=PMC3497941&id=PMC3497943&id=PMC3497958&id=PMC3497960&id=PMC3498107&id=PMC3498112&id=PMC3498121&id=PMC3498126&id=PMC3498130&id=PMC3498145&id=PMC3498163&id=PMC3498180&id=PMC3498190&id=PMC3498191&id=PMC3498199&id=PMC3498202&id=PMC3498222&id=PMC3498226&id=PMC3498238&id=PMC3498244&id=PMC3498254&id=PMC3498259&id=PMC3498273&id=PMC3498279&id=PMC3498292&id=PMC3498297&id=PMC3498337&id=PMC3498350&id=PMC3498358&id=PMC3498372&id=PMC3498378&id=PMC3498381&id=PMC3498556&id=PMC3498557&id=PMC3498558&id=PMC3498651”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
Traceback (most recent call last):                                            |
  File "./oa-get", line 161, in 
    for result in source_module.download_metadata(source_path):
  File "/home/erlehmann/open-access-media-importer/sources/pmc_pmcid.py", line 47, in download_metadata
    local_file.write(content.read())
  File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 596, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(4781 bytes read, 73 more expected)

Retrying now.

erlehmann commented Oct 21, 2013

Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3492181&id=PMC3492194&id=PMC3492197&id=PMC3492232&id=PMC3492240&id=PMC3492249&id=PMC3492256&id=PMC3492264&id=PMC3492270&id=PMC3492285&id=PMC3492294&id=PMC3492296&id=PMC3492329&id=PMC3492343&id=PMC3492351&id=PMC3492354&id=PMC3492359&id=PMC3492364&id=PMC3492391&id=PMC3492395&id=PMC3492396&id=PMC3492410&id=PMC3492419&id=PMC3492439&id=PMC3492443&id=PMC3492445&id=PMC3492456&id=PMC3492461&id=PMC3492494&id=PMC3492658&id=PMC3492666&id=PMC3492676&id=PMC3492689&id=PMC3492691&id=PMC3492692&id=PMC3492696&id=PMC3492697&id=PMC3492698&id=PMC3492699&id=PMC3492700&id=PMC3492701&id=PMC3492768&id=PMC3492772&id=PMC3492784&id=PMC3492785&id=PMC3492791&id=PMC3376429&id=PMC3492852&id=PMC3610441&id=PMC3492863&id=PMC3492872&id=PMC3492894&id=PMC3492899&id=PMC3492927&id=PMC3493006&id=PMC3493008&id=PMC3493011&id=PMC3493018&id=PMC3493046&id=PMC3493049&id=PMC3493051&id=PMC3493063&id=PMC3493075&id=PMC3493086&id=PMC3493204&id=PMC3493218&id=PMC3493258&id=PMC3493259&id=PMC3493270&id=PMC3493294&id=PMC3493308&id=PMC3493318&id=PMC3493330&id=PMC3493335&id=PMC3493344&id=PMC3493352&id=PMC3493355&id=PMC3493362&id=PMC3493369&id=PMC3493371&id=PMC3493374&id=PMC3493392&id=PMC3493450&id=PMC3493453&id=PMC3493460&id=PMC3493474&id=PMC3493482&id=PMC3493506&id=PMC3493507&id=PMC3493534&id=PMC3493543&id=PMC3493548&id=PMC3493568&id=PMC3493581&id=PMC3493586&id=PMC3493592&id=PMC3493599&id=PMC3493604&id=PMC3493612&id=PMC3493624&id=PMC3493642&id=PMC3493651&id=PMC3493658&id=PMC3493661&id=PMC3493665&id=PMC3493666&id=PMC3493667&id=PMC3493669&id=PMC3493671&id=PMC3493672&id=PMC3493673&id=PMC3493736&id=PMC3493795&id=PMC3493804&id=PMC3493833&id=PMC3493976&id=PMC3493977&id=PMC3494003&id=PMC3494020&id=PMC3494065&id=PMC3494076&id=PMC3494105&id=PMC3494113&id=PMC3494159&id=PMC3494176&id=PMC3494187&id=PMC3494197&id=PMC3494207&id=PMC3494208&id=PMC3494210&id=PMC3494217&id=PMC3494218&id=PMC3494228&id=PMC3494234&id=PMC3494236&id=PMC3125448&id=PMC3494380&id=PMC3494381&id=PMC3494382&id=PMC3494513&id=PMC3494524&id=PMC3494551&id=PMC3494569&id=PMC3494576&id=PMC3494608&id=PMC3494680&id=PMC3494693&id=PMC3494709&id=PMC3494716&id=PMC3494720&id=PMC3494831&id=PMC3494869&id=PMC3494870&id=PMC3494871&id=PMC3494875&id=PMC3494975&id=PMC3494979&id=PMC3494987&id=PMC3495014&id=PMC3495025&id=PMC3495034&id=PMC3495041&id=PMC3495096&id=PMC3495101&id=PMC3495205&id=PMC3495210&id=PMC3495214&id=PMC3495216&id=PMC3495264&id=PMC3495272&id=PMC3495274&id=PMC3495277&id=PMC3495292&id=PMC3495294&id=PMC3495298&id=PMC3495299&id=PMC3492977&id=PMC3495336&id=PMC3495337&id=PMC3495340&id=PMC3495345&id=PMC3495369&id=PMC3495373&id=PMC3495384&id=PMC3495386&id=PMC3495388&id=PMC3495389&id=PMC3495393&id=PMC3495395&id=PMC3495397&id=PMC3495398&id=PMC3495403&id=PMC3495415&id=PMC3495423&id=PMC3495627&id=PMC3495630&id=PMC3495655&id=PMC3495658&id=PMC3495662&id=PMC3495666&id=PMC3495676&id=PMC3495706&id=PMC3495709&id=PMC3495710&id=PMC3495711&id=PMC3495717&id=PMC3495718&id=PMC3495731&id=PMC3495769&id=PMC3495774&id=PMC3495780&id=PMC3495782&id=PMC3495811&id=PMC3495837&id=PMC3495839&id=PMC3495858&id=PMC3495892&id=PMC3495964&id=PMC3495973&id=PMC3496123&id=PMC3496124&id=PMC3496132&id=PMC3496152&id=PMC3496190&id=PMC3496214&id=PMC3496368&id=PMC3496403&id=PMC3496407&id=PMC3496410&id=PMC3496472&id=PMC3496473&id=PMC3496474&id=PMC3496477&id=PMC3496478&id=PMC3496480&id=PMC3496481&id=PMC3496482&id=PMC3496516&id=PMC3496517&id=PMC3496524&id=PMC3496530&id=PMC3496538&id=PMC3496539&id=PMC3496540&id=PMC3496542&id=PMC3496544&id=PMC3496546&id=PMC3496547&id=PMC3496550&id=PMC3496551&id=PMC3496552&id=PMC3496553&id=PMC3496555&id=PMC3496565&id=PMC3496579&id=PMC3496582&id=PMC3496597&id=PMC3496621&id=PMC3496624&id=PMC3496625&id=PMC3496626&id=PMC3496645&id=PMC3496662&id=PMC3496668&id=PMC3496690&id=PMC3496720&id=PMC3496723&id=PMC3496736&id=PMC3496850&id=PMC3496855&id=PMC3496877&id=PMC3496903&id=PMC3496911&id=PMC3496931&id=PMC3496944&id=PMC3496984&id=PMC3496986&id=PMC3496989&id=PMC3496990&id=PMC3496993&id=PMC3496996&id=PMC3497002&id=PMC3497005&id=PMC3497006&id=PMC3497007&id=PMC3497010&id=PMC3497027&id=PMC3497036&id=PMC3497037&id=PMC3497039&id=PMC3497053&id=PMC3497056&id=PMC3497063&id=PMC3497071&id=PMC3497086&id=PMC3497091&id=PMC3497100&id=PMC3497110&id=PMC3497225&id=PMC3497228&id=PMC3497230&id=PMC3497233&id=PMC3497249&id=PMC3497250&id=PMC3497256&id=PMC3497264&id=PMC3497266&id=PMC3497273&id=PMC3497274&id=PMC3497289&id=PMC3497297&id=PMC3497299&id=PMC3497300&id=PMC3497305&id=PMC3497320&id=PMC3497323&id=PMC3497324&id=PMC3497347&id=PMC3497348&id=PMC3497451&id=PMC3497458&id=PMC3497579&id=PMC3497708&id=PMC3497718&id=PMC3497856&id=PMC3497864&id=PMC3497871&id=PMC3497880&id=PMC3497894&id=PMC3497941&id=PMC3497943&id=PMC3497958&id=PMC3497960&id=PMC3498107&id=PMC3498112&id=PMC3498121&id=PMC3498126&id=PMC3498130&id=PMC3498145&id=PMC3498163&id=PMC3498180&id=PMC3498190&id=PMC3498191&id=PMC3498199&id=PMC3498202&id=PMC3498222&id=PMC3498226&id=PMC3498238&id=PMC3498244&id=PMC3498254&id=PMC3498259&id=PMC3498273&id=PMC3498279&id=PMC3498292&id=PMC3498297&id=PMC3498337&id=PMC3498350&id=PMC3498358&id=PMC3498372&id=PMC3498378&id=PMC3498381&id=PMC3498556&id=PMC3498557&id=PMC3498558&id=PMC3498651”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
Traceback (most recent call last):                                            |
  File "./oa-get", line 161, in 
    for result in source_module.download_metadata(source_path):
  File "/home/erlehmann/open-access-media-importer/sources/pmc_pmcid.py", line 47, in download_metadata
    local_file.write(content.read())
  File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 596, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(4781 bytes read, 73 more expected)

Retrying now.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Finding supplementary materials on host files.mi.ur.de (user erlehmann), same as #102 (comment):

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-cache find-media pmc_pmcid 2> oa-cache-find-media.log'

erlehmann commented Oct 21, 2013

Finding supplementary materials on host files.mi.ur.de (user erlehmann), same as #102 (comment):

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-cache find-media pmc_pmcid 2> oa-cache-find-media.log'
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Process is running for 192 minutes:

$ ps x | grep python | head -n1
 8373 ?        R    192:26 python ./oa-cache find-media pmc_pmcid

Process found ~30k supplementary materials:

$ cat oa-cache-find-media.log | grep "^$" | wc -l
31146

There are ~140k PMC IDs:

$ wc -w <pmc-ids-from-2013-03-01-until-2013-04-01
139674

erlehmann commented Oct 21, 2013

Process is running for 192 minutes:

$ ps x | grep python | head -n1
 8373 ?        R    192:26 python ./oa-cache find-media pmc_pmcid

Process found ~30k supplementary materials:

$ cat oa-cache-find-media.log | grep "^$" | wc -l
31146

There are ~140k PMC IDs:

$ wc -w <pmc-ids-from-2013-03-01-until-2013-04-01
139674
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

From #102 (comment) #102 (comment) we know that the first week of may had 5649 PMC IDs with 6306 supplementary materials, averaging ~1,12 supplementary materials per PMC ID.

erlehmann commented Oct 21, 2013

From #102 (comment) #102 (comment) we know that the first week of may had 5649 PMC IDs with 6306 supplementary materials, averaging ~1,12 supplementary materials per PMC ID.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

PMC IDs until completion: 139674 - 31146 = 108528
Supplementary materials until completion: 108528 × 1.12 = 121551
31146 materials in 11546 seconds means one supplementary material is found every .37 seconds.
Time until completion: 121551 × .37 seconds = 44973 seconds, meaning 12.5 hours. Seems I failed here. :(

Curiously, a year's worth of data would have taken over 6 days if everything went perfect, which it didn't. I started too late. Conclusion: The oa-cache find-media functionality is missing a progress bar. I'll have to add it.

erlehmann commented Oct 21, 2013

PMC IDs until completion: 139674 - 31146 = 108528
Supplementary materials until completion: 108528 × 1.12 = 121551
31146 materials in 11546 seconds means one supplementary material is found every .37 seconds.
Time until completion: 121551 × .37 seconds = 44973 seconds, meaning 12.5 hours. Seems I failed here. :(

Curiously, a year's worth of data would have taken over 6 days if everything went perfect, which it didn't. I started too late. Conclusion: The oa-cache find-media functionality is missing a progress bar. I'll have to add it.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 21, 2013

Two rules of thumb learned from here for the media discovery process using oa-cache:

  • Scanning a single PMC ID takes 0.32 seconds on average.
  • The server is able to scan approx. 187 PMC IDs per Minute.

erlehmann commented Oct 21, 2013

Two rules of thumb learned from here for the media discovery process using oa-cache:

  • Scanning a single PMC ID takes 0.32 seconds on average.
  • The server is able to scan approx. 187 PMC IDs per Minute.
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 22, 2013

Dumping statistics, same as #102 (comment):

$ ./oa-cache stats pmc_pmcid < oa-stats

Output:

Counting supplementary materials … 128215 supplementary materials found.
100% |#########################################################################|

erlehmann commented Oct 22, 2013

Dumping statistics, same as #102 (comment):

$ ./oa-cache stats pmc_pmcid < oa-stats

Output:

Counting supplementary materials … 128215 supplementary materials found.
100% |#########################################################################|
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 22, 2013

Licensing graphic shows a notable error in licensing detection, CC BY 2.0 and CC BY 2.0 UK are not detected as free media.

erlehmann commented Oct 22, 2013

Licensing graphic shows a notable error in licensing detection, CC BY 2.0 and CC BY 2.0 UK are not detected as free media.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 22, 2013

Graphic demonstrating the error: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-licenses-2013-03-error.png

Around >20k supplementary materials (from >128k) were incorrectly assessed as non-free. If this is fixed and licensing distribution is assumed to be uniform, we can assume a 20% more yield on subsequent re-runs of the Open Access Media Importer.

erlehmann commented Oct 22, 2013

Graphic demonstrating the error: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/pmc-licenses-2013-03-error.png

Around >20k supplementary materials (from >128k) were incorrectly assessed as non-free. If this is fixed and licensing distribution is assumed to be uniform, we can assume a 20% more yield on subsequent re-runs of the Open Access Media Importer.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Oct 22, 2013

Checking internet media types on host files.mi.ur.de (user erlehmann). Also checking internet media type for non-free materials, as in #102 (comment):

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-get update-mimetypes pmc_pmcid 2> oa-get-update-mimetypes.log'

erlehmann commented Oct 22, 2013

Checking internet media types on host files.mi.ur.de (user erlehmann). Also checking internet media type for non-free materials, as in #102 (comment):

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-get update-mimetypes pmc_pmcid 2> oa-get-update-mimetypes.log'
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 26, 2014

Quick estimate how long media process will take, based upon the 0.32 seconds figure from #108 (comment):

; dc
2   
k
0.32
37966
*
p
12149.1260 
/
p
202.48
60
/
p
3.37

I think that the media discovery using oa-cache process is going to take around 200 min.

erlehmann commented Mar 26, 2014

Quick estimate how long media process will take, based upon the 0.32 seconds figure from #108 (comment):

; dc
2   
k
0.32
37966
*
p
12149.1260 
/
p
202.48
60
/
p
3.37

I think that the media discovery using oa-cache process is going to take around 200 min.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

Failure:

$ tail oa-cache-find-media.log 
Traceback (most recent call last):
  File "./oa-cache", line 214, in 
    skip=skip
  File "/home/erlehmann/open-access-media-importer/sources/pmc_pmcid.py", line 86, in list_articles
    result['supplementary-materials'] = _get_supplementary_materials(tree)
  File "/home/erlehmann/open-access-media-importer/sources/pmc.py", line 576, in _get_supplementary_materials
    material = _get_supplementary_material(tree, sup)
  File "/home/erlehmann/open-access-media-importer/sources/pmc.py", line 616, in _get_supplementary_material
    assert 'Click here' not in caption
AssertionError

erlehmann commented Mar 27, 2014

Failure:

$ tail oa-cache-find-media.log 
Traceback (most recent call last):
  File "./oa-cache", line 214, in 
    skip=skip
  File "/home/erlehmann/open-access-media-importer/sources/pmc_pmcid.py", line 86, in list_articles
    result['supplementary-materials'] = _get_supplementary_materials(tree)
  File "/home/erlehmann/open-access-media-importer/sources/pmc.py", line 576, in _get_supplementary_materials
    material = _get_supplementary_material(tree, sup)
  File "/home/erlehmann/open-access-media-importer/sources/pmc.py", line 616, in _get_supplementary_material
    assert 'Click here' not in caption
AssertionError
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

Commented out assertion. Restarted media discovery:

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-cache find-media pmc_pmcid 2>> oa-cache-find-media.log'

erlehmann commented Mar 27, 2014

Commented out assertion. Restarted media discovery:

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-cache find-media pmc_pmcid 2>> oa-cache-find-media.log'
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

oa-cache-find-media.log contains lots of messages similar to:

Skipping Article “3587020”, as it has no DOI.

This might skew our results.

erlehmann commented Mar 27, 2014

oa-cache-find-media.log contains lots of messages similar to:

Skipping Article “3587020”, as it has no DOI.

This might skew our results.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

Current status of skipped articles:

$ grep -c 'Skipping Article' oa-cache-find-media.log 
1440

As having no DOI hinders re-use (as we cannot easily reference the original material), this might be on topic for JATS-Con.

erlehmann commented Mar 27, 2014

Current status of skipped articles:

$ grep -c 'Skipping Article' oa-cache-find-media.log 
1440

As having no DOI hinders re-use (as we cannot easily reference the original material), this might be on topic for JATS-Con.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

Dumping statistics:

$ ./oa-cache stats pmc_pmcid > oa-stats
Counting supplementary materials … 23730 supplementary materials found.
100% |#########################################################################|

erlehmann commented Mar 27, 2014

Dumping statistics:

$ ./oa-cache stats pmc_pmcid > oa-stats
Counting supplementary materials … 23730 supplementary materials found.
100% |#########################################################################|
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

Skipped articles:

$ grep -c 'Skipping Article' oa-cache-find-media.log
2933

How many articles were considered:

$ dc
37966 2933 - p
35033

erlehmann commented Mar 27, 2014

Skipped articles:

$ grep -c 'Skipping Article' oa-cache-find-media.log
2933

How many articles were considered:

$ dc
37966 2933 - p
35033
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-get update-mimetypes pmc_pmcid 2> oa-get-update-mimetypes.log'

erlehmann commented Mar 27, 2014

$ export PYTHONIOENCODING=utf-8
$ nohup sh -c './oa-get update-mimetypes pmc_pmcid 2> oa-get-update-mimetypes.log'
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 27, 2014

No idea how long it will take, though.

erlehmann commented Mar 27, 2014

No idea how long it will take, though.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Meh:

$ tail oa-get-update-mimetypes.log -n24
Traceback (most recent call last):
  File "./oa-get", line 106, in 
    chunk = urlopen(request, timeout=3).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 401, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 419, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
socket.timeout: timed out

erlehmann commented Mar 28, 2014

Meh:

$ tail oa-get-update-mimetypes.log -n24
Traceback (most recent call last):
  File "./oa-get", line 106, in 
    chunk = urlopen(request, timeout=3).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 401, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 419, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
socket.timeout: timed out
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Patched download routine to include Pokémon exception handling, restarted process.
http://forums.thedailywtf.com/forums/p/8499/161670.aspx#161535

erlehmann commented Mar 28, 2014

Patched download routine to include Pokémon exception handling, restarted process.
http://forums.thedailywtf.com/forums/p/8499/161670.aspx#161535

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

213 of 21222   1% |                                            | ETA:  08:34:00

erlehmann commented Mar 28, 2014

213 of 21222   1% |                                            | ETA:  08:34:00
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

CIF files are detected as text/plain because “chemical” is not a registered major media type:
https://en.wikipedia.org/wiki/Chemical_file_format#The_Chemical_MIME_Project

erlehmann commented Mar 28, 2014

CIF files are detected as text/plain because “chemical” is not a registered major media type:
https://en.wikipedia.org/wiki/Chemical_file_format#The_Chemical_MIME_Project

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Also, lots of XML (?) files are detected as text/plain:

DOI 10.3897/zookeys.268.4071, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592199/bin/zookeys.268.4071-treatment49.xml, source claimed text/xml but is text/plain.

erlehmann commented Mar 28, 2014

Also, lots of XML (?) files are detected as text/plain:

DOI 10.3897/zookeys.268.4071, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592199/bin/zookeys.268.4071-treatment49.xml, source claimed text/xml but is text/plain.
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

When trying to download , the following error occured: “''”.
When trying to download , the following error occured: “timed out”.
When trying to download , the following error occured: “”.
When trying to download , the following error occured: “”.
When trying to download , the following error occured: “timed out”.
625 of 20822   3% |#                                           | ETA:  09:45:22

Raphael, please look into “network is unreachable”.

erlehmann commented Mar 28, 2014

When trying to download , the following error occured: “''”.
When trying to download , the following error occured: “timed out”.
When trying to download , the following error occured: “”.
When trying to download , the following error occured: “”.
When trying to download , the following error occured: “timed out”.
625 of 20822   3% |#                                           | ETA:  09:45:22

Raphael, please look into “network is unreachable”.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

More timeouts:

When trying to download , the following error occured: “timed out”.
When trying to download , the following error occured: “timed out”.
833 of 20822   4% |#                                           | ETA:  09:56:00

erlehmann commented Mar 28, 2014

More timeouts:

When trying to download , the following error occured: “timed out”.
When trying to download , the following error occured: “timed out”.
833 of 20822   4% |#                                           | ETA:  09:56:00
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Weird, BMP advertised as JPEG:

DOI 10.1007/s10194-008-0089-8, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34517
55/bin/10194_2008_89_MOESM1_ESM.jpg, source claimed image/jpeg but is image/x-ms-bmp.
DOI 10.1007/s10194-008-0089-8, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34517
55/bin/10194_2008_89_MOESM2_ESM.jpg, source claimed image/jpeg but is image/x-ms-bmp.
DOI 10.1007/s10194-008-0089-8, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34517
55/bin/10194_2008_89_MOESM3_ESM.bmp, source claimed image/bmp but is image/x-ms-bmp.

erlehmann commented Mar 28, 2014

Weird, BMP advertised as JPEG:

DOI 10.1007/s10194-008-0089-8, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34517
55/bin/10194_2008_89_MOESM1_ESM.jpg, source claimed image/jpeg but is image/x-ms-bmp.
DOI 10.1007/s10194-008-0089-8, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34517
55/bin/10194_2008_89_MOESM2_ESM.jpg, source claimed image/jpeg but is image/x-ms-bmp.
DOI 10.1007/s10194-008-0089-8, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC34517
55/bin/10194_2008_89_MOESM3_ESM.bmp, source claimed image/bmp but is image/x-ms-bmp.
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Gimp image advertised as plain text:

DOI 10.1371/journal.pone.0044641, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439412/bin/pone.0044641.s001.xcf, source claimed text/plain but is image/x-xcf.

erlehmann commented Mar 28, 2014

Gimp image advertised as plain text:

DOI 10.1371/journal.pone.0044641, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439412/bin/pone.0044641.s001.xcf, source claimed text/plain but is image/x-xcf.
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Someone seems to have hardcoded “.rar” to audio/x-realaudio:

DOI 10.1186/1742-4682-10-27, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648501/bin/1742-4682-10-27-S3.rar, source claimed audio/x-realaudio but is application/x-rar.
DOI 10.1186/1742-4682-10-27, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648501/bin/1742-4682-10-27-S5.rar, source claimed audio/x-realaudio but is application/x-rar

erlehmann commented Mar 28, 2014

Someone seems to have hardcoded “.rar” to audio/x-realaudio:

DOI 10.1186/1742-4682-10-27, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648501/bin/1742-4682-10-27-S3.rar, source claimed audio/x-realaudio but is application/x-rar.
DOI 10.1186/1742-4682-10-27, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648501/bin/1742-4682-10-27-S5.rar, source claimed audio/x-realaudio but is application/x-rar
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

1042 of 20822   5% |##                                         | ETA:  09:37:40

erlehmann commented Mar 28, 2014

1042 of 20822   5% |##                                         | ETA:  09:37:40
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

1874 of 20822   9% |###                                        | ETA:  09:04:13

erlehmann commented Mar 28, 2014

1874 of 20822   9% |###                                        | ETA:  09:04:13
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Traceback (most recent call last):
  File "./oa-get", line 112, in 
    if 'Document, corrupt' in detected_mimetype:  # partial MS Office document
TypeError: argument of type 'NoneType' is not iterable

erlehmann commented Mar 28, 2014

Traceback (most recent call last):
  File "./oa-get", line 112, in 
    if 'Document, corrupt' in detected_mimetype:  # partial MS Office document
TypeError: argument of type 'NoneType' is not iterable
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Patched, restarted.

erlehmann commented Mar 28, 2014

Patched, restarted.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Looks good:

3262 of 18118  18% |#######                                    | ETA:  06:42:05

I hope it does not crash again.

erlehmann commented Mar 28, 2014

Looks good:

3262 of 18118  18% |#######                                    | ETA:  06:42:05

I hope it does not crash again.

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 28, 2014

Weird stuff:

DOI 10.1186/1471-2105-6-79, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1131891/bin/1471-2105-6-79-S1.doc, source claimed application/msword but is text/html.
DOI 10.1371/journal.pone.0064139, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3656924/bin/pone.0064139.s004.pdf, source claimed application/pdf but is application/zip.

erlehmann commented Mar 28, 2014

Weird stuff:

DOI 10.1186/1471-2105-6-79, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1131891/bin/1471-2105-6-79-S1.doc, source claimed application/msword but is text/html.
DOI 10.1371/journal.pone.0064139, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3656924/bin/pone.0064139.s004.pdf, source claimed application/pdf but is application/zip.
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 29, 2014

18118 of 18118 100% |##########################################| Time: 07:56:29

erlehmann commented Mar 29, 2014

18118 of 18118 100% |##########################################| Time: 07:56:29

@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 29, 2014

Dumping stats:

$ ./oa-cache stats pmc_pmcid > stats-2014-03-29
Counting supplementary materials … 23730 supplementary materials found.
100% |#########################################################################|

erlehmann commented Mar 29, 2014

Dumping stats:

$ ./oa-cache stats pmc_pmcid > stats-2014-03-29
Counting supplementary materials … 23730 supplementary materials found.
100% |#########################################################################|
@erlehmann

This comment has been minimized.

Show comment
Hide comment
@erlehmann

erlehmann Mar 29, 2014

This is probably what you wanted: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/stats-2014-03-29.tar.gz

Known errors: Some media types are not correctly detected and CC UK licensing is not recognized as free.

erlehmann commented Mar 29, 2014

This is probably what you wanted: http://daten.dieweltistgarnichtso.net/pics/graphs/open-access-media-importer/stats-2014-03-29.tar.gz

Known errors: Some media types are not correctly detected and CC UK licensing is not recognized as free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment