Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oami_pmc_pmcid_import should loop over the PMCIDs found by oa-pmc-ids #111

Closed
Daniel-Mietchen opened this issue Oct 6, 2013 · 10 comments
Closed

Comments

@Daniel-Mietchen
Copy link
Member

Moved here from #94 (comment) .

@Daniel-Mietchen
Copy link
Member Author

Just ran

./oa-pmc-ids --from 2013-10-08 --until 2013-10-13 | ./oami_pmc_pmcid_import

which gave

...
Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786892/bin/pone.0076065.s001.avi>, already exists at <https://commons.wikimedia.org/w/api.php>.
Unknown, possibly non-free license: <None>
Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786914/bin/pone.0075952.s001.mpg>, already exists at <https://commons.wikimedia.org/w/api.php>.
Traceback (most recent call last):
  File "./oa-get", line 187, in <module>
    if mediawiki.is_uploaded(material):
  File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
    assert len(first_sentence_of_caption) > 0
AssertionError

This error is the subject of #84 - the point here in #111 is that such an individual error should not stop the processing of other articles, but it currently does.

Moreover, the current handling of the looping does not allow me to find out which article had the "Unknown, possibly non-free license: " and whether that statement is correct.

@erlehmann
Copy link

I assume we can use xargs.

@erlehmann
Copy link

Does using a read loop fix the problem? If so, i'll add it to she shell script.

./oa-pmc-ids --from 2013-10-14 --until 2013-10-15 | tr ' ' '\n' | while read -r; do echo $REPLY | ./oami_pmc_pmcid_import; done;

@Daniel-Mietchen
Copy link
Member Author

Running it now. Looks good so far.

@Daniel-Mietchen
Copy link
Member Author

No problem so far.
Running

./oa-cache clear-database pmc_pmcid | ./oa-pmc-ids --from 2013-10-04 --until 2013-10-15 | tr ' ' '\n' | while read -r; do echo $REPLY | ./oami_pmc_pmcid_import; ./oa-cache clear-database pmc_pmcid; done;

now.

@Daniel-Mietchen
Copy link
Member Author

Got stuck with
#18

Removing “/home/danielmietchen/.local/share/open-access-media-importer/pmc_pmcid.sqlite” … done.
Input PMCIDs, delimited by whitespace: Removing “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid/efetch.fcgi0” … done.
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3783376”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
100% |#########################################################################################################################################|
PLOS ONE 2013
        Vocal Recruitment for Joint Travel in Wild Chimpanzees
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
“Vocal Recruitment for Joint Travel in Wild Chimpanzees”:
        2 × /

Checking MIME types …
DOI 10.1371/journal.pone.0076073, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s001.wmv, source claimed / but is video/x-ms-asf.
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
2 of 2 100% |###################################################################################################################| Time: 00:00:02
DOI 10.1371/journal.pone.0076073, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s002.pdf, source claimed / but is application/pdf.
Skipping download of <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s001.wmv>.
Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3783376%2Fbin%2Fpone.0076073.s001.wmv”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3783376%2Fbin%2Fpone.0076073.s001.wmv.ogg” …   99% |##                    

so we will have to use timeout perhaps.

@Daniel-Mietchen
Copy link
Member Author

Just set up a cron job

22 7 * * * sh oami_pmc_pmcid_import.sh

with oami_pmc_pmcid_import.sh
consisting of

#!/bin/bash

TIMEOUTFILE=timeout_pmcid.txt

# clear cache
./oa-cache clear-database pmc_pmcid

for pmcid in $(./oa-pmc-ids --from $(date +"%F" -d '2 days ago') --until $(date +"%F")); do
  date
  timeout 6h sh -c "echo $pmcid | ./oami_pmc_pmcid_import"
  if [[ $? == 124 ]]; then 
        echo "------------------ Timed out! --------------------"
        echo $pmcid >> "$TIMEOUTFILE"
  fi
  ./oa-cache clear-database pmc_pmcid 
done;

@erlehmann
Copy link

Daniel, if i put the above code into the repository, can the bug be closed?

@Daniel-Mietchen
Copy link
Member Author

I am now using a cronjob

12 6 * * * cd ~/open-access-media-importer; sh oami_pmc_pmcid_import.sh | tee -a oami_pmc_pmcid_import.tee

where oami_pmc_pmcid_import.sh is

#!/bin/bash

TIMEOUTFILE=timeout_pmcid.txt

# clear cache
./oa-cache clear-database pmc_pmcid

for pmcid in $(./oa-pmc-ids --from $(date +"%F" -d '3 days ago') --until $(date +"%F")); do
  date
  timeout 6h sh -c "echo $pmcid | ./oami_pmc_pmcid_import"
  if [[ $? == 124 ]]; then 
        echo "------------------ Timed out! --------------------"
        echo $pmcid >> "$TIMEOUTFILE"
  fi
  ./oa-cache clear-database pmc_pmcid 
done;

This works fine, so there is no need for your workaround in #111 (comment) , and I am closing this issue.

@erlehmann
Copy link

For proper logging:

12 6 * * * cd ~/open-access-media-importer; sh oami_pmc_pmcid_import.sh 2>&1 | tee -a oami_pmc_pmcid_import.tee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants