Extract & release open (CC-BY) biodiversity images from non-PMC biodiversity journals. Starting with Phytotaxa
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
phylogeny-data
plaintext-conversions
2textANDimages.sh
LICENSE
List-of-CC-BY-Phytotaxa-PDF-urls.txt
README.md
scrape_Phytotaxa.sh
uploadr.ini.bak
uploadr.py

README.md

LeidenPDFhack

Extract & release open (CC-BY) biodiversity images from non-PMC biodiversity journals. Starting with Phytotaxa

Inspired by Daniel Mietchen's open access media importer which releases media content from PMC-indexed academic journals.

Starting off with the CC-BY content from Phytotaxa and Zootaxa

1.) scrape journal links for OA PDF content

wget journal contents page e.g.: http://www.mapress.com/phytotaxa/content.htm

Regular expression: content/..../...........

parse out all the OA urls e.g.: http://www.mapress.com/phytotaxa/content/2014/f/pt00162p216.pdf

'/f/' indicates freely available

compile as link-list and wget all free PDFs

2.) pdftotext all PDFs to facilitate text-parsing

check licencing of each: grep -i 'creative commons' *.txt

for Phytotaxa CC BY licenced 'free' content starts from pt00093p039.pdf onwards

3.) parse out DOIs of each article and rename PDF by partial doi

grep -i -m1 '10.11646/phytotaxa' *.txt | cut -c 28-

Phytotaxa doi's have been implemented starting with http://dx.doi.org/10.11646/phytotaxa.76.1.2 (of the freely accessible PDFs)

pass each DOI to crossref content negotiation to get full citation for each PDF

4.) pdfimage strip images out of each article

pdfimages -j file.pdf foobar

5.) delete 480bytes to 13k (phyotaxa logos from each paper)

.ppm .pbm .jpg ONLY

6.) associate figure caption with each image SOMEHOW???

7.) Make it all a cronjob to run regularly, re-uploading open, attributed images to Flickr/Wikimedia on a regular basis