Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

LeidenPDFhack

Extract & release open (CC-BY) biodiversity images from non-PMC biodiversity journals. Starting with Phytotaxa

Inspired by Daniel Mietchen's open access media importer which releases media content from PMC-indexed academic journals.

Starting off with the CC-BY content from Phytotaxa and Zootaxa

1.) scrape journal links for OA PDF content

wget journal contents page e.g.: http://www.mapress.com/phytotaxa/content.htm

Regular expression: content/..../...........

parse out all the OA urls e.g.: http://www.mapress.com/phytotaxa/content/2014/f/pt00162p216.pdf

'/f/' indicates freely available

compile as link-list and wget all free PDFs

2.) pdftotext all PDFs to facilitate text-parsing

check licencing of each: grep -i 'creative commons' *.txt

for Phytotaxa CC BY licenced 'free' content starts from pt00093p039.pdf onwards

3.) parse out DOIs of each article and rename PDF by partial doi

grep -i -m1 '10.11646/phytotaxa' *.txt | cut -c 28-

Phytotaxa doi's have been implemented starting with http://dx.doi.org/10.11646/phytotaxa.76.1.2 (of the freely accessible PDFs)

pass each DOI to crossref content negotiation to get full citation for each PDF

4.) pdfimage strip images out of each article

pdfimages -j file.pdf foobar

5.) delete 480bytes to 13k (phyotaxa logos from each paper)

.ppm .pbm .jpg ONLY

6.) associate figure caption with each image SOMEHOW???

7.) Make it all a cronjob to run regularly, re-uploading open, attributed images to Flickr/Wikimedia on a regular basis

About

Extract & release open (CC-BY) biodiversity images from non-PMC biodiversity journals. Starting with Phytotaxa

Resources

License

Releases

No releases published

Packages

No packages published