In [10]:
import slate
import urllib
import re

In [3]:
furl = urllib.urlopen('http://www.nature.com/ismej/journal/v10/n1/pdf/ismej2015100a.pdf')
with open( '/tmp/asdfasdfasdf.pdf', 'w' ) as ftempfile :
    ftempfile.write( furl.read() )
with open( '/tmp/asdfasdfasdf.pdf' ) as f :
    doc = slate.PDF(f)

### PDF is garbage

In this example, we are looking for a link to some source code :

[`http://prodege.jgi-psf.org//downloads/src`](http://prodege.jgi-psf.org//downloads/src)

However, in the PDF, the URL is line wrapped, so the `src` is lost.

In [20]:
urlre = re.compile( '(?P<url>https?://[^\s]+)' )

for page in doc :
    print urlre.findall( page )

[]
['http://prodege.jgi-psf.org//downloads/', 'http://prodege.jgi-psf.org,']
[]
['http://img.jgi.', 'http://www.nature.com/ismej)']


### PDF is garbage, continued

If we remove line breaks to fix URLs that have been wrapped, we discover
that the visible line breaks in the document do not correspond to actual
line breaks in the represented text. The result is random garbage. 

In [19]:
urlre = re.compile( '(?P<url>https?://[^\s]+)' )

for page in doc :
    print urlre.findall( page.replace('\n','') )

[]
['http://prodege.jgi-psf.org//downloads/availablerun', 'http://prodege.jgi-psf.org,which']
[]
['http://img.jgi.Cell', 'http://creativecommons.org/licenses/by/4.0/the', 'http://www.nature.com/ismej)The']


### Nope.

At this point, the author elects to flip a table. 

Let's try looking at the HTML version. I'll swipe some code from 
[Dive into Python](http://www.diveintopython.net/) here, because 
finding URLs in a HTML document is what is known as a "Solved Problem."

In [21]:
from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)
            
def get_urls_from(url):
    url_list = []
    import urllib
    usock = urllib.urlopen(url)
    parser = URLLister()
    parser.feed(usock.read())         
    usock.close()      
    parser.close()                    
    map(url_list.append, 
        [item for item in parser.urls if item.startswith(('http', 'ftp', 'www'))])
    return url_list

Here are all the URLs in the document...

In [24]:
urls = get_urls_from('http://www.nature.com/ismej/journal/v10/n1/full/ismej2015100a.html')
urls

['http://www.isme-microbes.org/',
 'http://mts-isme.nature.com/cgi-bin/main.plex',
 'http://ad.doubleclick.net/N285/jump/ismej.nature.com/;abr=!NN2;type=sc;artid=ismej2015100a;issue=1;pos=top;subjmeta=631;subjmeta=208;subjmeta=212;techmeta=45;techmeta=23;sz=728x90;tile=1;ord=123456789?',
 'http://prodege.jgi-psf.org//downloads/src',
 'http://prodege.jgi-psf.org',
 'http://dx.doi.org/10.1089/cmb.2012.0021',
 'http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=npg&amp;cmd=Retrieve&amp;db=PubMed&amp;list_uids=22506599&amp;dopt=Abstract',
 'http://links.isiglobalnet2.com/gateway/Gateway.cgi?&amp;GWVersion=2&amp;SrcAuth=Nature&amp;SrcApp=Nature&amp;DestLinkType=FullRecord&amp;KeyUT=000303811800001&amp;DestApp=WOS_CPL',
 'http://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1:CAS:528:DC%2BC38XmsFOmt7k%3D&pissn=1751-7362&pyear=2016&md5=186ecb2f73d6216debcdc0dc83436dfe',
 'http://dx.doi.org/10.1186/1471-2105-10-421',
 'http://www.ncbi.nlm.nih.gov/entre

Bleh. That is mostly links in the references, ads and navigation cruft 
from the journal's content mismanagement system. Because their system
is heinously *ad hoc*, there is no base URL. So, we're forced to use an 
*ad hoc* exclusion list.

In [32]:
excluded = [ 'http://www.nature.com',
             'http://dx.doi.org',
             'http://www.ncbi.nlm.nih.gov',
             'http://creativecommons.org',
             'https://s100.copyright.com',
             'http://mts-isme.nature.com',
             'http://www.isme-microbes.org',
             'http://ad.doubleclick.net',
             'http://mse.force.com',
             'http://links.isiglobalnet2.com',
             'http://www.readcube.com',
             'http://chemport.cas.org',
             'http://publicationethics.org/',
             'http://www.natureasia.com/'
           ]

def novel_url( url ) :
    for excluded_url in excluded :
        if url.startswith( excluded_url ) :
            return False
    return True

filter( novel_url, urls )

['http://prodege.jgi-psf.org//downloads/src',
 'http://prodege.jgi-psf.org',
 'http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf']

Much better. Now, let's see if these exist...

In [33]:
import requests

for url in filter( novel_url, urls ) :
    request = requests.get( url )
    if request.status_code == 200:
        print 'Good : ', url
    else:
        print 'Fail : ', url

Good :  http://prodege.jgi-psf.org//downloads/src
Good :  http://prodege.jgi-psf.org
Fail :  http://img.jgi.doe.gov/w/doc/SingleCellDataDecontamination.pdf


Looks like this will work, though we'll need to create a hand-curated list of
excluded URLs. Othersise, the counts of dead links could be badly skewed by
any issues within the journal's content mismanagement system, ad servers and
other irrelevent crud.