# Build combined bibliography
We have two sets of publication DOIs. We first get a RIS file for each publication in Crossref. We then bulk upload these RIS files manually to a citation manager (Paperpile). This lets us download PDFs through a U-M Library proxy.

1. Add flag for UMMZ or USNM
2. Assign each paper a project-specific UUID
3. Get each paper's DOI to get a RIS file with Crossref
4. Get metadata and PDF for each paper with Paperpile
5. Match paper file name to metadata record
6. Batch rename each PDF with its UUID
7. Convert PDF to JSON using GROBID

**In**: UMMZ Bibliography [spreadsheet](https://docs.google.com/spreadsheets/d/1aK8KfREk6M3BTma-Gt-jlfrO6yS-yUC2nDTP9vY4aM0/edit#gid=1768615789); USNM Bibliography [spreadsheet](https://docs.google.com/spreadsheets/d/19HmbPDMdimm_gLkmHeiaOntT6rJahCNNqp6Xg-ePG-E/edit#gid=608332527)

**Out**: UMMZ Bibliography RIS files and full text PDFs (`/nfs/turbo/isr-slafia/specimen/`)

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import uuid
import random
from habanero import Crossref, cn
import requests
import os
import re
import glob
import fuzzymatcher
from fuzzymatcher import link_table, fuzzy_left_join

Load bibliographies. Deduplicate by DOI. Add a flag for each collection. Drop fields not needed for search.

In [2]:
df_usnm = pd.read_csv('/nfs/turbo/isr-slafia/specimen/usnm_bib.csv').drop_duplicates(subset=['doi'])
df_usnm['doi'] = df_usnm['doi']
df_usnm = df_usnm[['doi']]
df_usnm['bib'] = 'USNM'
df_usnm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1665 entries, 0 to 1664
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   doi     1665 non-null   object
 1   bib     1665 non-null   object
dtypes: object(2)
memory usage: 39.0+ KB


In [3]:
df_ummz =  pd.read_csv('/nfs/turbo/isr-slafia/specimen/ummz_bib.csv').drop_duplicates(subset=['DOIs']).rename(columns={"DOIs":"doi"})
df_ummz['doi'] = df_ummz['doi']
df_ummz = df_ummz[['doi']]
df_ummz['bib'] = 'UMMZ'
df_ummz.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 536 entries, 0 to 1111
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   doi     535 non-null    object
 1   bib     536 non-null    object
dtypes: object(2)
memory usage: 12.6+ KB


Make combined dataframe for search with unique identifier for naming files

In [4]:
rd = random.Random()
rd.seed(0)
uuid.uuid4 = lambda: uuid.UUID(int=rd.getrandbits(128))

df = pd.concat([df_usnm, df_ummz], axis=0).replace(["NaN"], np.nan).dropna()
df['uuid'] = df['doi'].transform(lambda g: uuid.uuid4())
df.to_csv('/nfs/turbo/isr-slafia/specimen/build_bibliography/build_bibliography.csv',index=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2200 entries, 0 to 1111
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   doi     2200 non-null   object
 1   bib     2200 non-null   object
 2   uuid    2200 non-null   object
dtypes: object(3)
memory usage: 68.8+ KB


Get RIS file from Crossref for each publication

In [5]:
testing=0

if testing==1:
    df=df[:10]

cr = Crossref('slafia@umich.edu') # set credentials

def search_crossref(df):
    crf_success = []
    crf_failure = []
    
    for i in tqdm(range(len(df))):
        try:
            r = cn.content_negotiation(ids = df.doi.iloc[i], format = "ris")
            with open("/nfs/turbo/isr-slafia/specimen/build_bibliography_ris/" + str(df.uuid.iloc[i]) + ".ris", 'w') as outfile:
                outfile.write(r)
                crf_success.append(df.doi.iloc[i])
        except requests.exceptions.HTTPError as errh:
            print("HTTPError:", errh)
            crf_failure.append(df.doi.iloc[i])
            pass
        except requests.exceptions.ConnectionError as errc:
            print("ConnectionError:", errc)
            crf_failure.append(df.doi.iloc[i])
            pass
        except requests.exceptions.Timeout as errt:
            print("Timeout:",errt)
            crf_failure.append(df.doi.iloc[i])
            pass
        except requests.exceptions.RequestException as err:
            print("Other error:",err)
            crf_failure.append(df.doi.iloc[i])
            pass
        
    print(len(crf_success), "RIS found")
    print(len(crf_failure), "RIS not found")
    
    return crf_failure
    
search_crossref(df)

  7%|▋         | 162/2200 [01:10<13:18,  2.55it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.9784/LEB4(3)SantiagoBlay01


  8%|▊         | 184/2200 [01:19<14:04,  2.39it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.9784/LEB4(3)Martins.01


 10%|█         | 221/2200 [01:35<13:40,  2.41it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.9784/LEB3(4)Lambert.01


 15%|█▌        | 333/2200 [02:24<10:21,  3.00it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.9784/LEB3(4)Santiago-Blay.03


 26%|██▌       | 572/2200 [04:08<09:04,  2.99it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.3897/zookeys.301.5081ZooKey


 31%|███       | 682/2200 [04:55<08:34,  2.95it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.600/036364412X616701


 35%|███▌      | 779/2200 [05:37<08:59,  2.63it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.1163/156854012


 45%|████▍     | 981/2200 [07:00<06:43,  3.02it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.4289/0013-8797.112.1.32


 76%|███████▌  | 1664/2200 [11:55<03:26,  2.60it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.5479/si.00775630.495


 86%|████████▋ | 1898/2200 [13:32<01:37,  3.10it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.1101/gr.1796204.


 92%|█████████▏| 2025/2200 [14:24<01:03,  2.75it/s]

HTTPError: 400 Client Error: Bad Request for url: https://doi.org/0.2307/2424624


 95%|█████████▍| 2083/2200 [14:48<00:37,  3.12it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.1016/S0002-8223(21)29159-1


100%|█████████▉| 2196/2200 [15:34<00:01,  2.74it/s]

HTTPError: 404 Client Error: Not Found for url: https://doi.org/10.1093/condor/17.1.60a


100%|██████████| 2200/2200 [15:36<00:00,  2.35it/s]

2187 RIS found
13 RIS not found





['10.9784/LEB4(3)SantiagoBlay01',
 '10.9784/LEB4(3)Martins.01',
 '10.9784/LEB3(4)Lambert.01',
 '10.9784/LEB3(4)Santiago-Blay.03',
 '10.3897/zookeys.301.5081ZooKey',
 '10.600/036364412X616701',
 '10.1163/156854012',
 '10.4289/0013-8797.112.1.32',
 '10.5479/si.00775630.495',
 '10.1101/gr.1796204.',
 '0.2307/2424624',
 '10.1016/S0002-8223(21)29159-1',
 '10.1093/condor/17.1.60a']

Join results from Paperpile on original dataframe
- Remove records that do not have a DOI or Title

In [7]:
df_original = pd.read_csv('/nfs/turbo/isr-slafia/specimen/build_bibliography/build_bibliography.csv')
df_original = df_original.rename(columns={'doi':'DOI'}).replace(["NaN"], np.nan).dropna()
df_original['DOI'] = df_original['DOI'].astype('str')
df_original.info()

Unnamed: 0,DOI,bib,uuid
0,10.1600/036364417X694926,USNM,e3e70682-c209-4cac-629f-6fbed82c07cd
1,10.1073/pnas.1703658114,USNM,f728b4fa-4248-5e3a-0a5d-2f346baa9455
2,10.1098/rspb.2017.1803,USNM,eb1167b3-67a9-c378-7c65-c1e582e2e662
3,10.11646/zootaxa.4306.2.7,USNM,f7c1bd87-4da5-e709-d471-3d60c8a70639
4,10.5252/z2017n2a1,USNM,e443df78-9558-867f-5ba9-1faf7a024204
...,...,...,...
2195,10.1093/condor/17.1.60a,UMMZ,b686db4a-bfdc-7072-6e23-d73a2913689e
2196,10.1037/h0070810,UMMZ,c57a5c70-ecdc-67dc-b453-7473e8f6fee3
2197,10.1126/science.35.908.834,UMMZ,4cfbe8d1-32e0-1da3-bb65-5d51dde97098
2198,10.2307/4071134,UMMZ,9427ad92-882d-ebde-f346-832e9aeebe00


In [8]:
df_paperpile = pd.read_csv('/nfs/turbo/isr-slafia/specimen/build_bibliography/build_bibliography_paperpile.csv')
df_paperpile = df_paperpile.dropna(subset=['Title'])
df_paperpile['DOI'] = df_paperpile['DOI'].replace(["NaN"], np.nan).dropna().astype('str')
print(f"Unique papers: {df_paperpile.DOI.nunique()}")
df_paperpile.info()

Unique papers: 2175
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2179
Data columns (total 38 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Item type          2175 non-null   object 
 1   Authors            2159 non-null   object 
 2   Editors            3 non-null      object 
 3   Title              2175 non-null   object 
 4   Journal            2167 non-null   object 
 5   Full journal       1963 non-null   object 
 6   Publication year   2174 non-null   float64
 7   Volume             2139 non-null   object 
 8   Issue              1875 non-null   object 
 9   Pages              2078 non-null   object 
 10  Folders filed in   2175 non-null   object 
 11  Labels filed in    0 non-null      float64
 12  Publisher          2175 non-null   object 
 13  Address            5 non-null      object 
 14  Book title         5 non-null      object 
 15  Proceedings title  2 non-null      object 
 16  Date

Inner merge on DOI 

In [9]:
df_merge = pd.merge(df_original, df_paperpile, how='inner', on='DOI')
df_merge.sample(10)
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1893 entries, 0 to 1892
Data columns (total 40 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   DOI                1893 non-null   object 
 1   bib                1893 non-null   object 
 2   uuid               1893 non-null   object 
 3   Item type          1893 non-null   object 
 4   Authors            1881 non-null   object 
 5   Editors            3 non-null      object 
 6   Title              1893 non-null   object 
 7   Journal            1885 non-null   object 
 8   Full journal       1730 non-null   object 
 9   Publication year   1892 non-null   float64
 10  Volume             1858 non-null   object 
 11  Issue              1607 non-null   object 
 12  Pages              1814 non-null   object 
 13  Folders filed in   1893 non-null   object 
 14  Labels filed in    0 non-null      float64
 15  Publisher          1893 non-null   object 
 16  Address            5 non

Check overlap between bibliographies

In [10]:
df_merge[df_merge.duplicated(subset=['DOI'],keep=False)]

Unnamed: 0,DOI,bib,uuid,Item type,Authors,Editors,Title,Journal,Full journal,Publication year,...,Notes,Copyright,Affiliation,Language,Sub-type,Page count,Dataset name,Dataset author(s),Dataset URL,Dataset DOI
122,10.1007/s13364-016-0289-6,USNM,9b8b71a1-b38a-05fb-f611-64cebfc74ca9,Journal Article,"Woodman N,Timm RM",,A new species of small-eared shrew in the Cryp...,Mammal Research,,2016.0,...,,,,,,,,,,
123,10.1007/s13364-016-0289-6,UMMZ,45224c79-9cf5-2508-aac4-87b171db0784,Journal Article,"Woodman N,Timm RM",,A new species of small-eared shrew in the Cryp...,Mammal Research,,2016.0,...,,,,,,,,,,
351,10.1126/science.344.6186.814,USNM,fea2a33a-51d1-1bcd-5a23-754bef38d426,Commentary,"Rocha LA,Aleixo A,Allen G,Almeda F,Baldwin CC,...",,Specimen collection: An essential tool,Science,Science,2014.0,...,,,"California Academy of Sciences, San Francisco,...",en,Commentary,,,,,
352,10.1126/science.344.6186.814,UMMZ,3f41b5b5-4b62-50da-e645-eef27c6d07e3,Commentary,"Rocha LA,Aleixo A,Allen G,Almeda F,Baldwin CC,...",,Specimen collection: An essential tool,Science,Science,2014.0,...,,,"California Academy of Sciences, San Francisco,...",en,Commentary,,,,,
787,10.2988/11-11.1,USNM,1aac3ca1-920c-904e-4063-d5251152797e,Journal Article,Woodman N,,Nomenclatural notes and identification of smal...,Proceedings of the Biological Society of Washi...,Proceedings of the Biological Society of Washi...,2011.0,...,,,,en,,,,,,
788,10.2988/11-11.1,UMMZ,994d799a-9922-5957-ee70-5259262ae4c4,Journal Article,Woodman N,,Nomenclatural notes and identification of smal...,Proceedings of the Biological Society of Washi...,Proceedings of the Biological Society of Washi...,2011.0,...,,,,en,,,,,,


Add a new column for a processed title

In [11]:
def preprocess_text(text):
    res = str(re.sub(r'[^\w\s]', '', text.lower())).strip() # remove punctutation, whitespace, lowercase text
    return res

df_merge['clean_title'] = df_merge['Title'].apply(preprocess_text)
df_merge

Unnamed: 0,DOI,bib,uuid,Item type,Authors,Editors,Title,Journal,Full journal,Publication year,...,Copyright,Affiliation,Language,Sub-type,Page count,Dataset name,Dataset author(s),Dataset URL,Dataset DOI,clean_title
0,10.1073/pnas.1703658114,USNM,f728b4fa-4248-5e3a-0a5d-2f346baa9455,Journal Article,Piperno DR,,Assessing elements of an extended evolutionary...,Proceedings of the National Academy of Sciences,,2017.0,...,,,,,,,,,,assessing elements of an extended evolutionary...
1,10.1098/rspb.2017.1803,USNM,eb1167b3-67a9-c378-7c65-c1e582e2e662,Journal Article,"Segar ST,Volf M,Isua B,Sisol M,Redmond CM,Rosa...",,Variably hungry caterpillars: predictive model...,Proceedings of the Royal Society B: Biological...,,2017.0,...,,,,,,,,,,variably hungry caterpillars predictive models...
2,10.11646/zootaxa.4306.2.7,USNM,f7c1bd87-4da5-e709-d471-3d60c8a70639,Journal Article,"Landschoff J,Lemaitre R",,Crossing the Indian Ocean: a range extension f...,Zootaxa,Zootaxa,2017.0,...,,,,,,,,,,crossing the indian ocean a range extension fo...
3,10.5252/z2017n2a1,USNM,e443df78-9558-867f-5ba9-1faf7a024204,Journal Article,"Lemaitre R,Felder DL,Poupin J",,Discovery of a new micro-pagurid fauna (Crusta...,Zoosystema,Zoosystema,2017.0,...,,,,,,,,,,discovery of a new micropagurid fauna crustace...
4,10.3389/fmicb.2017.00618,USNM,23a7711a-8133-2876-37eb-dcd9e87a1613,Journal Article,"Meyer JL,Paul VJ,Raymundo LJ,Teplitski M",,Comparative Metagenomics of the Polymicrobial ...,Front. Microbiol.,Frontiers in microbiology,2017.0,...,,,,,,,,,,comparative metagenomics of the polymicrobial ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1888,10.2307/1362845,UMMZ,da09fb60-2d27-dd76-86e0-c466b68c76f8,Journal Article,Dice LR,,Habits of the Magpie in Southeastern Washington,Condor,The Condor,1917.0,...,,,,,,,,,,habits of the magpie in southeastern washington
1889,10.1037/h0070810,UMMZ,c57a5c70-ecdc-67dc-b453-7473e8f6fee3,Journal Article,Dice LR,,The factors determining the vertical movements...,J. Exp. Psychol. Anim. Behav. Process.,Journal of experimental psychology. Animal beh...,1914.0,...,,,,,,,,,,the factors determining the vertical movements...
1890,10.1126/science.35.908.834,UMMZ,4cfbe8d1-32e0-1da3-bb65-5d51dde97098,Journal Article,Dice LR,,Color Variations of the House Mouse in California,Science,Science,1912.0,...,,,en,Research Article,,,,,,color variations of the house mouse in california
1891,10.2307/4071134,UMMZ,9427ad92-882d-ebde-f346-832e9aeebe00,Journal Article,Dice LR,,New Records for the State of Washington,Auk,The Auk,1910.0,...,,,,,,,,,,new records for the state of washington


Make a dataframe of the current PDF filenames from Paperpile.

In [37]:
path = r'/nfs/turbo/isr-slafia/specimen/build_bibliography_pdf/'
files = glob.glob(path + "/*.pdf")

all_files = list(filter(lambda file: os.stat(file).st_size > 1, files)) #filter empty files

paths = []
files = []
authors = []
titles = []

for file in all_files:
    filename = os.path.basename(file)
    author = filename.split("/")[-1].split("-")[0]
    title_path = filename.split("/")[-1].split("-")[1]
    title = os.path.splitext(title_path)[0]
    paths.append(file)
    files.append(filename)
    authors.append(author)
    titles.append(title)

df_filenames = pd.DataFrame(list(zip(paths, files, authors, titles)),
               columns =['path', 'file', 'author', 'title'])

df_filenames['clean_filename'] = df_filenames['file'].apply(preprocess_text)
df_filenames['clean_title'] = df_filenames['title'].apply(preprocess_text)
df_filenames

Unnamed: 0,path,file,author,title,clean_filename,clean_title
0,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Burt et al. 1971 - Mammals of Pennsylvania.pdf,Burt et al. 1971,Mammals of Pennsylvania,burt et al 1971 mammals of pennsylvaniapdf,mammals of pennsylvania
1,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Pogue 2013 - A review of the Paectes arcigera ...,Pogue 2013,A review of the Paectes arcigera species comp...,pogue 2013 a review of the paectes arcigera s...,a review of the paectes arcigera species compl...
2,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Pyenson and Sponberg 2011 - Reconstructing Bod...,Pyenson and Sponberg 2011,Reconstructing Body Size in Extinct Crown Ce ...,pyenson and sponberg 2011 reconstructing body...,reconstructing body size in extinct crown ce ...
3,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Rankin et al. 2019 - Contrasting consequences ...,Rankin et al. 2019,Contrasting consequences of historical climat...,rankin et al 2019 contrasting consequences of...,contrasting consequences of historical climate...
4,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Oliveira et al. 2011 - Phylogenetic relationsh...,Oliveira et al. 2011,Phylogenetic relationships within the specios...,oliveira et al 2011 phylogenetic relationship...,phylogenetic relationships within the speciose...
...,...,...,...,...,...,...
2553,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Wagner et al. 2014 - Revision of endemic Marqu...,Wagner et al. 2014,Revision of endemic Marquesas Islands Bidens ...,wagner et al 2014 revision of endemic marques...,revision of endemic marquesas islands bidens a...
2554,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Zelditch et al. 2008 - Building Developmental ...,Zelditch et al. 2008,Building Developmental Integration into Funct...,zelditch et al 2008 building developmental in...,building developmental integration into functi...
2555,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,"Olson 2012 - History, Structure, Evolution, Be...",Olson 2012,"History, Structure, Evolution, Behavior, Dist...",olson 2012 history structure evolution behavi...,history structure evolution behavior distribut...
2556,/nfs/turbo/isr-slafia/specimen/build_bibliogra...,Peterson et al. 2010 - A classification of the...,Peterson et al. 2010,A classification of the Chloridoideae (Poacea...,peterson et al 2010 a classification of the c...,a classification of the chloridoideae poaceae ...


Match records on fields and evaluate

In [77]:
left_on = ["clean_title"]
right_on = ["clean_title"]

df = fuzzymatcher.fuzzy_left_join(df_filenames, df_merge, left_on, right_on) # link on the closest match
df = df[df.best_match_score > 0] # keep records with a positive match score
df = df[['uuid', 'bib', 'Title', 'Authors', 'Journal', 'Item type', 'Publication year', 'DOI']] # simplify metadata
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1742 entries, 0 to 88371
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   uuid              1742 non-null   object 
 1   bib               1742 non-null   object 
 2   Title             1742 non-null   object 
 3   Authors           1735 non-null   object 
 4   Journal           1734 non-null   object 
 5   Item type         1742 non-null   object 
 6   Publication year  1742 non-null   float64
 7   DOI               1742 non-null   object 
dtypes: float64(1), object(7)
memory usage: 187.0+ KB


Rename files in directory based on match and UUID
- non matches do not have UUID filename and are removed using this pattern for whitespace (`rm -- *\ *`) in command line

In [44]:
references = dict(df.dropna(subset=["file","uuid"]).set_index("file")["uuid"])

for entry in references.items():
    filename = str(entry[0])
    new_filename = str(entry[1]) + '.pdf'
    filepath = os.path.join(path, filename)
    new_filepath = os.path.join(path, new_filename)
    os.rename(filepath, new_filepath)

Filter metadata to match remaining records

In [85]:
path = r'/nfs/turbo/isr-slafia/specimen/build_bibliography_pdf/'
files = glob.glob(path + "/*.pdf")

all_files = list(filter(lambda file: os.stat(file).st_size > 1, files)) #filter empty files

file_list = []

for file in all_files:
    file_full = os.path.basename(file)
    file_name = os.path.splitext(file_full)[0]
    file_list.append(file_name)
    
len(file_list)
df_final = df[df.uuid.isin(file_list)].drop_duplicates(subset=['DOI'])
df_final.to_csv('/nfs/turbo/isr-slafia/specimen/build_bibliography_final.csv', index=False)
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1308 entries, 0 to 88369
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   uuid              1308 non-null   object 
 1   bib               1308 non-null   object 
 2   Title             1308 non-null   object 
 3   Authors           1301 non-null   object 
 4   Journal           1302 non-null   object 
 5   Item type         1308 non-null   object 
 6   Publication year  1308 non-null   float64
 7   DOI               1308 non-null   object 
dtypes: float64(1), object(7)
memory usage: 92.0+ KB


Check total number of papers per collection

In [93]:
df_final.bib.value_counts()

AMNH    943
UMMZ    365
Name: bib, dtype: int64