# ICPSR Biennial Meeting: Bibliography Demo
We apply computational models to predict sentences from journal articles that indicate research data reuse.

In [1]:
import glob
import json
import re
import numpy as np
import pandas as pd
import pickle

import nltk
from nltk.tokenize import sent_tokenize

import spacy
from spacy import displacy

In [2]:
def clean_text(txt):
    """
    Converts text to lowercase, removes special characters and punctuation
    """
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

def find_acronyms(txt):
    """
    Finds and returns a sequence of capital letters
    """
    matches = re.findall(r"\b[A-Z\.]{2,}s?\b", txt)
    if matches:
        return 1
    else:
        return 0

## Journal articles
We select two journal articles from the [ICPSR Bibliography of Data-Related Literature](https://www.icpsr.umich.edu/web/pages/ICPSR/citations/) that reuse ICPSR data. We parse these articles and look for references to research datasets. We know that the following studies are related to each publication:

* **Alsan, M., & Wanamaker, M. (2018). [Tuskegee and the health of black men](https://app-dimensions-ai.proxy.lib.umich.edu/details/publication/pub.1091026367?search_mode=content&search_text=TUSKEGEE%20AND%20THE%20HEALTH%20OF%20BLACK%20MEN&search_type=kws&search_field=full_search). The quarterly journal of economics, 133(1), 407-455.**

    * Mortality Detail Files, 1968-1991 (ICPSR 7632)
    * Electoral Data for Counties in the United States: Presidential and Congressional Races, 1840-1972 (ICPSR 8611)
    * Bureau of Health Professions Area Resource File, 1940-1990: [United States] (ICPSR 9075)
    * General Social Survey, 1972-2014 [Cumulative File] (ICPSR 36319)
    * Historical, Demographic, Economic, and Social Data: The United States, 1790-2002 (ICPSR 2896)


* **Acharya, A., Blackwell, M., & Sen, M. (2016). [The political legacy of American slavery](https://app-dimensions-ai.proxy.lib.umich.edu/details/publication/pub.1058867878?search_mode=content&search_text=%22The%20Political%20Legacy%20of%20American%20Slavery%22&search_type=kws&search_field=text_search). The Journal of Politics, 78(3), 621-641.**

    * Three Generations Combined, 1965-1997 (ICPSR 4532)
    * Youth-Parent Socialization Panel Study, 1965-1997: Four Waves Combined (ICPSR 4037)
    * Historical, Demographic, Economic, and Social Data: The United States, 1790-2002 (ICPSR 2896)
    * ANES 1984 Time Series Study (ICPSR 8298)
    * ANES 1986 Time Series Study (ICPSR 8678)
    * ANES 1988 Time Series Study (ICPSR 9196)
    * ANES 1990 Time Series Study (ICPSR 9548)
    * ANES 1992 Time Series Study (ICPSR 6067)
    * ANES 1994 Time Series Study (ICPSR 6507)
    * ANES 1996 Time Series Study (ICPSR 6896)
    * ANES 1998 Time Series Study (ICPSR 2684)

We split each article into sentences and check if each:
* has indicator terms, adapted from [Park et al., 2018](https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.24049)
    * "data", "edu", "sample", "national", "survey", "public", "avail", "nsf", "gov", "access"
* is in a standard section
    * "Abstract", "Introduction", "Background", "Data", "Method", "Result", "Discussion", "Summary", "Conclusion", "Acknowlegdement"
* has at least one acronym

In [3]:
%%time

files = glob.glob('/nfs/turbo/hrg/bib/ICPSR-bibliography-demo/*.json')

pubs = pd.DataFrame()
for file in files: 
    with open(file, 'r') as f:
        data = json.loads(f.read())
        file_data = pd.json_normalize(data["pdf_parse"]["body_text"])
        file_data["paper_id"] = data["title"]
    pubs = pd.concat([pubs, file_data])

sentences = []

for row in pubs.itertuples():
    for sent in sent_tokenize(row[1]):
        sentences.append((row[7], row[5], sent))

df = pd.DataFrame(sentences, columns=['paper_title', 'paper_section', 'sentence_text'])

df['sentence_text_clean'] = df['sentence_text'].apply(clean_text)
df['paper_section_clean']= df['paper_section'].apply(clean_text)

df['hasData'] = np.where(df['sentence_text_clean'].str.contains('data'), 1, 0)
df['hasEdu'] = np.where(df['sentence_text_clean'].str.contains('edu'), 1, 0)
df['hasSample'] = np.where(df['sentence_text_clean'].str.contains('sample'), 1, 0)
df['hasNational'] = np.where(df['sentence_text_clean'].str.contains('national'), 1, 0)
df['hasSurvey'] = np.where(df['sentence_text_clean'].str.contains('survey'), 1, 0)
df['hasPublic'] = np.where(df['sentence_text_clean'].str.contains('public'), 1, 0)
df['hasAvail'] = np.where(df['sentence_text_clean'].str.contains('survey'), 1, 0)
df['hasNSF'] = np.where(df['sentence_text_clean'].str.contains('nsf'), 1, 0)
df['hasGov'] = np.where(df['sentence_text_clean'].str.contains('gov'), 1, 0)
df['hasAccess'] = np.where(df['sentence_text_clean'].str.contains('access'), 1, 0)

df['inIntro'] = np.where(df['paper_section_clean'].str.contains('intro'), 1, 0)
df['inDisc'] = np.where(df['paper_section_clean'].str.contains('discus'), 1, 0)
df['inAbst'] = np.where(df['paper_section_clean'].str.contains('abstr'), 1, 0)
df['inResult'] = np.where(df['paper_section_clean'].str.contains('resul'), 1, 0)
df['inConcl'] = np.where(df['paper_section_clean'].str.contains('conclu'), 1, 0)
df['inMethod'] = np.where(df['paper_section_clean'].str.contains('meth'), 1, 0)
df['inBack'] = np.where(df['paper_section_clean'].str.contains('back'), 1, 0)
df['inData'] = np.where(df['paper_section_clean'].str.contains('data'), 1, 0)
df['inSumm'] = np.where(df['paper_section_clean'].str.contains('summ'), 1, 0)
df['inAckno'] = np.where(df['paper_section_clean'].str.contains('acknowl'), 1, 0)

df['hasAcronym'] = df['sentence_text'].apply(find_acronyms)

X_predict = df.iloc[:, 5:]

df

CPU times: user 131 ms, sys: 4.51 ms, total: 135 ms
Wall time: 434 ms


Unnamed: 0,paper_title,paper_section,sentence_text,sentence_text_clean,paper_section_clean,hasData,hasEdu,hasSample,hasNational,hasSurvey,...,inDisc,inAbst,inResult,inConcl,inMethod,inBack,inData,inSumm,inAckno,hasAcronym
0,NBER WORKING PAPER SERIES TUSKEGEE AND THE HEA...,I. Introduction,The Tuskegee Study became a symbol of their mi...,the tuskegee study became a symbol of their mi...,i introduction,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,NBER WORKING PAPER SERIES TUSKEGEE AND THE HEA...,I. Introduction,Corbie -Smith et al.,corbie smith et al,i introduction,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,NBER WORKING PAPER SERIES TUSKEGEE AND THE HEA...,I. Introduction,(1999) African-American men have the worst hea...,1999 african american men have the worst heal...,i introduction,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,NBER WORKING PAPER SERIES TUSKEGEE AND THE HEA...,I. Introduction,1 Although recent trends have shown signs of i...,1 although recent trends have shown signs of i...,i introduction,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,NBER WORKING PAPER SERIES TUSKEGEE AND THE HEA...,I. Introduction,"Compared to other demographic groups, black me...",compared to other demographic groups black men...,i introduction,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1081,The Political Legacy of American Slavery *,,Data was also collected on the enslaved status...,data was also collected on the enslaved status...,,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1082,The Political Legacy of American Slavery *,,"Manuscripts were only available for Alabama, G...",manuscripts were only available for alabama ge...,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1083,The Political Legacy of American Slavery *,,† p < .1; * p < .05; * * p < .01.,p 1 p 05 p 01,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1084,The Political Legacy of American Slavery *,,All analyses are at the individual level with ...,all analyses are at the individual level with ...,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Models

The **Random Forest classifier** is used to predict if a sentence contains a reference to a dataset. This model uses the predictive features of sentences from the previous step to predict whether or not to focus on the sentence. This step helps us reduce the total amount of text to review. We get a probability score for each sentence and set a minimum probability threshold of 10% to filter unlikely sentences (i.e. those that do not contain an acronym, indicative words, or are in standard sections of articles that are unlikely to reference data).

In [4]:
np.random.seed(1)

pkl_filename = "./output/model-RF.pkl"

with open(pkl_filename, 'rb') as file:
    rf_model = pickle.load(file)

predict = rf_model.predict(X_predict)

prob = rf_model.predict_proba(X_predict)
df['prob_0'] = prob[:,0] 
df['prob_1'] = prob[:,1]

df_candidates = df.query('prob_1 >= 0.1')

df_candidates_health = df_candidates[((df_candidates['paper_title']=="NBER WORKING PAPER SERIES TUSKEGEE AND THE HEALTH OF BLACK MEN"))]
df_candidates_legacy = df_candidates[((df_candidates['paper_title']=="The Political Legacy of American Slavery *"))]

print("Predicted citances (Alsan and Wanamaker, 2017):", len(df_candidates_health))
print("Predicted citances (Acharya et al., 2016):", len(df_candidates_legacy))

Predicted citances (Alsan and Wanamaker, 2017): 160
Predicted citances (Acharya et al., 2016): 112


The **Named Entity Recognition** model is used to predict a span of text that contains a `Dataset` entity. We compare this prediction to the names of the studies we know were used in each paper.

In [5]:
# from thinc.api import set_gpu_allocator, require_gpu
# set_gpu_allocator("pytorch")
# require_gpu(0)

In [6]:
# spacy.prefer_gpu()
custom_ner_model = spacy.load('./output/model-best')

## Predictions

**Alsan, M., & Wanamaker, M. (2018). [Tuskegee and the health of black men](https://app-dimensions-ai.proxy.lib.umich.edu/details/publication/pub.1091026367?search_mode=content&search_text=TUSKEGEE%20AND%20THE%20HEALTH%20OF%20BLACK%20MEN&search_type=kws&search_field=full_search). The quarterly journal of economics, 133(1), 407-455.**

Our approach finds 1/5 known ICPSR Studies:

* Mortality Detail Files, 1968-1991 (ICPSR 7632)
    - No proper name is mentioned explicitly in the text (Table and Figure notes: "mortality files from the CDC")
* Electoral Data for Counties in the United States: Presidential and Congressional Races, 1840-1972 (ICPSR 8611)
* Bureau of Health Professions Area Resource File, 1940-1990: [United States] (ICPSR 9075)
* **General Social Survey, 1972-2014 [Cumulative File] (ICPSR 36319)**
    - Referenced by its alias (Introduction: "General Social Survey (GSS)")
* Historical, Demographic, Economic, and Social Data: The United States, 1790-2002 (ICPSR 2896)
    - Referenced by its PI name (Footnote: "(Haines 2010)")

Other data found include:
* National Health Interview Survey (NHIS)
* IPUMS Health Surveys
* IHIS
* NCHS
* 1979 Survey of Black Americans
* 1940 U.S. Census
* 1960 U.S. Census of Population
* 1970 U.S. Census of Population
* Medicare data

In [7]:
candidates = df_candidates_health['sentence_text'].to_list()

for candidate in candidates:
    label = set()
    doc = custom_ner_model(candidate)
    if len(doc.ents) > 0:
        label.add(clean_text(doc.ents))
        displacy.render(doc, style="ent", jupyter=True)

* **Acharya, A., Blackwell, M., & Sen, M. (2016). [The political legacy of American slavery](https://app-dimensions-ai.proxy.lib.umich.edu/details/publication/pub.1058867878?search_mode=content&search_text=%22The%20Political%20Legacy%20of%20American%20Slavery%22&search_type=kws&search_field=text_search). The Journal of Politics, 78(3), 621-641.**

Our approach finds 1/4+ known ICPSR Studies:

* Three Generations Combined, 1965-1997 (ICPSR 4532)
* Youth-Parent Socialization Panel Study, 1965-1997: Four Waves Combined (ICPSR 4037)
* Historical, Demographic, Economic, and Social Data: The United States, 1790-2002 (ICPSR 2896)
* **ANES 1984 Time Series Study (ICPSR 8298)**
* ANES 1986 Time Series Study (ICPSR 8678)
* ANES 1988 Time Series Study (ICPSR 9196)
* ANES 1990 Time Series Study (ICPSR 9548)
* ANES 1992 Time Series Study (ICPSR 6067)
* ANES 1994 Time Series Study (ICPSR 6507)
* ANES 1996 Time Series Study (ICPSR 6896)
* ANES 1998 Time Series Study (ICPSR 2684)

Other data found include:
* 1860 U.S. Census
* 1940 U.S. Census
* 1925 Agricultural Census
* Cooperative Congressional Election Study (CCES)

In [8]:
candidates = df_candidates_legacy['sentence_text'].to_list()

for candidate in candidates:
    label = set()
    doc = custom_ner_model(candidate)
    if len(doc.ents) > 0:
        label.add(clean_text(doc.ents))
        displacy.render(doc, style="ent", jupyter=True)