 # Notebook to get abstracts from Semantic Scholar using DOIs, PMIDs, or PMCIDs 
 
Note: This uses the filtered_data_with_identifiers.csv file which has the identifiers for the papers. This file was created using the notebook `get_identifiers_from_links.ipynb`. Please generate that file (or get it from google drive) and put it in a folder called `data` before running this notebook. This also requires an api key for semantic scholar. 

In [1]:
import pandas as pd

In [2]:
inp_df = pd.read_csv('../data/filtered_data_with_identifiers.csv')
print(inp_df.shape)
inp_df.head()

(9354, 23)


Unnamed: 0.1,Unnamed: 0,year,month,title,link_flair_text,domain,score,num_comments,sensationalism_score,jargon_proportion,...,is_top_domain_scientific,is_top_domain_news,is_top_domain_repo,is_top_domain_scam,is_top_domain_unknown,is_top_domain_indecisive,is_top_domain_less_than_2,label_voting_lm,label_voting_manual,identifier
0,5,2018,3,Firearm Injuries Drop 20 Percent When Gun Owne...,Biology,nejm.org,84,22,0.530595,0.0,...,False,False,True,False,False,False,False,repo,repo,DOI:10.1056/NEJMc1712773
1,19,2018,3,Supplementation with probiotics during late pr...,Health,journals.plos.org,8,1,0.482136,0.314286,...,True,False,False,False,False,False,False,scientific,scientific,DOI:10.1371/journal.pmed.1002507
2,25,2018,3,Study finds that bee venom could be a useful p...,Medicine,ncbi.nlm.nih.gov,27,9,0.507827,0.333333,...,False,False,True,False,False,False,False,repo,repo,PMCID:5793096
3,31,2018,3,"The interplay of gene flow, population size va...",Biology,onlinelibrary.wiley.com,6,0,0.433548,0.352941,...,True,False,False,False,False,False,False,scientific,scientific,DOI:10.1111/evo.13435/abstract
4,33,2018,3,Undisclosed Conflicts of Interests among Biome...,Social Science,ncbi.nlm.nih.gov,263,21,0.511708,0.086957,...,False,False,True,False,False,False,False,repo,repo,PMID:29400625


In [3]:
# Using semantic scholar bulk API to get abstracts
import requests
import time
import os
from tqdm import tqdm


In [4]:
# TODO: Get the API key from the environment variable
SEM_SCHOLAR_API_KEY = os.getenv('SEM_SCHOLAR_API_KEY')
print(SEM_SCHOLAR_API_KEY)

None


In [5]:
# API endpoint: https://api.semanticscholar.org/graph/v1/paper/batch
headers = {
    'x-api-key': 'ZMLjK54ov337qoRwIIzk234Dw0w7vY0E3b0pphOY'
}

def fetch_paper_details(ids):
    url = 'https://api.semanticscholar.org/graph/v1/paper/batch'
    params = {
        'fields': 'title,abstract'
    }
    response = requests.post(url, headers=headers, json={'ids': ids}, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code} for IDs: {ids}")
        return []


batch_size = 500
results = []
for i in tqdm(range(0, len(inp_df), batch_size)):
    batch_ids = inp_df['identifier'].iloc[i:i + batch_size].tolist()
    data = fetch_paper_details(batch_ids)
    results.extend(data)
    # If we get a result, add columns title and abstract to df
    if data:
        for j, r in enumerate(data):
            if r:
                sem_title = r.get('title', None)
                sem_abstract = r.get('abstract', None)
                if sem_title:
                    inp_df.loc[i+j, 'sem_scholar_title'] = sem_title
                if sem_abstract:
                    inp_df.loc[i+j, 'sem_scholar_abstract'] = sem_abstract         
    time.sleep(1.5)  # Max 1 request per second, adding some buffer as I got some 429 errors


100%|██████████| 19/19 [00:46<00:00,  2.45s/it]


In [6]:
inp_df.head()

Unnamed: 0.1,Unnamed: 0,year,month,title,link_flair_text,domain,score,num_comments,sensationalism_score,jargon_proportion,...,is_top_domain_repo,is_top_domain_scam,is_top_domain_unknown,is_top_domain_indecisive,is_top_domain_less_than_2,label_voting_lm,label_voting_manual,identifier,sem_scholar_title,sem_scholar_abstract
0,5,2018,3,Firearm Injuries Drop 20 Percent When Gun Owne...,Biology,nejm.org,84,22,0.530595,0.0,...,True,False,False,False,False,repo,repo,DOI:10.1056/NEJMc1712773,Reduction in Firearm Injuries during NRA Annua...,Decline in Firearm Injuries during NRA Convent...
1,19,2018,3,Supplementation with probiotics during late pr...,Health,journals.plos.org,8,1,0.482136,0.314286,...,False,False,False,False,False,scientific,scientific,DOI:10.1371/journal.pmed.1002507,Diet during pregnancy and infancy and risk of ...,Background There is uncertainty about the infl...
2,25,2018,3,Study finds that bee venom could be a useful p...,Medicine,ncbi.nlm.nih.gov,27,9,0.507827,0.333333,...,True,False,False,False,False,repo,repo,PMCID:5793096,Bee Venom Suppresses the Differentiation of Pr...,Bee venom (BV) has been widely used in the tre...
3,31,2018,3,"The interplay of gene flow, population size va...",Biology,onlinelibrary.wiley.com,6,0,0.433548,0.352941,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/evo.13435/abstract,,
4,33,2018,3,Undisclosed Conflicts of Interests among Biome...,Social Science,ncbi.nlm.nih.gov,263,21,0.511708,0.086957,...,True,False,False,False,False,repo,repo,PMID:29400625,Undisclosed conflicts of interest among biomed...,ABSTRACT Background: Textbooks are a formative...


In [7]:
# print the rows where we did not get any data
inp_df[inp_df['sem_scholar_title'] == 'nan']


Unnamed: 0.1,Unnamed: 0,year,month,title,link_flair_text,domain,score,num_comments,sensationalism_score,jargon_proportion,...,is_top_domain_repo,is_top_domain_scam,is_top_domain_unknown,is_top_domain_indecisive,is_top_domain_less_than_2,label_voting_lm,label_voting_manual,identifier,sem_scholar_title,sem_scholar_abstract
3,31,2018,3,"The interplay of gene flow, population size va...",Biology,onlinelibrary.wiley.com,6,0,0.433548,0.352941,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/evo.13435/abstract,,
27,511,2018,3,Features of immunosenescence greatly differed ...,Health,onlinelibrary.wiley.com,1,2,0.499449,0.190476,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/acel.12750/fulled,,
35,602,2018,3,Caffeine and Cannabis Effects on Vital Neurotr...,Neuroscience,ncbi.nlm.nih.gov,5,3,0.434245,0.055556,...,True,False,False,False,False,repo,repo,PMCID:5448447,,
36,610,2018,3,Caffeine and Cannabis Effects on Vital Neurotr...,Neuroscience,ncbi.nlm.nih.gov,47,11,0.434245,0.055556,...,True,False,False,False,False,repo,repo,PMCID:5448447,,
38,681,2018,3,How much carbon do European soils have?,Environment,onlinelibrary.wiley.com,1,1,0.444545,0.285714,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/gcb.12292/abstract,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9238,194280,2017,2,Conspiracy Endorsement as Motivated Reasoning:...,Environment,onlinelibrary.wiley.com,1,0,0.503490,0.000000,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/ajps.12234/abstract,,
9262,195041,2017,2,Demonic Influence: The Negative Mental Health ...,Psychology,onlinelibrary.wiley.com,411,43,0.491181,0.000000,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/jssr.12287/abstract,,
9277,195583,2019,10,Eating more than 2.5 eggs per week leads to 81...,Cancer,ncbi.nlm.nih.gov,0,40,0.571614,0.294118,...,True,False,False,False,False,repo,repo,PMCID:3232297,,
9280,195681,2019,10,The hair cells of the inner ear detect CO2 and...,Biology,ncbi.nlm.nih.gov,34,9,0.467021,0.153846,...,True,False,False,False,False,repo,repo,PMCID:3812300,,


In [8]:
# if sem scholar title or abstract is nan or text is "nan", then remove that row
out_df = inp_df[(~inp_df['sem_scholar_title'].isnull()) & (inp_df['sem_scholar_title'] != 'nan')]
out_df = out_df[(~out_df['sem_scholar_abstract'].isnull()) & (out_df['sem_scholar_abstract'] != 'nan')]

out_df

Unnamed: 0.1,Unnamed: 0,year,month,title,link_flair_text,domain,score,num_comments,sensationalism_score,jargon_proportion,...,is_top_domain_repo,is_top_domain_scam,is_top_domain_unknown,is_top_domain_indecisive,is_top_domain_less_than_2,label_voting_lm,label_voting_manual,identifier,sem_scholar_title,sem_scholar_abstract
0,5,2018,3,Firearm Injuries Drop 20 Percent When Gun Owne...,Biology,nejm.org,84,22,0.530595,0.000000,...,True,False,False,False,False,repo,repo,DOI:10.1056/NEJMc1712773,Reduction in Firearm Injuries during NRA Annua...,Decline in Firearm Injuries during NRA Convent...
1,19,2018,3,Supplementation with probiotics during late pr...,Health,journals.plos.org,8,1,0.482136,0.314286,...,False,False,False,False,False,scientific,scientific,DOI:10.1371/journal.pmed.1002507,Diet during pregnancy and infancy and risk of ...,Background There is uncertainty about the infl...
2,25,2018,3,Study finds that bee venom could be a useful p...,Medicine,ncbi.nlm.nih.gov,27,9,0.507827,0.333333,...,True,False,False,False,False,repo,repo,PMCID:5793096,Bee Venom Suppresses the Differentiation of Pr...,Bee venom (BV) has been widely used in the tre...
4,33,2018,3,Undisclosed Conflicts of Interests among Biome...,Social Science,ncbi.nlm.nih.gov,263,21,0.511708,0.086957,...,True,False,False,False,False,repo,repo,PMID:29400625,Undisclosed conflicts of interest among biomed...,ABSTRACT Background: Textbooks are a formative...
5,48,2018,3,One Tribe to Bind Them All: How Our Social Gro...,Psychology,onlinelibrary.wiley.com,15,1,0.540974,0.000000,...,False,False,False,False,False,scientific,scientific,DOI:10.1111/pops.12485,One Tribe to Bind Them All: How Our Social Gro...,“Social sorting” is a concept used by Mason (2...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9349,197738,2019,10,"Statements about building walls, deportation a...",Social Science,journals.plos.org,0,4,0.554148,0.263158,...,False,False,False,False,False,scientific,scientific,DOI:10.1371/journal.pone.0222837,Declared impact of the US President’s statemen...,"Statements about building walls, deportation a..."
9350,197742,2019,10,Many college students will uncritically accept...,Social Science,journals.plos.org,170,25,0.543118,0.100000,...,False,False,False,False,False,scientific,scientific,DOI:10.1371/journal.pone.0223736,When calculators lie: A demonstration of uncri...,Calculators are often unnecessary to solve rou...
9351,197760,2019,10,New method for making polymers with perfectly ...,Chemistry,pubs.acs.org,148,8,0.503372,0.171429,...,True,False,False,False,False,repo,repo,DOI:10.1021/jacs.9b08240,Homogenous Synthesis of Monodisperse High Olig...,Whereas monodisperse polymers are ubiquitous i...
9352,197779,2019,10,Research Shows That Doing a Bad Job Wrapping P...,Social Science,onlinelibrary.wiley.com,175,15,0.589604,0.000000,...,False,False,False,False,False,scientific,scientific,DOI:10.1002/jcpy.1140,Presentation Matters: The Effect of Wrapping N...,While gift-givers typically wrap gifts prior t...


In [9]:
out_df = out_df[['id', 'title', 'url', 'identifier', 'sem_scholar_title', 'sem_scholar_abstract']]
print(out_df.shape)
out_df

(7083, 6)


Unnamed: 0,id,title,url,identifier,sem_scholar_title,sem_scholar_abstract
0,811az5,Firearm Injuries Drop 20 Percent When Gun Owne...,http://www.nejm.org/doi/full/10.1056/NEJMc1712773,DOI:10.1056/NEJMc1712773,Reduction in Firearm Injuries during NRA Annua...,Decline in Firearm Injuries during NRA Convent...
1,814dwi,Supplementation with probiotics during late pr...,http://journals.plos.org/plosmedicine/article?...,DOI:10.1371/journal.pmed.1002507,Diet during pregnancy and infancy and risk of ...,Background There is uncertainty about the infl...
2,814umr,Study finds that bee venom could be a useful p...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,PMCID:5793096,Bee Venom Suppresses the Differentiation of Pr...,Bee venom (BV) has been widely used in the tre...
4,815lr6,Undisclosed Conflicts of Interests among Biome...,https://www.ncbi.nlm.nih.gov/pubmed/29400625,PMID:29400625,Undisclosed conflicts of interest among biomed...,ABSTRACT Background: Textbooks are a formative...
5,81745j,One Tribe to Bind Them All: How Our Social Gro...,http://onlinelibrary.wiley.com/doi/10.1111/pop...,DOI:10.1111/pops.12485,One Tribe to Bind Them All: How Our Social Gro...,“Social sorting” is a concept used by Mason (2...
...,...,...,...,...,...,...
9349,dpnj23,"Statements about building walls, deportation a...",https://journals.plos.org/plosone/article?id=1...,DOI:10.1371/journal.pone.0222837,Declared impact of the US President’s statemen...,"Statements about building walls, deportation a..."
9350,dpnu7e,Many college students will uncritically accept...,https://journals.plos.org/plosone/article?id=1...,DOI:10.1371/journal.pone.0223736,When calculators lie: A demonstration of uncri...,Calculators are often unnecessary to solve rou...
9351,dpqhem,New method for making polymers with perfectly ...,https://pubs.acs.org/doi/10.1021/jacs.9b08240,DOI:10.1021/jacs.9b08240,Homogenous Synthesis of Monodisperse High Olig...,Whereas monodisperse polymers are ubiquitous i...
9352,dptk4o,Research Shows That Doing a Bad Job Wrapping P...,https://onlinelibrary.wiley.com/doi/epdf/10.10...,DOI:10.1002/jcpy.1140,Presentation Matters: The Effect of Wrapping N...,While gift-givers typically wrap gifts prior t...


In [10]:
out_df.to_csv('../data/filtered_data_with_abstracts.csv', index=False)

# Conclusion

We were able to get the identifiers for 8299 papers. Out of these, we were able to get _some_ information from semantic scholar API for 6101 papers. Of these, we got 5236 abstracts. 