This code will explore some of the example files in the Semantic Scholar Open Research Corpus (S2ORC).  S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.  

First, we will examine metadata.

Then, we will examine pdf parses.

Decisions:
    - Should we drop certain fields (e.g. Physics, Biology, Chemistry, Engineering, Material Sciency, Art, Geology)
        - Could lead to some false positives being dropped.
    - Should only published works be considered?
        - Would omit many manuscripts including papers in the PRWP series
        - Also, may introduce a time lag where more recent papers will be excluded.

In [3]:
#load all packages

import pandas as pd
import json
import spacy
import os




In [4]:


#read in sample json from metadata
meta_path='C:/Users/wb469649/Documents/Github/s2orc/data/metadata/sample.jsonl'
meta_samp = pd.read_json(meta_path, lines=True)

meta_samp.head()

#read in sample json pdf parses
pdf_path='C:/Users/wb469649/Documents/Github/s2orc/data/pdf_parses/sample.jsonl'
pdf_samp = pd.read_json(pdf_path, lines=True)

pdf_flat=pd.json_normalize(pd.Series(open(pdf_path).readlines()).apply(json.loads))

pdf_samp.head()


Unnamed: 0,paper_id,_pdf_hash,abstract,body_text,bib_entries,ref_entries
0,77499681,11f281316fe4638843a83cf559ce4f60aade00f8,"[{'section': 'Abstract', 'text': 'The purpose ...","[{'section': '', 'text': 'Values are presented...",{'BIBREF0': {'title': 'Bone health and osteopo...,{'FIGREF0': {'text': '비스포스포네이트를 장기간 복용한 골다공증 환...
1,94550656,42b3e1bd9c4740192f22d8725d470218e86301c8,[],[],{'BIBREF0': {'title': 'Solving ratio-dependent...,{}
2,94551239,b355fc0f19e1945bcb585b0f696da8b01aa4578f,[],[],{'BIBREF2': {'title': 'Optical Near Field Reco...,{}
3,94551546,9bf1cb19041b8ddfca7aeccc9d2f7689c8aa1c7e,"[{'section': 'Abstract', 'text': 'Ethanolamine...","[{'section': 'INTRODUCTION', 'text': 'Gene the...","{'BIBREF0': {'title': 'Cancer statistics', 'au...",{'FIGREF0': {'text': 'General procedures for t...
4,94552339,7dc1bf397fb5aae2fa07e2697ba6f0237a411bb6,[],[],"{'BIBREF0': {'title': 'Titanium, its occurrenc...",{}


In the next section, I will play with the spacy python package to extract information from the title and abstract.

More details on the Spacy named entity recognition can be found here:
https://spacy.io/usage/linguistic-features#named-entities

In [15]:

#This is an example snippet using an example sentence.

nlp=spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion and will produce products in the U.S.A, the U.K., and Bangladesh.")

nlp_results=[]
cols=['text','label']
for token in doc.ents:
    nlp_results.append([token.text,token.label_])
nlp_df=pd.DataFrame(nlp_results, columns=cols)


Below is the code applying the spacy package to the titles and abstracts in the metadata file.

In [7]:

# this code applies the spacy tools to the sample metadata file.
nlp=spacy.load("en_core_web_sm")
meta_path='C:/Users/wb469649/Documents/Github/s2orc/data/metadata/sample.jsonl'
pdf_path='C:/Users/wb469649/Documents/Github/s2orc/data/pdf_parses/sample.jsonl'

#loop through all the metadata entries and save to pandas dataframe
nlp_results=[]
cols=['paper_id','text','label']

with open(meta_path) as f_meta:
    for line in f_meta:
        metadata_dict = json.loads(line)
        paper_id=metadata_dict['paper_id']
        text=' '.join(filter(None, [metadata_dict['title'] ,metadata_dict['abstract']]))
        #apply nlp to title and abstract
        doc=nlp(text)
        #save results
        
        for token in doc.ents:
            nlp_results.append([paper_id,token.text,token.label_])
        
nlp_df=pd.DataFrame(nlp_results, columns=cols)



Next we will apply the spacy package to the pdf_parsed files.



In [31]:
# this code applies the spacy tools to the sample metadata file.
nlp=spacy.load("en_core_web_sm")

#loop through all the metadata entries and save to pandas dataframe
pdf_results=[]
cols=['paper_id','text','label','original_text']

#list of excluded academic fields (mostly hard science fields and art)
excluded_fields = ['Medicine', 'Biology','Chemistry', 'Engineering', 'Physics','Material Sciency', 'Geology', 'Art']

#get list of entries with pdf text available
papers_with_pdf={}
with open(meta_path) as f_meta:
    for line in f_meta:
        metadata_dict = json.loads(line)
        paper_id=metadata_dict['paper_id']
        field=metadata_dict['mag_field_of_study']

        # if field is empty just replace with empty string
        if len(field)==0:
            field=[" "]

        # we want only papers pdf text available
        if not metadata_dict['has_pdf_parse']:
            continue
        
        # and we want to exclude some fields that are unlikely to use data produced by national governments
        # checks if all fields listed for paper are in the exlcuded list of fields
        if all(x in field for x in excluded_fields):
            continue
        
        papers_with_pdf[metadata_dict['paper_id']] = metadata_dict


        

#now go through pdf file and do nlp code on text
with open(pdf_path) as f_pdf:
    for line in f_pdf:
        pdf_dict = json.loads(line)
        paper_id=pdf_dict['paper_id']

        #Only check papers with pdf parsed
        if paper_id in papers_with_pdf:

            paragraphs = pdf_dict['abstract'] + pdf_dict['body_text']

            # (3) loop over paragraphs, apply nlp to title and abstract
            full_text=""
            for paragraph in paragraphs:

                # (4) loop over text in this paragraph and append together
                
                for text_span in paragraph['text']:
                    #append
                    full_text += " "
                    full_text += text_span
                
            #run nlp on text for paper
            doc=nlp(full_text)
            #save results
                
            for token in doc.ents:
                pdf_results.append([paper_id,token.text,token.label_,full_text])
        
nlp_pdf_df=pd.DataFrame(pdf_results, columns=cols)

nlp_pdf_df.to_csv('C:/Users/wb469649/Documents/Github/s2orc/data/sample_nlp.csv')


TypeError: argument of type 'NoneType' is not iterable

In [32]:
line

'{"paper_id": "18980190", "title": "Satellite Image Resolution Enhancement Using Image Fusion Method Based on DSW and TFT-CW Transforms", "authors": [{"first": "P", "middle": ["V"], "last": "Ritu", "suffix": ""}, {"first": "Karun", "middle": ["M"], "last": "Tech", "suffix": ""}, {"first": "Bala", "middle": ["Krishna"], "last": "Ch", "suffix": ""}], "abstract": null, "year": null, "arxiv_id": null, "acl_id": null, "pmc_id": null, "pubmed_id": null, "doi": null, "venue": null, "journal": null, "has_pdf_body_text": false, "mag_id": null, "mag_field_of_study": null, "outbound_citations": ["20845360", "6627072", "9760560", "15705701", "15330970", "18456373", "3035177", "18314517", "5481423", "122822095", "9060120", "61939206", "5481423"], "inbound_citations": [], "has_outbound_citations": true, "has_inbound_citations": false, "has_pdf_parse": true, "has_pdf_parsed_abstract": false, "has_pdf_parsed_body_text": false, "has_pdf_parsed_bib_entries": true, "has_pdf_parsed_ref_entries": false, "s

In [16]:
pdf_dict

{'paper_id': '77499681',
 '_pdf_hash': '11f281316fe4638843a83cf559ce4f60aade00f8',
 'abstract': [{'section': 'Abstract',
   'text': 'The purpose of this study is to evaluate the effects of teriparatide administration on fracture healing after intramedullary nailing in atypical femoral fractures. Materials and Methods: We retrospectively reviewed 26 patients (26 cases) with atypical femoral fracture who were treated using intramedullary nailing between January 2009 and December 2013. Teriparatide was not administered to 15 patients (non-injection group) and was administered to 11 patients after surgery (injection group). Clinical results were assessed using the Nakajima score and the visual analogue scale (VAS). Radiographic results were compared for the time of callus formation, callus bridge formation, and bone union between the groups. Results: Time to recover walking ability and to decrease pain in the surgery region (VAS≤2) were significantly shorter in the injection group than in 

In [7]:
# explore pdf_parses

# feel free to wrap this into a larger loop for batches 0~99
BATCH_ID = 0
meta_path='C:/Users/wb469649/Documents/Github/s2orc/data/metadata/sample.jsonl'
pdf_path='C:/Users/wb469649/Documents/Github/s2orc/data/pdf_parses/sample.jsonl'


# create a lookup for the pdf parse based on paper ID
paper_id_to_pdf_parse = {}
with open(pdf_path) as f_pdf:
    for line in f_pdf:
        pdf_parse_dict = json.loads(line)
        paper_id_to_pdf_parse[pdf_parse_dict['paper_id']] = pdf_parse_dict

# filter papers using metadata values
citation_contexts = []
with open(meta_path) as f_meta:
    for line in f_meta:
        metadata_dict = json.loads(line)
        paper_id = metadata_dict['paper_id']
        print(f"Currently viewing S2ORC paper: {paper_id}")
        
        # suppose we only care about ACL anthology papers
        if not metadata_dict['acl_id']:
            continue
    
        # and we want only papers with resolved outbound citations
        if not metadata_dict['has_outbound_citations']:
            continue
        
        # get citation context (paragraphs)!
        if paper_id in paper_id_to_pdf_parse:
            # (1) get the full pdf parse from the previously computed lookup dict
            pdf_parse = paper_id_to_pdf_parse[paper_id]
            
            # (2) pull out fields we need from the pdf parse, including bibliography & text
            bib_entries = pdf_parse['bib_entries']
            paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']

            # (3) loop over paragraphs, grabbing citation contexts
            for paragraph in paragraphs:
                
                # (4) loop over each inline citation in this paragraph
                for text in paragraph['cite_spans']:
                    
                    # (5) each inline citation can be resolved to a bib entry
                    cited_bib_entry = bib_entries[cite_span['ref_id']]
                    
                    # (6) that bib entry *may* be linked to a S2ORC paper.  if so, grab paragraph
                    linked_paper_id = cited_bib_entry['link']
                    if linked_paper_id:
                        citation_contexts.append({
                            'citing_paper_id': paper_id,
                            'cited_paper_id': linked_paper_id,
                            'context': paragraph['text'],
                            'citation_mention_start': cite_span['start'],
                            'citation_mention_end': cite_span['end'],
                        })

Currently viewing S2ORC paper: 77490025
Currently viewing S2ORC paper: 77490084
Currently viewing S2ORC paper: 77490191
Currently viewing S2ORC paper: 77490289
Currently viewing S2ORC paper: 77490322
Currently viewing S2ORC paper: 77490340
Currently viewing S2ORC paper: 77490349
Currently viewing S2ORC paper: 77490515
Currently viewing S2ORC paper: 77490720
Currently viewing S2ORC paper: 77490837
Currently viewing S2ORC paper: 77490849
Currently viewing S2ORC paper: 77490871
Currently viewing S2ORC paper: 77491094
Currently viewing S2ORC paper: 77491174
Currently viewing S2ORC paper: 77491215
Currently viewing S2ORC paper: 77491274
Currently viewing S2ORC paper: 77491423
Currently viewing S2ORC paper: 77491534
Currently viewing S2ORC paper: 77491577
Currently viewing S2ORC paper: 77491863
Currently viewing S2ORC paper: 77491866
Currently viewing S2ORC paper: 77491955
Currently viewing S2ORC paper: 77491964
Currently viewing S2ORC paper: 77492035
Currently viewing S2ORC paper: 77492214


In [8]:
paper_id_to_pdf_parse

{'77499681': {'paper_id': '77499681',
  '_pdf_hash': '11f281316fe4638843a83cf559ce4f60aade00f8',
  'abstract': [{'section': 'Abstract',
    'text': 'The purpose of this study is to evaluate the effects of teriparatide administration on fracture healing after intramedullary nailing in atypical femoral fractures. Materials and Methods: We retrospectively reviewed 26 patients (26 cases) with atypical femoral fracture who were treated using intramedullary nailing between January 2009 and December 2013. Teriparatide was not administered to 15 patients (non-injection group) and was administered to 11 patients after surgery (injection group). Clinical results were assessed using the Nakajima score and the visual analogue scale (VAS). Radiographic results were compared for the time of callus formation, callus bridge formation, and bone union between the groups. Results: Time to recover walking ability and to decrease pain in the surgery region (VAS≤2) were significantly shorter in the injectio

In [9]:
citation_contexts

[]

## Borrowed Code

Original Code from S2ORC which was an example to grab citation information within a paragraph.

In [10]:
# explore pdf_parses
import os
import json

# feel free to wrap this into a larger loop for batches 0~99
BATCH_ID = 0
meta_path='C:/Users/wb469649/Documents/Github/s2orc/data/metadata/sample.jsonl'
pdf_path='C:/Users/wb469649/Documents/Github/s2orc/data/pdf_parses/sample.jsonl'


# create a lookup for the pdf parse based on paper ID
paper_id_to_pdf_parse = {}
with open(pdf_path) as f_pdf:
    for line in f_pdf:
        pdf_parse_dict = json.loads(line)
        paper_id_to_pdf_parse[pdf_parse_dict['paper_id']] = pdf_parse_dict

# filter papers using metadata values
citation_contexts = []
with open(meta_path) as f_meta:
    for line in f_meta:
        metadata_dict = json.loads(line)
        paper_id = metadata_dict['paper_id']
        print(f"Currently viewing S2ORC paper: {paper_id}")
        
        # suppose we only care about ACL anthology papers
        if not metadata_dict['acl_id']:
            continue
    
        # and we want only papers with resolved outbound citations
        if not metadata_dict['has_outbound_citations']:
            continue
        
        # get citation context (paragraphs)!
        if paper_id in paper_id_to_pdf_parse:
            # (1) get the full pdf parse from the previously computed lookup dict
            pdf_parse = paper_id_to_pdf_parse[paper_id]
            
            # (2) pull out fields we need from the pdf parse, including bibliography & text
            bib_entries = pdf_parse['bib_entries']
            paragraphs = pdf_parse['abstract'] + pdf_parse['body_text']

            # (3) loop over paragraphs, grabbing citation contexts
            for paragraph in paragraphs:
                
                # (4) loop over each inline citation in this paragraph
                for cite_span in paragraph['cite_spans']:
                    
                    # (5) each inline citation can be resolved to a bib entry
                    cited_bib_entry = bib_entries[cite_span['ref_id']]
                    
                    # (6) that bib entry *may* be linked to a S2ORC paper.  if so, grab paragraph
                    linked_paper_id = cited_bib_entry['link']
                    if linked_paper_id:
                        citation_contexts.append({
                            'citing_paper_id': paper_id,
                            'cited_paper_id': linked_paper_id,
                            'context': paragraph['text'],
                            'citation_mention_start': cite_span['start'],
                            'citation_mention_end': cite_span['end'],
                        })

Currently viewing S2ORC paper: 77490025
Currently viewing S2ORC paper: 77490084
Currently viewing S2ORC paper: 77490191
Currently viewing S2ORC paper: 77490289
Currently viewing S2ORC paper: 77490322
Currently viewing S2ORC paper: 77490340
Currently viewing S2ORC paper: 77490349
Currently viewing S2ORC paper: 77490515
Currently viewing S2ORC paper: 77490720
Currently viewing S2ORC paper: 77490837
Currently viewing S2ORC paper: 77490849
Currently viewing S2ORC paper: 77490871
Currently viewing S2ORC paper: 77491094
Currently viewing S2ORC paper: 77491174
Currently viewing S2ORC paper: 77491215
Currently viewing S2ORC paper: 77491274
Currently viewing S2ORC paper: 77491423
Currently viewing S2ORC paper: 77491534
Currently viewing S2ORC paper: 77491577
Currently viewing S2ORC paper: 77491863
Currently viewing S2ORC paper: 77491866
Currently viewing S2ORC paper: 77491955
Currently viewing S2ORC paper: 77491964
Currently viewing S2ORC paper: 77492035
Currently viewing S2ORC paper: 77492214


In [11]:
variable = f_pdf[1].loads
print(variable)


TypeError: '_io.TextIOWrapper' object is not subscriptable