Given a paper, say in our example the BERT paper that was published last year: 
https://www.semanticscholar.org/paper/BERT%3A-Pre-training-of-Deep-Bidirectional-for-Devlin-Chang/df2b0e26d0599ce3e70df8a9da02e51594e0e992  

Our goal is to obtain a summary of what other papers talk about this paper. 

In order to achieve that, we would like to obtain:
1. A list of all the papers that cites this paper (600+ of them in the case of BERT)
2. A PDF version for each of these papers
3. In each of these referencing PDF's, the precise location at which the BERT paper is cited

To obtain 1, we use Semantic Scholar API:  http://api.semanticscholar.org/v1/paper/df2b0e26d0599ce3e70df8a9da02e51594e0e992 ( in the "citations" field).

To obtain 2, we can download them directly from arxiv: https://arxiv.org/pdf/[arxiv ID].pdf, for example: https://arxiv.org/pdf/1907.04829.pdf

To obtain 3, I recommend this application: https://github.com/kermitt2/grobid . You can try out the demo here by uploading a PDF: http://cloud.science-miner.com/grobid/


In [435]:
import numpy as np
import pandas as pd
import requests
import json


# We can set api endpoints in either of following methods to get url   
#url = "http://api.semanticscholar.org/v1/paper/arXiv:1810.04805"
#url = " http://api.semanticscholar.org/v1/paper/df2b0e26d0599ce3e70df8a9da02e51594e0e992"
url = "http://api.semanticscholar.org/v1/paper/arXiv:1802.05365"
JSONContent = requests.get(url).json()

#using .dumps() function that takes an object and creates string of it.
content = json.dumps(JSONContent, indent = 4, sort_keys=True)

print(content)


{
    "abstract": "We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pretrained network is crucial, allowing downstream models to mix different types of semi-supervision signals.",
    "arxivId": "1802.05365",
    "authors": [
        {
            "authorId": "39139825",
            "name": "Matthew E. Peters",
            "url": "https://www.semantic

In [436]:
# lets create the string object to dictionary so that we can access the key,values from the dictionary.Further we create the DataFrame.
my_dict = json.loads(content)

authors = my_dict["authors"]
citation_vel = my_dict["citationVelocity"]
citations = my_dict["citations"]

#Creates DataFrame object from dictionary by columns using pandas.DataFrame.from_dict()
#create dataframe of authors 
auth_df = pd.DataFrame.from_dict(authors)
auth_df=auth_df.drop(columns=['authorId','url'])

#create dataframe of citations, then get the list of papers which cite this paper.We use title of paper and intent assuming that intent means where exactly the peper is cited in this paper.
citations_df = pd.DataFrame.from_dict(citations)
paper_df = citations_df[['title', 'intent','url']]
paper_url = citations_df[['url', 'paperId', 'arxivId' ]]
paper_arxivId = citations_df[['arxivId' ]]
paper_df
paper_url
#paper_arxivId
#auth_df

Unnamed: 0,url,paperId,arxivId
0,https://www.semanticscholar.org/paper/fea13dcb...,fea13dcb73f9754ca498ef700769ec21793e6870,1905.05538
1,https://www.semanticscholar.org/paper/425e249c...,425e249c1c91128c016d27ce353540aed19eea7e,
2,https://www.semanticscholar.org/paper/fca28d8d...,fca28d8de84cae5bc8f48cb8cfc5cedf106f62a1,1808.06305
3,https://www.semanticscholar.org/paper/68acfb44...,68acfb44aed9a138b0693facadda025d90693f61,1904.12683
4,https://www.semanticscholar.org/paper/8492269d...,8492269d2bb474d57d6def97efcf86c42735554a,1908.03548
5,https://www.semanticscholar.org/paper/bd40bc8f...,bd40bc8f5b32f3b72495c5fdd83719796a212cd1,
6,https://www.semanticscholar.org/paper/39c4e608...,39c4e6082cca859d9277af126b6c7d3418da27c1,
7,https://www.semanticscholar.org/paper/7db30882...,7db308823bdd2b297da2c56096c5dd57b65c6152,1906.07854
8,https://www.semanticscholar.org/paper/69ab4622...,69ab4622c2cad8a6e03c5061cb1a563e57ff0e31,
9,https://www.semanticscholar.org/paper/61d24c45...,61d24c45e41f4f6cfc3d9da2f03efde1862b839e,1904.00585


In [437]:
paper_arxivId = citations_df[['arxivId' ]].head(10)


## Download PDF version of papers
A PDF version for each of these papers. To obtain this, we can download them directly from arxiv: https://arxiv.org/pdf/[arxiv ID].pdf, for example: https://arxiv.org/pdf/1907.04829.pdf

In [438]:
!rm -f input/*

In [439]:
import requests
from multiprocessing.pool import ThreadPool
import os

def download_url(url):
    print("downloading: ",url)
    # assumes that the last segment after the / represents the file name
    # if url is abc/xyz/file.txt, the file name will be file.txt
    file_name_start_pos = url.rfind("/") + 1
    file_name = url[file_name_start_pos:]
    file_path = os.path.join('./input', file_name)
 
    r = requests.get(url, stream=True)
    if r.status_code == requests.codes.ok:
        with open(file_path, 'wb') as f:
            for data in r:
                f.write(data)
    return url
        
arxiv_ids = [a_id for a_id in citations_df['arxivId'] if a_id is not None]
urls = [f'https://arxiv.org/pdf/{a_id}.pdf' for a_id in arxiv_ids]

urls = urls[:999]
print(f"{len(urls)} files to be downloaded.")

# Run 5 multiple threads. Each call will take the next element in urls list
results = ThreadPool(5).imap_unordered(download_url, urls)
for r in results:
    print(r)

605 files to be downloaded.
downloading:  https://arxiv.org/pdf/1905.05538.pdf
downloading:  https://arxiv.org/pdf/1808.06305.pdf
downloading:  https://arxiv.org/pdf/1904.12683.pdf
downloading:  https://arxiv.org/pdf/1908.03548.pdf
downloading:  https://arxiv.org/pdf/1906.07854.pdf
downloading: https://arxiv.org/pdf/1808.06305.pdf
 https://arxiv.org/pdf/1904.00585.pdf
downloading:  https://arxiv.org/pdf/1901.11504.pdf
https://arxiv.org/pdf/1904.12683.pdf
downloading: https://arxiv.org/pdf/1905.05538.pdf
 https://arxiv.org/pdf/1810.05201.pdf
downloading: https://arxiv.org/pdf/1908.03548.pdf
 https://arxiv.org/pdf/1906.05149.pdf
downloading: https://arxiv.org/pdf/1906.07854.pdf
 https://arxiv.org/pdf/1905.10892.pdf
downloading: https://arxiv.org/pdf/1904.00585.pdf https://arxiv.org/pdf/1811.08705.pdf

downloading:  https://arxiv.org/pdf/1902.10985.pdf
https://arxiv.org/pdf/1810.05201.pdf
downloading: https://arxiv.org/pdf/1901.11504.pdf https://arxiv.org/pdf/1808.09147.pdf

downloading: 

downloading: https://arxiv.org/pdf/1908.06931.pdf
 https://arxiv.org/pdf/1907.13337.pdf
downloading: https://arxiv.org/pdf/1902.10296.pdf
 https://arxiv.org/pdf/1904.09545.pdf
downloading: https://arxiv.org/pdf/1905.06638.pdf
 https://arxiv.org/pdf/1907.02030.pdf
downloading:  https://arxiv.org/pdf/1905.13453.pdf
https://arxiv.org/pdf/1903.04933.pdf
downloading: https://arxiv.org/pdf/1905.02331.pdf
 https://arxiv.org/pdf/1902.01069.pdf
downloading: https://arxiv.org/pdf/1907.13337.pdf
 https://arxiv.org/pdf/1907.00464.pdf
downloading: https://arxiv.org/pdf/1904.09545.pdf
 https://arxiv.org/pdf/1907.04944.pdf
downloading:  https://arxiv.org/pdf/1904.12848.pdf
https://arxiv.org/pdf/1907.02030.pdf
downloading: https://arxiv.org/pdf/1905.13453.pdf
 https://arxiv.org/pdf/1906.04341.pdf
downloading:  https://arxiv.org/pdf/1903.10104.pdf
https://arxiv.org/pdf/1902.01069.pdf
downloading:  https://arxiv.org/pdf/1903.05260.pdf
https://arxiv.org/pdf/1907.00464.pdf
downloading:  https://arxiv.org/

downloading:  https://arxiv.org/pdf/1904.09636.pdf
https://arxiv.org/pdf/1904.03084.pdf
downloading: https://arxiv.org/pdf/1905.12741.pdf https://arxiv.org/pdf/1904.08783.pdf

downloading:  https://arxiv.org/pdf/1807.10675.pdf
https://arxiv.org/pdf/1901.11117.pdf
downloading: https://arxiv.org/pdf/1902.09492.pdf
 https://arxiv.org/pdf/1809.02731.pdf
downloading:  https://arxiv.org/pdf/1806.06259.pdf
https://arxiv.org/pdf/1904.09636.pdf
downloading:  https://arxiv.org/pdf/1809.02279.pdf
https://arxiv.org/pdf/1904.08783.pdf
downloading: https://arxiv.org/pdf/1901.06796.pdf
 https://arxiv.org/pdf/1908.09982.pdf
downloading:  https://arxiv.org/pdf/1810.10045.pdf
https://arxiv.org/pdf/1807.10675.pdf
downloading:  https://arxiv.org/pdf/1908.07262.pdf
https://arxiv.org/pdf/1806.06259.pdf
downloading:  https://arxiv.org/pdf/1811.08600.pdf
https://arxiv.org/pdf/1809.02279.pdf
downloading:  https://arxiv.org/pdf/1908.05957.pdf
https://arxiv.org/pdf/1809.02731.pdf
downloading:  https://arxiv.org/

downloading:  https://arxiv.org/pdf/1812.10464.pdf
https://arxiv.org/pdf/1811.01088.pdf
downloading: https://arxiv.org/pdf/1810.11067.pdf https://arxiv.org/pdf/1908.07750.pdf

downloading:  https://arxiv.org/pdf/1908.05762.pdf
https://arxiv.org/pdf/1811.10773.pdf
downloading:  https://arxiv.org/pdf/1812.11760.pdf
https://arxiv.org/pdf/1812.10464.pdf
downloading:  https://arxiv.org/pdf/1908.05763.pdf
https://arxiv.org/pdf/1808.07036.pdf
downloading: https://arxiv.org/pdf/1904.01098.pdf
 https://arxiv.org/pdf/1906.11604.pdf
downloading:  https://arxiv.org/pdf/1908.06121.pdf
https://arxiv.org/pdf/1908.05762.pdf
downloading:  https://arxiv.org/pdf/1905.05682.pdf
https://arxiv.org/pdf/1812.11760.pdf
downloading: https://arxiv.org/pdf/1908.05763.pdf
 https://arxiv.org/pdf/1905.03329.pdf
downloading:  https://arxiv.org/pdf/1808.01371.pdf
https://arxiv.org/pdf/1906.11604.pdf
downloading: https://arxiv.org/pdf/1908.06121.pdf
 https://arxiv.org/pdf/1905.05950.pdf
downloading:  https://arxiv.org/

downloading: https://arxiv.org/pdf/1906.03608.pdf https://arxiv.org/pdf/1904.09223.pdf

downloading:  https://arxiv.org/pdf/1809.05053.pdf
https://arxiv.org/pdf/1812.09449.pdf
downloading:  https://arxiv.org/pdf/1905.13370.pdf
https://arxiv.org/pdf/1906.05807.pdf
downloading:  https://arxiv.org/pdf/1906.04473.pdf
https://arxiv.org/pdf/1908.04728.pdf
downloading:  https://arxiv.org/pdf/1902.00164.pdf
https://arxiv.org/pdf/1904.09223.pdf
downloading:  https://arxiv.org/pdf/1812.00686.pdf
https://arxiv.org/pdf/1907.13362.pdf
downloading:  https://arxiv.org/pdf/1906.06606.pdf
https://arxiv.org/pdf/1809.05053.pdf
downloading: https://arxiv.org/pdf/1905.13370.pdf
 https://arxiv.org/pdf/1804.07461.pdf
downloading:  https://arxiv.org/pdf/1811.00147.pdf
https://arxiv.org/pdf/1902.00164.pdf
downloading:  https://arxiv.org/pdf/1906.01604.pdf
https://arxiv.org/pdf/1812.00686.pdf
downloading:  https://arxiv.org/pdf/1810.10176.pdf
https://arxiv.org/pdf/1906.04473.pdf
downloading:  https://arxiv.org/

downloading:  https://arxiv.org/pdf/1902.05770.pdf
https://arxiv.org/pdf/1808.09367.pdf
downloading:  https://arxiv.org/pdf/1901.03438.pdf
https://arxiv.org/pdf/1906.04043.pdf
downloading:  https://arxiv.org/pdf/1808.09653.pdf
https://arxiv.org/pdf/1907.04347.pdf
downloading:  https://arxiv.org/pdf/1810.08740.pdf
https://arxiv.org/pdf/1903.05987.pdf
downloading: https://arxiv.org/pdf/1901.03438.pdf
 https://arxiv.org/pdf/1811.07691.pdf
downloading:  https://arxiv.org/pdf/1812.01260.pdf
https://arxiv.org/pdf/1808.09653.pdf
downloading:  https://arxiv.org/pdf/1905.11658.pdf
https://arxiv.org/pdf/1810.08740.pdf
downloading:  https://arxiv.org/pdf/1907.12412.pdf
https://arxiv.org/pdf/1808.09111.pdf
downloading: https://arxiv.org/pdf/1902.05770.pdf
 https://arxiv.org/pdf/1908.05854.pdf
downloading:  https://arxiv.org/pdf/1904.09380.pdf
https://arxiv.org/pdf/1811.07691.pdf
downloading:  https://arxiv.org/pdf/1908.06606.pdfhttps://arxiv.org/pdf/1907.12412.pdf

downloading:  https://arxiv.org/

downloading:  https://arxiv.org/pdf/1905.11471.pdf
https://arxiv.org/pdf/1907.10726.pdf
downloading: https://arxiv.org/pdf/1905.06655.pdf https://arxiv.org/pdf/1812.10814.pdf

downloading:  https://arxiv.org/pdf/1807.10857.pdf
https://arxiv.org/pdf/1811.04210.pdf
downloading:  https://arxiv.org/pdf/1808.10503.pdf
https://arxiv.org/pdf/1905.11471.pdf
downloading:  https://arxiv.org/pdf/1808.07644.pdf
https://arxiv.org/pdf/1904.09675.pdf
downloading:  https://arxiv.org/pdf/1712.03556.pdf
https://arxiv.org/pdf/1902.05196.pdf
downloading:  https://arxiv.org/pdf/1808.05326.pdf
https://arxiv.org/pdf/1812.10814.pdf
downloading:  https://arxiv.org/pdf/1906.06947.pdf
https://arxiv.org/pdf/1807.10857.pdf
downloading: https://arxiv.org/pdf/1808.10503.pdf
 https://arxiv.org/pdf/1908.05620.pdf
downloading:  https://arxiv.org/pdf/1809.01682.pdf
https://arxiv.org/pdf/1712.03556.pdf
downloading:  https://arxiv.org/pdf/1906.06253.pdf
https://arxiv.org/pdf/1808.07644.pdf
downloading: https://arxiv.org/p

In [68]:
# run only once
# !git clone https://github.com/kermitt2/grobid-client-python.git

In [440]:
!rm -f output/*

In [441]:
!cd grobid-client-python && python grobid-client.py --input ../input --output ../output processFulltextDocument --force

GROBID server is up and running
605 PDF files to process
../input/1811.10052.pdf
../input/1903.11245.pdf
../input/1904.08770.pdf
../input/1908.05426.pdf
../input/1906.05149.pdf
../input/1902.06000.pdf
../input/1908.06926.pdf
../input/1906.01515.pdf
../input/1808.09500.pdf
../input/1806.02847.pdf
../input/1904.01608.pdf
../input/1905.13125.pdf
../input/1903.04933.pdf
../input/1907.02679.pdf
../input/1906.00346.pdf
../input/1906.11511.pdf
../input/1907.03228.pdf
../input/1809.02796.pdf
../input/1805.04032.pdf
../input/1901.10125.pdf
../input/1906.06253.pdf
../input/1805.04218.pdf
../input/1906.00839.pdf
../input/1906.00742.pdf
../input/1903.02953.pdf
../input/1907.10136.pdf
../input/1808.09111.pdf
../input/1906.04726.pdf
../input/1810.05788.pdf
../input/1907.04355.pdf
../input/1906.08646.pdf
../input/1809.06309.pdf
../input/1901.11429.pdf
../input/1902.04574.pdf
../input/1903.11508.pdf
Processing failed with error 500
../input/1906.03088.pdf
../input/1904.11544.pdf
../input/1908.05828.pd

../input/1902.04260.pdf
../input/1808.03894.pdf
../input/1902.00751.pdf
../input/1808.09422.pdf
../input/1906.03591.pdf
../input/1904.01500.pdf
../input/1905.00537.pdf
../input/1904.02141.pdf
../input/1907.01118.pdf
../input/1906.01543.pdf
../input/1812.06083.pdf
../input/1906.06947.pdf
../input/1810.11190.pdf
../input/1811.10830.pdf
../input/1804.07888.pdf
../input/1904.08109.pdf
../input/1808.09543.pdf
../input/1807.06234.pdf
../input/1905.13370.pdf
../input/1810.06638.pdf
../input/1808.07036.pdf
../input/1907.12763.pdf
../input/1808.09147.pdf
../input/1904.03100.pdf
../input/1908.05854.pdf
../input/1811.00570.pdf
../input/1903.04153.pdf
../input/1904.01098.pdf
../input/1901.11333.pdf
../input/1907.03491.pdf
../input/1810.03352.pdf
../input/1905.00078.pdf
../input/1804.07726.pdf
../input/1906.04772.pdf
../input/1906.00266.pdf
../input/1907.08922.pdf
../input/1904.02181.pdf
../input/1906.00138.pdf
../input/1904.02142.pdf
../input/1905.05538.pdf
../input/1811.05542.pdf
../input/1812.01

In [71]:
# run only once
# import nltk
# nltk.download('punkt')

In [442]:
import xml.etree.ElementTree as ET

def match_all_bib_nodes(curr_node, acc):
    new_list = []
    if curr_node.tag == '{http://www.tei-c.org/ns/1.0}biblStruct':
        new_list += [curr_node]
    else:
        children = list(curr_node)
        for child in children:
            new_list += match_all_bib_nodes(child, acc)
    return acc + new_list

def get_title_node(curr_node):
    if curr_node.tag == '{http://www.tei-c.org/ns/1.0}title':
        return curr_node
    else:
        children = list(curr_node)
        for child in children:
            n = get_title_node(child)
            if n is not None:
                return n
        return None
    
def get_ref(bib_node):
    ID_KEY = '{http://www.w3.org/XML/1998/namespace}id'
    if ID_KEY not in bib_node.attrib.keys():
        return None
    else:
        return bib_node.attrib[ID_KEY]

from difflib import SequenceMatcher

def get_matching_score(title1: str, title2: str) -> float:
    return SequenceMatcher(None, 
                           title1.strip().lower(), 
                           title2.strip().lower()
                          ).ratio()

    
def get_citation_number(title, root):
    try:
        bib_nodes = match_all_bib_nodes(root, [])
        THRESHOLD = 0.98
        titles_by_ref = {
            get_ref(n): get_title_node(n).text
            for n in bib_nodes
            if get_ref(n) is not None and get_title_node(n).text is not None
        }
        scores = {
            k: get_matching_score(title, item)
            for k, item in titles_by_ref.items()
        }
        max_key = max(scores, key=scores.get)
        if scores[max_key] > THRESHOLD:
            return max_key, titles_by_ref[max_key]
        else:
            return None, None
    except Exception as inst:
        return None, None

def get_xml_node(xmlpath: str):
    tree = ET.parse(xmlpath)
    root = tree.getroot()
    return root

import xml.etree.ElementTree as ET
def find_ref_nodes(curr, ref):
    if match_ref_node(curr, ref):
        return [curr]
    children = list(curr)
    matched_nodes = []
    for child in children:
        newfound = find_ref_nodes(child, ref)
        matched_nodes += newfound
    return matched_nodes

def match_ref_node(curr_node, ref):
    return curr_node.tag == '{http://www.tei-c.org/ns/1.0}ref' and \
           'target' in curr_node.attrib and \
           curr_node.attrib['target'] == f'#{ref}'

def get_ref_paragraph(ref_node, root):
    parent_map = dict((c, p) for p in root.getiterator() for c in p)
    parent = parent_map[ref_node]
    sentences = ''.join([parent.text] + ['<REF>'] + [n.tail for n in list(parent) if n.tail is not None])
    return sentences

def get_paragraphs_by_filepath(xmlpath: str, title: str):
    root = get_xml_node(xmlpath)
    ref, _ = get_citation_number(title, root)
    ref_nodes = find_ref_nodes(root, ref)
    sentences = [
        get_ref_paragraph(rn, root) for rn in ref_nodes
    ]
    return sentences

# xmlpath = "output_style_transfer/4eece2039e953d4403ca2199d996469e877b.tei.xml"
# get_sentences_by_filepath(xmlpath, title)

from os import listdir
from os.path import isfile, join

def get_xml_files_in_dir(xmldir: str):
    onlyfiles = [join(xmldir, f) for f in listdir(xmldir) if isfile(join(xmldir, f))]
    return onlyfiles

import nltk

from nltk.tokenize import sent_tokenize
def extract_ref_sentence_from_paragraph(para: str, ref_token = '<REF>'):
    sentences = nltk.sent_tokenize(para)
    sentences_containing_ref = [s for s in sentences if ref_token in s]
    return sentences_containing_ref[0]

def get_first_n_ref_sentences_by_dir(n: int, xmldir: str, title: str):
    xmlfiles = get_xml_files_in_dir(xmldir)
    paragraph_set = [
         get_paragraphs_by_filepath(xmlf, title) for xmlf in xmlfiles
    ]
    mention_para_groups = [
        paragraphs[0:n] for paragraphs in paragraph_set if len(paragraphs) > 0
    ]
#     return mention_para_groups
    first_mention_sentences = [
        extract_ref_sentence_from_paragraph(para) 
        for para_group in mention_para_groups
        for para in para_group
    ]
    return first_mention_sentences


xmldirpath = './output'
title = my_dict['title']
NUMBER_OF_SENTENCES_PER_PAPER = 5

x = get_first_n_ref_sentences_by_dir(NUMBER_OF_SENTENCES_PER_PAPER, xmldirpath, title)
x

['For Question Match and ESIM we also experiment with ELMo <REF> which improved their score on Test with 0.4% and 1.8%.',
 'One of the greatest opportunities for further gains is through the use of context-sensitive word embeddings, such as ELMo <REF> and ULMfit .',
 'Modern deep learning algorithms often do away with feature engineering and learn latent representations directly from raw data that are given as input to Deep Neural Networks (DNNs) <REF>.',
 "Other variants have been proposed since Word2vec's initial public release, such as GloVe <REF>, ELMo , and BERT .",
 'Language Modeling recently gained in importance as it is being used as a base for transfer learning in multiple supervised tasks, obtaining impressive improvements over state-of-the-art <REF>.',
 'Natural language processing tasks show the best performance when transfer learning is applied either from an LSTM language model <REF> or from selfattention language models .',
 'Another approach to address inflections in P

In [443]:
import pandas as pd
df = pd.DataFrame(x, columns=["paragraph"])
df.to_csv('Elmo-605.csv', index=False)