### About this script
This script is used as a 'running experiment' to scrape papers data from various sources such as Google Scholar, SemanticScholar, Crossref and ResearchGate: therefore, expect some spaghetti. The scraping pipeline is made out of several (some technically optional) stages:

1. Define the root paper (in this case, it will be X. Leroy's Coinductive big-step operational semantics).
2. [optional] Scrape *Google Scholar* and/or *Semantic Scholar* to get papers that cite the root paper.
3. [optional] Fill part of the missing details querying *Crossref*'s REST API, *ResearchGate* and/or *ACM*.
5. Produce a set of complete papers.
    
Once these steps are completed, we should have a mostly complete set of papers: of course, we aren't guaranteed that every papers has a valid DOI, abstract and so on, so manual adjusting might still be necessary. Once we have a complete dataset, we proceed to compute some statistics about the papers, i.e. *tf-ifd* on the documents' abstract; then, we have a list of relevant terms which we can order and show in relation to the papers. 

In [1]:
from typing import Optional


class Article:
    author: str
    title: str
    doi: Optional[str]
    abstract: Optional[str]
    url: Optional[str]

    def __init__(
        self,
        author: str,
        title: str,
        doi: Optional[str] = None,
        abstract: Optional[str] = None,
        url: Optional[str] = None,
    ):
        self.author = author
        self.title = title
        self.doi = doi
        self.abstract = abstract
        self.url = url


import logging

logging.basicConfig()

# Uncomment to clutter your screen
logging.getLogger().setLevel(logging.INFO)

base = Article(
    author="Xavier Leroy",
    title="Coinductive big-step operational semantics",
)

The following block controls the program: scrape or not, check Crossref or not, ...

In [2]:
cits_path = "./cits.json"

# (Re)do scraping or not. 
scrape = True

# Scrape Google Scholar
do_gs = scrape and True
# Scrape SemanticScholar
do_s2 = scrape and True

# (Re)do filling or not. 
fill = False

# Scrape Crossref
do_cf = fill and False
# Scrape ResearchGate
do_rg = fill and False
# Scrape ACM
do_acm = fill and False

# Do statistical analysis. 
stats = True

### Scrape Google Scholar and/or Semantic Scholar to get papers that cite the root paper.

In [3]:
cits = {(base.title, base)}

if not scrape: 
    # Assume we already have a file to read citations from. 
    try:
        with open(cits_path) as f:
            articles = json.load(f)
            for article in articles:
                cits[article["title"]] = Article(
                    author=article["author"],
                    title=article["title"],
                    doi=article["doi"],
                    abstract=article["abstract"],
                    url = article["url"]
                )
    except BaseException:
        raise
else:
    # This will take a lot of time.
    if do_gs:
        # google scholar scraper
        from scholarly import scholarly

        # Avoid Google CAPTCHAs...
        from scholarly import ProxyGenerator

        pg = ProxyGenerator()
        
        # Change here if you don't want to use Tor.
        use_tor = True
        
        if use_tor:
            success = pg.Tor_Internal(tor_cmd="tor")
            if not success:
                print("Tor not working...")
                raise
        else:
            pg.FreeProxies()
            import ssl

            ssl._create_default_https_context = ssl._create_unverified_context

        scholarly.use_proxy(pg)

        gs_base = scholarly.search_single_pub(base.title)
        logging.info(
            "fetched gscholar pub for base article " +
            str(base) +
            " : " +
            str(gs_base))
        assert gs_base["bib"]["title"] == base.title

        p
        for cit in scholarly.citedby(gs_base):
            logging.info(
                "citation "
                + str(cit["bib"]["author"][0])
                + " "
                + cit["bib"]["title"]
                + " cited base"
            )
            gs_cits.append(cit)

        for cit in gs_cits:
            author, title = cit["bib"]["author"][0], cit["bib"]["title"]
            cits[title] = Article(author=author, title=title)

    if do_s2:
        import s2
        import requests

        res = requests.get(
            "https://api.semanticscholar.org/graph/v1/paper/search?query="
            + base.title.replace(" ", "+").replace("-", "+")
            + "&limit=1"
        )
        s2_base = json.loads(res.text)
        logging.info(
            "fetched s2 pub for base article " +
            str(base) +
            " : " +
            str(s2_base))
        s2_base = s2_base["data"][0]
        assert s2_base["title"] == base.title
        s2_base = s2.api.get_paper(paperId=s2_base["paperId"])
        base.doi = s2_base.doi
        for cit in s2_base.citations:
            author, doi, title = (
                cit.authors[0].name,
                cit.doi,
                cit.title,
            )
            logging.info(
                "cit " +
                str(author) +
                " " +
                str(title) +
                " cited base")
            cits[title] = Article(author=author, doi=doi, title=title)

    with open(cits_path, "w+") as f:
        json.dump([cit.__dict__ for cit in cits.values()], f)

In [4]:
if fill and do_cf:
    # crossref.org client
    import habanero

    cr = habanero.Crossref()

    # Specify email to get into the 'polite' queue
    user_mail = "ecmm@anche.no"
    habanero.Crossref(mailto=user_mail)
    from tqdm import tqdm
    from tqdm.contrib.logging import logging_redirect_tqdm

    lastit = 1

    with logging_redirect_tqdm():
        while lastit < len(cits.values()):
            i = 0
            try:
                for cit in tqdm(cits.values()):
                    i = i + 1
                    if i <= lastit:
                        continue
                    if cit.doi is None or cit.abstract is None:
                        do_extsearch = True
                        if cit.doi is not None:
                            try:
                                res = cr.works(ids=cit.doi)
                                if not res["status"] == "ok":
                                    logging.error(
                                        "Error fetching crossref work for citation "
                                        + str(cit)
                                    )
                                    continue
                                item = res["message"]
                                author = item["author"][0]
                                author = author["given"] + " " + author["family"]
                                if len(author) > len(cit.author):
                                    cit.author = author
                                abstract = None
                                if "abstract" in item:
                                    abstract = item["abstract"]
                                if cit.abstract is None:
                                    cit.abstract = abstract
                                do_extsearch = False
                            except BaseException as e:
                                logging.error(str(e))
                                cit.doi = None
                                pass

                        if do_extsearch:
                            res = cr.works(
                                query=cit.title,
                            )

                            res1 = cr.works(
                                query=cit.title + " " + cit.author,
                                query_author=cit.author,
                            )

                            res2 = cr.works(
                                query_title=cit.title,
                            )

                            res3 = cr.works(
                                query_container_title=cit.title,
                            )

                            items = (
                                res["message"]["items"]
                                + res1["message"]["items"]
                                + res2["message"]["items"]
                                + res3["message"]["items"]
                            )

                            for item in items:
                                try:
                                    title = item["title"][0]
                                    if not title.lower() == cit.title.lower():
                                        continue
                                    if "DOI" not in item:
                                        continue
                                    doi = item["DOI"]
                                    abstract = None
                                    author = item["author"][0]
                                    author = author["given"] + " " + author["family"]
                                    if len(author) > len(cit.author):
                                        cit.author = author
                                    if "abstract" in item:
                                        abstract = item["abstract"]
                                    if cit.doi is None:
                                        cit.doi = doi
                                    if cit.abstract is None:
                                        cit.abstract = abstract
                                except BaseException:
                                    pass
                            lastit = i
            except:
                lastit = i
                continue

    with open(cits_path, "w+") as f:
        to_write = [cit.__dict__ for cit in cits.values()]
        json.dump(to_write, f)

In [5]:
if fill and do_rg:
    from bs4 import BeautifulSoup
    import selenium
    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By

    browser = webdriver.Firefox()

    from tqdm import tqdm
    from tqdm.contrib.logging import logging_redirect_tqdm
    import re

    session = browser

    use_tor = False
    if use_tor:
        from stem import Signal
        from stem.control import Controller

        # signal TOR for a new connection
        def renew_connection():
            with Controller.from_port(port=9051) as controller:
                # controller.authenticate(password="password")
                controller.signal(Signal.NEWNYM)

        def get_tor_session():
            session = requests.session()
            # Tor uses the 9050 port as the default socks port
            session.proxies = {
                "http": "socks5://127.0.0.1:9050",
                "https": "socks5://127.0.0.1:9050",
            }
            return session

        session = get_tor_session()

    lastit = 1
    with logging_redirect_tqdm():
        while lastit < len(cits.values()):
            logging.info("lastit: " + str(lastit))
            i = 0

            try:
                for cit in tqdm(cits.values()):
                    
                    i = i + 1
                    if i <= lastit:
                        continue
                    time.sleep(1)
                    url = f"https://www.researchgate.net/search/publication?q={cit.title}+{cit.author}".replace(
                        " ", "+"
                    )
                    session.get(url)
                    res = browser.page_source
                    soup = BeautifulSoup(res, "html.parser")
                    for article in soup.find_all(
                        "div", {"class": "nova-legacy-v-publication-item__body"}
                    ):
                        title_tag = article.find_all(
                            "a", {"class": "nova-legacy-e-link--color-inherit"}
                        )
                        title = None
                        rg_url = None
                        doi = None
                        if len(title_tag) > 0:
                            title_tag = title_tag[0]
                            title = title_tag.text
                            rg_url = "https://www.researchgate.net/search/../" + title_tag["href"]
                        for bar in article.find_all(
                            "ul", {"class": "nova-legacy-e-list--type-inline"}
                        ):
                            for item in bar.find_all("li"):
                                if "DOI" in item.text:
                                    doi = item.text.strip("DOI: ")
                                    
                        if title.lower() == cit.title.lower():
                            session.get(rg_url)
                            res = browser.page_source
                            soup = BeautifulSoup(res, "html.parser")
                            abstract = soup.find("div", {"class": "research-detail-middle-section__abstract"}).text
                            if cit.doi is None: 
                                cit.doi = doi
                            if cit.abstract is None: 
                                cit.abstract = abstract
                        lastit = i
            except BaseException as e:
                print(e)
                lastit = i
                continue
    with open(cits_path, "w+") as f:
        to_write = [cit.__dict__ for cit in cits.values()]
        json.dump(to_write, f)

In [6]:
if fill and do_acm: 
    from bs4 import BeautifulSoup
    import selenium
    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By

    browser = webdriver.Firefox()

    from tqdm import tqdm
    from tqdm.contrib.logging import logging_redirect_tqdm
    import re

    session = browser

    use_tor = False
    if use_tor:
        from stem import Signal
        from stem.control import Controller

        # signal TOR for a new connection
        def renew_connection():
            with Controller.from_port(port=9051) as controller:
                # controller.authenticate(password="password")
                controller.signal(Signal.NEWNYM)

        def get_tor_session():
            session = requests.session()
            # Tor uses the 9050 port as the default socks port
            session.proxies = {
                "http": "socks5://127.0.0.1:9050",
                "https": "socks5://127.0.0.1:9050",
            }
            return session

        session = get_tor_session()

    lastit = 1
    with logging_redirect_tqdm():
        while lastit < len(cits.values()):
            logging.info("lastit: " + str(lastit))
            i = 0

            try:
                for cit in tqdm(cits.values()):
                    
                    i = i + 1
                    if i <= lastit:
                        continue
                    time.sleep(1)

                    url = f"https://dl.acm.org/action/doSearch?AllField={cit.title}+{cit.author}&startPage=0&pageSize=50".replace(
                        " ", "+"
                    )
                    session.get(url)
                    res = browser.page_source
                    soup = BeautifulSoup(res, "html.parser")
                    for article in soup.find_all(
                        "li", {"class": "search__item issue-item-container"}
                    ):
                        title_tag = article.find_all(
                            "span", {"class": "hlFld-Title"}
                        )
                        title = None
                        rg_url = None
                        doi = None
                        if len(title_tag) > 0:
                            title_tag = title_tag[0]
                            title = title_tag.text
                            rg_url = "https://dl.acm.org" + title_tag.find("a", href=True)["href"]
                        doi_tag = article.find("a", {"class":"issue-item__doi"})
                        if doi_tag is not None: 
                            doi = doi_tag["href"]
                        if title is None: 
                            continue
                        if title.lower() == cit.title.lower():
                            session.get(rg_url)
                            res = browser.page_source
                            soup = BeautifulSoup(res, "html.parser")
                            abstract = soup.find("div", {"class": "abstractSection"}).text
                            if cit.doi is None: 
                                cit.doi = doi
                            if cit.abstract is None: 
                                cit.abstract = abstract
                        lastit = i
            except BaseException as e:
                print(e)
                lastit = i
                continue
    with open(cits_path, "w+") as f:
        to_write = [cit.__dict__ for cit in cits.values()]
        json.dump(to_write, f)

In [7]:
print("without DOI:", len(list(filter(lambda x: x.doi is None, cits.values()))))
print("without abstract:", len(list(filter(lambda x: x.abstract is None, cits.values()))))

without DOI: 155
without abstract: 161


In [21]:
if stats:
    w_abstract = [cit for cit in cits.values() if cit.abstract is not None]
    print("there are" , len(w_abstract), "documents w/abstract in the dataset")
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity 
    import nltk 
    
    from nltk.corpus import stopwords
    import string 
    
    stop  = set(stopwords.words('english') + list(string.punctuation))
    
    tokenizer = lambda x: [i for i in nltk.word_tokenize(x.lower()) if i not in stop]
    vectorizer = TfidfVectorizer(tokenizer = tokenizer)
    documents = vectorizer.fit_transform([c.abstract for c in w_abstract])
    
    def search(query): 
        q = vectorizer.transform([query])
        match = cosine_similarity(q, documents)
        answers, scores = [], []
        for i, s in sorted(enumerate(match[0]), key=lambda x: -x[1]):
            answers.append(i)
            scores.append(s)
        return answers, scores
    
    anss, scores = search("agda")
    for i in range(len(anss)): 
        print(w_abstract[ans[i]].title, scores[i])

there are 144 documents w/abstract in the dataset
Flexible coinduction in agda 0.16996154584003076
Flexible coinduction in Agda 0.16996154584003076
Flexible Coinduction in Agda 0.16996154584003076
Operational semantics using the partiality monad 0.11423780231969859
Beating the Productivity Checker Using Embedded Languages 0.07988117770413543
Beating the productivity checker using embedded languages 0.07934553190445046
Introduction to bisimulation and coinduction 0.0
CakeML: a verified implementation of ML 0.0
A verified compiler for an impure functional language 0.0
Biorthogonality, step-indexing and compiler correctness 0.0
Probabilistic operational semantics for the lambda calculus 0.0
Functional big-step semantics 0.0
One-path reachability logic 0.0
Dynamic determinacy analysis 0.0
Pretty-big-step semantics 0.0
Type soundness proofs with definitional interpreters 0.0
Interaction trees: representing recursive and impure programs in Coq 0.0
Denotational cost semantics for functional l

In [25]:
for cit in sorted(cits.values(), key=lambda x: x.title): 
    print(cit.title)

A Big Step from Finite to Infinite Computations (SCICO Journal-first)
A Certified Extension of the Krivine Machine for a Call-by-Name Higher-Order Imperative Language
A Comparison of Big-step Semantics Definition Styles
A Compositional Framework for Certified Separate Compilation and Modular Program Verification
A Correct Compiler from Mini-ML to a Big-Step Machine Verified Using Natural Semantics in Coq
A Formally Verified Compiler Back-end
A Hoare logic for the coinductive trace-based big-step semantics of While
A Machine-Checked, Type-Safe Model of Java Concurrency
A Meta-theory for Big-step Semantics
A Mixin Based Object-Oriented Calculus: True Modularity in Object-Oriented Programming
A Provably Correct Compilation of Functional Languages into Scripting Languages
A Semantic Approach to Machine-Level Software Security
A Supposedly Fun Thing I May Have to Do Again
A Theory of Agreements and Protection
A big step from finite to infinite computations
A certified multi-prover verificat