# Sort Google Scholar
This is a jupyter envirnment where you can try the code of the repository without installing anything. The only limitation is the robot checking problem which would require selenium and manual solution of the captchas, but for trying a few keywords, it should work!

> **INSTRUCTIONS:** If this is the first time you are using a jupyter environment, you simply have to run the code blocks using the keyword `SHIFT` + `ENTER`. Make sure to update the keyword parameters when required.

SortGS has been recently included to PyPI, so the instructions here got simpler. First, let's install the package:

In [51]:
!pip install pymupdf

Defaulting to user installation because normal site-packages is not writeable
Collecting pymupdf
  Downloading pymupdf-1.25.3-cp39-abi3-macosx_10_9_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-macosx_10_9_x86_64.whl (19.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.3


In [53]:
import sortgs
import requests
import os
import time
import sys
import logging
import fitz
import plotly.express as px
import pandas as pd
from tqdm.auto import tqdm
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Example `search_query`:

- `Large Language Models` → General search
- `"Large Language Models"` → Exact phrase search
- `Large Language Models -transformer` → Exclude specific term
- `Large Language Models author:"Geoffrey Hinton"` → Search by author
- `Large Language Models source:Nature` → Search within a specific publication
- `("Large Language Models" OR "Transformer Models") AND (GPT OR BERT)` → Boolean search
- `intitle:"Large Language Models"` → Search in the title only

In [39]:
# Main query
search_query = 'airports'

# Expanded form with extra parameters
sortby = "cit/year"  # @param ["Citations", "cit/year"] {type:"string"}
nresults = 100  # @param {type:"number"}
startyear = '2000'  # @param {type:"string"}
endyear = None  # @param {type:"string"}
langfilter = None  # @param ["None", "zh-CN", "zh-TW", "nl", "en", "fr", "de", "it", "ja", "ko", "pl", "pt", "es", "tr"] {type:"string"}

# Convert the langfilter to a list if it's not None
if langfilter and langfilter != "None":
    langfilter = [langfilter]
else:
    langfilter = None  # No language filter applied if "None" is selected

# Constructing the base command
cmd = f"sortgs '{search_query}' --sortby '{sortby}' --nresults {nresults}"

if startyear:
    cmd += f" --startyear {startyear}"

if endyear:
    cmd += f" --endyear {endyear}"

if langfilter:
    lang_str = ' '.join(langfilter)
    cmd += f" --langfilter {lang_str}"

# Output the constructed command for review
print("Constructed command:", cmd)


Constructed command: sortgs 'airports' --sortby 'cit/year' --nresults 100 --startyear 2000


In [40]:
!{cmd}

Running with the following parameters:
Keyword: airports, Number of results: 100, Save database: True, Path: /Users/sotapanna/Sync/Software/sort-google-scholar/jupyter, Sort by: cit/year, Permitted Languages: All, Plot results: False, Start year: 2000, End year: 2025, Debug: False
Loading next 10 results
Loading next 20 results
Loading next 30 results
Loading next 40 results
Loading next 50 results
Loading next 60 results
Loading next 70 results
Loading next 80 results
Loading next 90 results
Loading next 100 results
                                          Author  ... cit/year
Rank                                              ...         
1                                       A Graham  ...      479
25                                R De Neufville  ...      232
71                         A Di Vaio, L Varriale  ...       88
8                         W Schlenker, WR Walker  ...       79
66    BJ Quilty, S Clifford, S Flasche, RM Eggo…  ...       62
...                                 

> _**NOTE:** It is normal to get some warnings, for example year not found or author not found. However, if you get the robot checking warning, then it might not work anymore in the IP that you have. You can try going in 'Runtime' > 'Disconnect and delete runtime' to get a new IP. If the problem persists, then you will have to run locally using selenium and solve the captchas manually. Make sure to avoid running this code too often to avoid the robot checking problem._

Next, you will see that a csv file with the name of the keyword was created.

In [58]:
csv_filename = search_query.replace(' ', '_')+'.csv'
df = pd.read_csv(csv_filename)
pd.set_option('display.max_colwidth', None)  # Set to None for full width
df.head(5)

Unnamed: 0,Rank,Author,Title,Citations,Year,Publisher,Venue,Source,cit/year
0,1,A Graham,Managing airports: An international perspective,1438,2023,taylorfrancis.com,,https://www.taylorfrancis.com/books/mono/10.4324/9781003269359/managing-airports-anne-graham,479
1,25,R De Neufville,"Airport systems planning, design, and management",1389,2020,api.taylorfrancis.com,Air transport management,https://api.taylorfrancis.com/content/chapters/edit/download?identifierName=doi&identifierValue=10.4324/9780429299445-6&type=chapterpdf,232
2,71,"A Di Vaio, L Varriale",Blockchain technology in supply chain management for sustainable performance: Evidence from the airport industry,528,2020,Elsevier,International Journal of Information Management,https://www.sciencedirect.com/science/article/pii/S0268401219304803,88
3,8,"W Schlenker, WR Walker","Airports, air pollution, and contemporaneous health",793,2016,academic.oup.com,The Review of economic studies,https://academic.oup.com/restud/article-abstract/83/2/768/2461206,79
4,66,"BJ Quilty, S Clifford, S Flasche, RM Eggo…",Effectiveness of airport screening at detecting travellers infected with novel coronavirus (2019-nCoV),375,2020,pmc.ncbi.nlm.nih.gov,…,https://pmc.ncbi.nlm.nih.gov/articles/PMC7014668/,62


In [59]:
# Configure Logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Set up Selenium WebDriver (Headless Chrome)
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)

# Headers for requests
HEADERS = {"User-Agent": "AbstractFetcher/1.0"}

# =================== FETCH ABSTRACTS FROM SEMANTIC SCHOLAR ===================
def fetch_abstract_semantic_scholar(row):
    """Fetches the abstract using Semantic Scholar API, trying DOI first, then title."""
    
    base_url = "https://api.semanticscholar.org/graph/v1/paper/"
    fields = "title,abstract"
    headers = {"User-Agent": "Academic-Abstract-Fetcher/1.0"}

    logging.info("Attempting to fetch abstract from Semantic Scholar for title: %s", row['Title'])

    # Attempt DOI-based retrieval
    if '10.' in row['Source']:
        doi = row['Source'].split('10.')[1].split('?')[0].split('/pdf')[0]
        doi = "10." + doi
        doi_url = f"{base_url}DOI:{doi}?fields={fields}"

        res = requests.get(doi_url, headers=headers)
        
        if res.status_code == 429:
            logging.warning("Semantic Scholar rate limit hit. Retrying in 5 seconds...")
            time.sleep(5)
            return fetch_abstract_semantic_scholar(row)  # Retry once

        if res.status_code == 200:
            abstract = res.json().get('abstract')
            if abstract:
                logging.info("Abstract found via Semantic Scholar (DOI).")
                return abstract
        else:
            logging.debug("Semantic Scholar (DOI) returned status code: %d", res.status_code)

    # Fallback to title-based retrieval
    title_query = requests.utils.quote(row['Title'])
    title_search_url = f"{base_url}search?query={title_query}&limit=1&fields={fields}"

    res = requests.get(title_search_url, headers=headers)

    if res.status_code == 429:
        logging.warning("Semantic Scholar rate limit hit. Retrying in 5 seconds...")
        time.sleep(5)
        return fetch_abstract_semantic_scholar(row)  # Retry once

    if res.status_code == 200:
        data = res.json().get('data', [])
        if data and 'abstract' in data[0]:
            logging.info("Abstract found via Semantic Scholar (Title Search).")
            return data[0]['abstract']
        else:
            logging.warning("Title search returned no abstract.")
    else:
        logging.debug("Semantic Scholar (Title) returned status code: %d", res.status_code)

    logging.info("Semantic Scholar did not return an abstract.")
    return None 

# =================== FETCH ABSTRACTS FROM CROSSREF ===================
def fetch_crossref(doi):
    """Fetch abstract from CrossRef using DOI."""
    url = f"https://api.crossref.org/works/{doi}"

    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        if response.status_code == 200:
            data = response.json()
            if "message" in data and "abstract" in data["message"]:
                return data["message"]["abstract"]
        else:
            logging.info("CrossRef did not return an abstract.")
    except requests.RequestException as e:
        logging.error(f"CrossRef request error: {e}")

    return None  # Return None if CrossRef fails

# =================== FETCH ABSTRACTS VIA WEB SCRAPING ===================
def fetch_abstract_web_scraping(url):
    """Attempts to fetch an abstract from a webpage using web scraping."""
    
    try:
        response = requests.get(url, headers=HEADERS, timeout=10)
        
        # Check if content type is a PDF
        if "application/pdf" in response.headers.get("Content-Type", ""):
            logging.info("Skipping PDF content, trying PDF extraction...")
            return extract_abstract_from_pdf(url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            abstract_candidates = soup.find_all(["p", "div"], string=True, limit=10)
            
            for tag in abstract_candidates:
                text = tag.get_text(strip=True)
                if len(text) > 100:  # Ensure a meaningful abstract
                    return text
        
        logging.info("Web scraping did not return an abstract. Trying Selenium for JS-loaded pages...")
        return fetch_abstract_with_selenium(url)

    except requests.RequestException as e:
        logging.error(f"Web Scraping request error: {e}")

    return None

# =================== FETCH ABSTRACTS USING SELENIUM ===================
def fetch_abstract_with_selenium(url):
    """Uses Selenium to scrape JS-loaded abstracts."""
    
    try:
        options = Options()
        options.add_argument("--headless")  # Run in headless mode
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")

        with webdriver.Chrome(options=options) as driver:
            driver.get(url)
            soup = BeautifulSoup(driver.page_source, "html.parser")
            
            abstract_candidates = soup.find_all(["p", "div"], string=True, limit=10)
            for tag in abstract_candidates:
                text = tag.get_text(strip=True)
                if len(text) > 100:
                    return text
        
    except Exception as e:
        logging.error(f"Selenium Web Scraping failed: {e}")

    return None

# =================== FETCH ABSTRACTS FROM PDF ===================
def extract_abstract_from_pdf(pdf_url):
    """Download and extract abstract from a PDF file."""
    
    try:
        response = requests.get(pdf_url, stream=True)

        if response.status_code == 200:
            with open("temp.pdf", "wb") as f:
                f.write(response.content)

            with fitz.open("temp.pdf") as doc:
                text = ""
                for page in doc:
                    text += page.get_text()

                return text[:500]  # Limit to first 500 characters (likely abstract)

    except requests.RequestException as e:
        logging.error(f"PDF extraction error: {e}")

    return None

# =================== INTELLIGENT SOURCE SELECTION ===================
def get_abstract(row):
    """Selects the best source for fetching abstracts based on available data."""
    
    title = row.get("Title", "").strip()
    source_url = row.get("Source", "").strip()
    doi = row.get("DOI", "").strip()

    logging.info(f"Starting abstract retrieval for: {title}")

    # 1. Try Semantic Scholar First (DOI-based, then Title-based)
    logging.info(f"Trying Semantic Scholar for Title: {title}")
    abstract = fetch_abstract_semantic_scholar(row)
    if abstract:
        return abstract

    # 2. Try DOI-based lookup via CrossRef
    if doi:
        logging.info(f"Trying CrossRef for DOI: {doi}")
        abstract = fetch_crossref(doi)
        if abstract:
            return abstract

    # 3. Try Web Scraping if a valid source URL is available
    if source_url.startswith("http"):
        logging.info(f"Trying Web Scraping for Source: {source_url}")
        abstract = fetch_abstract_web_scraping(source_url)
        if abstract:
            return abstract

    # 4. Try PDF Extraction if the source is a PDF
    if source_url.endswith(".pdf"):
        logging.info(f"Trying PDF Extraction for: {source_url}")
        abstract = extract_abstract_from_pdf(source_url)
        if abstract:
            return abstract
    
    logging.warning(f"Abstract not found for: {title}")
    return "Abstract not found"

# =================== MAIN SCRIPT ===================
# Fetch abstracts with a progress bar
tqdm.pandas(desc="Fetching Abstracts")
# df['Abstract'] = df.progress_apply(get_abstract, axis=1)

# Export enriched DataFrame to CSV
df.to_csv(csv_filename, index=False)

print("Abstract retrieval completed. Data saved to .csv")

Abstract retrieval completed. Data saved to .csv


In [57]:
df.head(10)

Unnamed: 0,Rank,Author,Title,Citations,Year,Publisher,Venue,Source,cit/year
0,1,A Graham,Managing airports: An international perspective,1438,2023,taylorfrancis.com,,https://www.taylorfrancis.com/books/mono/10.4324/9781003269359/managing-airports-anne-graham,479
1,25,R De Neufville,"Airport systems planning, design, and management",1389,2020,api.taylorfrancis.com,Air transport management,https://api.taylorfrancis.com/content/chapters/edit/download?identifierName=doi&identifierValue=10.4324/9780429299445-6&type=chapterpdf,232
2,71,"A Di Vaio, L Varriale",Blockchain technology in supply chain management for sustainable performance: Evidence from the airport industry,528,2020,Elsevier,International Journal of Information Management,https://www.sciencedirect.com/science/article/pii/S0268401219304803,88
3,8,"W Schlenker, WR Walker","Airports, air pollution, and contemporaneous health",793,2016,academic.oup.com,The Review of economic studies,https://academic.oup.com/restud/article-abstract/83/2/768/2461206,79
4,66,"BJ Quilty, S Clifford, S Flasche, RM Eggo…",Effectiveness of airport screening at detecting travellers infected with novel coronavirus (2019-nCoV),375,2020,pmc.ncbi.nlm.nih.gov,…,https://pmc.ncbi.nlm.nih.gov/articles/PMC7014668/,62
5,3,"NJ Ashford, S Mumayiz, PH Wright","Airport engineering: planning, design, and development of 21st century airports",838,2011,books.google.com,,https://books.google.com/books?hl=en&lr=&id=-V_yKCO592EC&oi=fnd&pg=PR11&dq=airports&ots=M7f1MmuxDM&sig=_nn-WnmYWab5DivanBMbYeXISFI,56
6,58,"M Dresner, JSC Lin, R Windle",The impact of low-cost carriers on airport and route competition,462,2017,taylorfrancis.com,Low Cost Carriers,https://www.taylorfrancis.com/chapters/edit/10.4324/9781315091617-18/impact-low-cost-carriers-airport-route-competition-martin-dresner-jiun-sheng-chris-lin-robert-windle,51
7,35,"M Masiol, RM Harrison",Aircraft engine exhaust emissions and other airport-related contributions to ambient air pollution: A review,587,2014,Elsevier,Atmospheric environment,https://www.sciencedirect.com/science/article/pii/S1352231014004361,49
8,67,"NJ Ashford, HP Stanton, CA Coutu, JR Beaslley",Airport operations,626,2013,repo.poltekbangsby.ac.id,,"http://repo.poltekbangsby.ac.id/id/eprint/581/1/Airport%20Operations,%203%20edition%20(%20PDFDrive%20).pdf",48
9,65,SD Barrett,How do the demands for airport services differ between full-service carriers and low-cost carriers?,407,2017,taylorfrancis.com,Low Cost Carriers,https://www.taylorfrancis.com/chapters/edit/10.4324/9781315091617-11/demands-airport-services-differ-full-service-carriers-low-cost-carriers-sean-barrett,45


In [None]:
# @title Rank vs Citations
view = df.reset_index().copy()

# Function to truncate and add line breaks to long titles
def shorten_title(title, max_length=60):
    words = title.split()
    shortened_lines = []
    current_line = []

    # Add words to the current line until max_length is exceeded
    for word in words:
        if len(' '.join(current_line + [word])) <= max_length:
            current_line.append(word)
        else:
            shortened_lines.append(' '.join(current_line))
            current_line = [word]

    # Add the last line
    if current_line:
        shortened_lines.append(' '.join(current_line))

    return '<br>'.join(shortened_lines)


# Apply this function to the 'Title' column and create a new column for the shortened titles
view['Short_Title'] = view['Title'].apply(shorten_title)

# Now use 'Short_Title' for hover_name
fig = px.scatter(view,
                 x='Rank',
                 y='Citations',
                 title='Number of Citations vs Google Scholar Rank',
                 hover_name='Short_Title',
                 hover_data=['Rank', 'Author', 'Citations', 'Year', 'Publisher', 'Venue', 'cit/year']
)
fig.show()

In [None]:
# Generate .bib filename based on the CSV filename
bib_filename = os.path.splitext(csv_filename)[0] + ".bib"

# Function to convert DataFrame to BibTeX
def df_to_bib(df, filename):
    with open(filename, "w", encoding="utf-8") as f:
        for _, row in df.iterrows():
            entry_type = "article"  # Assuming all are journal articles
            citation_key = f"{row['Author'].split(',')[0].split()[0]}{row['Year']}"  # First author + year
            entry = f"""@{entry_type}{{{citation_key},
    author = {{{row['Author']}}},
    title = {{{row['Title']}}},
    year = {{{row['Year']}}},
    publisher = {{{row['Publisher']}}},
    journal = {{{row['Venue']}}},
    url = {{{row['Source']}}}
}}\n\n"""
            f.write(entry)

# Export DataFrame as .bib with the same name as the CSV
df_to_bib(df.head(10), bib_filename)

print(f"BibTeX file exported successfully as {bib_filename}.")