---
title: "A Structural Analysis of Academic Writing"
date: "2024-11-05"
date-modified: "2024-11-05"
categories: [LLM, Python, R, Web Scraping]
image: sunset5.jpeg
format:
  html:
    code-fold: true
jupyter: python3
---




Academic journals require research articles to follow a specific format. First, an 150-300 word abstract provides an big-picture overview of the main ideas contained within the paper. An abstract may also provide a brief summary of relevant background information or methods used in the paper. Following the abstract, authors typically provide a longer introduction section. In this authors may do a number of things, including explain why their research is important, summarize past results in the field, identify key gaps in the literature, and discuss methods used in the paper. After the introduction, authors typically include a methods, results, discussion, and conclusion section. However, these sections are often modified, split up, or removed entirely to better suit the project.

Since the abstract and introduction are crucial to any research paper, understanding how to write these sections effectively is an essential skill for a researcher. Advice on paper writing is typically *qualitative* -- stuff like "start broad and then gradually get narrower" or "make sure to emphasize the importance of your research". Today, I outline a *quantitative* framework for writing abstracts and introduction. As an example, my analysis gives an estimate of how many sentences of motivation you should provide and how where these sentences should be located in your paper. To do this, I analyzed over $9\,000$ papers from the [PLOS Computational Biology](https://journals.plos.org/ploscompbiol/) journal using the open source LLM [Llama 3.2](https://www.llama.com/).
 
**Remark**: The idea for this project came from my supervisor, [Prof. Eric Cytrynbaum](https://personal.math.ubc.ca/~cytryn/index.shtml).

## Web Scraping

Before I could do any fancy AI-powered analysis, I needed to get the abstracts and introductions from a large number of academic papers. Getting a bunch of abstracts is straightforward, the [arXiv Dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) contain millions of them. However, sourcing the introductions from these papers is much harder. My first idea was to just feed the full-text PDFs from the arXiv dataset directly into the LLM. However, even state-of-the-art models like ChatGPT 4o struggled to extract the abstract and introduction sections I needed. I'm not entirely sure why this task is so difficult, but I think it might something to do with the two-column formatting present in many academic journals. This experience led me to realize that I needed a way to extract text directly from the papers themselves, which led me to web scraping.

Web scraping is a tedious and annoying task. Small changes between webpages can completely break your scraper, and you're constantly running the risk of getting your IP permanently banned. Journals run by the big publishing companies like Springer and Elsevier also require authentication, which adds an additional layer of complexity. In order to keep things as simple as possible, I chose to extract papers from a single open-source journal[^1], [PLOS Computational Biology](https://journals.plos.org/ploscompbiol/). To actually do the web scraping, I used the Python library [beautifulsoup4](https://pypi.org/project/beautifulsoup4/). I've attached an abridged and annotated version of my code below:


In [None]:
#| code-fold: True

import requests
import time
import random
import json
from bs4 import BeautifulSoup

# Get the abstract and introduction given a DOI
def extract_paper(url):

    # Make an HTTP request to the URL and get the HTML content
    try:
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html5lib")
    except: 
        ...

    # Attempt to find the abstract
    try:
        section = soup.find("div", {"class": "abstract-content"})
        paragraphs = [p.text for p in section.find_all("p")]
        abstract = " ".join(paragraphs).replace("\n", "")
    except:
        ...

    # Attempt to find the introduction
    ...
        
    # Attempt to extract the date
    ...
      
    # Return the data from the paper in a JSON-friendly format
    return { ... }

# Extract DOIs from a webpage
def extract_dois(url):

    # Attempt to extract HTML content from the URL
    ...
      
    # Attempt to get all DOI links from the "Research Articles" section
    doi_list = []
    try:
        header = soup.find("h2", string="Research Articles")
        links = header.parent.find_all("a", href = True)
        doi_list = [a["href"] for a in links if "doi" in a["href"]]
    except:
        ...
    
    return doi_list

# Extract data from papers in PLOS computational biology
def main():

    # Base URL for getting monthly journal pages 
    URL = "https://journals.plos.org/ploscompbiol/issue?id=10.1371/"

    # Iterate through each volume (1-20) and issue (1-12)
    # Do this in reverse order ot get the most recent papers first
    doi_list = []
    for volume in range(20, 0, -1):
        for issue in range(12, 0, -1):

            # Create a list of potential URLs to check
            urls = [URL + f"issue.pcbi.v{volume:02d}.i{issue:02d}"]

            # Attempt to get the DOIs from each URL: 
            for url in urls: doi_list += extract_dois(url)

            # Pause for 1-2 seconds to avoid getting IP banned 
            time.sleep(1 + random.random())

    # Then, scrape the abstract and introduction from each DOI
    data = []
    for i, url in enumerate(doi_list):
        data.append(extract_paper(url))

        # Print a status message and wait 1-2 seconds 
        ...

    # Export data to a json file
    ...

[^1]: I picked computational biology over another PLOS journal because its closest to my own research interests. Hopefully, the results of this analysis will come in handy if I ever manage to write a paper!

## Sentence Categorization

## Data Analysis

## Conclusion