## Web Scraping Approach

A web scraping process aimed at collecting news articles from an online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
import requests

def check_url_exists(url):
    response = requests.head(url)  # Using HEAD to check the existence without downloading the whole page
    return response.status_code == 200  # Returns True if the page exists

def generate_pubmed_urls(start_id, count=300):
    existing_urls = []
    attempt = 0  # Count of attempts to find existing pages
    while len(existing_urls) < count and attempt < 1000:  # Prevent infinite loops
        url = f"https://pubmed.ncbi.nlm.nih.gov/{start_id + attempt}/"
        if check_url_exists(url):
            existing_urls.append(url)
        attempt += 1
    return existing_urls

#starting ID
start_id = 38535994
urls = generate_pubmed_urls(start_id)

In [3]:
len(urls)

300

In [4]:
def scrape_pubmed_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the title of the article
    title = soup.find('h1', class_='heading-title')
    article_title = title.get_text(strip=True) if title else 'PubMed Article'  # Fallback title if not found
    
    # Find the main content of the article (abstract)
    abstract_div = soup.find('div', class_='abstract-content selected')
    abstract_text = abstract_div.get_text(strip=True) if abstract_div else ''
    
    return {
        'title': article_title,
        'text': abstract_text
    }

scraping_functions = {
    'pubmed.ncbi.nlm.nih.gov': scrape_pubmed_article,  
}

def scrape_article(url):
    domain = url.split('//')[1].split('/')[0]
    if domain in scraping_functions:
        func = scraping_functions[domain]
        article_data = func(url)
        return article_data
    print(f"No specific scraping function for URL: {url}")
    return None

# Scraping the articles and collect the data
articles = []
for url in urls:
    result = scrape_article(url)
    if result:
        articles.append({'url': url, 'title': result['title'], 'text': result['text'], 'label': 'Human-written'})

pubmed_df = pd.DataFrame(articles)

In [5]:
pubmed_df

Unnamed: 0,url,title,text,label
0,https://pubmed.ncbi.nlm.nih.gov/38535994/,NPEPPS Is a Druggable Driver of Platinum Resis...,There is an unmet need to improve the efficacy...,Human-written
1,https://pubmed.ncbi.nlm.nih.gov/38535995/,High Internal Phase Emulsion for Constructing ...,Polymerized high internal phase emulsions (pol...,Human-written
2,https://pubmed.ncbi.nlm.nih.gov/38535996/,Proteomic Analysis of Human Saliva via Solid-P...,Proteomics of human saliva samples was achieve...,Human-written
3,https://pubmed.ncbi.nlm.nih.gov/38535997/,Adenine Methylation Enhances the Conformationa...,The N6-methyladenosine modification is one of ...,Human-written
4,https://pubmed.ncbi.nlm.nih.gov/38535998/,Targeting mitochondrial dysfunction using meth...,Methylene blue (MB) is a well-established anti...,Human-written
...,...,...,...,...
295,https://pubmed.ncbi.nlm.nih.gov/38536291/,"Operational stressors, psychological distress,...",Military personnel experience stressors during...,Human-written
296,https://pubmed.ncbi.nlm.nih.gov/38536292/,Validation of the adapted response to stressfu...,There is evidence to suggest that resilience m...,Human-written
297,https://pubmed.ncbi.nlm.nih.gov/38536293/,The impact of family stressors and resources o...,Much of the prior research on variables impact...,Human-written
298,https://pubmed.ncbi.nlm.nih.gov/38536294/,Diverse predictors of early attrition in an el...,Reconnaissance Marine training is deliberately...,Human-written


In [6]:
(pubmed_df['text'] == '').sum()

55

In [7]:
pubmed_df = pubmed_df[pubmed_df['text'] != '']

In [8]:
(pubmed_df['text'] == '').sum()

0

In [9]:
pubmed_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 245 entries, 0 to 299
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     245 non-null    object
 1   title   245 non-null    object
 2   text    245 non-null    object
 3   label   245 non-null    object
dtypes: object(4)
memory usage: 9.6+ KB


## Data storage for further analysis

After successfully scraping and organizing the data, it is stored. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [10]:
pubmed_df.to_pickle("pubmed_data.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)