## Web Scraping Approach

A web scraping process aimed at collecting news articles from an online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
def scrape_nature_article(url):
    response = requests.get(url)
    if response.status_code != 200:
        return {'title': 'Failed to fetch article', 'text': ''}
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extracting the title
    title_tag = soup.find('h1', class_='c-article-magazine-title')
    article_title = title_tag.get_text(strip=True) if title_tag else 'No Title Found'
    
    # Extracting the teaser text
    teaser_text_tag = soup.find('div', class_='c-article-teaser-text')
    teaser_text = teaser_text_tag.get_text(strip=True) if teaser_text_tag else ''
    
    # Extracting the main article paragraphs
    article_body_texts = soup.find_all('p', class_='article__teaser')
    article_body = ' '.join([p.get_text(strip=True) for p in article_body_texts])
    
    # Combining teaser and main article body
    full_text = teaser_text + " " + article_body if teaser_text else article_body
    
    return {
        'title': article_title,
        'text': full_text.strip()
    }

# List of Nature URLs to scrape
urls = [
    'https://www.nature.com/articles/d41586-024-00576-w',
    'https://www.nature.com/articles/d41586-024-00662-z',
    'https://www.nature.com/articles/s41586-024-07176-8',
    'https://www.nature.com/articles/s41586-024-07186-6',
    'https://www.nature.com/articles/s41586-024-07187-5',
    'https://www.nature.com/articles/s41586-024-07096-7',
    'https://www.nature.com/articles/s41586-024-07167-9',
    'https://www.nature.com/articles/s41586-024-07159-9',
    'https://www.nature.com/articles/s41586-024-07058-z',
    'https://www.nature.com/articles/s41586-024-07150-4',
    'https://www.nature.com/articles/d41586-024-00780-8',
    'https://www.nature.com/articles/d41586-024-00895-y',
    'https://www.nature.com/articles/d41586-024-00886-z',
    'https://www.nature.com/articles/d41586-024-00828-9',
    'https://www.nature.com/articles/d41586-024-00839-6',
    'https://www.nature.com/articles/d41586-024-00795-1',
    'https://www.nature.com/articles/d41586-024-00747-9',
    'https://www.nature.com/articles/d41586-024-00720-6',
    'https://www.nature.com/articles/d41586-024-00661-0',
    'https://www.nature.com/articles/d41586-024-00695-4',
    'https://www.nature.com/articles/d41586-023-02240-1',
    'https://www.nature.com/articles/d41586-023-02357-3',
    'https://www.nature.com/articles/d41586-023-02333-x',
    'https://www.nature.com/articles/d41586-023-02215-2',
    'https://www.nature.com/articles/d41586-023-01618-5',
    'https://www.nature.com/articles/d41586-023-01935-9',
    'https://www.nature.com/articles/d41586-023-01890-5',
    'https://www.nature.com/articles/d41586-022-01932-4',
    'https://www.nature.com/articles/d41586-023-01491-2',
    'https://www.nature.com/articles/d41586-023-01355-9',
    'https://www.nature.com/articles/d41586-023-01020-1',
    'https://www.nature.com/articles/d41586-023-01023-y',
    'https://www.nature.com/articles/d41586-023-00979-1',
    'https://www.nature.com/articles/d41586-023-00850-3',
    'https://www.nature.com/articles/d41586-023-00835-2',
    'https://www.nature.com/articles/d41586-023-00847-y',
    'https://www.nature.com/articles/d41586-023-00798-4',
    'https://www.nature.com/articles/d41586-023-00710-0',
    'https://www.nature.com/articles/d41586-023-00755-1',
    'https://www.nature.com/articles/d41586-023-00597-x',
    'https://www.nature.com/articles/d41586-023-00531-1',
    'https://www.nature.com/articles/d41586-023-00539-7',
    'https://www.nature.com/articles/d41586-023-00475-6',
    'https://www.nature.com/articles/d41586-023-00268-x',
    'https://www.nature.com/articles/d41586-023-00243-6',
    'https://www.nature.com/articles/d41586-023-00225-8',
    'https://www.nature.com/articles/d41586-023-00195-x',
    'https://www.nature.com/articles/d41586-023-00027-y',
    'https://www.nature.com/articles/d41586-023-00068-3',
    'https://www.nature.com/articles/d41586-022-04494-7'
]

# Scraping the articles and collect the data
articles = []
for url in urls:
    result = scrape_nature_article(url)
    if result:
        articles.append({
            'url': url,
            'title': result['title'],
            'text': result['text'],
            'label': 'Human-written'
        })

nature_df = pd.DataFrame(articles)

In [3]:
nature_df

Unnamed: 0,url,title,text,label
0,https://www.nature.com/articles/d41586-024-005...,Magnetic whirlpools offer improved data storage,Complex magnetic structures called skyrmions h...,Human-written
1,https://www.nature.com/articles/d41586-024-006...,Whittling down the bacterial subspecies that m...,Understanding the factors that drive formation...,Human-written
2,https://www.nature.com/articles/s41586-024-071...,No Title Found,,Human-written
3,https://www.nature.com/articles/s41586-024-071...,No Title Found,,Human-written
4,https://www.nature.com/articles/s41586-024-071...,No Title Found,,Human-written
5,https://www.nature.com/articles/s41586-024-070...,No Title Found,,Human-written
6,https://www.nature.com/articles/s41586-024-071...,No Title Found,,Human-written
7,https://www.nature.com/articles/s41586-024-071...,No Title Found,,Human-written
8,https://www.nature.com/articles/s41586-024-070...,No Title Found,,Human-written
9,https://www.nature.com/articles/s41586-024-071...,No Title Found,,Human-written


In [4]:
(nature_df['text'] == '').sum()

8

In [5]:
nature_df = nature_df[nature_df['text'] != '']

In [6]:
(nature_df['text'] == '').sum()

0

In [7]:
nature_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42 entries, 0 to 49
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     42 non-null     object
 1   title   42 non-null     object
 2   text    42 non-null     object
 3   label   42 non-null     object
dtypes: object(4)
memory usage: 1.6+ KB


## Data storage for further analysis

After successfully scraping and organizing the data, it is stored. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [8]:
nature_df.to_pickle("nature_data.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)