# Introduction to the Data Scraping Notebook

In this data scraping notebook, our objective is to gather relevant information from two key sources – The White House and The European Commission. By systematically collecting and processing data, we aim to provide valuable insights into the nature of their support, potential differences in rhetoric, and the impact of President Zelenskiy's visits.

The notebook includes pipelines for scraping the data from The White House and The European Commission. The scraping classes are implemented in a scr/scraper.py file.

# Set Up Environment

## Import necessary libraries

In [11]:
import pandas as pd

import src.scraper as s

# Scrape The White House Data

In this section, we retrieve data from The White House, focusing on President Biden's administration. This includes parsing official statements, speeches, and documents. 

The data from The White House will be crucial in understanding the United States' stance and support for Ukraine, particularly in the context of President Zelenskiy's visits. It forms a foundational component of our comparative analysis with The European Commission.

## Scraping Pipeline

In [18]:
# Initialize an empty dataframe to store the results
articles_df = pd.DataFrame(columns=['Title', 'Date', 'Category', 'Location', 'Text'])

# List of categories to be scraped
links = ['https://www.whitehouse.gov/briefing-room/speeches-remarks/']

# Iterate through all categories
for link in links:
    # Initialize the scraping class
    scraper = s.TheWhiteHouseScraper(url=link)
    soup = scraper.get_html_content()
    
    # Get the total number of pages
    page_num = scraper.get_page_num(soup)
    print(link)
    print(f'Total number of pages: {page_num}')

    # Get articles from each page
    for i in range(1, page_num+1):
        page_link = f'{link}page/{i}/'
        page_scraper = s.TheWhiteHouseScraper(url=page_link)
        page_soup = page_scraper.get_html_content()
        
        # Add articles to a dataframe
        df_temp = pd.DataFrame(page_scraper.get_articles(page_soup))
        articles_df = pd.concat([articles_df, df_temp], ignore_index=True)

        # Print progress every 10%
        if i % (page_num // 10) == 0:
            progress_percent = (i / page_num) * 100
            print(f'{progress_percent:.0f}% completed - {i}/{page_num}')
    print()

print("Scraping completed.")

https://www.whitehouse.gov/briefing-room/speeches-remarks/
Total number of pages: 202
10% completed - 20/202
20% completed - 40/202
30% completed - 60/202
40% completed - 80/202
50% completed - 100/202
59% completed - 120/202
69% completed - 140/202
79% completed - 160/202
89% completed - 180/202
99% completed - 200/202

Scraping completed.


In [20]:
articles_df.head()

Unnamed: 0,Title,Date,Category,Location,Text
0,Remarks by President Biden at a Political Even...,2024-02-01T20:24:19-05:00,Speeches and Remarks,"Region 1 Union Hall; Warren, Michigan","4:41 P.M. EST\n \nTHE PRESIDENT: Well, thank ..."
1,Remarks by President Biden at the National Pra...,2024-02-01T14:13:03-05:00,Speeches and Remarks,"U.S. Capitol; Washington, D.C.","9:04 A.M. EST\nTHE PRESIDENT: Frank, thank yo..."
2,Remarks by President Biden at a Campaign Recep...,2024-01-31T00:04:32-05:00,Speeches and Remarks,"Private Residence; Miami, Florida","6:27 P.M. EST\n\nTHE PRESIDENT: Well, Chris, t..."
3,Remarks and Q&A by National Security Advisor J...,2024-01-30T22:00:00-05:00,Speeches and Remarks,"Council on Foreign Relations; Washington, D.C.",MR. SULLIVAN: At least I had the bravery to g...
4,Remarks as Prepared for Delivery by First Lady...,2024-01-30T20:52:50-05:00,Speeches and Remarks,The White House,"Brian and Sandra, it’s an honor for Joe and me..."


In [24]:
# Check if there are any missing values in the dataframe
articles_df.isnull().values.any()

False

No missing values, the data is fine.

In [25]:
# Save to csv
articles_df.to_csv('data/thewhitehouse.csv', index=False)

# Scrape The European Commission Data

In this phase, our focus shifts to collecting data from The European Commission, which plays a significant role in the European Union's policies and actions.

We specifically are going to focus on the speeches and remarks of Ursula von der Leyen, the current President of the European Commission. The data collected will offer insights into the European Commission's stance and support for Ukraine.

In [None]:
# Under development