# Introduction to the Data Scraping Notebook

In this data scraping notebook, our objective is to gather relevant information from two key sources – The White House and The European Commission. By systematically collecting and processing data, we aim to provide valuable insights into the nature of their support, potential differences in rhetoric, and the impact of President Zelenskiy's visits.

The notebook includes pipelines for scraping the data from The White House and The European Commission. The scraping classes are implemented in a scr/scraper.py file.

# Set Up Environment

## Import necessary libraries

In [1]:
import pandas as pd

import src.scraper as s

# Scrape The White House Data

In this section, we retrieve data from The White House, focusing on President Biden's administration. This includes parsing official statements, speeches, and and press briefings. 

The data from The White House will be crucial in understanding the United States' stance and support for Ukraine, particularly in the context of President Zelenskiy's visits. It forms a foundational component of our comparative analysis with The European Commission.

## Scraping Pipeline

In [2]:
# Initialize an empty dataframe to store the results
wh_articles_df = pd.DataFrame(columns=['Title', 'Date', 'Category', 'Location', 'Text'])

# List of categories to be scraped
wh_links = ['https://www.whitehouse.gov/briefing-room/speeches-remarks/',
            'https://www.whitehouse.gov/briefing-room/statements-releases/',
            'https://www.whitehouse.gov/briefing-room/press-briefings/']        

# Iterate through all categories
for link in wh_links:
    # Get the category
    category = link.split('/')[-2]
    print(category)
    # Initialize the scraping class
    scraper = s.TheWhiteHouseScraper(url=link)
    soup = scraper.get_html_content()
    
    # Get the total number of pages
    page_num = scraper.get_page_num(soup)
    print(f'Total number of pages: {page_num}')

    # Get articles from each page
    for i in range(1, page_num+1):
        page_link = f'{link}page/{i}/'
        page_scraper = s.TheWhiteHouseScraper(url=page_link)
        page_soup = page_scraper.get_html_content()
        
        # Add articles to a dataframe
        df_temp = pd.DataFrame(page_scraper.get_articles(page_soup, category))
        wh_articles_df = pd.concat([wh_articles_df, df_temp], ignore_index=True)

        # Print progress every 10%
        if i % (page_num // 10) == 0:
            print(f'{i}/{page_num} completed.')
    print()

print("Scraping completed.")

speeches-remarks
Total number of pages: 202
20/202 completed.
40/202 completed.
60/202 completed.
80/202 completed.
100/202 completed.
120/202 completed.
140/202 completed.
160/202 completed.
180/202 completed.
200/202 completed.

statements-releases
Total number of pages: 512
51/512 completed.
102/512 completed.
153/512 completed.
204/512 completed.
255/512 completed.
306/512 completed.
357/512 completed.
408/512 completed.
459/512 completed.
510/512 completed.

press-briefings
Total number of pages: 92
9/92 completed.
18/92 completed.
27/92 completed.
36/92 completed.
45/92 completed.
54/92 completed.
63/92 completed.
72/92 completed.
81/92 completed.
90/92 completed.

Scraping completed.


In [11]:
wh_articles_df.shape

(8053, 5)

In [12]:
wh_articles_df.head()

Unnamed: 0,Title,Date,Category,Location,Text
0,Remarks by President Biden and Vice President ...,2024-02-03T22:00:00-05:00,Speeches and Remarks,Biden for President Campaign Headquarters; Wil...,"THE VICE PRESIDENT: Hello, Delaware! (Applau..."
1,Remarks by Vice President Harris at a Campaign...,2024-02-02T23:33:00-05:00,Speeches and Remarks,"South Carolina State University; Orangeburg, S...",THE VICE PRESIDENT: All right. Can we hear i...
2,Remarks by President Biden at a Political Even...,2024-02-01T20:24:19-05:00,Speeches and Remarks,"Region 1 Union Hall; Warren, Michigan","4:41 P.M. EST\n \nTHE PRESIDENT: Well, thank ..."
3,Remarks by President Biden at the National Pra...,2024-02-01T14:13:03-05:00,Speeches and Remarks,"U.S. Capitol; Washington, D.C.","9:04 A.M. EST\nTHE PRESIDENT: Frank, thank yo..."
4,Remarks by President Biden at a Campaign Recep...,2024-01-31T00:04:32-05:00,Speeches and Remarks,"Private Residence; Miami, Florida","6:27 P.M. EST\n\nTHE PRESIDENT: Well, Chris, t..."


In [13]:
# Save to csv
wh_articles_df.to_csv('data/thewhitehouse.csv', index=False)

# Scrape The European Commission Data

In this phase, our focus shifts to collecting data from The European Commission, which plays a significant role in the European Union's policies and actions.

We specifically are going to focus on the speeches and remarks of Ursula von der Leyen, the current President of the European Commission. The data collected will offer insights into the European Commission's stance and support for Ukraine.

In [None]:
# Under development

In [None]:
link = 'https://ec.europa.eu/commission/presscorner/home/en'