# Scrape White House press release text

> Use the requests library to read each of the Biden's administration's [press releases](https://www.whitehouse.gov/briefing-room).

---

## Config

#### Python tools and Jupyter settings

In [9]:
import json
import requests
import pandas as pd
import jupyter_black
import altair as alt
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

In [2]:
jupyter_black.load()
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = None

In [3]:
today = pd.Timestamp("today").strftime("%Y-%m-%d")

---

## Read data

#### Read the metadata scraped in `00` notebook

In [None]:
df = pd.read_json("../../data/processed/biden_trump_releases_metadata_keywords.json")

In [22]:
df.head()

Unnamed: 0,headline,url,date_str,category,page,location,date,president,issue_flag,derived_category,latitude,longitude,keywords,headline_lower
0,Remarks as Prepared for Delivery by First Lady Jill Biden at the Human Rights Campaign 2024 Los Angeles Dinner,https://www.whitehouse.gov/briefing-room/speeches-remarks/2024/03/24/remarks-as-prepared-for-delivery-by-first-lady-jill-biden-at-the-human-rights-campaign-2024-los-angeles-dinner/,"March 24, 2024",Speeches and Remarks,1,,2024-03-24,biden,Not specified,Speeches and Remarks,,,"[remarks, delivery, jill, rights, campaign, los, angeles, dinner]",remarks as prepared for delivery by first lady jill biden at the human rights campaign 2024 los angeles dinner
1,"Remarks by Vice President Harris on Gun Violence Prevention While at Marjory Stoneman Douglas High School | Parkland, FL",https://www.whitehouse.gov/briefing-room/speeches-remarks/2024/03/23/remarks-by-vice-president-harris-on-gun-violence-prevention-while-at-marjory-stoneman-douglas-high-school-parkland-fl/,"March 23, 2024",Speeches and Remarks,1,"Parkland, FL",2024-03-23,biden,Not specified,Speeches and Remarks,26.310777,-80.253225,"[remarks, vice, president, harris, gun, violence, prevention, marjory, douglas, school, parkland, fl]","remarks by vice president harris on gun violence prevention while at marjory stoneman douglas high school | parkland, fl"
2,Statement from Press Secretary Karine Jean-Pierre on the Terrorist Attack in Moscow,https://www.whitehouse.gov/briefing-room/statements-releases/2024/03/23/statement-from-press-secretary-karine-jean-pierre-on-the-terrorist-attack-in-moscow/,"March 23, 2024",Statements and Releases,1,,2024-03-23,biden,Not specified,Statements and Releases,,,"[statement, press, secretary, karine, attack, moscow]",statement from press secretary karine jean-pierre on the terrorist attack in moscow
3,"Press Release: Letter to the Speaker of the House and President of the Senate: Designation of Funding as Emergency Requirements in Accordance with Section 6 of the Further Consolidated Appropriations Act, 2024",https://www.whitehouse.gov/briefing-room/presidential-actions/2024/03/23/press-release-letter-to-the-speaker-of-the-house-and-president-of-the-senate-designation-of-funding-as-emergency-requirements-in-accordance-with-section-6-of-the-further-consolidated-appropriations/,"March 23, 2024",Presidential Actions,1,,2024-03-23,biden,Not specified,Presidential Actions,,,"[press, release, letter, speaker, house, president, designation, funding, emergency, requirements, accordance, section, appropriations, act]","press release: letter to the speaker of the house and president of the senate: designation of funding as emergency requirements in accordance with section 6 of the further consolidated appropriations act, 2024"
4,Statement from President Joe Biden on the Bipartisan Government Funding Bill,https://www.whitehouse.gov/briefing-room/statements-releases/2024/03/23/statement-from-president-joe-biden-on-the-bipartisan-government-funding-bill/,"March 23, 2024",Statements and Releases,1,,2024-03-23,biden,Not specified,Statements and Releases,,,"[statement, president, joe, biden, government, funding, bill]",statement from president joe biden on the bipartisan government funding bill


In [None]:
urls = list(df["url"])

In [20]:
urls[0:4]

['https://www.whitehouse.gov/briefing-room/speeches-remarks/2024/03/24/remarks-as-prepared-for-delivery-by-first-lady-jill-biden-at-the-human-rights-campaign-2024-los-angeles-dinner/',
 'https://www.whitehouse.gov/briefing-room/speeches-remarks/2024/03/23/remarks-by-vice-president-harris-on-gun-violence-prevention-while-at-marjory-stoneman-douglas-high-school-parkland-fl/',
 'https://www.whitehouse.gov/briefing-room/statements-releases/2024/03/23/statement-from-press-secretary-karine-jean-pierre-on-the-terrorist-attack-in-moscow/',
 'https://www.whitehouse.gov/briefing-room/presidential-actions/2024/03/23/press-release-letter-to-the-speaker-of-the-house-and-president-of-the-senate-designation-of-funding-as-emergency-requirements-in-accordance-with-section-6-of-the-further-consolidated-appropriations/']

In [21]:
releases_list = []

for u in urls[0:4]:
    r = requests.get(u)
    soup = BeautifulSoup(r.text, "html.parser")

    # Find the section with class 'body-content'
    body_content_section = soup.find("section", class_="body-content")

    # Find all <p> tags within the body content section
    paragraphs = body_content_section.find_all("p")
    paragraphs_center = body_content_section.find_all(
        "p", class_="has-text-align-center"
    )[0]

    # Iterate through each <p> tag
    for p in paragraphs:
        # Initialize variables to store location, time, and text for each iteration
        location_one = None
        location_two = None
        time_begin = None
        time_end = None
        text_clean = ""

        # Check if the <p> tag contains location information
        if p.find("em"):
            location_parts = p.get_text().split("<br/>")
            location_one = location_parts[0].strip()
            try:
                location_two = location_parts[1].replace("</em></p>", "").strip()
            except IndexError:
                pass  # Handle the case when there's no second location part
        # Check if the <p> tag contains time information
        elif (
            p.get_text()
            .strip()
            .startswith(("1", "2", "3", "4", "5", "6", "7", "8", "9"))
        ):
            time_parts = p.get_text().split("<br/>")
            time_begin = (
                time_parts[0]
                .strip()
                .replace('<p class="has-text-align-center"><em>', "")
                .replace("<p>", "")
            )
            try:
                time_end = (
                    time_parts[-1]
                    .strip()
                    .replace("\xa0", "")
                    .replace("END  ", "")
                    .replace("</p>", "")
                )
            except IndexError:
                pass  # Handle the case when there's no time end information
            try:
                text_clean = p.get_text().split("<br/><br/>", 1)[1]
            except IndexError:
                text_clean = (
                    p.get_text().strip() + "\n"
                )  # If no text split, use the whole text
        else:
            text_clean = p.get_text().strip() + "\n"  # Extract text

        # Create a dictionary for the extracted information
        text_dict = {
            "loc_one": location_one,
            "loc_two": location_two,
            "start": time_begin,
            "end": time_end,
            "text": text_clean,
        }

        releases_list.append(text_dict)

IndexError: list index out of range

In [10]:
releases_list

[{'loc_one': 'CommerzbankMunich, Germany',
  'loc_two': None,
  'start': None,
  'end': None,
  'text': ''},
 {'loc_one': None,
  'loc_two': None,
  'start': '2:07 P.M. CESTVICE PRESIDENT HARRIS:\xa0 Good afternoon, everyone.President Zelenskyy, it was my honor to meet with you again.\xa0 This is our fifth meeting, by my count.\xa0 And our first meeting was here almost exactly two years ago.I want to thank you for all that you have done and all that you are as a leader.\xa0 You and I have had many conversations.\xa0 And it is my honor to say, as part of a public conversation, that you have been an extraordinarily courageous leader and have shown you commitment to the Ukrainian people and to democratic principles, including the most important — one of the most important — which is the importance of sovereignty and territorial integrity.So, it is good to see you again.I was in Munich and have been here to talk about where we stand currently in terms of our relationship to the Ukrainian p

## Export

#### All to CSV & JSON

In [11]:
df.to_csv(f"data/processed/white_house_release_release_metadata.csv", index=False)
df.to_json(
    f"data/processed/white_house_release_release_metadata.json",
    indent=4,
    orient="records",
)

#### Export by category

In [12]:
categories = list(df.category.unique())

In [13]:
for c in categories:
    df.query(f'category == "{c}"').to_csv(
        f"data/processed/white_house_release_release_metadata_{c.lower().replace(' ', '_')}.csv",
        index=False,
    )
    df.query(f'category == "{c}"').to_json(
        f"data/processed/white_house_release_release_metadata_{c.lower().replace(' ', '_')}.json",
        indent=4,
        orient="records",
    )