# Scrape Biden White House press release metadata

> Use the requests library to visit the Biden's administration's [press release page](https://www.whitehouse.gov/briefing-room), loop over the pagination and grab basic information (headline, url, date and category) about each release.

---

## Config

#### Python tools and Jupyter settings

In [1]:
import json
import requests
import pandas as pd
import jupyter_black
import altair as alt
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

In [2]:
jupyter_black.load()
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = None

In [3]:
today = pd.Timestamp("today").strftime("%Y-%m-%d")
president = "biden"

---

## Read data

#### Get pagination details from White House press release [home page](https://www.whitehouse.gov/briefing-room/)

In [4]:
r = requests.get(f"https://www.whitehouse.gov/briefing-room/")
sp = BeautifulSoup(r.text, "html.parser")

pagination = sp.find_all("a", class_="page-numbers")[-1].text
end = int(pagination.replace("Page ", "")) + 1

#### Loop through each page and snag details about each release

In [5]:
releases = []

for i in tqdm(range(1, end)):
    response = requests.get(f"https://www.whitehouse.gov/briefing-room/page/{i}/")
    soup = BeautifulSoup(response.text, "html.parser")

    for s in soup.find_all("article", class_="news-item"):
        headline = s.find("h2").text.strip()
        url = s.find("a")["href"]
        date = s.find("time").text
        category = s.find("span", class_="cat-links").text
        releases_dict = {
            "headline": headline,
            "url": url,
            "date_str": date,
            "category": category,
            "page": i,
        }
        releases.append(releases_dict)

  0%|          | 0/993 [00:00<?, ?it/s]

#### Get the list into a dataframe

In [6]:
src = pd.DataFrame(releases)

In [7]:
count = len(src)
count

9922

#### Clean up

In [8]:
src["location"] = src["headline"].str.split("|", expand=True)[1].fillna("")

In [9]:
src["date"] = pd.to_datetime(src["date_str"]).astype(str)

In [10]:
src["president"] = president

In [11]:
df = src.copy()

#### The result: 

In [12]:
df.head()

Unnamed: 0,headline,url,date_str,category,page,location,date,president
0,Remarks as Prepared for Delivery by First Lady Jill Biden at the Human Rights Campaign 2024 Los Angeles Dinner,https://www.whitehouse.gov/briefing-room/speeches-remarks/2024/03/24/remarks-as-prepared-for-delivery-by-first-lady-jill-biden-at-the-human-rights-campaign-2024-los-angeles-dinner/,"March 24, 2024",Speeches and Remarks,1,,2024-03-24,biden
1,"Remarks by Vice President Harris on Gun Violence Prevention While at Marjory Stoneman Douglas High School | Parkland, FL",https://www.whitehouse.gov/briefing-room/speeches-remarks/2024/03/23/remarks-by-vice-president-harris-on-gun-violence-prevention-while-at-marjory-stoneman-douglas-high-school-parkland-fl/,"March 23, 2024",Speeches and Remarks,1,"Parkland, FL",2024-03-23,biden
2,Statement from Press Secretary Karine Jean-Pierre on the Terrorist Attack in Moscow,https://www.whitehouse.gov/briefing-room/statements-releases/2024/03/23/statement-from-press-secretary-karine-jean-pierre-on-the-terrorist-attack-in-moscow/,"March 23, 2024",Statements and Releases,1,,2024-03-23,biden
3,"Press Release: Letter to the Speaker of the House and President of the Senate: Designation of Funding as Emergency Requirements in Accordance with Section 6 of the Further Consolidated Appropriations Act, 2024",https://www.whitehouse.gov/briefing-room/presidential-actions/2024/03/23/press-release-letter-to-the-speaker-of-the-house-and-president-of-the-senate-designation-of-funding-as-emergency-requirements-in-accordance-with-section-6-of-the-further-consolidated-appropriations/,"March 23, 2024",Presidential Actions,1,,2024-03-23,biden
4,Statement from President Joe Biden on the Bipartisan Government Funding Bill,https://www.whitehouse.gov/briefing-room/statements-releases/2024/03/23/statement-from-president-joe-biden-on-the-bipartisan-government-funding-bill/,"March 23, 2024",Statements and Releases,1,,2024-03-23,biden


In [13]:
print(
    f"The scraper collected {count} metadata records from President {president.title()}. Data exported successfully."
)

The scraper collected 9922 metadata records from President Biden. Data exported successfully.


---

## Export

#### All to CSV & JSON

In [14]:
df.to_csv(
    f"../../data/processed/{president}/{president}_release_metadata.csv", index=False
)
df.to_json(
    f"../../data/processed/{president}/{president}_release_metadata.json",
    indent=4,
    orient="records",
)

#### Export by category

In [15]:
categories = list(df.category.unique())

In [16]:
# for c in categories:
#     df.query(f'category == "{c}"').to_csv(
#         f"../../data/processed/{president}/{president}_release_metadata_{c.lower().replace(' ', '_')}.csv",
#         index=False,
#     )
#     df.query(f'category == "{c}"').to_json(
#         f"../../data/processed/{president}/{president}_release_metadata_{c.lower().replace(' ', '_')}.json",
#         indent=4,
#         orient="records",
#     )