# Scrape Trump White House transcript metadata

> Use the requests library to visit the Biden's administration's [press release page](https://www.whitehouse.gov/briefing-room), loop over the pagination and grab basic information (headline, url, date and category) about each remarks transcript.

---

## Config

#### Python tools and Jupyter settings

In [1]:
import json
import requests
import pandas as pd
import jupyter_black
import altair as alt
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

In [2]:
jupyter_black.load()
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = None

In [3]:
today = pd.Timestamp("today").strftime("%Y-%m-%d")
president = "trump"

---

## Read data

#### Get pagination details from archived White House news page [home page](https://trumpwhitehouse.archives.gov/news/)

In [4]:
r = requests.get(f"https://trumpwhitehouse.archives.gov/news/")
sp = BeautifulSoup(r.text, "html.parser")

pagination = sp.find_all("a", class_="page-numbers")[-1].text
end = int(pagination.replace("Page ", "")) + 1

#### Loop through each page [in the archive](https://trumpwhitehouse.archives.gov/news/) and snag details about each release

In [5]:
releases = []

for i in tqdm(range(1, end)):  # Example: Looping through 5 pages
    response = requests.get(f"https://trumpwhitehouse.archives.gov/news/page/{i}/")
    soup = BeautifulSoup(response.text, "html.parser")

    # Select articles for both briefing statements and presidential actions
    articles = soup.find_all(
        "article", class_=["briefing-statement", "presidential-action"]
    )

    for article in articles:
        # Common elements
        headline = article.find("h2").text.strip()
        url = article.find("a")["href"]
        date = article.find("time").text

        # Category can be either briefing-statement__type or presidential-action__type
        category = article.find(
            "p", class_=["briefing-statement__type", "presidential-action__type"]
        )
        if category:
            category_text = category.text.strip()
        else:
            category_text = "Not specified"

        # Extracting issue flag text
        issue_flag = article.find("p", class_="issue-flag")
        if issue_flag:
            issue_flag_text = issue_flag.text.strip()
        else:
            issue_flag_text = "Not specified"

        releases_dict = {
            "headline": headline,
            "url": url,
            "date_str": date,
            "category": category_text,
            "issue_flag": issue_flag_text,  # Adding issue flag text to the dictionary
            "page": i,
        }
        releases.append(releases_dict)

  0%|          | 0/910 [00:00<?, ?it/s]

#### Get the list into a dataframe

In [6]:
src = pd.DataFrame(releases)

#### How many releases?

In [7]:
count = len(src)
count

8478

#### Clean up

In [8]:
src["location"] = src["headline"].str.split("|", expand=True)[1].fillna("")

In [9]:
src["date"] = pd.to_datetime(src["date_str"]).astype(str)

In [10]:
src["president"] = president

In [11]:
df = src.copy()

#### The result: 

In [12]:
df.head()

Unnamed: 0,headline,url,date_str,category,issue_flag,page,location,date,president
0,Executive Order on the Revocation of Executive Order 13770,https://trumpwhitehouse.archives.gov/presidential-actions/executive-order-revocation-executive-order-13770/,"Jan 20, 2021",Executive Orders,Not specified,1,,2021-01-20,trump
1,Statement from the Press Secretary Regarding Executive Grants of Clemency,https://trumpwhitehouse.archives.gov/briefings-statements/statement-press-secretary-regarding-executive-grants-clemency-012021/,"Jan 20, 2021",Statements & Releases,Law & Justice,1,,2021-01-20,trump
2,Executive Order on Care Of Veterans With Service In Uzbekistan,https://trumpwhitehouse.archives.gov/presidential-actions/executive-order-care-veterans-service-uzbekistan/,"Jan 19, 2021",Executive Orders,Veterans,1,,2021-01-19,trump
3,Memorandum on Deferred Enforced Departure for Certain Venezuelans,https://trumpwhitehouse.archives.gov/presidential-actions/memorandum-deferred-enforced-departure-certain-venezuelans/,"Jan 19, 2021",Presidential Memoranda,Foreign Policy,1,,2021-01-19,trump
4,Statement from National Security Advisor Robert C. O’Brien,https://trumpwhitehouse.archives.gov/briefings-statements/statement-national-security-advisor-robert-c-obrien-011921/,"Jan 19, 2021",Statements & Releases,National Security & Defense,1,,2021-01-19,trump


---

## Export

#### All to CSV & JSON

In [13]:
df.to_csv(
    f"../../data/processed/{president}/{president}_release_metadata.csv", index=False
)
df.to_json(
    f"../../data/processed/{president}/{president}_release_metadata.json",
    indent=4,
    orient="records",
)

In [14]:
print(
    f"The scraper collected {count} metadata records from President {president.title()}. Data exported successfully."
)

The scraper collected 8478 metadata records from President Trump. Data exported successfully.


#### Export by category

In [15]:
categories = list(df.category.unique())

In [16]:
# for c in categories:
#     df.query(f'category == "{c}"').to_csv(
#         f"../../data/processed/{president}/{president}_release_metadata_{c.lower().replace(' ', '_')}.csv",
#         index=False,
#     )
#     df.query(f'category == "{c}"').to_json(
#         f"../../data/processed/{president}/{president}_release_metadata_{c.lower().replace(' ', '_')}.json",
#         indent=4,
#         orient="records",
#     )