# NY Times story landing pages
> This notebook fetches the first 100 stories under a topic or author page, extracting headline, summary, url and date.

---

#### Import Python tools and Jupyter config

In [1]:
import requests
import pandas as pd
import jupyter_black
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup

In [2]:
jupyter_black.load()
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = None

In [3]:
today = pd.Timestamp("today").strftime("%Y-%m-%d")

---

## Fetch

#### Headers for requests

In [4]:
headers = {
    "Accept": "*/*",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
}

#### List of topic page urls

In [5]:
topic_page_urls = [
    "https://www.nytimes.com/spotlight/donald-trump",
    "https://www.nytimes.com/spotlight/kamala-harris",
    "https://www.nytimes.com/by/maggie-haberman",
    "https://www.nytimes.com/news-event/2024-election",
    "https://www.nytimes.com/by/jill-cowan",
    "https://www.nytimes.com/by/soumya-karlamangla",
    "https://www.nytimes.com/news-event/donald-trump-investigations",
    "https://www.nytimes.com/by/kurtis-lee",
]

#### Loop though topic pages, extracting stories from each, storing in a list

In [6]:
# Placeholder for storing all stories
all_stories = []

# Loop through each topic page URL
for topic_page_url in topic_page_urls:
    topic = topic_page_url.split("/")[-1]  # Extract topic from the URL
    page = 1
    max_pages = 9  # Max pages to scrape
    stories_list = []

    # Progress bar for visual feedback
    with tqdm(total=max_pages, unit="page") as pbar:
        while page <= max_pages:
            url = f"{topic_page_url}?page={page}"
            response = requests.get(url)
            soup = BeautifulSoup(response.text, "html.parser")
            stories = soup.find_all("li", class_="css-18yolpw")

            # Break the loop if no stories are found
            if not stories:
                break

            for story in stories:
                # Extract the URL, headline, and summary as before
                url = story.find("a")["href"]
                summary = story.find("p").text
                headline = story.find("h3").text
                byline = (
                    story.find_all("p")[1].get_text()
                    if len(story.find_all("p")) > 1
                    else None
                )

                # Add the extracted data to the stories list
                stories_dict = {
                    "topic": topic,
                    "headline": headline,
                    "summary": summary,
                    "byline": byline,
                    "url": url,
                }
                stories_list.append(stories_dict)

            page += 1  # Go to the next page

            # Update the progress bar
            pbar.update(1)
            pbar.set_description(f"Processing {topic} page {page}")

    # Append stories from this topic page to all_stories
    all_stories.extend(stories_list)

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

  0%|          | 0/9 [00:00<?, ?page/s]

#### Large dataframe with all the stories

In [7]:
src = pd.DataFrame(all_stories).drop_duplicates()

#### Function to extract date if the URL has the expected structure

In [8]:
def extract_date(url):
    parts = url.split("/")
    if (
        len(parts) > 3
        and len(parts[1]) == 4
        and len(parts[2]) == 2
        and len(parts[3]) == 2
    ):
        return f"{parts[1]}-{parts[2]}-{parts[3]}"
    return None

#### Apply the function to the 'url' column to create a new 'date' column

In [9]:
src["date"] = src["url"].apply(extract_date)

In [15]:
df = src.sort_values(["topic", "date"], ascending=[True, False]).reset_index(drop=True)

#### How many stories? 

In [16]:
len(df)

687

In [17]:
df.query('topic=="kurtis-lee"').head()

Unnamed: 0,topic,headline,summary,byline,url,date
453,kurtis-lee,How Los Angeles Aims to Make a Profit on the 2028 Olympics,"The Summer Games will be the third for Los Angeles as host, but it will be a challenge to repeat the financial success of 1984.",By Kurtis Lee,/2024/08/12/business/economy/olympics-los-angeles-2028-economy.html,2024-08-12
454,kurtis-lee,"Along the Hollywood Walk of Fame, a Struggle to Make a Living","Los Angeles lifted restrictions that had forced street vendors, mostly immigrants, on Hollywood Boulevard to dodge citations. Other challenges remain.","By Kurtis Lee, Ana Facio-Krajcer and Adam Perez",/2024/06/29/business/economy/hollywood-street-vendors.html,2024-06-29
455,kurtis-lee,California Moves to Modify Law Letting Workers Sue Employers,"Gov. Gavin Newsom announced a deal with business and labor leaders heading off a ballot measure to repeal the law, which has cost companies billions.",By Kurtis Lee,/2024/06/18/business/economy/california-newsom-labor.html,2024-06-18
456,kurtis-lee,‘Winners and Losers’ as $20 Fast-Food Wage Nears in California,The nation’s highest state minimum wage for fast-food workers takes effect on Monday. Owners and employees are sizing up the potential impact.,By Kurtis Lee,/2024/03/28/business/economy/fast-food-minimum-wage-california.html,2024-03-28
457,kurtis-lee,California’s Economy Has Been Pinched by Unemployment,The Golden State’s jobless rate remains stubbornly high.,By Kurtis Lee,/2024/03/11/us/california-economy-unemployment.html,2024-03-11
