# Scrape NYT

### Sample HTML tag

```html
<section class="story-wrapper"><a class="css-9mylee" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" aria-hidden="false"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis</span></p></div><p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p></div><p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">7 min read</p></div></div></div></a></section>
```

Notice that we need to extract the headline, as well as the summary

### Code
(you may have to install BeautifulSoup)

In [8]:
from bs4 import BeautifulSoup

In [10]:
html_element = """<section class="story-wrapper"><a class="css-9mylee" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" aria-hidden="false"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis</span></p></div><p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p></div><p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">7 min read</p></div></div></div></a></section>"""

In [12]:
soup = BeautifulSoup(html_element, 'html.parser')

In [48]:
headline1 = soup.find('section', class_='story-wrapper')
headline1.find_all('p')[1], headline1.find_all('p')[2]

(<p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p>,
 <p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p>)

In [64]:
title_and_summary_tag = headline1.find_all('p')
title = title_and_summary_tag[1].text
summary = title_and_summary_tag[2].text

title_and_summary = title + ". " + summary
title_and_summary

'In Pardoning His Son, Biden Echoes Some of Trump’s Complaints. President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.'

In [120]:
def get_text(html_element):
    title_and_summary_tag = html_element.find_all('p')

    if len(title_and_summary_tag) == 0: return None
    
    if len(title_and_summary_tag) < 2: # This function is not very robust :(
        return title_and_summary_tag[0].text
        
    title   = title_and_summary_tag[0].text
    summary = title_and_summary_tag[1].text
    
    title_and_summary = title + ". " + summary
    title_and_summary

    return title_and_summary

In [99]:
get_text(headline1)

'Analysis. In Pardoning His Son, Biden Echoes Some of Trump’s Complaints'

### Find ALL headlines

First, we download the front-page

In [68]:
import requests

In [69]:
%%time
response = requests.get('https://www.nytimes.com/')

CPU times: total: 0 ns
Wall time: 295 ms


In [72]:
response

<Response [200]>

In [74]:
print(response.text[:500])

<!DOCTYPE html>
<html lang="en" class=" nytapp-vi-homepage "  xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    
    
    
    <meta charset="utf-8" />
    <title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>
    <meta data-rh="true" name="description" content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. and intern


In [76]:
html = BeautifulSoup(response.text)

In [78]:
html.find_all(class_="story-wrapper")[:5]

[<section class="story-wrapper"><a aria-hidden="false" class="css-9mylee" data-uri="nyt://article/bef4947b-a6fe-5d15-8030-c60d25bdc916" href="https://www.nytimes.com/2024/12/01/us/politics/biden-pardon-son-hunter.html"><div><div class="css-xdandi"><p class="indicate-hover css-1gg6cw2">Biden Issues a ‘Full and Unconditional Pardon’ of His Son Hunter</p></div><p class="summary-class css-ofqxyv">After pledging not do so amid President-elect Trump’s attacks, President Biden ended Hunter Biden’s legal woes, including a guilty verdict in a gun case.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">6 min read</p></div></div></div></a></section>,
 <section class="story-wrapper"><a aria-hidden="false" class="css-9mylee" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis

### Extract headlines

In [80]:
html.find_all(class_="story-wrapper")[0]

<section class="story-wrapper"><a aria-hidden="false" class="css-9mylee" data-uri="nyt://article/bef4947b-a6fe-5d15-8030-c60d25bdc916" href="https://www.nytimes.com/2024/12/01/us/politics/biden-pardon-son-hunter.html"><div><div class="css-xdandi"><p class="indicate-hover css-1gg6cw2">Biden Issues a ‘Full and Unconditional Pardon’ of His Son Hunter</p></div><p class="summary-class css-ofqxyv">After pledging not do so amid President-elect Trump’s attacks, President Biden ended Hunter Biden’s legal woes, including a guilty verdict in a gun case.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">6 min read</p></div></div></div></a></section>

In [84]:
html.find_all(class_="story-wrapper")[0].find_all('p')

[<p class="indicate-hover css-1gg6cw2">Biden Issues a ‘Full and Unconditional Pardon’ of His Son Hunter</p>,
 <p class="summary-class css-ofqxyv">After pledging not do so amid President-elect Trump’s attacks, President Biden ended Hunter Biden’s legal woes, including a guilty verdict in a gun case.</p>,
 <p class="css-1a0ymrn" data-ttr="1">6 min read</p>]

In [104]:
for e in html.find_all(class_="story-wrapper")[:15]:
    #print(e)
    print(get_text(e))

Biden Issues a ‘Full and Unconditional Pardon’ of His Son Hunter. After pledging not do so amid President-elect Trump’s attacks, President Biden ended Hunter Biden’s legal woes, including a guilty verdict in a gun case.
Analysis. In Pardoning His Son, Biden Echoes Some of Trump’s Complaints
Hunter Biden Faced Prison Time for Tax and Gun Charges. 1 min read
The HeadlinesAudio. Biden Pardons His Son in U-Turn, Syrian Rebels Advance, and More
Analysis. Trump Remains Defiant After the Collapse of the Matt Gaetz Selection
Schumer Presses for F.B.I. Checks and Senate Consideration of Trump Nominees. In a letter, Senator Chuck Schumer said Democrats would work with Republicans, but asserted that Donald Trump’s picks should undergo Senate vetting.
Kash Patel Would Bring Bravado and Baggage to F.B.I. Role. 6 min read
Kash Patel Would Bring Bravado and Baggage to F.B.I. Role. 6 min read
Distrustful of Health Agencies, These Voters Cheer Trump’s Picks to Run Them. 6 min read
Distrustful of Health

In [122]:
headlines = [get_text(headline) for headline in html.find_all(class_="story-wrapper")]

In [124]:
headlines[:5]

['Biden Issues a ‘Full and Unconditional Pardon’ of His Son Hunter. After pledging not do so amid President-elect Trump’s attacks, President Biden ended Hunter Biden’s legal woes, including a guilty verdict in a gun case.',
 'Analysis. In Pardoning His Son, Biden Echoes Some of Trump’s Complaints',
 'Hunter Biden Faced Prison Time for Tax and Gun Charges. 1 min read',
 'The HeadlinesAudio. Biden Pardons His Son in U-Turn, Syrian Rebels Advance, and More',
 'Analysis. Trump Remains Defiant After the Collapse of the Matt Gaetz Selection']

In [126]:
len(headlines)

99

### Write headlines to file

#### Create the filename

In [130]:
import datetime

In [132]:
datetime.datetime.today()

datetime.datetime(2024, 12, 2, 6, 8, 55, 758276)

In [134]:
datetime.datetime.today().strftime('%Y-%m-%d')

'2024-12-02'

In [136]:
TODAY = datetime.datetime.today().strftime('%Y-%m-%d')

In [137]:
TODAY

'2024-12-02'

In [140]:
filename = f"headlines_nyt_{TODAY}.txt"
filename

'headlines_nyt_2024-12-02.txt'

In [144]:
with open(filename, 'w', encoding='utf-8') as output_file:
    for headline in headlines:
        if headline is None: continue
        output_file.write(headline + '\n')