# Scraping: https://www.nytimes.com/

Let's try to scrape the frontpage of the NYT. We're looking for

* Headlines
* Bylines
* Article links

## Getting started

We'll start by **importing the necessary libraries**.

In [1]:
from bs4 import BeautifulSoup
import requests

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [3]:
response = requests.get("https://www.nytimes.com")
doc = BeautifulSoup(response.text, "html.parser")

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

## ATTEMPT ONE: Grabbing the tags directly

Let's jump right into trying to grab the link.

Oh, look it's an.... `a` tag. No special class or anything. What if we try to get all of the `a` tags on the page?

In [11]:
headline_tags = doc.find_all('h2', attrs = {'class': 'story-heading'})

In [12]:
len(headline_tags)

#alternatvie way: headline_tags = doc.find_all(class_='story-heading')

119

In [13]:
for tag in headline_tags:
    print(tag.text)

ISIS Proves an Elusive Target for America’s Cyberweapons
Opioid Addicts Find an Ally in Blue
Addiction Drug Lacks Results, but It Has Powerful Friends 
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
Democrats Call for Sessions’s Testimony to Be Public 
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him 
Role of Trump’s Lawyer Blurs Public and Private Lines
Obamacare Repeal Limits Flexibility for Those in Transition
Uber Board Discusses a Leave for Embattled C.E.O.
Jeffrey Immelt to Retire as General Electric Chief 9:28 AM ET
Bill Cosby Sex Assault Trial: The Defense’s Turn 5:00 AM ET
Macron’s Party on Track to Claim Majority in Parliament 
Should Puerto Rico Be 51st State? Residents Go to Polls. 

                                    Before the Cloud, a Mine of Data                            
Your Monday Briefing
California Today: Talking to a Tony Winner


      Listen to ‘The Daily’
    

How to Save on Summer Travel
Ways Your iPhone Will Change After App

In [16]:
# Looping through my h2 tags or whatever

for tag in headline_tags[:3]:
    # Now that I am looking at one of those h2 tags
    # find me ONE link inside of it (the first one)
    link = tag.find('a')
    print(tag.text)
    print(link['href'])
    print("---------")

ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
---------
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
---------
Addiction Drug Lacks Results, but It Has Powerful Friends 
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
---------


In [17]:
# find all h2 tags with the class of story-heading
# 119 resuslts
summary_tags = doc.find_all('p', attrs={'class': 'summary'})

In [18]:
len(summary_tags)

33

Okay, that's terrible. Do you know how many `a` tags are going to be on that page? Many many many. Many very useless ones.

## Talking to parents

When you can't uniquely identify something, sometimes you need to go up the tree to find its **parent**, the elements that are above it. We'll be looking for an element that covers the **entire story**, then we'll pick the link out of it.

In [19]:
# find all h2 tags with the class of story-heading
story_tags = doc.find_all(class_='story')
len(story_tags)

154

In [21]:
story_tags[10]

<article class="story" data-collection-renderstyle="HpHeadline" data-rank="0" data-story-id="100000005159361" id="topnews-100000005159361">
<h2 class="story-heading"><i class="icon"></i><a href="https://www.nytimes.com/2017/06/12/business/ge-immelt.html">Jeffrey Immelt to Retire as General Electric Chief</a> <time class="timestamp" data-eastern-timestamp="9:28 AM" data-utc-timestamp="1497274133" datetime="2017-06-12">9:28 AM ET</time></h2>
</article>

Great, it looks like this:
    
    <article class="story theme-summary lede" id="topnews-100000004994965" data-story-id="100000004994965" data-rank="0" data-collection-renderstyle="LedeSum">

I'm going to go out on a limb and say we should look for an `article` tag, but what about the class? `story theme-summary lede` gives us three options:

* `story`
* `theme-summary`
* `lede`

`story` sounds promising, yeah?

In [35]:
# make a list of dictionaries
stories = []

for story in story_tags:
    # Empty dictionary
    current = {}
    headline = story.find(class_= 'story-heading')
    if headline:
        print(headline.text)
        current['headline'] = headline.text.strip()
    link = story.find('a')
    if link:
        print(link['href'])
        current['url'] = link['href']
    summary = story.find(class_='summary')
    if summary:
        print(summary.text)
        current['summary'] = summary.text.strip()
    byline = story.find(class_='byline')
    if byline:
        print(byline.text)
        current['byline'] = byline.text.strip()
    stories.append(current)
    print(current)
    print("----------")

ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.
This is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.
By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET
{'headline': 'ISIS Proves an Elusive Target for America’s Cyberweapons', 'url': 'https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html', 'summary': 'The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.\nThis is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear fac

Seems to work well enough! Now that we have a parent, **we can use that parent to grab the elements inside of the story.** We'll use `.find` and `.find_all` to get everything we need.

* STEP ONE: Get the story
* STEP TWO: Get the headline
* STEP THREE: Get the byline
* STEP FOUR: Get the link

In [34]:
print(stories)

[{'headline': 'ISIS Proves an Elusive Target for America’s Cyberweapons', 'url': 'https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html', 'summary': 'The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.\nThis is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.', 'byline': 'By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET'}, {'headline': 'Opioid Addicts Find an Ally in Blue', 'url': 'https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html', 'summary': 'Police leaders are assigning themselves a big role in reversing a complex crisis, and not through mass arrests.', 'byline': 'By AL BAKER 5:00 AM ET'}, {'headline': 'Addiction Drug Lacks Results, but It Has Powerful Friends ', 'url': 'https://www.nytimes.com/201

In [38]:
import pandas as pd

In [39]:
df = pd.DataFrame(stories)
df.head()

Unnamed: 0,byline,headline,summary,url
0,By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET,ISIS Proves an Elusive Target for America’s Cy...,The effectiveness of cyberweapons hit its limi...,https://www.nytimes.com/2017/06/12/world/middl...
1,By AL BAKER 5:00 AM ET,Opioid Addicts Find an Ally in Blue,Police leaders are assigning themselves a big ...,https://www.nytimes.com/2017/06/12/nyregion/wh...
2,,"Addiction Drug Lacks Results, but It Has Power...",,https://www.nytimes.com/2017/06/11/health/vivi...
3,By JIM RUTENBERG,"Trump Era, Unlike Watergate Era, Has Rival Set...",Different versions of the Trump-Russia scandal...,https://www.nytimes.com/2017/06/11/business/me...
4,,Democrats Call for Sessions’s Testimony to Be ...,,https://www.nytimes.com/2017/06/11/us/politics...


In [40]:
df.to_csv('stories.csv', index = False)

#index: otherwise, it will add a column of csv columns

If we examine the page, it looks like headlines might be h2 tags that have a story-heaing class.