# Scraping: https://www.nytimes.com/

Let's try to scrape the frontpage of the NYT. We're looking for

* Headlines
* Bylines
* Article links

## Getting started

We'll start by **importing the necessary libraries**.

In [28]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [5]:
response = requests.get("https://www.nytimes.com/")

In [6]:
doc = BeautifulSoup(response.text, "html.parser")

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

## ATTEMPT ONE: Grabbing the tags directly

Let's jump right into trying to grab the link.

Oh, look it's an.... `a` tag. No special class or anything. What if we try to get all of the `a` tags on the page?

In [8]:
headline_tags = doc.find_all(class_="story-heading")

In [14]:
for tag in headline_tags:
    title = tag.text.strip()
    link = tag.find("a")["href"]
    
    print (title)
    print (link)
    print ("--------")

ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
Addiction Drug Lacks Results, but It Has Powerful Friends
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
https://www.nytimes.com/2017/06/11/business/media/comey-trump-watergate.html
Democrats Call for Sessions’s Testimony to Be Public
https://www.nytimes.com/2017/06/11/us/politics/jeff-sessions-russia-trump-attorney-general-senate.html
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him
https://www.nytimes.com/2017/06/11/us/politics/preet-bharara-trump-contacts.html
Role of Trump’s Lawyer Blurs Public and Private Lines
https://www.nytimes.com/2017/06/11/us/politics/trump-lawyer-marc-kasowitz.html
Obamacare Repeal Limits Flexibility for T

TypeError: 'NoneType' object is not subscriptable

Okay, that's terrible. Do you know how many `a` tags are going to be on that page? Many many many. Many very useless ones.

## Talking to parents

When you can't uniquely identify something, sometimes you need to go up the tree to find its **parent**, the elements that are above it. We'll be looking for an element that covers the **entire story**, then we'll pick the link out of it.

In [16]:
story_tags = doc.find_all(class_="story")

Great, it looks like this:
    
    <article class="story theme-summary lede" id="topnews-100000004994965" data-story-id="100000004994965" data-rank="0" data-collection-renderstyle="LedeSum">

I'm going to go out on a limb and say we should look for an `article` tag, but what about the class? `story theme-summary lede` gives us three options:

* `story`
* `theme-summary`
* `lede`

`story` sounds promising, yeah?

In [17]:
print (story_tags[0])

<article class="story theme-summary lede" data-collection-renderstyle="LedeSum" data-rank="0" data-story-id="100000005146544" id="topnews-100000005146544">
<h2 class="story-heading"><a href="https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html">ISIS Proves an Elusive Target for America’s Cyberweapons</a></h2>
<p class="byline">By DAVID E. SANGER and ERIC SCHMITT <time class="timestamp" data-eastern-timestamp="5:00 AM" data-utc-timestamp="1497258021" datetime="2017-06-12">5:00 AM ET</time></p>
<p class="summary"><ul><li>The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.</li>
<li>This is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.</li></ul></p>
<p class="theme-comments">
<a class="comments-link" href="https://www.nytimes.com/2017/06/

Seems to work well enough! Now that we have a parent, **we can use that parent to grab the elements inside of the story.** We'll use `.find` and `.find_all` to get everything we need.

* STEP ONE: Get the story
* STEP TWO: Get the headline
* STEP THREE: Get the byline
* STEP FOUR: Get the link

In [25]:
#print the stuff
for story in story_tags:
    headline = story.find(class_="story-heading")
    link = story.find("a")
    summary = story.find(class_="summary")
    byline = story.find (class_="byline")
    if headline:
        print (headline.text.strip())
    if link:
        print (link["href"])
    if summary:
        print (summary.text.strip())
    if byline:
        print (byline.text.strip())
    print ("--------")

ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.
This is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.
By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET
--------
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
Police leaders are assigning themselves a big role in reversing a complex crisis, and not through mass arrests.
By AL BAKER 5:00 AM ET
--------
Addiction Drug Lacks Results, but It Has Powerful Friends
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
--------
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts


In [27]:
#save the stuff in a dictionary
stories = []
for story in story_tags:
    current = {}
    
    headline = story.find(class_="story-heading")
    link = story.find("a")
    summary = story.find(class_="summary")
    byline = story.find (class_="byline")
    
    if headline:
        current["headline"] = headline.text.strip()
    if link:
        current["link"] = link["href"]
    if summary:
        current["summary"] = summary.text.strip()
    if byline:
        current["byline"] = byline.text.strip()
    
    stories.append(current)
    
print(stories)

[{'headline': 'ISIS Proves an Elusive Target for America’s Cyberweapons', 'link': 'https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html', 'summary': 'The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.\nThis is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.', 'byline': 'By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET'}, {'headline': 'Opioid Addicts Find an Ally in Blue', 'link': 'https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html', 'summary': 'Police leaders are assigning themselves a big role in reversing a complex crisis, and not through mass arrests.', 'byline': 'By AL BAKER 5:00 AM ET'}, {'headline': 'Addiction Drug Lacks Results, but It Has Powerful Friends', 'link': 'https://www.nytimes.com/2

In [29]:
#put the stuff in a dataframe
df = pd.DataFrame(stories)

In [30]:
df.head()

Unnamed: 0,byline,headline,link,summary
0,By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET,ISIS Proves an Elusive Target for America’s Cy...,https://www.nytimes.com/2017/06/12/world/middl...,The effectiveness of cyberweapons hit its limi...
1,By AL BAKER 5:00 AM ET,Opioid Addicts Find an Ally in Blue,https://www.nytimes.com/2017/06/12/nyregion/wh...,Police leaders are assigning themselves a big ...
2,,"Addiction Drug Lacks Results, but It Has Power...",https://www.nytimes.com/2017/06/11/health/vivi...,
3,By JIM RUTENBERG,"Trump Era, Unlike Watergate Era, Has Rival Set...",https://www.nytimes.com/2017/06/11/business/me...,Different versions of the Trump-Russia scandal...
4,,Democrats Call for Sessions’s Testimony to Be ...,https://www.nytimes.com/2017/06/11/us/politics...,


In [32]:
df.to_csv("stories.csv", index=False)

The error seems to happen with this one piece here:
    
    <h1 class="story-heading"><a href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html">Legendary New York Intellectuals Are His Ex-Friends</a></h1>
    <p class="summary">Norman Podhoretz, the former editor at Commentary magazine, looks back at the fierce, argumentative parties of New York’s intelligentsia.</p>
    <p class="byline">By JOHN LELAND </p>
    <p class="theme-comments">
    <a class="comments-link" href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html?hp&amp;target=comments#commentsContainer"><i class="icon sprite-icon comments-icon"></i><span class="comment-count"> Comments</span></a>
    </p>
    </article>

Oh look, it uses an `h1` instead of an `h2`, but it's still a `story-heading`. Let's change our code to **look for a `story-heading` class regardless of tag name**.

Another error! Let's print out again.

It looks like it failed on this one. 

    <article class="story">
    <h3 class="kicker">
    <a href="http://wordplay.blogs.nytimes.com">Wordplay »</a>
    </h3>
    </article>

Now we have a choice to make: do we care about this? I... don't. If we want to skip through to the next element in a loop, we can use `continue`.

Let's say **hey, if you don't have a headline, we're going to skip you.**

Maybe we can also say hey, let's get rid of the whitespace on the headlines by using `.strip()`

### Next step: Adding more pieces

Now we need to add in the links and the bylines. We'll start with the links by pulling in any `a` tags.

## Adding in bylines

Bylines look like this:

    <p class="byline">By PETER BAKER and STEVEN ERLANGER <time class="timestamp" datetime="2017-03-17" data-eastern-timestamp="12:36 PM" data-utc-timestamp="1489768575">12:36 PM ET</time></p>
    
So... let's just grab the element inside of story that has the class of `byline`!

So we get another one of those "missing byline" errors, yeah? Well, maybe not everything has a byline. It doesn't mean we should skip the whole thing, let's just skip the byline for that one.

**Looking a lot better!** Now the only problem is "By LOUIS LUCERO II 1:00 PM ET" instead of having "LOUIS LUCERO II" or even better "LOUIS LUCERO II".

## So I guess you better learn regular expressions, 'eh?