# Scraping: https://www.nytimes.com/

Let's try to scrape the frontpage of the NYT. We're looking for

* Headlines
* Bylines
* Article links

## Getting started

We'll start by **importing the necessary libraries**.

In [1]:
from bs4 import BeautifulSoup
import requests

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [3]:
resonse = requests.get("https://www.nytimes.com/")

In [None]:
doc = 

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

## ATTEMPT ONE: Grabbing the tags directly

Let's jump right into trying to grab the link.

Oh, look it's an.... `a` tag. No special class or anything. What if we try to get all of the `a` tags on the page?

Okay, that's terrible. Do you know how many `a` tags are going to be on that page? Many many many. Many very useless ones.

## Talking to parents

When you can't uniquely identify something, sometimes you need to go up the tree to find its **parent**, the elements that are above it. We'll be looking for an element that covers the **entire story**, then we'll pick the link out of it.

Great, it looks like this:
    
    <article class="story theme-summary lede" id="topnews-100000004994965" data-story-id="100000004994965" data-rank="0" data-collection-renderstyle="LedeSum">

I'm going to go out on a limb and say we should look for an `article` tag, but what about the class? `story theme-summary lede` gives us three options:

* `story`
* `theme-summary`
* `lede`

`story` sounds promising, yeah?

Seems to work well enough! Now that we have a parent, **we can use that parent to grab the elements inside of the story.** We'll use `.find` and `.find_all` to get everything we need.

* STEP ONE: Get the story
* STEP TWO: Get the headline
* STEP THREE: Get the byline
* STEP FOUR: Get the link

If we examine the page, it looks like headlines might be `h2` tags that have a `story-heaing` class.

### An error strikes!

But we get an error!

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-57-9218ec61124f> in <module>()
          4     print("This is a story")
          5     headline = story.find('h2', { 'class': 'story-heading' })
    ----> 6     print(headline.text)

    AttributeError: 'NoneType' object has no attribute 'text'

Hm, a story missing a headline? Let's look at it a little closer. We could do this in a classy way, but let's just brute force it by print out every article just before the error line.

The error seems to happen with this one piece here:
    
    <h1 class="story-heading"><a href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html">Legendary New York Intellectuals Are His Ex-Friends</a></h1>
    <p class="summary">Norman Podhoretz, the former editor at Commentary magazine, looks back at the fierce, argumentative parties of New York’s intelligentsia.</p>
    <p class="byline">By JOHN LELAND </p>
    <p class="theme-comments">
    <a class="comments-link" href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html?hp&amp;target=comments#commentsContainer"><i class="icon sprite-icon comments-icon"></i><span class="comment-count"> Comments</span></a>
    </p>
    </article>

Oh look, it uses an `h1` instead of an `h2`, but it's still a `story-heading`. Let's change our code to **look for a `story-heading` class regardless of tag name**.

Another error! Let's print out again.

It looks like it failed on this one. 

    <article class="story">
    <h3 class="kicker">
    <a href="http://wordplay.blogs.nytimes.com">Wordplay »</a>
    </h3>
    </article>

Now we have a choice to make: do we care about this? I... don't. If we want to skip through to the next element in a loop, we can use `continue`.

Let's say **hey, if you don't have a headline, we're going to skip you.**

Maybe we can also say hey, let's get rid of the whitespace on the headlines by using `.strip()`

### Next step: Adding more pieces

Now we need to add in the links and the bylines. We'll start with the links by pulling in any `a` tags.

## Adding in bylines

Bylines look like this:

    <p class="byline">By PETER BAKER and STEVEN ERLANGER <time class="timestamp" datetime="2017-03-17" data-eastern-timestamp="12:36 PM" data-utc-timestamp="1489768575">12:36 PM ET</time></p>
    
So... let's just grab the element inside of story that has the class of `byline`!

So we get another one of those "missing byline" errors, yeah? Well, maybe not everything has a byline. It doesn't mean we should skip the whole thing, let's just skip the byline for that one.

**Looking a lot better!** Now the only problem is "By LOUIS LUCERO II 1:00 PM ET" instead of having "LOUIS LUCERO II" or even better "LOUIS LUCERO II".

## So I guess you better learn regular expressions, 'eh?