
# Webscraping with BeautifulSoup and requests


---

## Learning Objectives

After this lesson students will be able to:
- Get HTML content from websites with requests 
- Parse website content with BeautifulSoup


### Prior knowledge required
- Python and pandas basics
---

# Web scraping issues

## Terms of service ‚≠êÔ∏è
Google is your friend. See what it says about webscraping.

The law is unresolved, but generally, if the data is publicly available and you are using it for educational purposes, it's unlikely that you will have problems. 

## Let's do some scraping
### Imports

In [None]:
# install if needed
# pip install bs4

In [1]:


# import pandas, bs4, and requests
import pandas as pd
from bs4 import BeautifulSoup
import requests

#### Use the requests library to get the content of a sample webpage

In [2]:
import bs4

In [3]:
bs4.__version__

'4.13.4'

In [4]:
url = 'https://rldaggie.github.io/sample-html/'
response = requests.get(url)

#### What did we get back?

In [5]:
response

<Response [200]>

#### Our response object has a lot more in it, we just have to get it out.
#### Status Codes

In [7]:
response.status_code

200

## Status codes
Status codes tell you how the target server responded to your request

#### 200 = OK

#### 300s = Redirection

#### 400s = Client Error
- 400 = Bad Request
- 403 = Forbidden (not authorized)
- 404 = Not Found

#### 500s = Server Error

If your request was successful, you now have the contents of the webpage stored in memory on your machine.

---
#### Let's get the good stuff üöÄ

In [17]:
response.text.find('<li')

411

In [16]:
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <title>The title</title>\n\n    <style media="screen">\n      tbody tr {\n        color: red;\n      }\n    </style>\n  </head>\n  <body>\n    <h1 class="foobar" id="title">This is an h1</h1>\n\n    <div>\n      <h1 class="foobar">This is yet another heading.</h1>\n\n      Something inside the div\n    </div>\n\n    <h3>Todo List</h3>\n    <ol class="todo">\n      <li class="foobar">Take out trash</li>\n      <li>Pay billz</li>\n      <li class="foobar">Feed dog</li>\n    </ol>\n\n    <h3>Completed</h3>\n    <ol class=\'done\'>\n      <li>Mow lawn</li>\n      <li class="foobar"><span>Take out compost</span></li>\n      <li><span>Create scraping lecture</span></li>\n    </ol>\n\n    <p class=\'foobar\'>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo

In [19]:
response.text.rindex('</li>')

3803

In [22]:
response.text[411:3810]

'<li class="foobar">Take out trash</li>\n      <li>Pay billz</li>\n      <li class="foobar">Feed dog</li>\n    </ol>\n\n    <h3>Completed</h3>\n    <ol class=\'done\'>\n      <li>Mow lawn</li>\n      <li class="foobar"><span>Take out compost</span></li>\n      <li><span>Create scraping lecture</span></li>\n    </ol>\n\n    <p class=\'foobar\'>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <span>Duis aute irure dolor</span> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. <em>Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum</em>.</p>\n\n    <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitat

#### We could parse this by hand üòø

#### But that would be painful and we can instead use a library üòÄ
### Create a `BeautifulSoup` object

In [23]:
soup = BeautifulSoup(response.text)

### What is it

In [24]:
type(soup)

bs4.BeautifulSoup

#### Let's take a look at it

In [25]:
soup
# soup

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>The title</title>
<style media="screen">
      tbody tr {
        color: red;
      }
    </style>
</head>
<body>
<h1 class="foobar" id="title">This is an h1</h1>
<div>
<h1 class="foobar">This is yet another heading.</h1>

      Something inside the div
    </div>
<h3>Todo List</h3>
<ol class="todo">
<li class="foobar">Take out trash</li>
<li>Pay billz</li>
<li class="foobar">Feed dog</li>
</ol>
<h3>Completed</h3>
<ol class="done">
<li>Mow lawn</li>
<li class="foobar"><span>Take out compost</span></li>
<li><span>Create scraping lecture</span></li>
</ol>
<p class="foobar">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <span>Duis aute irure dolor</span> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. <em>Excepteu

# `soup.find()`

### Returns either:

1. A soup object of the first match
2. `None`

In [26]:
soup.find('li')

<li class="foobar">Take out trash</li>

In [27]:
li = soup.find('li')

In [28]:
type(li)

bs4.element.Tag

#### Get the text in the tag

In [29]:
li.text

'Take out trash'

#### Get the attributes of the tag

In [30]:
li.attrs

{'class': ['foobar']}

In [31]:
li

<li class="foobar">Take out trash</li>

# ‚≠êÔ∏è ‚≠êÔ∏è`soup.find_all()` ‚≠êÔ∏è ‚≠êÔ∏è

### Returns a **_LIST_** (techically a bs4.element.ResultSet) of soup objects that match your query.

## Behaves differently than `find()`

In [32]:
soup.find_all('li')

[<li class="foobar">Take out trash</li>,
 <li>Pay billz</li>,
 <li class="foobar">Feed dog</li>,
 <li>Mow lawn</li>,
 <li class="foobar"><span>Take out compost</span></li>,
 <li><span>Create scraping lecture</span></li>,
 <li><a href="#">Home</a></li>,
 <li><a href="#">About</a></li>,
 <li><a href="#">Contact</a></li>]

In [33]:
[t.text for t in soup.find_all('li')]

['Take out trash',
 'Pay billz',
 'Feed dog',
 'Mow lawn',
 'Take out compost',
 'Create scraping lecture',
 'Home',
 'About',
 'Contact']

<Li>this work?</Li>

#### Make a list comprehension that creates a list containing only the text of the tags

In [36]:
l = [t.text for t in soup.find_all("li")]
l

['Take out trash',
 'Pay billz',
 'Feed dog',
 'Mow lawn',
 'Take out compost',
 'Create scraping lecture',
 'Home',
 'About',
 'Contact']

#### List comprehension that puts the classes of the h1 tags in a list

In [37]:
soup.find('h1').text

'This is an h1'

In [38]:
soup.find('h1').attrs['class']

['foobar']

In [39]:
[i.attrs['class'] for i in soup.find_all('h1')]

[['foobar'], ['foobar']]

In [40]:
[tag['class'] for tag in soup.find_all('h1') if tag.has_attr('class')]

[['foobar'], ['foobar']]

## Todo List

Find the ordered list items where the class = 'done'

In [33]:
done = soup.find('ol',attrs={'class':'done'})

In [38]:
done

<ol class="done">
<li>Mow lawn</li>
<li class="foobar"><span>Take out compost</span></li>
<li><span>Create scraping lecture</span></li>
</ol>

#### Get the list item texts from the ol

In [40]:
done.find_all('li')

[<li>Mow lawn</li>,
 <li class="foobar"><span>Take out compost</span></li>,
 <li><span>Create scraping lecture</span></li>]

In [41]:
[t.text for t in done.find_all('li')]

['Mow lawn', 'Take out compost', 'Create scraping lecture']

## Let's scrape a music reviews website

### TOS

Find the Terms of Service for the website. 

### robots.txt

- robots.txt https:my_site_name_here.com/robots.txt tells you what pages it would like you to crawl.

#### Get the content

In [50]:
url = 'https://pitchfork.com/reviews/albums/'

response = requests.get(url)

# response.text

soup = BeautifulSoup(response.text)
soup.find('h2')

In [43]:




#album titles
soup.find('h2')

albums = [t.text for t in soup.find_all('h2')]

soup.find('ul', {'class': 'artist-list'})

artists = [t.text for t in soup.find_all('ul', {'class': 'artist-list'})]

albums

[]

In [42]:
pd.DataFrame({'artist': artists, 'album': albums})

Unnamed: 0,artist,album


#### Find the content of any H2 tags with BS4

In [56]:
reviews = soup.find_all('div', {'class': 'review'})

In [64]:
reviews[0]#.find('a').attrs['href']

<div class="review"><a class="review__link" href="/reviews/albums/beyonce-cowboy-carter/"><div class="review__artwork artwork"><div class="review__artwork--with-notch"><img alt="Beyonc√©: Cowboy Carter" src="https://media.pitchfork.com/photos/65f9ba5e7f6a6f4c6c74a9d8/1:1/w_160/Beyonce-Cowboy-Carter.jpg"/></div></div><div class="review__title"><ul class="artist-list review__title-artist"><li>Beyonc√©</li></ul><h2 class="review__title-album"><em>Cowboy Carter</em></h2></div></a><div class="review__meta"><a class="review__meta-bnm" href="/reviews/best/albums/">Best New Album</a><ul class="genre-list genre-list--inline review__genre-list"><li class="genre-list__item"><a class="genre-list__link" href="/reviews/albums/?genre=folk">Folk/Country</a></li><li class="genre-list__item"><a class="genre-list__link" href="/reviews/albums/?genre=pop">Pop/R&amp;B</a></li></ul><ul class="authors"><li><a class="linked display-name display-name--linked" href="/staff/julianne-escobedo-shepard/"><span class

In [63]:
urls = ['https://www.pitchfork.com/' + review.find('a').attrs['href'] for review in reviews]
urls

['https://www.pitchfork.com//reviews/albums/beyonce-cowboy-carter/',
 'https://www.pitchfork.com//reviews/albums/yung-lean-bladee-psykos/',
 'https://www.pitchfork.com//reviews/albums/mizu-forest-scenes/',
 'https://www.pitchfork.com//reviews/albums/photek-modus-operandi/',
 'https://www.pitchfork.com//reviews/albums/various-artists-funk-br-sao-paulo/',
 'https://www.pitchfork.com//reviews/albums/kelly-moran-moves-in-the-field/',
 'https://www.pitchfork.com//reviews/albums/nourished-by-time-catching-chickens-ep/',
 'https://www.pitchfork.com//reviews/albums/alena-spanger-fire-escape/',
 'https://www.pitchfork.com//reviews/albums/1010benja-ten-total/',
 'https://www.pitchfork.com//reviews/albums/jlin-akoma/',
 'https://www.pitchfork.com//reviews/albums/tatyana-its-over/',
 'https://www.pitchfork.com//reviews/albums/future-metro-boomin-we-dont-trust-you/']

In [None]:
def get_reviews()

def get_score()

def get_text_of_reviews()



#### Grab all the Trending Beers 

## More Issues
Sometimes the HTML doesn't appear right away. Maybe you need to simulate clicking on buttons.

You can use a headless browser. 

- Selenium with Chromium will do the job. Here's an article on the topic: https://www.scrapingbee.com/blog/selenium-python/

- [Scrapy](https://scrapy.org/) is another option for scraping websites. It makes requests and gets data but is more powerful and complex than requests with BS4.

- Your IP address (or username if logged in) can get blocked if you are deemed to be malicious. 

- DOS (Denial of Service) attacks are real and if you ping a website lots and lots of times quickly you might get blocked, regardless of what robots.txt or the terms of use say.

- If you want to scrape repeatedly, make sure the website doesn't get changed andeaking how you grab the data!

## Summary

You've seen how to use requests with BS4 to get HTML and parse it.

Scraping websites is brittle and can be frustrating. But it's pretty cool. üòâ

### Check for understanding

- What requests method do you use to grab HTML?
- How do you get HTML content out of the requests object?