# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [None]:
# Use beautifulsoup methods to extract necessary content from an article

In [5]:
articles = soup.select('.grid-cols-4')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">situation safe drop</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2012-12-06 </p>
 <p class="text-right">By Brenda Adams </p>
 </div>
 <p>Rate energy skin important where. Mind participant sister cost my.
 Reduce money heavy reason forward now someone.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">Congress thing simply</h2>
 <div class="grid grid-cols-2 italic">
 <p> 2012-10-31 </p>
 <p class="text-right">By Courtney Edwards </p>
 </div>
 <p>Argue source top to. Minute this good ask show. Democratic behavior husband like reduce.
 By assume school long. Blue up

In [7]:
article = articles[0]
article

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">situation safe drop</h2>
<div class="grid grid-cols-2 italic">
<p> 2012-12-06 </p>
<p class="text-right">By Brenda Adams </p>
</div>
<p>Rate energy skin important where. Mind participant sister cost my.
Reduce money heavy reason forward now someone.</p>
</div>
</div>

In [8]:
# returns first tag with h2
article.h2

<h2 class="text-2xl text-green-900">situation safe drop</h2>

In [9]:
# to extract text from tag
headline = article.h2.text
headline

'situation safe drop'

In [16]:
# to extract date
print(article.p)
# or to get date without extra elements
print(article.p.text)
# remove whitespace
print(article.p.text.strip())
date = article.p.text.strip()
date

<p> 2012-12-06 </p>
 2012-12-06 
2012-12-06


'2012-12-06'

In [19]:
author = article.select('.text-right')[0].text.strip()[3:]
author

'Brenda Adams'

In [21]:
content = article.select('p')[-1].text
content

'Rate energy skin important where. Mind participant sister cost my.\nReduce money heavy reason forward now someone.'

Bringing it all together: Make a function

In [24]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
    
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }
# test function
parse_news(article)

{'headline': 'situation safe drop',
 'date': '2012-12-06',
 'author': 'Brenda Adams',
 'content': 'Rate energy skin important where. Mind participant sister cost my.\nReduce money heavy reason forward now someone.'}

In [26]:
# loop through all the articles
pd.DataFrame([parse_news(art) for art in articles])

Unnamed: 0,headline,date,author,content
0,situation safe drop,2012-12-06,Brenda Adams,Rate energy skin important where. Mind partici...
1,Congress thing simply,2012-10-31,Courtney Edwards,Argue source top to. Minute this good ask show...
2,ball director bag,1999-03-26,Rachel Williams,My media finally sea weight street. Pull howev...
3,main son common,2016-01-09,Zachary Williams,Expect others case house inside front informat...
4,professor cell science,2000-02-24,Daniel Smith,Help mind offer fund process drop about person...
5,operation capital serious,1994-10-26,Shannon Meyer,Might six very specific director out. Defense ...
6,mother take food,2006-01-01,Rodney Clark,Message class movie skill fire. Camera fish gr...
7,talk professor too,1975-01-25,Samuel Owens,Lose man other feel blue. Teach do amount sold...
8,establish research prevent,1977-04-04,Robert Hobbs,Bad knowledge easy senior billion moment lawye...
9,window skill card,1990-06-03,Samantha Yates,Cover administration government sign worker ef...


## Scraping People

In [28]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)

In [29]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [36]:
people = soup.select('.person')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Christopher Todd</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Seamless zero administration workforce"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">jamescourtney@hotmail.com</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">2080990891</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 2192 Kenneth Plain <br/>
                 Ellishaven, SD 48201
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-sp

In [37]:
person = people[0]
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Christopher Todd</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Seamless zero administration workforce"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">jamescourtney@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">2080990891</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                2192 Kenneth Plain <br/>
                Ellishaven, SD 48201
            </p>
</div>
</div>

In [45]:
# isolate name
name = person.h2.text
name

'Christopher Todd'

In [58]:
# isolate quote
quote = person.p.text.strip().strip('"')
quote

'Seamless zero administration workforce'

In [98]:
# isolate email
person.select('.email')[0].text

'jamescourtney@hotmail.com'

In [99]:
# isolate phone
person.select('.phone')[0].text

'2080990891'

In [110]:
# isolate address
import re
re.sub(r'\s{2,}',', ',person.select('.address')[0].text.strip())

'2192 Kenneth Plain, Ellishaven, SD 48201'

In [109]:
# combine steps into function
def parse_person(person):
    name = person.h2.text
    quote = person.p.text.strip().strip('"')
    email = person.select('.email')[0].text
    phone = person.select('.phone')[0].text
    address = re.sub(r'\s{2,}',', ',person.select('.address')[0].text.strip())

    
    return {
        'name': name, 'quote': quote, 'email': email,
        'phone': phone,
        'address': address
    }
# test function
parse_person(person)

{'name': 'Christopher Todd',
 'quote': 'Seamless zero administration workforce',
 'email': 'jamescourtney@hotmail.com',
 'phone': '2080990891',
 'address': '2192 Kenneth Plain, Ellishaven, SD 48201'}

In [111]:
# loop through all the people
pd.DataFrame([parse_person(pers) for pers in people])

Unnamed: 0,name,quote,email,phone,address
0,Christopher Todd,Seamless zero administration workforce,jamescourtney@hotmail.com,2080990891,"2192 Kenneth Plain, Ellishaven, SD 48201"
1,Jason Lewis,Cross-platform user-facing secured line,glenn21@gmail.com,+1-304-624-2556x2733,"6766 Jose Cove Suite 009, New Ryan, UT 00945"
2,Marissa Schultz,Vision-oriented object-oriented paradigm,xlang@yahoo.com,+1-594-277-8390x31096,"8710 Lynch Springs, Robertfort, NC 38892"
3,Angela Miller,Function-based background matrices,mitchelldaniel@gmail.com,+1-236-506-2547,"68057 Stephanie Route, Matthewview, SD 73860"
4,Donna Morales,Multi-lateral exuding capability,monicafuller@yahoo.com,+1-297-017-7159x452,"7450 Davis Fords, South Bridget, IL 45407"
5,Ricardo Hart,Cloned interactive approach,pfernandez@anderson.biz,001-996-284-6269x91723,"60157 Paula Plaza, Lake Nicole, MI 85996"
6,John Patel,Function-based motivating alliance,regina98@fox.com,001-832-652-7073x1426,"4802 Payne Parkway, East Kayla, AZ 63826"
7,Cathy Wilson,Right-sized real-time workforce,hunterashley@mcbride.com,680.000.6346,"2287 Bradley Cliffs Suite 159, Heathermouth, N..."
8,Lisa Wright,Balanced intangible capability,lynnrivera@underwood.biz,(574)481-7388x073,"787 Day Mission, New Troyton, NJ 61019"
9,Andrew Conley,Right-sized attitude-oriented strategy,brett10@hotmail.com,001-561-136-1498x1274,"52783 Edward Terrace Suite 509, New Joshua, AL..."


## Web Scraping Etiquette

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [112]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)

In [113]:
response

<Response [200]>