# Data Acquisition with Web Scraping

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

`$ pip install beautifulsoup4`

## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('h2')`: to get all the elements with tag name of `h2`
- `soup('h2')` : same as `find_all` method above
- `soup.find('h2')`: finds the first matching element

First make the request. The response is a bunch of html.

In [2]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [3]:
soup = BeautifulSoup(html)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   News Example Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   News!
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid gap-y-12">
   <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
    <img src="/static/placeholder.png"/>
    <div class="col-span-3 space-y-3 py-3

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [None]:
# Use beautifulsoup methods to extract necessary content from an article

In [4]:
articles = soup.select('.grid-cols-4')
articles

[<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">example believe medical</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1993-01-24 </p>
 <p class="text-right">By David Brown </p>
 </div>
 <p>Accept tax however cup how ball. Member set something teach star never. Single about game expert.
 Effort simply student wonder rule. Particular pressure hold foot newspaper well new. In account image thousand.</p>
 </div>
 </div>,
 <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
 <img src="/static/placeholder.png"/>
 <div class="col-span-3 space-y-3 py-3">
 <h2 class="text-2xl text-green-900">still writer walk</h2>
 <div class="grid grid-cols-2 italic">
 <p> 1983-11-05 </p>
 <p class="text-right">By Allison White </p>
 </div>
 <p>Region read yourself. Model 

In [5]:
article_one = articles[0]
article_one

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">example believe medical</h2>
<div class="grid grid-cols-2 italic">
<p> 1993-01-24 </p>
<p class="text-right">By David Brown </p>
</div>
<p>Accept tax however cup how ball. Member set something teach star never. Single about game expert.
Effort simply student wonder rule. Particular pressure hold foot newspaper well new. In account image thousand.</p>
</div>
</div>

In [6]:
headline_one = article_one.h2.text
headline_one

'example believe medical'

In [7]:
date_one = article_one.p.text.strip()
date_one

'1993-01-24'

In [19]:
article_one.select('.text-right')[0].text.strip().replace('By ', '')

'David Brown'

In [20]:
# author = article_one.select('.text-right')[0].text.strip()[3:]
# author
author_one = article_one.select('.text-right')[0].text.strip().replace('By ', '')
author_one

'David Brown'

In [24]:
article_one.select('p')[-1].text

'Accept tax however cup how ball. Member set something teach star never. Single about game expert.\nEffort simply student wonder rule. Particular pressure hold foot newspaper well new. In account image thousand.'

In [25]:
content_one = article_one.select('p')[-1].text
content_one

'Accept tax however cup how ball. Member set something teach star never. Single about game expert.\nEffort simply student wonder rule. Particular pressure hold foot newspaper well new. In account image thousand.'

Bringing it all together: Make a function

In [26]:
def parse_news(article):
    headline = article.h2.text
    date = article.p.text.strip()
    author = article.select('.text-right')[0].text.strip()[3:]
    content = article.select('p')[-1].text
   
    return {
        'headline': headline, 'date': date, 'author': author,
        'content': content
    }

In [27]:
parse_news(article_one)

{'headline': 'example believe medical',
 'date': '1993-01-24',
 'author': 'David Brown',
 'content': 'Accept tax however cup how ball. Member set something teach star never. Single about game expert.\nEffort simply student wonder rule. Particular pressure hold foot newspaper well new. In account image thousand.'}

In [29]:
[parse_news(article) for article in articles]

[{'headline': 'example believe medical',
  'date': '1993-01-24',
  'author': 'David Brown',
  'content': 'Accept tax however cup how ball. Member set something teach star never. Single about game expert.\nEffort simply student wonder rule. Particular pressure hold foot newspaper well new. In account image thousand.'},
 {'headline': 'still writer walk',
  'date': '1983-11-05',
  'author': 'Allison White',
  'content': 'Region read yourself. Model plan cost adult. Trade exist worker national history together.\nEver responsibility help provide church. Look out media. Bar onto door.'},
 {'headline': 'range because number',
  'date': '1978-10-26',
  'author': 'Aaron Luna',
  'content': 'Phone alone TV general growth special. Fine economic issue as realize year director experience.\nHuge least as strong. Still job run.'},
 {'headline': 'opportunity charge mouth',
  'date': '2000-02-15',
  'author': 'Louis Lee',
  'content': 'Wide local few dark which tonight investment money. Relate down aud

In [30]:
# loop through all the articles
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,headline,date,author,content
0,example believe medical,1993-01-24,David Brown,Accept tax however cup how ball. Member set so...
1,still writer walk,1983-11-05,Allison White,Region read yourself. Model plan cost adult. T...
2,range because number,1978-10-26,Aaron Luna,Phone alone TV general growth special. Fine ec...
3,opportunity charge mouth,2000-02-15,Louis Lee,Wide local few dark which tonight investment m...
4,red I pull,2012-03-30,Jessica Houston,Other day they. Full glass tend.\nHow difficul...
5,loss involve traditional,1987-06-10,Anne Tucker,Stop subject own sea road several produce. Rea...
6,off mother subject,1970-04-05,David Saunders,Add they herself where major. Recently north h...
7,class me ago,2002-06-29,Cynthia Cruz DDS,Across box own leader while.\nStage probably s...
8,concern ago just,1977-10-24,Olivia Perez,Attack hard us south impact. Increase lot sout...
9,staff politics blood,2019-02-09,Charles Griffin,Between city use short best. True artist belie...


## Scraping People

In [31]:
requests.get('https://web-scraping-demo.zgulde.net/people')

<Response [200]>

In [32]:
requests.get('https://web-scraping-demo.zgulde.net/people').text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>Example People Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">People</h1>\n\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n\n<div id="people" class="grid grid-cols-2 gap-x-12 gap-y-16">\n    \n    <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">\n    

In [33]:
BeautifulSoup(requests.get('https://web-scraping-demo.zgulde.net/people').text)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Example People Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">People</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Erika Frazier</

In [38]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Hoppper'})
soup = BeautifulSoup(response.text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Example People Page
  </title>
  <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
  <link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
 </head>
 <body class="mx-auto max-w-screen-lg pb-32">
  <h1 class="my-5 text-4xl text-center">
   People
  </h1>
  <div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
   <p>
    <i class="bi bi-exclamation-circle text-xl">
    </i>
    All data on this page is strictly for demonstration purposes and fake.
   </p>
  </div>
  <div class="grid grid-cols-2 gap-x-12 gap-y-16" id="people">
   <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
    <h2 class="text-2xl text-purp

In [42]:
people = soup.select('.person')
people

[<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-span-full border-b">Jesse Tucker</h2>
 <p class="quote col-span-full px-5 py-5 text-center text-gray-500">
             "Visionary foreground interface"
         </p>
 <div class="grid grid-cols-9">
 <i class="bi bi-envelope-fill text-purple-800"></i>
 <p class="email col-span-8">mannpeter@simmons-jensen.net</p>
 <i class="bi bi-telephone-fill text-purple-800"></i>
 <p class="phone col-span-8">894-426-0860x15929</p>
 </div>
 <div class="address grid grid-cols-9">
 <i class="bi bi-geo-fill text-purple-800"></i>
 <p class="col-span-8">
                 1552 Patricia Gateway <br/>
                 South Jose, PA 66084
             </p>
 </div>
 </div>,
 <div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
 <h2 class="text-2xl text-purple-800 name col-

In [44]:
person = people[0]
person

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Jesse Tucker</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Visionary foreground interface"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">mannpeter@simmons-jensen.net</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">894-426-0860x15929</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                1552 Patricia Gateway <br/>
                South Jose, PA 66084
            </p>
</div>
</div>

In [49]:
name = person.h2.text
name

'Jesse Tucker'

In [64]:
person.select('.quote')[0].text.strip().replace('\"', '')

'Visionary foreground interface'

In [72]:
person.select('.email')[0].text

'mannpeter@simmons-jensen.net'

In [77]:
# phone no
person.select('.phone')[0].text.split('x')[0]

'894-426-0860'

In [78]:
# ext
person.select('.phone')[0].text.split('x')[1]

'15929'

In [125]:
# address
person.select('.address')[0].text.strip().replace('\n', '').replace('  ', '')

'1552 Patricia Gateway South Jose, PA 66084'

In [127]:
def parse_person(person):
    name = person.h2.text
    quote = person.select('.quote')[0].text.strip().replace('\"', '')
    email = person.select('.email')[0].text
    phone = person.select('.phone')[0].text.split('x')[0].replace('(', '').replace(')', ' ')\
        .replace('.', '-').replace(' ', '-')
#     ext = person.select('.phone')[0].text.split('x')[1]
    address = person.select('.address')[0].text.strip().replace('\n', '').replace('  ', '')

    
    return {
        'name': name.lower(), 'quote': quote.lower(), 'email': email.lower(),
        'phone': phone.lower(),# 'ext': ext
        'address': address
    }

In [128]:
parse_person(person)

{'name': 'jesse tucker',
 'quote': 'visionary foreground interface',
 'email': 'mannpeter@simmons-jensen.net',
 'phone': '894-426-0860',
 'address': '1552 Patricia Gateway South Jose, PA 66084'}

In [129]:
# loop through all the persons
[parse_person(person) for person in people]

[{'name': 'jesse tucker',
  'quote': 'visionary foreground interface',
  'email': 'mannpeter@simmons-jensen.net',
  'phone': '894-426-0860',
  'address': '1552 Patricia Gateway South Jose, PA 66084'},
 {'name': 'molly lopez',
  'quote': 'reactive transitional core',
  'email': 'kking@hotmail.com',
  'phone': '950-002-2608',
  'address': '9967 Scott Ridge Guerrashire, AK 97431'},
 {'name': 'misty torres',
  'quote': 'sharable high-level structure',
  'email': 'vboyd@miller.com',
  'phone': '690-672-5246',
  'address': '08582 William River Lake Anitatown, DC 82582'},
 {'name': 'samuel robinson',
  'quote': 'progressive methodical customer loyalty',
  'email': 'heather05@jackson.com',
  'phone': '001-537-965-5807',
  'address': '224 Carney Parkways Suite 651 South Oliviahaven, KS 23905'},
 {'name': 'christopher williamson',
  'quote': 'self-enabling eco-centric methodology',
  'email': 'wraymond@english.com',
  'phone': '511-984-8371',
  'address': '36238 James Court Suite 237 Zacharyview

In [130]:
pd.DataFrame([parse_person(person) for person in people])

Unnamed: 0,name,quote,email,phone,address
0,jesse tucker,visionary foreground interface,mannpeter@simmons-jensen.net,894-426-0860,"1552 Patricia Gateway South Jose, PA 66084"
1,molly lopez,reactive transitional core,kking@hotmail.com,950-002-2608,"9967 Scott Ridge Guerrashire, AK 97431"
2,misty torres,sharable high-level structure,vboyd@miller.com,690-672-5246,"08582 William River Lake Anitatown, DC 82582"
3,samuel robinson,progressive methodical customer loyalty,heather05@jackson.com,001-537-965-5807,224 Carney Parkways Suite 651 South Oliviahave...
4,christopher williamson,self-enabling eco-centric methodology,wraymond@english.com,511-984-8371,"36238 James Court Suite 237 Zacharyview, OH 57289"
5,rhonda davis,configurable zero-defect database,ismith@yahoo.com,698-250-6267,"967 Fowler Locks Apt. 599 East Kyle, VT 86580"
6,catherine hill,realigned dynamic budgetary management,alejandro48@yahoo.com,092-335-8030,"6890 Barbara Vista Apt. 497 West Jessica, ID 3..."
7,mr. daniel thompson dvm,organic user-facing frame,gwendolyn26@james-roman.net,034-610-6618,"01121 Amber Meadow Suite 315 Kathleenborough, ..."
8,ashley frederick,public-key methodical hub,batesjose@hotmail.com,631-829-9244,"70935 Horne Track Suite 354 Kimberlyborough, S..."
9,adam neal,synergized static complexity,castillojoanne@rose.com,890-005-5735,"08447 Andrade Plaza Suite 909 Travishaven, CA ..."


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```

## Exercises

#### Codeup Blog Articles

Visit Codeup's Blog(http://codeup.com/blog/) and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's title and content.

Encapsulate your work in a function named get_blog_articles that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

In [None]:
response = requests.get('https://codeup.com/blog/', headers={'user-agent': 'Codeup DS Hopper'})
soup = BeautifulSoup(response.text)