# Data Acquisition with Web Scraping

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

First make the request. The response is a bunch of html.

In [3]:
response = requests.get('https://web-scraping-demo.zgulde.net/news')
html = response.text
html

'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <title>News Example Page</title>\n    <link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet" />\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" />\n</head>\n<body class="mx-auto max-w-screen-lg pb-32">\n    \n<h1 class="my-5 text-4xl text-center">News!</h1>\n<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">\n    <p>\n        <i class="bi bi-exclamation-circle text-xl"></i>\n        All data on this page is strictly for demonstration purposes and fake.\n    </p>\n</div>\n<div class="grid gap-y-12">\n    \n    <div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">\n        <img src="/static/placeholder.png" />\n        <d

We can make more sense of that html with the beautiful soup library.

In [4]:
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>News Example Page</title>
<link href="https://unpkg.com/tailwindcss@^2/dist/tailwind.min.css" rel="stylesheet"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.4.1/font/bootstrap-icons.css" rel="stylesheet"/>
</head>
<body class="mx-auto max-w-screen-lg pb-32">
<h1 class="my-5 text-4xl text-center">News!</h1>
<div class="my-5 text-red-800 px-5 py-3 bg-red-100 font-bold">
<p>
<i class="bi bi-exclamation-circle text-xl"></i>
        All data on this page is strictly for demonstration purposes and fake.
    </p>
</div>
<div class="grid gap-y-12">
<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">knowledge life b

From here we can switch between the browser and python and try out different ways of getting different parts of the html document.

We can leverage Google Chrome's developer tools by right clicking and choosing "Inspect". We can then use this html document inspector to help us with our web scraping.

In [5]:
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')
articles[0].select('.italic')[0].select('p')

[<p> 1975-08-28 </p>, <p class="text-right">By Brett Leach </p>]

In [7]:
article = articles[0]

In [11]:
article.find_all('p')[-1].text

'Writer people less defense. Data once year region spend street.\nEvening consider from rate guess. Style opportunity bar law material. Be purpose picture make left. Hair north million create various.'

Bringing it all together:

In [5]:
def process_article(article):
    date, author = articles[0].select('.italic')[0].select('p')
    return {
        'title': article.h2.text,
        'date': date.text,
        'author': author.text
    }

pd.DataFrame([process_article(article) for article in articles])

Unnamed: 0,title,date,author
0,address gun different,2020-01-13,By Curtis Watkins
1,image candidate window,2020-01-13,By Curtis Watkins
2,necessary part sense,2020-01-13,By Curtis Watkins
3,major example know,2020-01-13,By Curtis Watkins
4,the both poor,2020-01-13,By Curtis Watkins
5,laugh life spring,2020-01-13,By Curtis Watkins
6,up mention house,2020-01-13,By Curtis Watkins
7,onto grow theory,2020-01-13,By Curtis Watkins
8,wide agree card,2020-01-13,By Curtis Watkins
9,fact successful sense,2020-01-13,By Curtis Watkins


## Soup Methods

- `soup.select('.class')`: to get all the elements with class `class`
- `soup.select_one('.class')`: to get the first element with class `class`
- `soup.h2`: to get the first `h2` element
- `soup.find_all('p')`: to get all the elements with tag name of `p`

## Scraping People

In [21]:
response = requests.get('https://web-scraping-demo.zgulde.net/people', headers={'user-agent': 'Codeup DS Germain'})
soup = BeautifulSoup(response.text)

In [27]:
people = soup.select('.person')

In [28]:
person = people[0]

In [73]:
import re


def parse_person(person):
    name = person.h2.text
    # .p finds the first p element; or element with a tag name of "p"
    quote = person.p.text.strip()
    # email
    email = person.select('.email')[0].text
    # phone
    phone = person.select('.phone')[0].text
    # address
    address = person.select('.address')[0].text.strip()
    address = re.sub(r'\s{2,}', ' ', address)
    
    return {'name': name, 'quote': quote, 'email': email, 'phone': phone, 'address': address}

In [80]:
pd.DataFrame([parse_person(person) for person in people])

Unnamed: 0,name,quote,email,phone,address
0,Andre Griffin,"""Phased leadingedge policy""",thomaskim@gmail.com,(436)679-1646x0120,"060 Kathy Mountain Suite 012 Daltonchester, NH..."
1,Maria Thompson,"""Universal exuding synergy""",llee@torres.com,(123)875-0068x72236,"46885 Lori Locks Lucasside, TN 85690"
2,Sarah Morales DDS,"""Exclusive leadingedge array""",gonzalezangela@holt.com,(567)108-2583,"192 Sanders Wall Ritafurt, MI 20512"
3,Charles Goodman,"""Diverse actuating monitoring""",amber48@gates.info,1452733117,"058 Lynn Causeway East Jessicaberg, NM 02370"
4,Jeremy Owens,"""Profound multi-tasking project""",michelle07@walls.com,+1-249-206-1988,"6144 Christina Locks Williambury, WY 38484"
5,Rebecca Jackson,"""Monitored methodical knowledge user""",uholland@rubio.com,248-719-3324x74895,"378 Zavala Camp Port Rebeccafurt, MS 37657"
6,Carol Ward,"""Programmable explicit frame""",joanna31@hotmail.com,001-726-765-3318x78956,"982 Davis Mill Apt. 088 West Christineshire, M..."
7,Amanda Johnson,"""Optimized even-keeled help-desk""",ryan77@williams-ross.org,+1-201-149-5239,"581 Holmes Islands Apt. 776 West Denise, MI 47197"
8,Matthew Cruz,"""Organized fault-tolerant secured line""",ophillips@hotmail.com,+1-161-190-2315x11798,"5735 Goodwin Extension Suite 339 Markland, ID ..."
9,George Mcmillan,"""Fully-configurable high-level firmware""",garciataylor@gmail.com,643.855.9056x93770,"11309 Wilson Plain Apt. 385 Kellyburgh, ID 08407"


## Web Scraping Etiquitte

- respect the `robots.txt` file if present

    * [Wikipedia: Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
    * [robotstxt.org](http://www.robotstxt.org/robotstxt.html)
    * [codeup's robots.txt](https://codeup.com/robots.txt)

- use a descriptive user agent

    ```python
    requests.get('http://example.com', headers={'user-agent': 'codeup data science germain cohort'})
    ```