# Web Scraping
<br>
## Requests, BeautifulSoup & PyMongo
<br>
### Mark Llorente, borrowed heavily from Cary Goltermann

# Afternoon Objectives

1. Understand the process of getting data from the web.
2. HTML/CSS for scraping:
    * Connect to web pages from Python
    * Parse HTML in Python
3. Format data for insertion into Mongo.
4. Be able to use existing API's to get fetch pre-formatted data.

### Internet vs. World Wide Web

* The internet is commonly refered to as a network of networks. It is the infrastructure that allows networks all around the world to connect with one another. There are many different protocols to transfer information within this larger, meta-network.
* The World Wide Web, or Web, provides one of the ways that data can be transfered over the internet. Uses a **U**niform **R**esource **L**ocator, URL, to specify the location, within the internet, of a document.

<center><img src="images/url.png" style="width: 600px"></center>
    
* Documents on the web are generally written in **H**yper**T**ext **M**arkup **L**anguage, HTML, which can be natively viewed by browsers, the tool that we use to browse the web.

### Communication on the Web

* Information is transmitted around the web through a number of protocols. The main one that you will see is the **H**yper**T**ext **T**ransfer **P**rotocol, HTTP.
* These transfers, called **requests**, are initiated in a number of ways, but always begin with the client, read: you at your browser.

<center><img src="images/requests.png" style="width: 600px"></center>

## Requests

There are 4 main types of request that can be issued by your browser: 
* get
* post
* put
* delete

For web scraping purposes, you will almost always be using get requests. We will learn some more about the others in a couple of weeks during data products day.

## Scraping from a Web Page with Python

* Scraping a web site basically comes down to making a request from Python and parsing through the HTML that is returned from each page.
* For each of these tasks we have a Python library:
    * `requests`, for making...well requests, and
    * `bs4`, aka BeautifulSoup.

## Requests Library

* The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making http requests within Python.
* The interface is mindbogglingly simple:
    1. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request.
    2. That instance makes the results of the request available via attributes/methods.

In [2]:
import requests
fun_cheap = 'http://sf.funcheap.com/'
r = requests.get(fun_cheap)
r.text[:1000] # First 1000 characters of the HTML, stored on the "text" attribute

'\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/" prefix="og: http://ogp.me/ns#">\n\n<head profile="https://gmpg.org/xfn/11">\n<script src="//cdn.optimizely.com/js/195632799.js"></script>\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\n\n<title>Funcheap | Free Events &amp; Things to Do in San Francisco</title>\n\n<meta name="generator" content="WordPress" /> <!-- leave this for stats -->\n\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/style.css?v=1.8.10" type="text/css" media="screen" />\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/madmenu.css?v=1.1" type="text/css" media="screen" />\n<!--[if IE 6]>\n    <style type="text/css">\n    body {\n     

## Getting Info from a Web Page

* Now that we can gain easy access to the HMTL for a web page, we need some way to pull the desired content from it. Luckily there is already a system in place to do this.
* With a combination of HMTL and CSS selectors we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

# CSS Selectors

* Cascading Style Sheet, or CSS, is the "language" used to make the items in a web page look better.
* CSS allows for adding properties to specific part of a page, generally divided up into tags, e.g. `<tag_start> SOME STUFF </tag_end>`

# Very cool resource for learning about CSS selectors:
* http://flukeout.github.io/
* Let's work through some of these breakout style

## sffuncheap.com

* Say we want to scrape all of the events, their description and cost.
* How do we do this?

### Let's take a the site

What are we looking for:
* Patterns in the way the site it put together.
* If we can find a pattern we can algorithmically exploit it to get our scraping done.
* Live demo of chrome dev tools!
* [sffuncheap](http://sf.funcheap.com)

### Notes
* If you look around the home page you'll see the current day and days in the near future.
* If you click on one of those days the url will become: `sf.funcheap.com/<year>/<month>/<day>`
* This can be exploited.

In [3]:
from bs4 import BeautifulSoup
r = requests.get(fun_cheap + '2018/09/17')
soup = BeautifulSoup(r.text, 'html.parser')
type(soup)

bs4.BeautifulSoup

## What is this SOUP?
### And why is it BEAUTIFUL??

Well, we can do some fancy HTML parsing/CSS selecting effortlessly.

In [4]:
soup.select('h2.title')
# Or we could do it like:

[<h2 class="title">Events for  September 17, 2018</h2>]

In [5]:
title = soup.find_all('h2', class_='title')[0] # Going to use this later
title

<h2 class="title">Events for  September 17, 2018</h2>

## Back to the task at hand

* Want to scrape all of the events, their description and cost.
* How do we do this?
* What if we just inspect the page looking for a tag that has the info we want!

In [5]:
tanbox_lefts = soup.find_all('div', class_='tanbox left')
print(type(tanbox_lefts))
tanbox_lefts[0]

<class 'bs4.element.ResultSet'>


<div class="tanbox left" style="background-color:white;">\n<span class="title"><a href="http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/" rel="bookmark" title="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day">Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day</a></span>\n<div class="meta archive-meta">Monday, July 17 \x96 8:00 am | \n\n\n<span class="cost">Cost: FREE</span> | <span>Clover Sonoma Milk Tasting Room Pop-up</span> </div>\n<div class="thumbnail-wrapper"><a href="http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/" rel="bookmark" title="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day"><img alt="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day" class="left" src="http://cdn.funcheap.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-06-at-10.40.54-AM1-175x130.png"/></a></div><p>Clove

## Did I Find the Right Things?

In [6]:
tbl1 = tanbox_lefts[0]
tbl1.find('a') # All the elements that come from beautiful soup parsing can be parsed too

<a href="http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/" rel="bookmark" title="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day">Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day</a>

In [7]:
a_tags = [tbl.find('a') for tbl in tanbox_lefts]
a_tags[-5:] # These are the every Monday events

[<a href="http://sf.funcheap.com/event-series/popcorn-poker-movie-trivia-game-night-parkway/" rel="bookmark" title="Popcorn Poker: Movie Trivia Game Night | The New Parkway">Popcorn Poker: Movie Trivia Game Night | The New Parkway</a>,
 <a href="http://sf.funcheap.com/event-series/move-free-comedy-night-light/" rel="bookmark" title="Free Comedy: Move Along, Nothing to See Here | Oakland">Free Comedy: Move Along, Nothing to See Here | Oakland</a>,
 <a href="http://sf.funcheap.com/event-series/classical-revolution-chamber-music-jam-mission-dist/" rel="bookmark" title="\u201cClassical Revolution\u201d Chamber Music Jam | Mission Dist.">\u201cClassical Revolution\u201d Chamber Music Jam | Mission Dist.</a>,
 <a href="http://sf.funcheap.com/event-series/ivy-league-free-comedy-monday-nights-albany/" rel="bookmark" title="Ivy League Free Comedy on Monday Nights | Albany">Ivy League Free Comedy on Monday Nights | Albany</a>,
 <a href="http://sf.funcheap.com/event-series/kevin-wongs-funny-comed

## So we got too much stuff.
### What to do now?

We could:
* Only keep dates that have real dates, not "Every Monday". Requires going further down the HTML.
* Find some other container that only holds the real events for the day.

In [8]:
good_clear_float = title.next_sibling.next_sibling
str(good_clear_float)[:500]

'<div class="clearfloat">\n<span class="left"><a href="/2017/07/16/">&lt; Sunday, July 16</a></span>\n<span class="right"><a href="/2017/07/18/">Tuesday, July 18 &gt;</a></span>\n<div style="clear:both"></div>\n<div class="archive_date_title" style="background-color:white;margin:0px;margin-bottom:4px;padding:0px;"><h3 style="background-color:black;color:white;margin:0px;padding:3px;font: 16px Arial;font-weight:bold;">Monday<span style="font-weight:normal;">, July 17</span></h3></div>\n<div class="tanb'

In [9]:
for tag in good_clear_float.find_all('a', rel=True):
    print(tag.attrs['href'])
print('These look right, apart from some repeats')

http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/
http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/
http://sf.funcheap.com/free-coworking-job-sunnyvale/
http://sf.funcheap.com/free-coworking-job-sunnyvale/
http://sf.funcheap.com/jack-daniels-popup-store-trims-southern-cooking-happy-hour-sf-2/
http://sf.funcheap.com/jack-daniels-popup-store-trims-southern-cooking-happy-hour-sf-2/
http://sf.funcheap.com/tour-of-diego-riveras-1st-us-mural-union-square-36/
http://sf.funcheap.com/laborfest2017-destruction-city-college-san-francisco-bernal-public-library/
http://sf.funcheap.com/the-book-of-greens-book-launch-omnivore-books/
http://sf.funcheap.com/free-sneak-preview-movie-dunkirk-sundance-kabuki-cinemas/
http://sf.funcheap.com/free-sneak-preview-movie-dunkirk-amc-saratoga/
http://sf.funcheap.com/haight-street-trivia-comedy-gaming-free-pizza-night-milk-bar-7/
http://sf.funcheap.com/garage-surf-jurassic-punk-t

## New Strategy

As you go through a web site you should build up a dictionary for the documents that you want to store in Mongo. In the example above we may, for each post url, create a dictionary with the information:

```python
    { url: url_of_event,
      date: date_event,
      cost: cost_of_event }
```

We can then insert these dictionaries into a Mongo database via PyMongo, which we will learn about next. First we want a function that will allow us to do something like:

```python
event_list = []
for tbl in good_clear_float.find_all('div', class_='tanbox left'):
    event_dict.append(make_event_dict(tbl))
```

In [10]:
# For testing
tbl1

<div class="tanbox left" style="background-color:white;">\n<span class="title"><a href="http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/" rel="bookmark" title="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day">Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day</a></span>\n<div class="meta archive-meta">Monday, July 17 \x96 8:00 am | \n\n\n<span class="cost">Cost: FREE</span> | <span>Clover Sonoma Milk Tasting Room Pop-up</span> </div>\n<div class="thumbnail-wrapper"><a href="http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/" rel="bookmark" title="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day"><img alt="Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal &amp; Coffee | Final Day" class="left" src="http://cdn.funcheap.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-06-at-10.40.54-AM1-175x130.png"/></a></div><p>Clove

In [11]:
def make_event_dict(tbl):
    pass

In [12]:
def make_event_dict(tbl):
    """Parses event in a "tanbox left" div for it's url, title and cost.
    Returns as dictionary.
    """
    event_dict = {}
    event_dict['url'] = tbl.find('a', rel=True).attrs['href']
    event_dict['title'] = tbl.find('a').attrs['title']
    maybe_cost = tbl.find('span', class_='cost')
    if maybe_cost:
        event_dict['cost'] = maybe_cost.text
    return event_dict

In [13]:
event_list = []
for tbl in good_clear_float.find_all('div', class_='tanbox left'):
    if tbl.find('a', rel=True):
        event_list.append(make_event_dict(tbl))

event_list

[{'cost': u'Cost: FREE',
  'title': u'Clover Sonoma Milk Tasting Room Pop-Up: Free Cereal & Coffee | Final Day',
  'url': u'http://sf.funcheap.com/clover-sonoma-milk-tasting-room-popup-free-cereal-coffee-closing-day/'},
 {'cost': u'Cost:',
  'title': u'Free Co-working: Getting The Job Done | Sunnyvale',
  'url': u'http://sf.funcheap.com/free-coworking-job-sunnyvale/'},
 {'cost': u'Cost:',
  'title': u'Jack Daniel\u2019s Pop-Up Store: Trims, Southern Cooking & Daily Demos | SF',
  'url': u'http://sf.funcheap.com/jack-daniels-popup-store-trims-southern-cooking-happy-hour-sf-2/'},
 {'title': u'Tour of Diego Rivera\u2019s 1st US Mural | Union Square',
  'url': u'http://sf.funcheap.com/tour-of-diego-riveras-1st-us-mural-union-square-36/'},
 {'title': u'LaborFest2017: The Destruction of City College of San Francisco | Bernal Public Library',
  'url': u'http://sf.funcheap.com/laborfest2017-destruction-city-college-san-francisco-bernal-public-library/'},
 {'title': u'\u201cThe Book of Greens\u

Now that we have the contents of the events that we want, how to we save them to Mongo?

In [10]:
from pymongo import MongoClient

remote_client = MongoClient("<string_from_morning_remote_connection>")
tweets = remote_client.tweets
coffee = tweets.coffee

In [15]:
tweet = coffee.find_one()
print(type(tweet))
tweet['place']

<type 'dict'>


{u'country': u'United States',
 u'country_code': u'US',
 u'full_name': u'city, MS',
 u'place_type': u'city'}

In [16]:
tweet.keys()

[u'contributors',
 u'truncated',
 u'text',
 u'in_reply_to_status_id',
 u'id',
 u'favorite_count',
 u'source',
 u'retweeted',
 u'coordinates',
 u'entities',
 u'in_reply_to_screen_name',
 u'id_str',
 u'retweet_count',
 u'in_reply_to_user_id',
 u'favorited',
 u'user',
 u'geo',
 u'in_reply_to_user_id_str',
 u'lang',
 u'created_at',
 u'filter_level',
 u'in_reply_to_status_id_str',
 u'place',
 u'_id']

In [17]:
for tweet in coffee.find().limit(20):
    print(tweet['user']['screen_name'])

TransitoRos
BaristaBatgirl
arianaskittles
jordansundblad
alexdelro
RosenaraM
MerveUygr
melissat91
_alysaperez
hopethepope
lucianamachad15
Angela79248101
hannahTkoch
charleyleemcken
jennifelovepink
SoyLatte
JustinC_93
MO_SandraE
NoFeelingsPls
LoveAfrodita_HC


In [18]:
# Remember to close the connection
remote_client.close()

## Back to sffuncheap

Knowing what we know now, how do we put the events we scraped into Mongo?

Steps are simple:
1. Start a Mongo server
2. Open a connection to the server with PyMongo
3. Write code that executes queries (can be insertions) into a database from Python

In [11]:
local_client = MongoClient() # Default is to connect locally
db = local_client.sffuncheap # What's the server look like before we run this?
events = db.events

In [20]:
for event in event_list[:5]:
    events.insert_one(event)

In [21]:
events.insert_many(event_list[5:])

<pymongo.results.InsertManyResult at 0x10ceafe60>

In [12]:
# Remember to close the connection
events.drop() # <<<---- only for lecture refreshing purposes
local_client.close()

## Scraping Note

* Frequently we don't know what data we're going to want for a project/employer/etc.
* In these, likely, cases a good scraping practice is just to download all of the HTML and save it.
* Later, then, you can figure out how to parse it into another database without the risk of having missed some important info.

## Scraping from an Existing API

* Let's take a look at the API for all the publically avaliable policing data in the [UK](https://data.police.uk/docs/).
* After taking a look at the documentation for the interface, let's experiment with what we get when we issue a request to this API.
* The process looks remarkable similar to the one we went through for scraping a web page, except this time the response we're looking for is avaliable via the `json()` method.

## API Scraping and Mongo

* Many APIs will give you a choice of how it will return data to you.
* Choosing json will make life easier since we will frequently be using Mongo for our storage unit during our scraping endeavors, and it plays very well with json.
* If an API doesn't offer a data format choice then take whatever they provide and coerce it into a dictionary format with Python so you can put it in Mongo.

## Demo: Network Tab

* Some websites are powered on the backend by an api that isn't public but that can be reversed engineered.
* [NBA Stats](http://stats.nba.com/) (technically most of this api has been made public).

# Afternoon Objectives

1. Understand the process of getting data from the web.
2. HTML/CSS for scraping:
    * Connect to web pages from Python
    * Parse HTML in Python
3. Format data for insertion into Mongo.
4. Be able to use existing API's to get fetch pre-formatted data.