# Web scraping

**Date: 28 March 2017**

@author: Daniel Csaba


## Preliminaries 

Import usual packages.  

In [5]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  

%matplotlib inline

We have seen how to input data from `csv` and `xls` files -- either online or from our computer and through APIs. Sometimes the data is only available as specific part of a website.

We want to access the source code of the website and systematically extract the relevant information.

Again, use Google fu to find useful links. Here are a couple:
* [link 1](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [link 2](http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/)
* [link 3](https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/)

## Structure of web pages (very simplistic)

`Hypertext Markup Language` (HTML) specifies the structure and main content of the site -- tells the browser how to layout content. Think of `Markdown`.

It is structured using tags.

```html
<html>
    <head>
        (Meta) Information about the page.
    </head>
    <body>
        <p>
            This is a paragraph.
        </p>
        <table>
            This is a table
        </table>
    </body>
</html>
```

`Tag`s determine the content and layout depending on their relation to other tags. Useful terminology:

* `child` -- a child is a tag inside another tag. The `p` tag above is a child of the `body` tag.
* `parent` --  a parent is the tag another tag is inside. The `body` tag above is a parent of the `p` tag.
* `sibling` -- a sibling is a tag that is nested inside the same parent as another tag. The `head` and `body` tags above are siblings.

There are many different tags -- take a look at a [reference list](https://developer.mozilla.org/en-US/docs/Web/HTML/Element). You won't and shouldn't remember all of them but it's useful to have a rough idea about them.

And take a look at a real example -- open page, then right click:  "View Page Source"

In the real example you will see that there is more information after the tag, most commanly a `class` and an `id`. Something similar to the following:

```html
<html>
    <head class='main-head'>
        (Meta) Information about the page.
    </head>
    <body>
        <p class='inner-paragraph' id='001'>
            This is a paragraph.
        </p>
        <table class='inner-table' id='002'>
            This is a table
        </table>
    </body>
</html>
```
The `class` and `id` information will help us in locating the information we are looking for in a systematic way.

Useful way to explore the `html` and the corresponding website is right clicking on the web page and then clicking on `Inspect element` -- interpretation of the html by the browser


Suppose we want to check prices for renting a room in Manhattan in Craigslist. Let's check for example the `rooms & shares` section for the [East Village](https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0).

## Accessing web pages 

We have to download the content of the webpage -- i.e. get the contents structured by the HTML. This we can do with the `requests` library, which is a human readable HTTP (HyperText Transfer Protocol) library for python. You cna find the Quickstart Documentation [here](http://docs.python-requests.org/en/master/user/quickstart/).

In [7]:
import requests

You might want to query for different things and download information for all of them. You can pass this as extra information.

In [8]:
# see if the URL was specified successfully
url = 'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'
cl = requests.get(url)

In [9]:
type(cl)

requests.models.Response

In [10]:
cl

<Response [200]>

The `[200]` stands for the `status_code` which carries information whether the download was succesful. If it starts with 2 it's a good sign, 4 or 5 not so much.

In [11]:
cl.status_code

200

Check tab completion

In [12]:
cl.url

'https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0'

This is going to be ugly and unreadable -- get the text, or general content.

In [15]:
cl.content

b'\xef\xbb\xbf<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist</title>\n\n    <meta name="description" content="new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://newyork.craigslist.org/search/roo">\n    <link rel="alternate" type="application/rss+xml" href="https://newyork.craigslist.org/search/roo?availabilityMode=0&amp;format=rss&amp;query=east%20village" title="RSS feed for craigslist | new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist ">\n    \n    <link rel="next" href="https://newyork.craigslist.org/search/roo?s=120&amp;availabilityMode=0&amp;query=east%20village">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n    <link type="text/css" rel="stylesheet" media="all" href="//www.craigslist.org

In [22]:
# this is another way to specify the URL
url= 'https://newyork.craigslist.org/search/roo'
keys = {'query':['east village', 'west village'],'availabilityMode':'0'}
cl_extra = requests.get(url, params=keys)

In [23]:
cl_extra.status_code

200

In [24]:
cl_extra.url

'https://newyork.craigslist.org/search/roo?query=east+village&query=west+village&availabilityMode=0'

In [25]:
cl_extra.content

b'\xef\xbb\xbf<!DOCTYPE html>\n\n<html class="no-js"><head>\n    <title>new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist</title>\n\n    <meta name="description" content="new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge"/>\n    <link rel="canonical" href="https://newyork.craigslist.org/search/roo">\n    <link rel="alternate" type="application/rss+xml" href="https://newyork.craigslist.org/search/roo?availabilityMode=0&amp;format=rss&amp;query=east%20village&amp;query=west%20village" title="RSS feed for craigslist | new york rooms for rent &amp; shares available &quot;east village&quot; - craigslist ">\n    \n    <link rel="next" href="https://newyork.craigslist.org/search/roo?s=120&amp;availabilityMode=0&amp;query=east%20village&amp;query=west%20village">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n    <link type="text/css" rel="

## Extracting information from a web page 

Now that we have the content of the web page we want to extraxt certain information. `BeautifulSoup` is a Python package which helps us in doing that. See the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for more information.


In [26]:
from bs4 import BeautifulSoup

In [28]:
BeautifulSoup?

In [29]:
# parse the html
cl_soup = BeautifulSoup(cl.content,'html.parser')

Print this out in a prettier way.

In [30]:
print(cl_soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <title>
   new york rooms for rent &amp; shares available "east village" - craigslist
  </title>
  <meta content='new york rooms for rent &amp; shares available "east village" - craigslist' name="description">
   <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
   <link href="https://newyork.craigslist.org/search/roo" rel="canonical">
    <link href="https://newyork.craigslist.org/search/roo?availabilityMode=0&amp;format=rss&amp;query=east%20village" rel="alternate" title='RSS feed for craigslist | new york rooms for rent &amp; shares available "east village" - craigslist ' type="application/rss+xml">
     <link href="https://newyork.craigslist.org/search/roo?s=120&amp;availabilityMode=0&amp;query=east%20village" rel="next">
      <meta content="width=device-width,initial-scale=1" name="viewport">
       <link href="//www.craigslist.org/styles/cl.css?v=a14d0c65f7978c2bbc0d780a3ea7b7be" media="all" rel="stylesheet" type="text/css">
  

In [31]:
print('Type:', type(cl_soup))

Type: <class 'bs4.BeautifulSoup'>


In [33]:
# we can access a tag 
print('Title: ', cl_soup.title)

Title:  <title>new york rooms for rent &amp; shares available "east village" - craigslist</title>


In [34]:
# or only the text content
print('Title: ', cl_soup.title.text) # or
print('Title: ', cl_soup.get_text())

Title:  new york rooms for rent & shares available "east village" - craigslist
Title:  

new york rooms for rent & shares available "east village" - craigslist














<!--
var areaCountry = "US";
var areaID = "3";
var areaRegion = "NY";
var catAbb = "roo";
var countOfTotalText = "image {count} of {total}";
var currencySymbol = "&#x0024;";
var defaultView = "list";
var expiredFavIDs = null;
var imageConfig = {"1":{"hostname":"https://images.craigslist.org","sizes":["50x50c","300x300","600x450","1200x900"]},"0":{"hostname":"https://images.craigslist.org","sizes":["50x50c","300x300","600x450"]},"2":{"hostname":"https://images.craigslist.org","sizes":["50x50c","300x300","600x450","1200x900"]}};
var lessInfoText = "less info";
var maptileBaseUrl = "//map{s}.craigslist.org/t09/{z}/{x}/{y}.png";
var maxResults = 2500;
var noImageText = "no image";
var pID = null;
var postalLat = null;
var postalLon = null;
var purveyorCategories = null;
var searchDistance = null;
var sectionAbb = "hhh"

We can find all tags of certain type with the `find_all` method. This returns a list.

In [None]:
cl_soup.find_all?

To get the first paragraph in the html write

In [35]:
cl_soup.find_all('p')

[<p class="result-info">
 <span class="icon icon-star" role="button">
 <span class="screen-reader-text">favorite this post</span>
 </span>
 <time class="result-date" datetime="2017-04-04 13:54" title="Tue 04 Apr 01:54:15 PM">Apr  4</time>
 <a class="result-title hdrlnk" data-id="6058998250" href="/mnh/roo/6058998250.html">Room on Central East Village prime location ♥</a>
 <span class="result-meta">
 <span class="result-price">$500</span>
 <span class="result-hood"> (East Village)</span>
 <span class="result-tags">
 <span class="maptag" data-pid="6058998250">map</span>
 </span>
 <span class="banish icon icon-trash" role="button">
 <span class="screen-reader-text">hide this posting</span>
 </span>
 <span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
 <a class="restore-link" href="#">
 <span class="restore-narrow-text">restore</span>
 <span class="restore-wide-text">restore this posting</span>
 </a>
 </span>
 </p>, <p class="result-info">
 <span class="icon

This is a lot of information and we want to extract some part of it. Use the `text` or `get_text()` method to get the text content.

This is still messy. We will need a smarter search.

We can also access the `children` of a certain tag. For example here are the children of the first paragraph tag.

In [None]:
list(cl_soup.find_all('p')[0].children)

Look for tags based on their class. This is extremely useful for efficiently locating information.

In [None]:
cl_soup.find_all('span', class_='result-price')[0].get_text()

In [None]:
prices = cl_soup.find_all('span', class_='result-price')

In [None]:
price_data = [price.get_text() for price in prices]

In [None]:
price_data[:10]

In [None]:
len(price_data)

We are getting more cells than we want -- there were only 120 listings on the page. Check the ads with "Inspect Element".

In [None]:
cl_soup.find_all('li', class_='result-row')[0]

In [None]:
ads = cl_soup.find_all('li', class_='result-row')

What's going wrong? Some ads don't have a price listed, so we can't retrieve it.

In [None]:
import bs4

data = [[ad.find('a', class_='result-title hdrlnk').get_text(), 
         ad.find('a', class_='result-title hdrlnk')['data-id'], 
         ad.find('span', class_='result-price').get_text()] for ad in ads 
        if ...]

We only have x listing because (120-x) listings did not have a price.

In [None]:
df.columns = ['Title', 'ID', 'Price']

In [None]:
df.head()

We could do text anaylsis and see what words are common in ads which has a relatively higher price.

This approach is not really efficient because it only gets the first page of the search results. We see on the top of the CL page the total number of listings. In the `Inspection` mode we can pick an element from the page and check how it is defined in the `html` -- this is useful to get tags and classes efficiently.

For example, the total number of ads is a `span` tag with a 'totalcount' `class`.

We can see if we start clicking on the 2nd nd 3rd pages of the results that there is a structure in how they are defined

First page:

https://newyork.craigslist.org/search/roo?query=east+village&availabilityMode=0

Second page:

https://newyork.craigslist.org/search/roo?s=120&availabilityMode=0&query=east%20village

Third page:

https://newyork.craigslist.org/search/roo?s=240&availabilityMode=0&query=east%20village


The number after `roo?s=` in the domain specifies where the listings are starting from (not inclusive). In fact, if we modify it ourselves we can fine-tune the page starting from the corresponding listing and then showing 120 listings. Try it!

We can also define the first page by puttig`s=0&` after `roo?` like this:

https://newyork.craigslist.org/search/roo?s=0&availabilityMode=0&query=east%20village


In [None]:
# First we get the total number of listings in real time


In [None]:
# Next we write a loop to scrape all pages



In [None]:
df.head(15)

In [None]:
df.shape

In [None]:
df.tail(15)

We have scraped all the listings from CL in section "Rooms and Shares" for the East Village.

## Exercise

Suppose you have a couple of destinations in mind and you want to check the weather for each of them for this Friday. You want to get it from the [National Weather Service](http://www.weather.gov/).

These are the places I want to check (suppose there are many more and you want to automate it):

```python
locations = ['Bozeman, Montana', 'White Sands National Monument', 'Stanford University, California']
```

It seems that the NWS is using latitude and longitude coordinates in its search.

i.e. for White Sands
http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997

Would be cool to pass these on as arguments.

After some Google fu (i.e. "latitude and longitude of location python") find a post by [Chris Albon](https://chrisalbon.com/python/geocoding_and_reverse_geocoding.html) which describes exactly what we want.

> "Geocoding (converting a phyiscal address or location into latitude/longitude) and reverse geocoding (converting a lat/long to a phyiscal address or location)[...] Python offers a number of packages to make the task incredibly easy [...] use pygeocoder, a wrapper for Google's geo-API, to both geocode and reverse geocode.

Install `pygeocoder` through `pip install pygeocoder` (from `conda` only the OSX version is available).

In [None]:
from pygeocoder import Geocoder

In [None]:
loc = Geocoder.geocode('Bozeman, Montana')
loc.coordinates

We can check whether it's working fine at http://www.latlong.net/

In [None]:
locations = 

coordinates =

In [None]:
for location, coordinate in zip(locations, coordinates):
    print('The coordinates of {} are:'.format(location), coordinate)

Define a dictionary (of dictionaries) for the parameters we want to pass to the GET request that we send to the NWS server.

In [None]:
keys = {}
for location, coordinate in zip(locations, coordinates):
    

In [None]:
keys

Recall the format of the url associated with a particular location

http://forecast.weather.gov/MapClick.php?lat=32.38092788700044&lon=-106.4794398029997


In [None]:
url = ' http://forecast.weather.gov/MapClick.php'    
nws = requests.get(url, params=keys[locations[0]])

In [None]:
nws.status_code

Create a BeautifulSoup instance

In [None]:
nws_soup = 

In [None]:
seven = nws_soup.find('div', id='seven-day-forecast-container')

In [None]:
data = []

for location in locations:
    # send GET request to the server
    nws = requests.get(url, params=keys[location])
    # create a BeautifulSoup isntance
    nws_soup = BeautifulSoup(nws.content, 'html.parser')
    
    # locate the part of the html file that contains the information of interest
    seven = nws_soup.find('div', id='seven-day-forecast-container')
    temp = seven.find(text='Friday').parent.parent.find('p', class_='temp temp-high').get_text()
    
    data.append([location, temp])

In [None]:
df_weather = pd.DataFrame(data, columns=['Location', 'Friday weather'])
df_weather