### Practice - Python for data scientists II - accessing data - scraping

Answer all **Questions**

In many cases as a data scientist, you’ll have access to data via database, csv format, or an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In this case you will want to use a technique called *web scraping* to get the data from the web page and format it for further analysis.

In this notebook, we will go over how to work with the `Requests` and `BeautifulSoup` Python libraries in order to make use of data from web pages. The `Requests` module lets you integrate your Python programs with web services, while the `BeautifulSoup` module is designed to facilitate extracting data from parsed HTML. 

After an introduction to each library, we'll scrape weather forecasts from the National Weather Service using `Python 3`, the `Requests` and `BeautifulSoup` libraries, and then load the data into a `Pandas` dataframe for further analysis.

[Scrapy](https://scrapy.org/) is another web scraping library that also includes a web crawling pipeline. Its a little more difficult to use, but provides greater functionality.

Prerequisites:    
- Python 3 programming    
- Pandas   
- Basic HTML, CSS tag knowledge  
- Basic understanding of HTTP requests   

References:    
Beautiful Soup Documentation  
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#   

How To Work with Web Data Using Requests and Beautiful Soup with Python 3  
- https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3

Tutorial: Python Web Scraping Using BeautifulSoup   
- https://www.dataquest.io/blog/web-scraping-tutorial-python/   

StackOverflow, Difference between BeautifulSoup and Scrapy crawler?         
- https://stackoverflow.com/questions/19687421/difference-between-beautifulsoup-and-scrapy-crawler   


# In addition to Python 3, make sure the following libraries are installed
# Note: This is a non-executable cell

$ pip install pandas   
$ pip install requests
$ pip install beautifulsoup4

### The requests library

The `requests` library is the standard library for making HTTP requests in Python. It abstracts the complexities of making HTTP requests behind a relatively simple API.

Using the `requests` library will make a `GET` request to a web server, which will retrieve an `HTTP` response with the HTML contents of a given web page. There are several different types of requests we can make using requests, of which GET is just one.  Information on other HTTP methods can be found here:  
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods 

Using `requests` download a sample web page below.

Feel free to try other web pages.

In [1]:
import requests
page = requests.get("http://jayurbain.com/simplehtml.html")
page

<Response [200]>

After running the HTTP GET request, we get a `Response` object. This object has a HTTP response `status_code` property, which indicates if the page was downloaded successfully.

More on response codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


In [2]:
page.status_code

200

A status_code of $200$ means that the page downloaded successfully. A status code starting with a 2 generally indicates success, a code starting with a 4 indicates request (user) error, and a code starting with 5 indicates a server error.

Use the response object to print the HTML content of the page using the `content` property:

In [3]:
page.content

b'<html>\n<head><meta charset="UTF-8">\n<title>Basic Web Page</title>\n</head>\n<body>\n<h1>HTML syntax summary</h1>\n<h2>or, all you need to know about HTML</h2>\n<p>This is how you write an HTML document.</p>\n<p>The end.</p>\n<br>\nSome links:<br>\n<ul>\n<li><a class="website" href="https://google.com" id="link1">Google</a></li>\n<li><a class="website" href="https://twitter.com" id="link2">Twitter</a></li>\n<li><a class="website" href="https://facebook.com" id="link3">Facebook</a></li>\n</ul>\n</body>\n</html>'

### Parsing a web page with BeautifulSoup    

Once the page is downloaded, the document needs to be parsed so we can extract data.

[`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/) creates a parse tree for parsed documents that can be used to extract data from HTML for web scraping.

The `BeautifulSoup` library can be used to parse the document downloaded above.

Note: `bs4` is the most recent version of BeautifulSoup    
https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [4]:
from bs4 import BeautifulSoup    
soup = BeautifulSoup(page.content, 'html.parser')
type(soup)

bs4.BeautifulSoup

We can format and print the HTML content of the page, using the `prettify` method on the `BeautifulSoup` object.

In [5]:
print(soup.prettify())

<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   Basic Web Page
  </title>
 </head>
 <body>
  <h1>
   HTML syntax summary
  </h1>
  <h2>
   or, all you need to know about HTML
  </h2>
  <p>
   This is how you write an HTML document.
  </p>
  <p>
   The end.
  </p>
  <br/>
  Some links:
  <br/>
  <ul>
   <li>
    <a class="website" href="https://google.com" id="link1">
     Google
    </a>
   </li>
   <li>
    <a class="website" href="https://twitter.com" id="link2">
     Twitter
    </a>
   </li>
   <li>
    <a class="website" href="https://facebook.com" id="link3">
     Facebook
    </a>
   </li>
  </ul>
 </body>
</html>


Web page documents are represented as a logical tree defined by the Document Object Model (DOM). Each branch of the tree ends in a node, and each node contains objects. Some nodes can be nested with other nodes, others are terminal nodes.

References:  
- https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model  
- https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction  

Once parsed, `BeautifulSoup` has methods that allow navigation of the nested tree structure one level at a time, or extraction of specific node elements.

For example, we can first select all the elements at the top level of the page using the `children` property of `BeautifulSoup`. 

Note: `children` returns a list generator, so we need to call the list function.

In [7]:
[item for item in list(soup.children)]

[<html>
 <head><meta charset="utf-8"/>
 <title>Basic Web Page</title>
 </head>
 <body>
 <h1>HTML syntax summary</h1>
 <h2>or, all you need to know about HTML</h2>
 <p>This is how you write an HTML document.</p>
 <p>The end.</p>
 <br/>
 Some links:<br/>
 <ul>
 <li><a class="website" href="https://google.com" id="link1">Google</a></li>
 <li><a class="website" href="https://twitter.com" id="link2">Twitter</a></li>
 <li><a class="website" href="https://facebook.com" id="link3">Facebook</a></li>
 </ul>
 </body>
 </html>]

In [8]:
newlist = []
for item in list(soup.children):
    newlist.append(item)

print(newlist)

[<html>
<head><meta charset="utf-8"/>
<title>Basic Web Page</title>
</head>
<body>
<h1>HTML syntax summary</h1>
<h2>or, all you need to know about HTML</h2>
<p>This is how you write an HTML document.</p>
<p>The end.</p>
<br/>
Some links:<br/>
<ul>
<li><a class="website" href="https://google.com" id="link1">Google</a></li>
<li><a class="website" href="https://twitter.com" id="link2">Twitter</a></li>
<li><a class="website" href="https://facebook.com" id="link3">Facebook</a></li>
</ul>
</body>
</html>]


Top-level children type.

In [9]:
[type(item) for item in list(soup.children)]

[bs4.element.Tag]

Here are some examples showing how to navigate the document data structure:.

In [10]:
print('soup.title', soup.title)

print('soup.title.name', soup.title.name)

print('soup.title.string', soup.title.string)

print('soup.title.parent.name', soup.title.parent.name)

print('soup.p', soup.p)

print("soup.find_all('a')", soup.find_all('a'))

print('soup.find(id="link3")', soup.find(id="link3"))

soup.title <title>Basic Web Page</title>
soup.title.name title
soup.title.string Basic Web Page
soup.title.parent.name head
soup.p <p>This is how you write an HTML document.</p>
soup.find_all('a') [<a class="website" href="https://google.com" id="link1">Google</a>, <a class="website" href="https://twitter.com" id="link2">Twitter</a>, <a class="website" href="https://facebook.com" id="link3">Facebook</a>]
soup.find(id="link3") <a class="website" href="https://facebook.com" id="link3">Facebook</a>


#### Finding Instances of a Tag

Single tags can be extracted from a page by using `Beautiful Soup’s` `find_all` method. This will return all instances of a given tag within a document.

In [11]:
soup.find_all('p')

[<p>This is how you write an HTML document.</p>, <p>The end.</p>]

In [12]:
soup.find_all('a')

[<a class="website" href="https://google.com" id="link1">Google</a>,
 <a class="website" href="https://twitter.com" id="link2">Twitter</a>,
 <a class="website" href="https://facebook.com" id="link3">Facebook</a>]

Note: The data above is contained in a list. So we can access individual elements using indexing.


In [13]:
soup.find_all('p')[0].get_text()

'This is how you write an HTML document.'

In [14]:
soup.find_all('a')[1].get_text()

'Twitter'

#### Finding Tags by Class or ID

HTML elements that refer to CSS selectors like class and ID can be helpful to look at when working with web data using `BeautifulSoup`. Specific classes and IDs can be selected by using the `find_all()` method and passing the class and ID strings as arguments.

Find all of the instances of the class *website*. 

In [15]:
soup.find_all(class_='website')

[<a class="website" href="https://google.com" id="link1">Google</a>,
 <a class="website" href="https://twitter.com" id="link2">Twitter</a>,
 <a class="website" href="https://facebook.com" id="link3">Facebook</a>]

You can also search for the class *websigte only within `<a>` tags.

In [16]:
soup.find_all('a', class_='website')

[<a class="website" href="https://google.com" id="link1">Google</a>,
 <a class="website" href="https://twitter.com" id="link2">Twitter</a>,
 <a class="website" href="https://facebook.com" id="link3">Facebook</a>]

### Practice Exercise


#### Web Scraping Example: National Weather Service

Load the following page from the National Weather Service:

https://forecast.weather.gov/MapClick.php?lat=43.042&lon=-87.9069#.XbXtPEU3ny8

<img src="national_weather_service_2019-10-27_mke.png" width="600px"/>
    
Feel free to try another location, but you usually don't have this nice of weather in Wisconsin this time of year!    

#### Explore the page structure

Prior to scraping a web page, you need to review its structure to identify the elements from which to extract data.

The screen shot below uses [Chrome Developer Tools](https://developer.chrome.com/devtools)

From the menu (upper-right: 3 vertical dots) select *More Tools -> Developer Tools* then select *Elements* to view HTML elements.

<img src="chrome_developer_tools_2019-10-27.png" width="600px"/>

*Note: Other browsers have similar tools.*

The elements panel will show the HTML tags on the page and let you navigate through nested children. 

Any element can be selected and inspected. Open and inspect the *seven-day-forecast* div.

<img src="seven-day-forecast.png" width="600px"/>


#### On your own: Scrape the forecast

- Download the web page containing the forecast.  
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=43.042&lon=-87.9069#.XbXtPEU3ny8")   
- Create a `BeautifulSoup` class to parse the page.  
- Find the `div` with id `seven-day-forecast`, and assign to variable `seven_day`  
- Inside seven_day, find each individual forecast item.  
- Extract and print the first forecast item.  

In [26]:
# Answer:

import requests
from bs4 import BeautifulSoup

url = 'https://forecast.weather.gov/MapClick.php?lat=43.042&lon=-87.9069#.XhkSYchKiCp'

page = requests.get(url)

if page.ok:
    soup = BeautifulSoup(page.content, 'html.parser')
    seven_day = soup.find('div', id='seven-day-forecast')
    # print(seven_day) # optional print statement
    forecast_items = seven_day.find_all('li', class_='forecast-tombstone')
    # print(forecast_items) # optional print statement
    tonight = forecast_items[0]
    print(tonight)
    print('available text:', tonight.text)
else:
    print('status code:', page.status_code)

<li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Rain and snow, possibly mixed with sleet before 2am, then snow and sleet, possibly mixed with freezing rain between 2am and 3am, then snow after 3am.  Patchy blowing snow after 3am. Low around 29. Windy, with a north wind 15 to 20 mph increasing to 25 to 30 mph after midnight. Winds could gust as high as 40 mph.  Chance of precipitation is 100%. Little or no ice accumulation expected.  Total nighttime snow and sleet accumulation of 1 to 2 inches possible. " class="forecast-icon" src="DualImage.php?i=nra&amp;j=nfzra_sn&amp;ip=100&amp;jp=100" title="Tonight: Rain and snow, possibly mixed with sleet before 2am, then snow and sleet, possibly mixed with freezing rain between 2am and 3am, then snow after 3am.  Patchy blowing snow after 3am. Low around 29. Windy, with a north wind 15 to 20 mph increasing to 25 to 30 mph after midnight. Winds could gust as high a

### Additional tutorial information, optional

#### Extracting information 

We're interested in tonight's forecast. 

There are 4 pieces of information we can extract:  
- The name of the forecast item — in this case, `Tonight`.  
- The description of the conditions — this is stored in the `title` property of `img`.  
- A short description of the conditions.  
- The temperature low.  

First, extract the name of the forecast item, the short description, and the temperature.

In [27]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Tonight
Rain andWindy thenWintry Mixand PatchyBlowing Snow
Low: 29 °F


Extract the `title` attribute from the `img` tag. The `BeautifulSoup` object can be treated as a dictionary by passing in the attribute we want as a key.


In [28]:
img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Rain and snow, possibly mixed with sleet before 2am, then snow and sleet, possibly mixed with freezing rain between 2am and 3am, then snow after 3am.  Patchy blowing snow after 3am. Low around 29. Windy, with a north wind 15 to 20 mph increasing to 25 to 30 mph after midnight. Winds could gust as high as 40 mph.  Chance of precipitation is 100%. Little or no ice accumulation expected.  Total nighttime snow and sleet accumulation of 1 to 2 inches possible. 


#### Extracting all the information from the page

Select all items with the class `period-name` inside an item with the class `tombstone-container` from seven_day.

We can use `list comprehension`s with the `get_text` method on each `BeautifulSoup` object.


In [29]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight']

We can apply the same technique to get the other 3 fields.

In [30]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)

temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)

descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)

['Rain andWindy thenWintry Mixand PatchyBlowing Snow', 'Snow andPatchyBlowing Snow', 'Heavy Snowand AreasBlowing Snow', 'Partly Sunny', 'Chance Snow', 'Partly Sunny', 'Mostly Cloudythen ChanceRain/Snow', 'ChanceRain/Snowthen PartlySunny', 'Mostly Cloudy']
['Low: 29 °F', 'High: 31 °F', 'Low: 17 °F', 'High: 28 °F', 'Low: 24 °F', 'High: 34 °F', 'Low: 28 °F⇑', 'High: 41 °F', 'Low: 23 °F']
['Tonight: Rain and snow, possibly mixed with sleet before 2am, then snow and sleet, possibly mixed with freezing rain between 2am and 3am, then snow after 3am.  Patchy blowing snow after 3am. Low around 29. Windy, with a north wind 15 to 20 mph increasing to 25 to 30 mph after midnight. Winds could gust as high as 40 mph.  Chance of precipitation is 100%. Little or no ice accumulation expected.  Total nighttime snow and sleet accumulation of 1 to 2 inches possible. ', 'Saturday: Snow.  Patchy blowing snow. High near 31. Windy, with a northeast wind around 30 mph, with gusts as high as 45 mph.  Chance of 

#### Itegrating Scraped Data in a Pandas Dataframe

We can integrate the data into a Pandas DataFrame for analysis. 


In [32]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Tonight,Rain andWindy thenWintry Mixand PatchyBlowing ...,Low: 29 °F,"Tonight: Rain and snow, possibly mixed with sl..."
1,Saturday,Snow andPatchyBlowing Snow,High: 31 °F,Saturday: Snow. Patchy blowing snow. High nea...
2,SaturdayNight,Heavy Snowand AreasBlowing Snow,Low: 17 °F,"Saturday Night: Snow, mainly before 3am. The s..."
3,Sunday,Partly Sunny,High: 28 °F,"Sunday: Partly sunny, with a high near 28. Nor..."
4,SundayNight,Chance Snow,Low: 24 °F,Sunday Night: A 30 percent chance of snow. Cl...
5,Monday,Partly Sunny,High: 34 °F,"Monday: Partly sunny, with a high near 34. Wes..."
6,MondayNight,Mostly Cloudythen ChanceRain/Snow,Low: 28 °F⇑,"Monday Night: A chance of snow after midnight,..."
7,Tuesday,ChanceRain/Snowthen PartlySunny,High: 41 °F,"Tuesday: A chance of rain and snow before 9am,..."
8,TuesdayNight,Mostly Cloudy,Low: 23 °F,"Tuesday Night: Mostly cloudy, with a low aroun..."
