### Practice - Python for data scientists II - accessing data - scraping

Answer all **Questions**

In many cases as a data scientist, you’ll have access to data via database, csv format, or an Application Programming Interface (API). However, there are times when the data you want can only be accessed as part of a web page. In this case you will want to use a technique called *web scraping* to get the data from the web page and format it for further analysis.

In this notebook, we will go over how to work with the `Requests` and `BeautifulSoup` Python libraries in order to make use of data from web pages. The `Requests` module lets you integrate your Python programs with web services, while the `BeautifulSoup` module is designed to facilitate extracting data from parsed HTML. 

After an introduction to each library, we'll scrape weather forecasts from the National Weather Service using `Python 3`, the `Requests` and `BeautifulSoup` libraries, and then load the data into a `Pandas` dataframe for further analysis.

[Scrapy](https://scrapy.org/) is another web scraping library that also includes a web crawling pipeline. Its a little more difficult to use, but provides greater functionality.

Prerequisites:    
- Python 3 programming    
- Pandas   
- Basic HTML, CSS tag knowledge  
- Basic understanding of HTTP requests   

References:    
Beautiful Soup Documentation  
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#   

How To Work with Web Data Using Requests and Beautiful Soup with Python 3  
- https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3

Tutorial: Python Web Scraping Using BeautifulSoup   
- https://www.dataquest.io/blog/web-scraping-tutorial-python/   

StackOverflow, Difference between BeautifulSoup and Scrapy crawler?         
- https://stackoverflow.com/questions/19687421/difference-between-beautifulsoup-and-scrapy-crawler   


# In addition to Python 3, make sure the following libraries are installed
# Note: This is a non-executable cell

$ pip install pandas   
$ pip install requests
$ pip install beautifulsoup4

### The requests library

The `requests` library is the standard library for making HTTP requests in Python. It abstracts the complexities of making HTTP requests behind a relatively simple API.

Using the `requests` library will make a `GET` request to a web server, which will retrieve an `HTTP` response with the HTML contents of a given web page. There are several different types of requests we can make using requests, of which GET is just one.  Information on other HTTP methods can be found here:  
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods 

Using `requests` download a sample web page below.

Feel free to try other web pages.

In [0]:
import requests
page = requests.get("http://jayurbain.com/simplehtml.html")
page

After running the HTTP GET request, we get a `Response` object. This object has a HTTP response `status_code` property, which indicates if the page was downloaded successfully.

More on response codes: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


In [0]:
page.status_code

A status_code of $200$ means that the page downloaded successfully. A status code starting with a 2 generally indicates success, a code starting with a 4 indicates request (user) error, and a code starting with 5 indicates a server error.

Use the response object to print the HTML content of the page using the `content` property:

In [0]:
page.content

### Parsing a web page with BeautifulSoup    

Once the page is downloaded, the document needs to be parsed so we can extract data.

[`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/) creates a parse tree for parsed documents that can be used to extract data from HTML for web scraping.

The `BeautifulSoup` library can be used to parse the document downloaded above.

Note: `bs4` is the most recent version of BeautifulSoup    
https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [0]:
from bs4 import BeautifulSoup    
soup = BeautifulSoup(page.content, 'html.parser')
type(soup)

We can format and print the HTML content of the page, using the `prettify` method on the `BeautifulSoup` object.

In [0]:
print(soup.prettify())

Web page documents are represented as a logical tree defined by the Document Object Model (DOM). Each branch of the tree ends in a node, and each node contains objects. Some nodes can be nested with other nodes, others are terminal nodes.

References:  
- https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model  
- https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction  

Once parsed, `BeautifulSoup` has methods that allow navigation of the nested tree structure one level at a time, or extraction of specific node elements.

For example, we can first select all the elements at the top level of the page using the `children` property of `BeautifulSoup`. 

Note: `children` returns a list generator, so we need to call the list function.

In [0]:
[item for item in list(soup.children)]

Top-level children type.

In [0]:
[type(item) for item in list(soup.children)]

Here are some examples showing how to navigate the document data structure:.

In [0]:
print('soup.title', soup.title)

print('soup.title.name', soup.title.name)

print('soup.title.string', soup.title.string)

print('soup.title.parent.name', soup.title.parent.name)

print('soup.p', soup.p)

print("soup.find_all('a')", soup.find_all('a'))

print('soup.find(id="link3")', soup.find(id="link3"))

#### Finding Instances of a Tag

Single tags can be extracted from a page by using `Beautiful Soup’s` `find_all` method. This will return all instances of a given tag within a document.

In [0]:
soup.find_all('p')

In [0]:
soup.find_all('a')

Note: The data above is contained in a list. So we can access individual elements using indexing.


In [0]:
soup.find_all('p')[0].get_text()

In [0]:
soup.find_all('a')[1].get_text()

#### Finding Tags by Class or ID

HTML elements that refer to CSS selectors like class and ID can be helpful to look at when working with web data using `BeautifulSoup`. Specific classes and IDs can be selected by using the `find_all()` method and passing the class and ID strings as arguments.

Find all of the instances of the class *website*. 

In [0]:
soup.find_all(class_='website')

You can also search for the class *websigte only within `<a>` tags.

In [0]:
soup.find_all('a', class_='website')

### Practice Exercise


#### Web Scraping Example: National Weather Service

Load the following page from the National Weather Service:

https://forecast.weather.gov/MapClick.php?lat=43.042&lon=-87.9069#.XbXtPEU3ny8

<img src="national_weather_service_2019-10-27_mke.png" width="600px"/>
    
Feel free to try another location, but you usually don't have this nice of weather in Wisconsin this time of year!    

#### Explore the page structure

Prior to scraping a web page, you need to review its structure to identify the elements from which to extract data.

The screen shot below uses [Chrome Developer Tools](https://developer.chrome.com/devtools)

From the menu (upper-right: 3 vertical dots) select *More Tools -> Developer Tools* then select *Elements* to view HTML elements.

<img src="chrome_developer_tools_2019-10-27.png" width="600px"/>

*Note: Other browsers have similar tools.*

The elements panel will show the HTML tags on the page and let you navigate through nested children. 

Any element can be selected and inspected. Open and inspect the *seven-day-forecast* div.

<img src="seven-day-forecast.png" width="600px"/>


#### On your own: Scrape the forecast

- Download the web page containing the forecast.  
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=43.042&lon=-87.9069#.XbXtPEU3ny8")   
- Create a `BeautifulSoup` class to parse the page.  
- Find the `div` with id `seven-day-forecast`, and assign to variable `seven_day`  
- Inside seven_day, find each individual forecast item.  
- Extract and print the first forecast item.  

In [0]:
# Answer:
    

### Additional tutorial information, optional

#### Extracting information 

We're interested in tonight's forecast. 

There are 4 pieces of information we can extract:  
- The name of the forecast item — in this case, `Tonight`.  
- The description of the conditions — this is stored in the `title` property of `img`.  
- A short description of the conditions.  
- The temperature low.  

First, extract the name of the forecast item, the short description, and the temperature.

In [0]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Extract the `title` attribute from the `img` tag. The `BeautifulSoup` object can be treated as a dictionary by passing in the attribute we want as a key.


In [0]:
img = tonight.find("img")
desc = img['title']
print(desc)

#### Extracting all the information from the page

Select all items with the class `period-name` inside an item with the class `tombstone-container` from seven_day.

We can use `list comprehension`s with the `get_text` method on each `BeautifulSoup` object.


In [0]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

We can apply the same technique to get the other 3 fields.

In [0]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)

temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)

descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)

#### Itegrating Scraped Data in a Pandas Dataframe

We can integrate the data into a Pandas DataFrame for analysis. 


In [0]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather