# <center>Web Scraping</center>

References: 
 - https://www.dataquest.io/blog/web-scraping-tutorial-python/
 - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

## 1. Different ways to access data on the web
 - Scrape HTML web pages 
 - Download data file directly 
    * data files such as csv, txt
    * pdf files
 - Access data through Application Programming Interface (API), e.g. The Movie DB, Twitter

## 2. Scape HTML web pages using BeatifulSoup

### 2.1. Basic structure of HTML pages###
* HTML tages: <font color="green">head</font>, <font color="green">body</font>, <font color="green">p</font>, <font color="green">a</font>, <font color="green">form</font>, <font color="green">table </font>, ...
* A tag may have properties. 
  * For example, tag <font color="green">a</font> has property (or attribute) <font color="green">href</font>, the target of the link
  *  <font color="green">class</font> and <font color="green">id</font> are special properties used by html to control the style of each element through Cascading Style Sheets (CSS). <font color="green">id</font> is the unique identifier of an element, and <font color="green">class</font> is used to group elements for styling. 
* A tag can be referenced by its position in relation to each other 
  * **child** – a child is a tag inside another tag, e.g. the two <font color="green">p</font> tags are children of the <font color="green">div</font> tag.
  * **parent** – a parent is the tag another tag is inside, e.g. the <font color="green">html</font> tag is the parent of the <font color="green">body</font> tag.
  * **sibling** – a sibling is a tag that has the same parent as another tag, e.g. in the html example, the <font color="green">head</font> and <font color="green">body</font> tags are siblings, since they’re both inside <font color="green">html</font>. Both <font color="green">p</font> tags are siblings, since they’re both inside <font color="green">body</font>.

Web page displayed:
--------------------

First paragraph.

Second paragraph.

**First outer paragraph.**

**Second outer paragraph. **

### 2.2. Basic steps of scraping a web page using package BeautifulSoup ###
  1. Install modules <font color="green">requests</font>, <font color="green">BeautifulSoup4</font>.
      * **requests**: allow you to send HTTP/1.1 requests using Python. To install:
          - Open terminal (Mac) or Anaconda Command Prompt (Windows)
          - Issue: pip install requests
      * **BeautifulSoup**: web page parsing library, to install, use: pip install beautifulsoup4
  1. Open the **source code of the web page** to find out html elements that you will scrape
      * Firefox: right click on the web page and select "view page source"
      * Safari: please instruction here to see page source (http://ccm.net/faq/33026-safari-view-the-source-code-of-a-webpage)
      * Ineternet Explorer: see instruction at https://www.computerhope.com/issues/ch000746.htm
  2. Use <font color="green">**request**</font> library to retrive the source code
  3. Use libraries to parse the source code. Available libraries:
      * <font color="green">Beautifulsoup</font>
      * <font color="green">lxml</font>: another good library for web page scraping
      * ...

### 2.3. Scrape the sample html using BeautifulSoup ###
- Kinds of Objects in BeautifulSoup
  * <font color="green">**Tag**</font>: an xml or HTML tag
  * <font color="green">**Name**</font>: every tag has a name
  * <font color="green">**Attributes**</font>: a tag may have any number of attributes. A tag is shown as a **dictionary** in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value, ...}. If an attribute has multiple values, the value is stored as a list
  * <font color="green">**NavigableString**</font>: the text within a tag

In [None]:
# Exercise 2.3.1. Import requests and beautifulsoup packages

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# import requests package
import requests                   

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup     

In [None]:
# Exercise 2.3.2. Get web page content

# send a get request to the web page
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")    

# status_code 200 indicates success. 
# a status code >200 indicates a failure
if page.status_code==200:   
    
    # content property gives the content returned in bytes 
    print(page.content)          

**Basics of HTTP (Hypertext Transfer Protocol)**
- HTTP is designed to enable communications between clients (e.g. browser) and servers (e.g. Apache web server).
- A client submits an HTTP **request** to the server; then the server returns a **response** to the client. 
- Two commonly used methods for a request-response between a client and server:
  - GET - Requests data from a specified resource
  - POST - Submits data to be processed to a specified resource

In [None]:
# Exercise 2.3.3. Parse web page content

# Process the returned content using beautifulsoup module

# initiate a beautifulsoup object using the html source and Python’s html.parser
soup = BeautifulSoup(page.content, 'html.parser')  

# soup object stands for the **root** node of the html document tree
print("Soup object:")
# print soup object nicely
print(soup.prettify())                             


In [None]:
# soup.children returns an iterator of all children nodes
print("\soup children nodes:")
soup_children=soup.children
print(soup_children)

# convert to list
soup_children=list(soup.children)
print("\nlist of children of root:")
print(len(soup_children))

In [None]:
                    
# html is the only child of the root node
html=soup_children[0]    

html

In [None]:
# Exercise 2.3.4. Get head and body tag

html_children=list(html.children)
# head is the the 2nd child of html
print(html_children)
head=html_children[1]

# extract all text inside head
print("\nhead text:")
print(head.get_text())

# body is the fourth child of html
body=html_children[3]

In [None]:
# Exercise 2.3.5. Continue the navigation through html document tree 

# Task 1. get div tag inside body. 
# div is the second child of body


# Task 2: get the first p in div (2nd child of div)


In [None]:
# Exercise 2.3.6. Get details of a tag

# get the first p tag in the div of body
div=list(body.children)[1]
p=list(div.children)[1]

p

# get the details of p tag
# first, get the data type of p
print("\ndata type:")
print(type(p))
# get tag name (property of p object)
print ("\ntag name: ")     
print(p.name)

# a tag object with attributes has a dictionary. 
# each attribute name of the tag is a key
# get "class" attribute

print ("\ntag class: ")
print(p["class"])

# get text of p tag
p.get_text()

In [None]:
 # Exercise 2.3.7.  get siblings of p object
print("\nget siblings of the first p tag")
print(list(p.next_siblings))

# get next p tag within the div
# get the sibling next to the next sibling of p
print("\nget the 2nd p tag")
print(p.next_sibling.next_sibling)

### 2.4. Navigating the html document tree ###
 
* Going down
  * <font color="green">**contents**</font>: get a tag's direct children as a **list**
  * <font color="green">**children**</font>: get a tag's direct chidren as an **iterator**
  * <font color="green">**descendants**</font>:  get an iterator for a tag's all descendants, including direct children, the children of its direct children, and so on
* Going up
  * <font color="green">**parent**</font>: get a tag's parent
  * <font color="green">**parents**</font>: get an iterator for a tag's ancestors, from the parent to the very top of the document
* Going sideways
  * <font color="green">**next_sibling**</font>: get a tag's next sibling
  * <font color="green">**previous_sibling**</font>: get a tag's previous sibling

### 2.5. Finding all tags by filters and save found tags into a list ###
* **find_all**: find **all instances of a tag** at once, e.g. find\_all('p')
* search for tags by **attributes**, e.g. find\_all(<font color="blue">**class\_**</font> ='inner-text', id="first")
* search for tags by **tag names and attributes**, e.g. find\_all('p', class\_ ='inner-text')

In [None]:
# Exercise 2.5.1. find all p tags

# find all p tags
for p in soup.find_all('p'):
    print (p)

In [None]:
# Exercise 2.5.2.  Searching for tags by attributes
# since "class" is reserved word, 
#"class_" is used to denote attribute "class"
for p in soup.find_all(class_ ='inner-text', id="first"):
    print (p )                                
    # note: p tag has two class values: inner-text and first-item. 
    # The filter matches with one of them
    

In [None]:
# Exercise 2.5.3. Searching for tags by names and attributes
for p in soup.find_all("p", class_ ='first-item', id='first'):
    print (p)
    

In [None]:
# Exercise 2.5.4. Get the details of a tag

# p is the object you found in Exercise 2.5.3

# get data type of object p
print("data type of object p", type(p))

# get tag name
print("tag's name:", p.name)

# get attributes
print("tag's class attribute:", p["class"])
print("tag's id attribute:", p["id"])

# get tag's text
print("tag's text:", p.get_text())

# the tag's text is a NivagableString
p_text=list(p.children)
print("tag's text object data type:", type(p_text[0]))


### 2.6.  Select tags into a list by CSS Selectors: select ###
* CSS selectors used by CSS language to specify HTML tags to style
* Some examples:
  1. **div p** – finds all <font color="green">p</font> tags inside a <font color="green">div</font> tag.
  2. **body p b** – finds all <font color="green">b</font> tags inside a <font color="green">p</font> tags within a <font color="green">body</font> tag
  3. **p.outer-text** – finds all <font color="green">p</font> tags with a <font color="green">class</font> of **outer-text**.
  4. **p#first** – finds all <font color="green">p</font> tags with an <font color="green">id</font> attribute of **first**
  5. **p[class=outer-text]** – finds all <font color="green">p</font> tags with a class attribute that is **exactly** "outer-text" (no other class). Note [ ] is the generic way to define a filter on any attribute. "." is just for "class" attribute.
  6. **p[class~=outer-text]** – finds all <font color="green">p</font> tags  with a class attribute that **contains** a value "outer-text" (it may contain other values too, equivalent to p.outer-text). 
  7. **body p.outer-text b** – finds any <font color="green">b</font> tags within <font color="green">p</font> tags with a <font color="green">class</font> of **outer-text** inside of a <font color="green">body</font> tag.
  8. **div, p** – finds all <font color="green">div</font> and <font color="green">p</font> tags (without nesting relationships). Compare it with example #1!
  9. **p.outer-text.first-item** – finds all <font color="green">p</font> tags  with **both class attribute "outer-text" and "first-item"**.
  10. What about finding all p with class "outer-text" but not class "first-item"?
        
* For details of css selectors, see https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Selectors

In [None]:
# Exercise 2.6.1.: select p tags within div tags
# Notice the space between div and p
# This means p is a descendant of div 
# p is not necessarily a direct child of div
soup.select("div p")


In [None]:
# Exercise 2.6.2.: select b tags within p tags in the body


In [None]:
# Exercise 2.6.3.: finds all p tags with a class of outer-text
soup.select("p.outer-text")


In [None]:
# Exercise 2.6.4.: select p tags with id "first"
soup.select("p#first")


In [None]:
# Exercise 2.6.5.:find p tag within body and
# with a class attribute which is **exactly** "outer-text"

# Note: this is the generic way to set  
# a fileter on any attribute

soup.select("body p[class=outer-text]")
# compare the result with # Exercise 2.6.3.

In [None]:
# Exercise 2.6.6. find p tag with body 
# which has a class attribute **containing** a value "out-text"
# Note the use of "~". 

soup.select("body p[class~=outer-text]")

# This is equivalent to soup.select("body p.outer-text")
# However, it's a generic way to set condition 
# on any type of attributes, not just "class" attribute

soup.select("body p.outer-text")


In [None]:
# Exercise 2.6.7. select b tags within 
# p tags which has a class outer-text and 
# is within the body tag


In [None]:
# Exercise 2.6.8. select all div and p tags 
# Compare the result with Exercise 2.6.1.
# "," between tags means "and/or", 
# while " " (space) between tags means "descendant"
soup.select("div, p")

In [None]:
# Exercise 2.6.9. select p tags 
# with two classes: outer-text and first-item
soup.select("p.outer-text.first-item")

# what if another class, say "xxx" also required?

In [None]:
# Exercise 2.10. finding all p tags with class "outer-text" 
# but not class "first-item"


### 2.7.  Example: downloading weather forecast for the next week for New York City ###
- Instruction:
    1. Open web site http://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.WXi6hlGQzIU and inspect page source
    2. Find "Extended Forecast for" in the source code
    3. Extract div tags in this section using "seven-day-forecast-body div ul li div.tombstone-container"
       * Notice that the div under "Extended Forecast for" is what we need
       * Follow the path to weather forecast for each period
       <img src='weather.png' width='60%'>
    4. For each div tag, extract text in different p tags and represent the result as a tuple, e.g. ("Today", "Mostly Sunny", "High: 75F"). Save the 7-day forecast as a list and print the list 

In [None]:
# Exercise 2.7.1. downloading weather forecast for the next week for New York City 

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=40.7146&lon=-74.0071#.WXi6hlGQzIU")    # send a get request to the web page
rows=[]

# status_code 200 indicates success. 
#a status code >200 indicates a failure 
if page.status_code==200:        
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # find a block with id='seven-day-forecast-body'
    # follow the path down to the div for each period
    divs=soup.select("div#seven-day-forecast-body \
    div ul li div.tombstone-container")
    #print len(divs)
    #print divs
    
    for idx, div in enumerate(divs):
        # for testing you can print idx, div
        #print idx, div 
        
        # initiate the variable for each period
        title=None
        desc=None
        temp=None
        
        # get title
        p_title=div.select("p.period-name")
        
        # test if "period-name" indeed exists
        # before you get the text
        if p_title!=[]:
            title=p_title[0].get_text()
        
        # get description
        p_desc=div.select("p.short-desc")
        if p_desc!=[]:
            desc=p_desc[0].get_text()
        
        # get temperature
        p_temp=div.select("p.temp")
        if p_temp!=[]:
            temp=p_temp[0].get_text()
            
        # add title, description, and temperature as a tuple into the list
        rows.append((title, desc, temp))
        print((title, desc, temp))


In [None]:
# Exercise 2.7.2. Extract "Detailed Forecast" section in the web page 