<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-\amily:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Web Scraping
              
</p>
</div>

Data Science Cohort Live NYC Sept 2022
<p>Phase 1: Topic 10</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
   

Previously:
    
- Accessed data via API

Sometimes no programmatic access to data!
- No API exists
- No SQL server to interact with.
- No csv files to download.

Many ecommerce sites: no APIs or databases to interact with.
<br>
<br>

<div>
<center><img src="Images/prod_page.png" width="600"/></center>
</div>
<center> Master of Malt</center>
   

<div>
<center><img src="Images/ardbeg_page.png" width="700"/></center>
</div>   

There is data on the page
<div>
<center><img src="Images/ardbeg_tast_nt.png" width="600"/></center>
</div>   

Find the data is in the web site source code...
<div>
<center><img src="Images/source_whiskex.png" width="1800"/></center>
</div>
    <center> Data embedded within a soup of HTML tags </center>   

Let's take a look at a very simple sample web site.

#### HyperText Markup Language (HTML)

Tells a browser how to layout content.

- Consists of elements called tags. 
- The most basic tag is the html tag: specfies everything inside of opening/closing tags is HTML. 

Take a look at an example website.


### Let's take a look at Yelp
- Open up yelp.com in your browser.
- Open up the inspector
    - Mac: cmd+option+c
    - Windows: ctrl+shift+c
- Click on the elements tab, and click on an element

| Tag | Function | 
| --- | --- |
| html | Denotes extent of HTML document |
| head | External style sheet definition, metadata, titles |
| title | Web page title |
| body | Specifies main web page content block |
| h1-h6 | Section heading (ordered by decreasing size)|
| p | Represents paragraph |
| div | Defines division or section of document |
| span | Meant for inline or small selection  |
| img | Signifies image and defines source |
| a | Linking to external sites or internal events  |
| ul | Declare unordered (bulleted) list |
| li | List item |

#### CSS (Cascading Style Sheets)

- Uses class and id modifiers on tag.
- Styling:
    - Color
    - Font
    - Spacing,
    - etc.
- Can use external sheet for styling
- Separate content and styling.

#### Structure of tag levels
- HTML document structured as tree structure:
<br>
<br>
<div>
    <center><img src="Images/html_tree.png" width="500"/></center>
</div>

#### Goal
Extract information structured by tags.

- Get HTML documents as text.
- Parse tags and extract data.

#### Web scraping frameworks

<div>
    <center><img src="Images/scrapy.png" width="180"/></center>
</div>
<div>
<center><img src="Images/selenium.png" width="300"/></center>
</div>
<div>
<center><img src="Images/bs4.png" width="300"/></center>
</div>

We will use:

<div>
<center><img src="Images/bs4.png" width="400"/></center>
</div>

<div>
<center><img src="Images/requests.png" width="300"/></center>
</div>

- **Requests**: grab the HTML content as text.
- **BeautifulSoup**: parse the content and extract data.

In [None]:
# import requests
import requests

Make requests on a simple webpage:

In [None]:
sample_url = "http://dataquestio.github.io/web-scraping-pages/simple.html"
r = requests.get(sample_url)

Let's get the content:
- like .text attribute
- returns in byte representation.

In [None]:
req_content = r.content
req_content 

- Pretty ugly.
- Parse and get relevant data:
    - Want to use HTML tree structure.
    - Class and id structure.
    
BeautifulSoup helps us with this:

In [None]:
from bs4 import BeautifulSoup

Create Soup object with web site content as input.

In [None]:
soup = BeautifulSoup(req_content, 'html.parser') 

In [None]:
print(soup.prettify())

Soup is parsing structure and hierarchy of tags and content in HTML document.

Can go tranverse through tree hierarchy:

#### Descending through hierarchy

In [None]:
soup

In [None]:
html_level = soup.html
html_level

Tag element contains:
- node tag
- node contents (children nodes, text, etc.)

In [None]:
type(html_level)

**.name** attribute 

Can get the name of the node that you are at:
- .name attribute of Soup/Tag objects    

In [None]:
html_level.name

**.contents** attribute

- gets list of tag's children

In [None]:
html_level.contents

**.children** attribute

Can also yield the children as generator:
- as opposed to .contents which yields entire list of children.
- useful when creating list comprehensions off the tag's children.

In [None]:
html_level.children

Get the name of the tags of html's direct children:
- need to exclude line breaks.

In [None]:
children_names = \
[child.name for child 
 in html_level.children 
 if child != '\n']

In [None]:
children_names

Let's access the body child and go down the branch:
- Can address body child as an attribute of previous level.

In [None]:
body_level = html_level.body
body_level

There's another level left down this branch:

In [None]:
body_level

Accessing the paragraph <p> child:

In [None]:
p_level = body_level.p
p_level

**.text** attribute

Get the text inside the tag:
- .text attribute

In [None]:
p_level.text

#### Going up levels:
**.parent** attribute:
- We can also go the other way:

In [None]:
p_level.parent

Not too shabby.

#### Going sideways
- Traversing through siblings

**.previous_siblings** attribute
- generator that creates previous siblings

In [None]:
html_level

In [None]:
body_level

In [None]:
prev_sibs = body_level.previous_siblings
prev_sibs

Traversing the generator:
- goes backwards through previous siblings with next() operator
- terminates when exhausts previous siblings

In [None]:
next(prev_sibs)

Can be used in a list comprehension as well:
- get tag names of previous siblings excluding line breaks

In [None]:
prevsib_names = [prev_sib.name for prev_sib in body_level.previous_siblings if prev_sib != "\n"]
prevsib_names

**.next_siblings** attribute: 
- does the same thing but for siblings following the current tag

In [None]:
head_level = html_level.head
list(head_level.next_siblings)

Previous web-site was very simple. Website usually has more complex tree structures:
- A given tag can have many children of the same type. Want all children of a given type.
- Dealing with nested structures: divs within divs 
- A set of children with a given class. 
- Specific tag with a unique id  

A more complex but still simple example might help:

In [None]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content)
soup

Going down to the body level:

In [None]:
body_level = soup.html.body
body_level

There are many p tags.
- Want all p tags:

In [None]:
body_level.p

Only got the first.

We need:
    
.find_all() 
- finds all instances of specified tags contained in current node.
- returns a list

In [None]:
body_level.find_all('p')

But let's take a closer look at the body structure:
- .prettify() can sometimes be useful

In [None]:
print(body_level.prettify())

**Nested structures**
- One set of paragraphs p tags contained in a div
- Other set as direct children of body.

May want to access p tags that are direct children.

.find_all() has recursive argument (True as default)
- recursive = False:
- gets immediate children satisying requirement

In [None]:
body_level.find_all('p', recursive = False)

This is the paragraph tags on the outer level.
- direct children of the body

**Exercise**: get me the paragraph Tags nested in the div layer.

In [None]:
# do it!
#body_level

<details>
    <summary><b><u>Solution</u></b></summary>

```python 
body_level.div.find_all('p')

```
</details>


In [None]:
# can be useful if div had class arguments
# id arguments

body_level.find('div').find_all('p')

#### Class and id selectors

- a grouping of tags (class)
- naming a specific tag instance (id).

Used in CSS styling.

Can also use this for data selection / scraping.

**Class and id selectors with**:
- .find()
- .find_all()

Take additional arguments for class/id

In [None]:
print(body_level.prettify())

Get the group of paragraph tags in the inner-text class.

In [None]:
body_level.find_all('p', class_ = 'inner-text')

**Class and id selectors with**:
- .find()
- .find_all()

Take additional arguments for class/id

In [None]:
body_level

Extract by id:

In [None]:
body_level.find('p', id = 'second')
#body_level.find_all('p', id = 'second')

#### Going back to our whisky page

- Get bottling details (age, ABV, distillery, etc)

In [None]:
ardbeg_url = "https://www.thewhiskyexchange.com/p/114/ardbeg-uigeadail"

In [None]:
ardbeg_req = requests.get(ardbeg_url)
ardbeg_soup = BeautifulSoup(ardbeg_req.content)

Let's get the whisky facts:
- Bottler
- Country
- Chill filtered
- etc.

In [None]:
# returns the match as a tag object
prod_fact = ardbeg_soup.find('ul', 
                 class_ = "product-facts" )

prod_fact

Clearly have a list with each element (li) containing:
- attribute image
- key in h3 tag of class "product-facts__type"
- value in p tag of class "product-facts__data"


Let's get the first li item:
- prod_fact.find('li')
- prod_fact.li

In [None]:
first_li_elem = prod_fact.find('li')
first_li_elem

Now we want the key-value pairs:

In [None]:
# using .find() because we know there is only one of these tags
detail_key = first_li_elem.find('h3', class_ = "product-facts__type").text
detail_key

In [None]:
# using .find() because we know there is only one of these tags
detail_val = first_li_elem.find('p', class_ = "product-facts__data").text
detail_val

We can get the key,value pairs for all of these list elements in the product fact list:
- iterate over .children
- or use .find_all('li')

In [None]:
factTag_list = prod_fact.find_all('li')
factTag_list

Each element is a Tag object.

In [None]:
print(type(factTag_list[0]))

Let's iterate over the list and extract keys and pairs:

In [None]:
data_dict = {}

for elem in factTag_list:
    
    detail_key = elem.find('h3', class_ = "product-facts__type").text
    detail_val = elem.find('p', class_ = "product-facts__data").text
    
    data_dict.update({detail_key: detail_val})

Take a look at out data dictionary:

In [None]:
data_dict

Starting to look a lot like data that could be a row in a table or DataFrame.

Let's try and extract some other information about this whisky as well:
- Get the header for the Flavour Profile subsection
- Get the contents

- notice that all are in the header h2 tag with id = FlavourProfile
- let's access this h2

In [None]:
flavorheader = ardbeg_soup.find(
    'h2', id = "FlavourProfile")
flavorheader

In [None]:
header_text = flavorheader.text
header_text

- the corresponding content is in a div with class = "flavour-profile"
- let's access this

In [None]:
flavor_content = ardbeg_soup.find(
    'div', class_ = "flavour-profile")

print(flavor_content.prettify())

The whisky has four flavor scores that we are interested in extracting:
- Body
- Richness
- Smoke
- Sweetness

These are contained in the first of the children div nodes.

In [None]:
[child.name for child 
 in flavor_content.children if child != '\n']

A list with the strength of various taste characteristics:

In [None]:
flavor_style = flavor_content.div
flavor_style

Ther is another sibling div that contains other information:
- A list with similar taste descriptors

In [None]:
# first sibiling is a \n character
flavor_style.next_sibling.next_sibling

In [None]:
print(flavor_style.prettify())

Get the keys for the flavor profile:
- Contained as text in span of class="flavour-profile__label"

In [None]:
flav_key_spans  = \
flavor_style.find_all(
    'span', 
    class_ = 'flavour-profile__label')

In [None]:
flav_profile_keys = \
[ span.text for span in flav_key_spans ]
flav_profile_keys

For getting the values:
- Value is inside the tag as the data-text attribute

**How can we extract it**?

In [None]:
flav_profile_gauges = \
flavor_style.find_all(
    'div', 
    class_ = 'flavour-profile__gauge')

flav_profile_gauges

Tags are addressable as dictionaries:
- tag attribute name is key

In [None]:
flav_profile_gauges[0]

In [None]:
flav_profile_gauges[0]['data-text']

Extracting the values for the flavor profile is straightforward:

In [None]:
value_list = [gauge['data-text'] for gauge in flav_profile_gauges]
value_list

In [None]:
# Zipping this together and making a dictionary
flav_dict = \
dict(zip(flav_profile_keys, value_list))

flav_dict

And we can update our data dictionary:

In [None]:
# and we can update our data dictionary
data_dict.update(flav_dict)

data_dict

There are many places where we can get the name of the Whisky:
- A meta tag below the html level
- with attributes name = 'twitter:title' and content = "Ardbeg Uigeadail"

.find() and .find_all()
- has way to select a tag by attributes
- attrs takes in a dictionary of attributes

In [None]:
name_meta_tag = ardbeg_soup.html.find(
    'meta', attrs = {'name': 'twitter:title'})
name_meta_tag

Extract the content attribute from tag:
- update data dictionary

In [None]:
name_meta_tag['content']

In [None]:
data_dict.update(
    {'name': name_meta_tag['content'] })

data_dict

A lot of:
- html tree traversing
- exploring scheme
- searches to extract data

When each product site has same tagging structure:

- Build function that extracts data like we did.
- Can be used while looping through multiple products.

#### Build Function

In [None]:
def extract_data_dict(product_page):
    
    data_dict = {}
    product_req = requests.get(product_page)
    product_soup = BeautifulSoup(product_req.content)
    
    # get name
    name_meta_tag = product_soup.html.find('meta', attrs = {'name': 'twitter:title'})
    data_dict.update({'name': name_meta_tag['content'] })
    
    # get product facts and extract information

    prod_fact = product_soup.find('ul', class_ = "product-facts" )
    
    # loops through to update data_dict with bottling information
    for elem in prod_fact.find_all('li'):
    
        detail_key = elem.find('h3', class_ = "product-facts__type").text
        detail_val = elem.find('p', class_ = "product-facts__data").text
        data_dict.update({detail_key: detail_val})
    
    # get flavor ratings
    flavor_style = product_soup.find('div', class_ = "flavour-profile").div
    flav_profile_keys = [ span.text for span in flavor_style.find_all('span', class_ = 'flavour-profile__label') ]
    value_list = [gauge['data-text'] for gauge in flavor_style.find_all('div', class_ = 'flavour-profile__gauge')]
    data_dict.update(dict(zip(flav_profile_keys, value_list)))

    return data_dict

Use function on a product page url:

In [None]:
extract_data_dict('https://www.thewhiskyexchange.com/p/114/ardbeg-uigeadail')
#extract_data_dict('https://www.thewhiskyexchange.com/p/12827/bunnahabhain-12-year-old')

#### Crawling a page of products

Let's end this by applying our function to a page with a list of products:
- need to extract product list
- get link urls
- apply our function to the link urls.

Soupify the products page.

In [None]:
scotchproducts_page = "https://www.thewhiskyexchange.com/c/40/single-malt-scotch-whisky"
prodpage_req = requests.get(scotchproducts_page)
scotchproducts_soup = BeautifulSoup(prodpage_req.content)

Parse the source code and get a list of the product urls:

In [None]:
li_items = scotchproducts_soup.find_all('li', class_="product-grid__item")
prod_urls = [ 'https://www.thewhiskyexchange.com' + elem.a['href'] for elem in li_items][0:10]
prod_urls

Now apply our function:
- Thread it through the list
- map() is useful here.

A list of dicts.

In [None]:
extracted_data = \
list(map(extract_data_dict, prod_urls))

extracted_data

This is a tabular format:
- put this in a dataframe.
- can save to csv, etc.

In [None]:
import pandas as pd
whisky_df = pd.DataFrame(extracted_data)
whisky_df

If you want to scrape the entire set of product pages:
- need to loop through product list pages.
- extract all urls.
- apply function to url list.

**Not always easy**
- Some product pages have some tag elements missing
- slightly different page structure.
- Error handling required.

#### A final important etiquette note

**Throttle requests**
- limit the time between each request/product page scrape
- server limits scrape rate: 
    - cut access to you if you scrape too much, too fast

- time.sleep(t)
- t is in seconds

Try every second.

In [None]:
import time

def extract_data_dict(product_page):
    
    # essential to not getting blocked by server
    time.sleep(1) #this waits 500 ms before executing code
    
    data_dict = {}
    product_req = requests.get(product_page)
    product_soup = BeautifulSoup(product_req.content)
    
    # get name
    name_meta_tag = product_soup.html.find('meta', attrs = {'name': 'twitter:title'})
    data_dict.update({'name': name_meta_tag['content'] })
    
    # get product facts and extract information

    prod_fact = product_soup.find('ul', class_ = "product-facts" )
    
    # loops through to update data_dict with bottling information
    for elem in prod_fact.find_all('li'):
    
        detail_key = elem.find('h3', class_ = "product-facts__type").text
        detail_val = elem.find('p', class_ = "product-facts__data").text
        data_dict.update({detail_key: detail_val})
    
    # get flavor ratings
    flavor_style = product_soup.find('div', class_ = "flavour-profile").div
    flav_profile_keys = [ span.text for span in flavor_style.find_all('span', class_ = 'flavour-profile__label') ]
    value_list = [gauge['data-text'] for gauge in flavor_style.find_all('div', class_ = 'flavour-profile__gauge')]
    data_dict.update(dict(zip(flav_profile_keys, value_list)))

    return data_dict

Running the data extractor will take more time:

In [None]:
extracted_data = \
list(map(extract_data_dict, prod_urls))

extracted_data

Given:

- errors might pop up
- connection may be severed
- with wait time: takes a long time

Good idea to append requested data to a .json or .csv file as you scrape.
