<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-\amily:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Web Scraping
              
</p>
</div>

Data Science Cohort Live NYC May 2022
<p>Phase 1: Topic 10</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
   

Previously:
    
- Accessed data via API

Sometimes no programmatic access to data!
- No API exists
- No SQL server to interact with.
- No csv files to download.

Many ecommerce sites: no APIs or databases to interact with.
<br>
<br>

<div>
<center><img src="Images/master_of_malt_menu.png" width="600"/></center>
</div>
<center> Master of Malt</center>
   

<div>
<center><img src="Images/edradour.png" width="900"/></center>
</div>
<center> But I want data from these fields! </center>    

The data is in the web site source code...
<div>
<center><img src="Images/source_mom.png" width="1800"/></center>
</div>
    <center> Data embedded within a soup of HTML tags </center>   

#### HyperText Markup Language (HTML)

Tells a browser how to layout content.

- Consists of elements called tags. 
- The most basic tag is the html tag: specfies everything inside of opening/closing tags is HTML. 

Take a look at an example website.


### Let's take a look at Yelp
- Open up yelp.com in your browser.
- Open up the inspector
    - Mac: cmd+option+c
    - Windows: ctrl+shift+c
- Click on the elements tab, and click on an element

| Tag | Function | 
| --- | --- |
| html | Denotes extent of HTML document |
| head | External style sheet definition, metadata, titles |
| title | Web page title |
| body | Specifies main web page content block |
| h1-h6 | Section heading (ordered by decreasing size)|
| p | Represents paragraph |
| div | Defines division or section of document |
| span | Meant for inline or small selection  |
| img | Signifies image and defines source |
| a | Linking to external sites or internal events  |
| ul | Declare unordered (bulleted) list |
| li | List item |

#### CSS (Cascading Style Sheets)

- Uses class and id modifiers on tag.
- Styling:
    - Color
    - Font
    - Spacing,
    - etc.
- Can use external sheet for styling
- Separate content and styling.

#### Structure of tag levels
- HTML document structured as tree structure:
<br>
<br>
<div>
    <center><img src="Images/html_tree.png" width="500"/></center>
</div>

#### Goal
Extract information structured by tags.

- Get HTML documents as text.
- Parse tags and extract data.

#### Web scraping frameworks

<div>
    <center><img src="Images/scrapy.png" width="180"/></center>
</div>
<div>
<center><img src="Images/selenium.png" width="300"/></center>
</div>
<div>
<center><img src="Images/bs4.png" width="300"/></center>
</div>

We will use:

<div>
<center><img src="Images/bs4.png" width="400"/></center>
</div>

<div>
<center><img src="Images/requests.png" width="300"/></center>
</div>

- **Requests**: grab the HTML content as text.
- **BeautifulSoup**: parse the content and extract data.

In [None]:
# import requests
import requests

Make requests on a simple webpage:

In [None]:
sample_url = "http://dataquestio.github.io/web-scraping-pages/simple.html"
r = requests.get(sample_url)

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [None]:
r.status_code

Let's get the content:
- like .text attribute
- returns in byte representation.

In [None]:
req_content = r.content
req_content 

- Pretty ugly.
- Parse and get relevant data:
    - Want to use HTML tree structure.
    - Class and id structure.
    
BeautifulSoup helps us with this:

In [None]:
from bs4 import BeautifulSoup

Create Soup object with web site content as input.

In [None]:
soup = BeautifulSoup(req_content, 'html.parser') 

In [None]:
soup

In [None]:
type(soup)

In [None]:
print(soup.prettify())

Soup is parsing structure and hierarchy of tags and content in HTML document.

Can go tranverse through tree hierarchy:

#### Descending through hierarchy

In [None]:
soup

In [None]:
html_level = soup.html
html_level

.contents attribute: gets list of tag's children

In [None]:
html_level.contents

Can also yield children as iterator:

In [None]:
for x in html_level.children:
    print(x)

Let's go down the body branch:
- Can address body child as an attribute of previous level.

In [None]:
body_level = html_level.body
body_level

There's another level left down this branch:

In [None]:
body_level

In [None]:
p_level = body_level.p
p_level

Note: this only gets the first p tag.
- If want more: need to use .find_all()

Get the text inside the tag:
- .text attribute

In [None]:
p_level.text

#### Going up levels
We can also go the other way:

In [None]:
p_level.parent

Not too shabby.

#### Going sideways
- Traversing through siblings

In [None]:
html_level

In [None]:
body_level

In [None]:
list(body_level.previous_siblings)

Let's have a gander at a slightly more complex website:

In [None]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content)
soup

Going down to the body level:

In [None]:
body_level = soup.html.body
body_level

Want all p tags:

In [None]:
body_level.p

We need:
    
.find_all() 
- finds all instances of specified tags within level.
- returns a list

In [None]:
body_level.find_all('p')

#### Class and id selectors
- Modify style easily across:
- many instance of same type (class)
- or one specific instance (id).
- Can also use this for data selection / scraping.

Additional arguments for .find_all()

In [None]:
body_level

Extract by class:

In [None]:
body_level.find_all('p', class_ = 'inner-text')

In [None]:
body_level

Extract by id:

In [None]:
body_level.find_all('p', id = 'second')

#### Going back to our whisky page

- Get bottling details (age, ABV, distillery, etc)

In [None]:
edradour_url = "https://www.masterofmalt.com/whiskies/edradour-10-year-old-whisky/?srh=1"

In [None]:
edrad_req = requests.get(edradour_url)
edrad_soup = BeautifulSoup(edrad_req.content)

Extract all info from bottling details:

In [None]:
details = edrad_soup.find_all('div', id="whiskyDetailsWrapper")[0]
details

In [None]:
detail_keys = details.find_all('span', class_ = "kv-key gold")

In [None]:
detail_keys

In [None]:
detail_values = details.find_all('span', class_ = "kv-val")

In [None]:
detail_values

In [None]:
data_dict = {}
for key,val in zip(detail_keys, detail_values):
    data_dict[key.text] = val.text
    

In [None]:
data_dict

In [None]:
tasting_note_div = edrad_soup.find_all("div", 
                                       id = "ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_breakDownTastingNote")[0]

tasting_note_div

In [None]:
tasting_dict = { note.text.split(':')[0]:note.text.split(':')[1] for note in tasting_note_div.find_all('p') }
data_dict.update(tasting_dict)

In [None]:
data_dict

We have started wrangling data from an actual website.

Now we might want to do this for many whiskies on this site:

- How might we start doing this?
- Any ideas?

When each product site has same tagging structure:

- Build function that extracts data like we did.
- Loop through each product.
- Apply function to each product to scrape data.