<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-\amily:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Web Scraping
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 1: Topic 10</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
   

Previously:
    
- Accessed data via API

Sometimes no programmatic access to data!
- No API exists
- No SQL server to interact with.
- No csv files to download.

Many ecommerce sites: no APIs or databases to interact with.
<br>
<br>

<div>
<center><img src="Images/master_of_malt_menu.png" width="600"/></center>
</div>
<center> Master of Malt</center>
   

<div>
<center><img src="Images/edradour.png" width="900"/></center>
</div>
<center> But I want data from these fields! </center>    

The data is in the web site source code...
<div>
<center><img src="Images/source_mom.png" width="1800"/></center>
</div>
    <center> Data embedded within a soup of HTML tags </center>   

Let's take a look at a very simple sample web site.

#### HyperText Markup Language (HTML)

Tells a browser how to layout content.

- Consists of elements called tags. 
- The most basic tag is the html tag: specfies everything inside of opening/closing tags is HTML. 

Take a look at an example website.


| Tag | Function | 
| --- | --- |
| html | Denotes extent of HTML document |
| head | External style sheet definition, metadata, titles |
| title | Web page title |
| body | Specifies main web page content block |
| h1-h6 | Section heading (ordered by decreasing size)|
| p | Represents paragraph |
| div | Defines division or section of document |
| span | Meant for inline or small selection  |
| img | Signifies image and defines source |
| a | Linking to external sites or internal events  |
| ul | Declare unordered (bulleted) list |
| li | List item |

#### CSS (Cascading Style Sheets)

- Uses class and id modifiers on tag.
- Styling:
    - Color
    - Font
    - Spacing,
    - etc.
- Can use external sheet for styling
- Separate content and styling.

#### Structure of tag levels
- HTML document structured as tree structure:
<br>
<br>
<div>
    <center><img src="Images/html_tree.png" width="500"/></center>
</div>

#### Goal
Extract information structured by tags.

- Get HTML documents as text.
- Parse tags and extract data.

#### Web scraping frameworks

<div>
    <center><img src="Images/scrapy.png" width="180"/></center>
</div>
<div>
<center><img src="Images/selenium.png" width="300"/></center>
</div>
<div>
<center><img src="Images/bs4.png" width="300"/></center>
</div>

We will use:

<div>
<center><img src="Images/bs4.png" width="400"/></center>
</div>

<div>
<center><img src="Images/requests.png" width="300"/></center>
</div>

- **Requests**: grab the HTML content as text.
- **BeautifulSoup**: parse the content and extract data.

In [6]:
# import requests
import requests

Make requests on a simple webpage:

In [7]:
sample_url = "http://dataquestio.github.io/web-scraping-pages/simple.html"
r = requests.get(sample_url)

Let's get the content:
- like .text attribute
- returns in byte representation.

In [8]:
req_content = r.content
req_content 

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

- Pretty ugly.
- Parse and get relevant data:
    - Want to use HTML tree structure.
    - Class and id structure.
    
BeautifulSoup helps us with this:

In [9]:
from bs4 import BeautifulSoup

Create Soup object with web site content as input.

In [10]:
soup = BeautifulSoup(req_content, 'html.parser') 

In [11]:
type(soup)

bs4.BeautifulSoup

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


Soup is parsing structure and hierarchy of tags and content in HTML document.

Can go tranverse through tree hierarchy:

#### Descending through hierarchy

In [13]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [14]:
html_level = soup.html
html_level

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

.contents attribute: gets list of tag's children

In [16]:
html_level.contents

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

Can also yield children as iterator:

In [22]:
for x in html_level.children:
    print(x)
    
(print(html_level.children))



<head>
<title>A simple example page</title>
</head>


<body>
<p>Here is some simple content for this page.</p>
</body>


<list_iterator object at 0x7fe8f2497ac0>


Let's go down the body branch:
- Can address body child as an attribute of previous level.

In [29]:
body_level = html_level.body
body_level

<body>
<p>Here is some simple content for this page.</p>
</body>

There's another level left down this branch:

In [30]:
body_level

<body>
<p>Here is some simple content for this page.</p>
</body>

In [31]:
p_level = body_level.p
p_level

<p>Here is some simple content for this page.</p>

Note: this only gets the first p tag.
- If want more: need to use .find_all()

Get the text inside the tag:
- .text attribute

In [41]:
p_level.text

'Here is some simple content for this page.'

#### Going up levels
We can also go the other way:

In [42]:
p_level.parent

<body>
<p>Here is some simple content for this page.</p>
</body>

Not too shabby.

#### Going sideways
- Traversing through siblings

In [43]:
html_level

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [44]:
body_level

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

In [45]:
list(body_level.previous_siblings)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n']

Let's have a gander at a slightly more complex website:

In [47]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content)
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Going down to the body level:

In [48]:
body_level = soup.html.body
body_level

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

Want all p tags:

In [49]:
body_level.p

<p class="inner-text first-item" id="first">
                First paragraph.
            </p>

We need:
    
.find_all() 
- finds all instances of specified tags within level.
- returns a list

In [52]:
body_level.find_all('p')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

#### Class and id selectors
- Modify style easily across:
- many instance of same type (class)
- or one specific instance (id).
- Can also use this for data selection / scraping.

Additional arguments for .find_all()

In [53]:
body_level

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

Extract by class:

In [54]:
body_level.find_all('p', class_ = 'inner-text')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

In [55]:
body_level

<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>

Extract by id:

In [57]:
body_level.find_all('p', id = 'second')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>]

#### Going back to our whisky page

- Get bottling details (age, ABV, distillery, etc)

In [77]:
edradour_url = "https://www.masterofmalt.com/whiskies/edradour-10-year-old-whisky/?srh=1"

In [78]:
edrad_req = requests.get(edradour_url)
edrad_soup = BeautifulSoup(edrad_req.content)

In [79]:
print(edrad_soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <title>
   Just a moment...
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="noindex,nofollow" name="robots"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <link href="/cdn-cgi/styles/cf-errors.css" rel="stylesheet"/>
  <script>
   (function(){
        window._cf_chl_opt={
            cvId: '2',
            cType: 'managed',
            cNounce: '85171',
            cRay: '73cc7832acad195d',
            cHash: '8c62e218cc11768',
            cUPMDTk: "\/whiskies\/edradour-10-year-old-whisky\/?srh=1&__cf_chl_tk=tJMZFQ57tF8caHkQwp5BdyvxkRUwU6vCqpbtAuxl6sY-1660844792-0-gaNycGzNBz0",
            cFPWv: 'g',
            cTTimeMs: '1000',
            cTplV: 3,
            cRq: {
                ru: 'aHR0cHM6Ly93d3cubWFzdGVyb2ZtYWx0LmNvbS93aGlza2llcy9lZHJhZG91ci0xMC15ZWFyLW9sZC13aGlza3kvP3NyaD0x',
                ra

Extract all info from bottling details:

In [80]:
body_content = edrad_soup.html.body

In [81]:
body_content.find_all('div')

[<div class="main-wrapper" role="main">
 <div class="main-content">
 <h1 class="zone-name-title h1">
 <img class="heading-favicon" onerror="this.onerror=null;this.parentNode.removeChild(this)" src="/favicon.ico"/>
             www.masterofmalt.com
         </h1>
 <h2 class="h2" id="cf-challenge-running">
             Checking if the site connection is secure
         </h2>
 <noscript>
 <div id="cf-challenge-error-title">
 <div class="h2">
 <span class="icon-wrapper">
 </span>
 <span id="cf-challenge-error-text">
                         Enable JavaScript and cookies to continue
                     </span>
 </div>
 </div>
 </noscript>
 <div id="trk_jschal_js" style="display:none;background-image:url('/cdn-cgi/images/trace/managed/nojs/transparent.gif?ray=73cc7832acad195d')"></div>
 <div class="core-msg spacer" id="cf-challenge-body-text">
             www.masterofmalt.com needs to review the security of your connection before
             proceeding.
         </div>
 <form action="/whi

In [82]:
details = edrad_soup.find_all('div', id="whiskyDetailsWrapper")[0]
details

IndexError: list index out of range

In [75]:
detail_keys = details.find_all('span', class_ = "kv-key gold")

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [76]:
detail_keys

NameError: name 'detail_keys' is not defined

In [263]:
detail_values = details.find_all('span', class_ = "kv-val")

In [264]:
detail_values

[<span class="kv-val"><a href="https://www.masterofmalt.com/country/scotch-whisky/">Scotch Whisky</a></span>,
 <span class="kv-val"><a href="https://www.masterofmalt.com/region/highland-whisky/">Highland Whisky</a></span>,
 <span class="kv-val" itemprop="brand"><a href="https://www.masterofmalt.com/distilleries/edradour-whisky-distillery/">Edradour</a></span>,
 <span class="kv-val" itemprop="manufacturer">Edradour</span>,
 <span class="kv-val"><a href="https://www.masterofmalt.com/age/10-year-old/">10 year old Whisky</a></span>,
 <span class="kv-val"><a href="https://www.masterofmalt.com/style/single-malt-whisky/">Single Malt Whisky</a></span>,
 <span class="kv-val">40.0%</span>,
 <span class="kv-val">70cl</span>]

In [265]:
data_dict = {}
for key,val in zip(detail_keys, detail_values):
    data_dict[key.text] = val.text
    

In [266]:
data_dict

{'Country': 'Scotch Whisky',
 'Region': 'Highland Whisky',
 'Distillery / Brand': 'Edradour',
 'Bottler': 'Edradour',
 'Age': '10 year old Whisky',
 'Style': 'Single Malt Whisky',
 'Alcohol': '40.0%',
 'Volume': '70cl'}

In [267]:
tasting_note_div = edrad_soup.find_all("div", 
                                       id = "ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_breakDownTastingNote")[0]

tasting_note_div

<div id="ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_breakDownTastingNote">
<p class="pageCopy" id="ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_noseTastingNote"><b>Nose:</b> Medium, great complexity. Thoroughly fruity, sherry, sweetness, alluring vanilla.</p>
<p class="pageCopy" id="ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_palateTastingNote"><b>Palate:</b> Cloying, seductive murkiness. Rum, barley, toasted almonds. Some may find themselves lost in the mÃªlÃ©e, not quite enough method to the madness.</p>
<p class="pageCopy" id="ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_finishTastingNote"><b>Finish:</b> Any confusion is arrested: spiced fruitcake with crÃ¨me anglaise.</p>
<p class="pageCopy" id="ContentPlaceHolder1_ctl00_ctl02_TastingNoteBox_ctl00_overallTastingNote"><b>Overall:</b> An unusual malt.</p>
</div>

In [268]:
tasting_dict = { note.text.split(':')[0]:note.text.split(':')[1] for note in tasting_note_div.find_all('p') }
data_dict.update(tasting_dict)

In [269]:
data_dict

{'Country': 'Scotch Whisky',
 'Region': 'Highland Whisky',
 'Distillery / Brand': 'Edradour',
 'Bottler': 'Edradour',
 'Age': '10 year old Whisky',
 'Style': 'Single Malt Whisky',
 'Alcohol': '40.0%',
 'Volume': '70cl',
 'Nose': ' Medium, great complexity. Thoroughly fruity, sherry, sweetness, alluring vanilla.',
 'Palate': ' Cloying, seductive murkiness. Rum, barley, toasted almonds. Some may find themselves lost in the mÃªlÃ©e, not quite enough method to the madness.',
 'Finish': ' Any confusion is arrested',
 'Overall': ' An unusual malt.'}

We have started wrangling data from an actual website.

Now we might want to do this for many whiskies on this site:

- How might we start doing this?
- Any ideas?

When each product site has same tagging structure:

- Build function that extracts data like we did.
- Loop through each product.
- Apply function to each product to scrape data.