### # Basic Idea about a website:

- User requests a page from a server. Say, `www.google.com`.

- Server will provide us with a raw **index.html** file.

- Web browsers (Chrome or Edge) convert the page by beautifying it into a webpage.

**NOTE:** For scraping the websites, we can either use the **API** or use scraping tools like bs4. 

### # Key difference between API and Web Services:

**Web service** is a collection of open source protocols and standards used for exchanging data between systems or applications, whereas **API** is a software interface that allows two applications to interact with each other without any user involvement.

### # Libraries We're gonna need for Scraping:

- **requests**

In [1]:
## requests: Solely for extracting the content. There ain't gonna be any strcuture in the extracted content. 

%pip install requests 




- **html5lib**

In [2]:
## html5lib: For parsing the content.

%pip install html5lib

Note: you may need to restart the kernel to use updated packages.


- **bs4** (BeautifulSoup)

In [3]:
## bs4: Will use those parses to extract desried content w the help of inbuilt functions.

%pip install bs4

Note: you may need to restart the kernel to use updated packages.


In [4]:
# Import required libs 

import requests
from bs4 import BeautifulSoup

In [5]:
# Step 1: Get the HTML

r = requests.get("https://aws.amazon.com/what-is/data-science/")
html_content = r.content
html_content

b'<!doctype html>\n<html class="no-js aws-lng-en_US aws-with-target" lang="en-US" data-static-assets="https://a0.awsstatic.com" data-js-version="1.0.517" data-css-version="1.0.468">\n <head> \n  <meta http-equiv="Content-Security-Policy" content="default-src \'self\' data: https://a0.awsstatic.com; connect-src \'self\' https://112-tzm-766.mktoresp.com https://112-tzm-766.mktoutil.com https://a0.awsstatic.com https://a0.p.awsstatic.com https://a1.awsstatic.com https://amazonwebservices.d2.sc.omtrdc.net https://amazonwebservicesinc.tt.omtrdc.net https://api.regional-table.region-services.aws.a2z.com https://api.us-west-2.prod.pricing.aws.a2z.com https://auth.aws.amazon.com https://aws.amazon.com https://aws.demdex.net https://b0.p.awsstatic.com https://c0.b0.p.awsstatic.com https://calculator.aws https://chatbot-api.us-east-1.prod.mrc-sunrise.marketing.aws.dev https://cm.everesttech.net https://csml-plc-prod.us-west-2.api.aws/plc/csml/logging https://d0.awsstatic.com https://d1.awsstatic

In [6]:
## Step 2: Parse the html content

soup = BeautifulSoup(html_content, 'html.parser')  # We did get some beautified structure!
soup

<!DOCTYPE html>

<html class="no-js aws-lng-en_US aws-with-target" data-css-version="1.0.468" data-js-version="1.0.517" data-static-assets="https://a0.awsstatic.com" lang="en-US">
<head>
<meta content="default-src 'self' data: https://a0.awsstatic.com; connect-src 'self' https://112-tzm-766.mktoresp.com https://112-tzm-766.mktoutil.com https://a0.awsstatic.com https://a0.p.awsstatic.com https://a1.awsstatic.com https://amazonwebservices.d2.sc.omtrdc.net https://amazonwebservicesinc.tt.omtrdc.net https://api.regional-table.region-services.aws.a2z.com https://api.us-west-2.prod.pricing.aws.a2z.com https://auth.aws.amazon.com https://aws.amazon.com https://aws.demdex.net https://b0.p.awsstatic.com https://c0.b0.p.awsstatic.com https://calculator.aws https://chatbot-api.us-east-1.prod.mrc-sunrise.marketing.aws.dev https://cm.everesttech.net https://csml-plc-prod.us-west-2.api.aws/plc/csml/logging https://d0.awsstatic.com https://d1.awsstatic.com https://d1fgizr415o1r6.cloudfront.net https:

In [7]:
soup.prettify  # The content is lil bit more structurized

<bound method Tag.prettify of <!DOCTYPE html>

<html class="no-js aws-lng-en_US aws-with-target" data-css-version="1.0.468" data-js-version="1.0.517" data-static-assets="https://a0.awsstatic.com" lang="en-US">
<head>
<meta content="default-src 'self' data: https://a0.awsstatic.com; connect-src 'self' https://112-tzm-766.mktoresp.com https://112-tzm-766.mktoutil.com https://a0.awsstatic.com https://a0.p.awsstatic.com https://a1.awsstatic.com https://amazonwebservices.d2.sc.omtrdc.net https://amazonwebservicesinc.tt.omtrdc.net https://api.regional-table.region-services.aws.a2z.com https://api.us-west-2.prod.pricing.aws.a2z.com https://auth.aws.amazon.com https://aws.amazon.com https://aws.demdex.net https://b0.p.awsstatic.com https://c0.b0.p.awsstatic.com https://calculator.aws https://chatbot-api.us-east-1.prod.mrc-sunrise.marketing.aws.dev https://cm.everesttech.net https://csml-plc-prod.us-west-2.api.aws/plc/csml/logging https://d0.awsstatic.com https://d1.awsstatic.com https://d1fgiz

In [8]:
## Type of `soup`

type(soup)

bs4.BeautifulSoup

In [9]:
## Step 3: Tree taversal. Now, we're ready to parse the html tree.

# Fetch the title from the web page
title = soup.title
print(title)
print(title.string)

<title>What is Data Science? - Beginner's Guide to Data Science - AWS</title>
What is Data Science? - Beginner's Guide to Data Science - AWS


In [10]:
print(type(title))
print(type(title.string))

<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>


### # Commonly used types of objects in BeautifulSoup:

**i.** Tag

**ii.** NavigableString

**iii.** BeautifulSoup

**iv.** Comment

In [11]:
## Fetch the body of the html page

soup.body
# print(soup.body.string)

<body class="awsm">
<a class="lb-sr-only lb-sr-only-focusable lb-bold lb-skip-el" href="#aws-page-content-main" id="aws-page-skip-to-main"> Skip to main content</a>
<header class="awsm m-page-header lb-with-mobile-subrow" id="aws-page-header" role="banner">
<div class="m-nav" id="m-nav" role="navigation">
<div class="m-nav-header lb-clearfix" data-menu-url="https://s0.awsstatic.com/en_US/nav/v3/panel-content/desktop/index.html">
<div class="m-nav-logo">
<div class="lb-bg-logo aws-amazon_web_services_smile-header-desktop-en">
<a href="https://aws.amazon.com/?nc2=h_lg"><span>Click here to return to Amazon Web Services homepage</span></a>
</div>
</div>
<nav aria-label="Secondary navigation" class="m-nav-secondary-links" style="min-width: 620px">
<a href="/contact-us/?nc2=h_header">Contact Us</a>
<a aria-controls="popover-support-selector" aria-expanded="false" aria-haspopup="true" aria-label="Explore support options" class="lb-txt-none lb-tiny-iblock lb-txt-13 lb-txt lb-has-trigger-indica

In [12]:
## Get all the `paras` from the html page

paras = soup.find_all('p')
paras

[<p>Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.</p>,
 <p>Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities.  </p>,
 <p>While the term data science is not new, the meanings and connotations have change

In [13]:
## Get all the `anchors` from the html page

anchors = soup.find_all('a')
anchors

[<a class="lb-sr-only lb-sr-only-focusable lb-bold lb-skip-el" href="#aws-page-content-main" id="aws-page-skip-to-main"> Skip to main content</a>,
 <a href="https://aws.amazon.com/?nc2=h_lg"><span>Click here to return to Amazon Web Services homepage</span></a>,
 <a href="/contact-us/?nc2=h_header">Contact Us</a>,
 <a aria-controls="popover-support-selector" aria-expanded="false" aria-haspopup="true" aria-label="Explore support options" class="lb-txt-none lb-tiny-iblock lb-txt-13 lb-txt lb-has-trigger-indicator" data-lb-popover-trigger="popover-support-selector" data-mbox-ignore="true" href="#" id="popover-popover-support-selector-trigger" role="button"> Support  <i class="icon-caret-down lb-trigger-mount"></i></a>,
 <a aria-controls="popover-language-selector" aria-expanded="false" aria-haspopup="true" aria-label="Set site language" class="lb-tiny-iblock lb-txt lb-has-trigger-indicator" data-language="en" data-lb-popover-trigger="popover-language-selector" href="#" id="m-nav-language-s

In [14]:
## Get the first element in the html page
para1 = soup.find('p')
print(para1.string)

# Get subtags of any element in the html page
print(para1['id'])  # No id tag in para1
print(para1['class'])  # No class tag in para1

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.


KeyError: 'id'

In [15]:
## Find all paras with class="lead"

soup.find_all('p', class_='lead')

[]

In [16]:
## Get only the text from the tags/soup

print(soup.find('body').get_text())


 Skip to main content





Click here to return to Amazon Web Services homepage



Contact Us
 Support  
English 
My Account 




 Sign In


  Create an AWS Account 









Products
Solutions
Pricing
Documentation
Learn
Partner Network
AWS Marketplace
Customer Enablement
Events
Explore More 

















 Close 



عربي
Bahasa Indonesia
Deutsch
English
Español
Français
Italiano
Português




Tiếng Việt
Türkçe
Ρусский
ไทย
日本語
한국어
中文 (简体)
中文 (繁體)





 Close 

My Profile
Sign out of AWS Builder ID
AWS Management Console
Account Settings
Billing & Cost Management
Security Credentials
AWS Personal Health Dashboard



 Close 

Support Center
Knowledge Center
AWS Support Overview
AWS re:Post












Click here to return to Amazon Web Services homepage







  Get Started for Free 


  Contact Us 












 Products 
 Solutions 
 Pricing 
 Introduction to AWS 
 Getting Started 
 Documentation 
 Training and Certification 
 Developer Center 
 Customer Success 
 Partner Network 


In [17]:
## Get all the links from the page

unique_links = set()

for link in anchors:
    if '#' not in str(link.get('href')):
        href = link.get('href')
        unique_links.add(href)
        
unique_links

{'/architecture/?nc1=f_cc',
 '/big-data/datalakes-and-analytics/',
 '/big-data/datalakes-and-analytics/?sc_icampaign=aware_what-is-seo-pages&sc_ichannel=ha&sc_icontent=awssm-11373_aware&sc_iplace=ed&trk=edb040cb-3307-4428-90ec-83f484dc26bd~ha_awssm-11373_aware',
 '/big-data/datalakes-and-analytics/what-is-a-data-lake/?nc1=f_cc',
 '/blogs/?awsf.blog-master-category=category%2523analytics&sc_icampaign=aware_what-is-seo-pages&sc_ichannel=ha&sc_icontent=awssm-11373_aware&sc_iplace=ed&trk=e11c65a7-7ed5-412a-9acb-7172728db26b~ha_awssm-11373_aware',
 '/blogs/?nc1=f_cc',
 '/console/mobile/',
 '/contact-us/?nc1=f_m',
 '/contact-us/?nc2=h_header',
 '/contact-us/?nc2=h_ql_exm',
 '/containers/?nc1=f_cc',
 '/customer-enablement/?nc2=h_ql_ce',
 '/developer/?nc1=f_dr',
 '/developer/?nc2=h_mo',
 '/developer/language/java/?nc1=f_dr',
 '/developer/language/javascript/?nc1=f_dr',
 '/developer/language/net/?nc1=f_dr',
 '/developer/language/php/?nc1=f_cc',
 '/developer/language/python/?nc1=f_dr',
 '/develo

### Let's see what the `comment` type is all about:

In [18]:
markup = '<p><!-- This is a comment! --></p>'
soup2 = BeautifulSoup(markup)
type(soup2.p.string)

bs4.element.Comment

In [19]:
## Where id="something"

soup.find(id="aws-page-header")

# Print out its childern
print(soup.find(id="aws-page-header").children)  ## <list_iterator at 0x2d6cbf8cf10>

# Print out its contents
for elem in soup.find(id="aws-page-header").contents:
    print(elem)

<list_iterator object at 0x0000023A243CECD0>


<div class="m-nav" id="m-nav" role="navigation">
<div class="m-nav-header lb-clearfix" data-menu-url="https://s0.awsstatic.com/en_US/nav/v3/panel-content/desktop/index.html">
<div class="m-nav-logo">
<div class="lb-bg-logo aws-amazon_web_services_smile-header-desktop-en">
<a href="https://aws.amazon.com/?nc2=h_lg"><span>Click here to return to Amazon Web Services homepage</span></a>
</div>
</div>
<nav aria-label="Secondary navigation" class="m-nav-secondary-links" style="min-width: 620px">
<a href="/contact-us/?nc2=h_header">Contact Us</a>
<a aria-controls="popover-support-selector" aria-expanded="false" aria-haspopup="true" aria-label="Explore support options" class="lb-txt-none lb-tiny-iblock lb-txt-13 lb-txt lb-has-trigger-indicator" data-lb-popover-trigger="popover-support-selector" data-mbox-ignore="true" href="#" id="popover-popover-support-selector-trigger" role="button"> Support  <i class="icon-caret-down lb-trigger-mount"></i></a>

In [20]:
# Print out its Children

for elem in soup.find(id="aws-page-header").children:
    print(elem)



<div class="m-nav" id="m-nav" role="navigation">
<div class="m-nav-header lb-clearfix" data-menu-url="https://s0.awsstatic.com/en_US/nav/v3/panel-content/desktop/index.html">
<div class="m-nav-logo">
<div class="lb-bg-logo aws-amazon_web_services_smile-header-desktop-en">
<a href="https://aws.amazon.com/?nc2=h_lg"><span>Click here to return to Amazon Web Services homepage</span></a>
</div>
</div>
<nav aria-label="Secondary navigation" class="m-nav-secondary-links" style="min-width: 620px">
<a href="/contact-us/?nc2=h_header">Contact Us</a>
<a aria-controls="popover-support-selector" aria-expanded="false" aria-haspopup="true" aria-label="Explore support options" class="lb-txt-none lb-tiny-iblock lb-txt-13 lb-txt lb-has-trigger-indicator" data-lb-popover-trigger="popover-support-selector" data-mbox-ignore="true" href="#" id="popover-popover-support-selector-trigger" role="button"> Support  <i class="icon-caret-down lb-trigger-mount"></i></a>
<a aria-controls="popover-language-selector"

**=>** What's the difference bw the both exactly?!

### `.contents`: A tag's children are available as a list.

### `.children`: A tag's children are available as a generator.

In [21]:
## .strings

for item in soup.find(id="aws-page-header").stripped_strings:  # Can also use strings!
    print(item)

Click here to return to Amazon Web Services homepage
Contact Us
Support
English
My Account
Sign In
Create an AWS Account
Products
Solutions
Pricing
Documentation
Learn
Partner Network
AWS Marketplace
Customer Enablement
Events
Explore More
Close
عربي
Bahasa Indonesia
Deutsch
English
Español
Français
Italiano
Português
Tiếng Việt
Türkçe
Ρусский
ไทย
日本語
한국어
中文 (简体)
中文 (繁體)
Close
My Profile
Sign out of AWS Builder ID
AWS Management Console
Account Settings
Billing & Cost Management
Security Credentials
AWS Personal Health Dashboard
Close
Support Center
Knowledge Center
AWS Support Overview
AWS re:Post
Click here to return to Amazon Web Services homepage
Get Started for Free
Contact Us
Products
Solutions
Pricing
Introduction to AWS
Getting Started
Documentation
Training and Certification
Developer Center
Customer Success
Partner Network
AWS Marketplace
Support
AWS re:Post
Log into Console
Download the Mobile App


In [22]:
## Print out its `parent`

soup.find(id="aws-page-header").parent

<body class="awsm">
<a class="lb-sr-only lb-sr-only-focusable lb-bold lb-skip-el" href="#aws-page-content-main" id="aws-page-skip-to-main"> Skip to main content</a>
<header class="awsm m-page-header lb-with-mobile-subrow" id="aws-page-header" role="banner">
<div class="m-nav" id="m-nav" role="navigation">
<div class="m-nav-header lb-clearfix" data-menu-url="https://s0.awsstatic.com/en_US/nav/v3/panel-content/desktop/index.html">
<div class="m-nav-logo">
<div class="lb-bg-logo aws-amazon_web_services_smile-header-desktop-en">
<a href="https://aws.amazon.com/?nc2=h_lg"><span>Click here to return to Amazon Web Services homepage</span></a>
</div>
</div>
<nav aria-label="Secondary navigation" class="m-nav-secondary-links" style="min-width: 620px">
<a href="/contact-us/?nc2=h_header">Contact Us</a>
<a aria-controls="popover-support-selector" aria-expanded="false" aria-haspopup="true" aria-label="Explore support options" class="lb-txt-none lb-tiny-iblock lb-txt-13 lb-txt lb-has-trigger-indica

In [23]:
## Print out its `parents`

soup.find(id="aws-page-header").parents

<generator object PageElement.parents at 0x0000023A27ABE120>

**=>** We've got a generator object to iterate!

In [24]:
for item in soup.find(id="aws-page-header").parents:
    print(item.name)

body
html
[document]


In [25]:
## Siblings

print(soup.find(id="aws-page-header").previous_sibling.previous_sibling)
print("---")
print(soup.find(id="aws-page-header").next_sibling.next_sibling)

<a class="lb-sr-only lb-sr-only-focusable lb-bold lb-skip-el" href="#aws-page-content-main" id="aws-page-skip-to-main"> Skip to main content</a>
---
<div class="lb-page-content lb-page-with-sticky-subnav" data-page-alert-target="true" id="aws-page-content" style="padding-top:0px; padding-bottom:0px;">
<main id="aws-page-content-main" role="main" tabindex="-1">
<div data-eb-slot="what-is-header" data-eb-slot-meta="{'version':'1.0','slotId':'what-is-header','experienceId':'93f2c10b-57a0-4aac-a291-b4b33afe10b1','allowBlank':false,'filters':{'limit':1,'query':'id \u003d \'what-is-data-science\''}}">
<div data-eb-a711d52c="" data-eb-c-scope="what-is-header" data-eb-ce="" data-eb-d-scope="DIRECTORIES" data-eb-ssr-ce="" data-eb-tpl-n="what-is-header" data-eb-tpl-v="1.0.0">
<style>[data-eb-a711d52c] .eb-what-is-header {
  background-color: #1e2832;
  background-image: url("//d1.awsstatic.com/r2018/h/QuickSight Q/Site Merch/SiteMerch-QuickSightQ_Hero-BG.c455f708c1d1da51ca3520e7678b415423fd06a5.

In [26]:
## To access "CSS Selectors":

print(soup.select('#m-nav-secondary-links'))  # CSS Selector
print("---")
print(soup.select('.m-nav-secondary-links'))  # Class

[]
---
[<nav aria-label="Secondary navigation" class="m-nav-secondary-links" style="min-width: 620px">
<a href="/contact-us/?nc2=h_header">Contact Us</a>
<a aria-controls="popover-support-selector" aria-expanded="false" aria-haspopup="true" aria-label="Explore support options" class="lb-txt-none lb-tiny-iblock lb-txt-13 lb-txt lb-has-trigger-indicator" data-lb-popover-trigger="popover-support-selector" data-mbox-ignore="true" href="#" id="popover-popover-support-selector-trigger" role="button"> Support  <i class="icon-caret-down lb-trigger-mount"></i></a>
<a aria-controls="popover-language-selector" aria-expanded="false" aria-haspopup="true" aria-label="Set site language" class="lb-tiny-iblock lb-txt lb-has-trigger-indicator" data-language="en" data-lb-popover-trigger="popover-language-selector" href="#" id="m-nav-language-selector" role="button">English <i class="icon-caret-down lb-trigger-mount"></i></a>
<a aria-controls="popover-my-account" aria-expanded="false" aria-haspopup="true"