### BEAUTIFULSOUP BASICS

<img src="beautifulsoup_image.png">

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

<img src="process.png">

<img src="setting_environment.png">

<img src="fetching_html.png">

<img src="parsing_html.png">

<img src="traversing_html_tree.png">

There are mainly two ways to extract data from a website:

<li>Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.</li>
<li>Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.</li>

#### Step 0: Setting up Environment
##### Install all the requirements
pip install requests

pip install bs4

pip install html5lib

In [52]:
import requests
from bs4 import BeautifulSoup
url = 'https://codewithharry.com/'

#### Step 1: get the HTML

In [53]:
r = requests.get(url)
htmlcontent = r.content
htmlcontent

b'<!DOCTYPE html><html lang="en"><head><meta name="viewport" content="width=device-width"/><meta charSet="utf-8"/><script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9655830461045889" crossorigin="anonymous"></script><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n        new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n        j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n        \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n        })(window,document,\'script\',\'dataLayer\',\'GTM-MCDDKRF\');</script><title>Home | CodeWithHarry</title><meta name="description" content="Welcome to Code With Harry. Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn."/><link rel="icon" href="/favicon.ico"/><meta name="next-head-count" content="7"/><link rel="preload

#### Step 2: parse the HTML

In [54]:
soup = BeautifulSoup(htmlcontent, 'html.parser')
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html lang="en"><head><meta content="width=device-width" name="viewport"/><meta charset="utf-8"/><script async="" crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9655830461045889"></script><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
        new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
        j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
        'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
        })(window,document,'script','dataLayer','GTM-MCDDKRF');</script><title>Home | CodeWithHarry</title><meta content="Welcome to Code With Harry. Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn." name="description"/><link href="/favicon.ico" rel="icon"/><meta content="7" name="next-head-count"/><link as="s

#### Step 3: HTML tree traversal

In [55]:
# <body>

#   <div id="content">
#     <h1>Heading here</h1>
#     <p>Lorem ipsum dolor sit amet.</p>
#     <p>Lorem ipsum dolor <em>sit</em> amet.</p>
#     <hr>
#   </div>
  
#   <div id="nav">
#     <ul>
#       <li>item 1</li>
#       <li>item 2</li>
#       <li>item 3</li>
#     </ul>
#   </div>

# </body>

<img src="tree.gif">

##### Ancestor
An ancestor refers to any element that is connected but further up the document tree - no matter how many levels higher.

In the diagram below, the body element is the ancestor of all other elements on the page.

<img src="tree_ancestor.gif">


##### Descendant
A descendant refers to any element that is connected but lower down the document tree - no matter how many levels lower.
In the diagram below, all elements that are connected below the div element are descendants of that div.

<img src="tree_descendant.gif">

##### Parent and Child
A parent is an element that is directly above and connected to an element in the document tree. In the diagram below, the div is a parent to the ul.

A child is an element that is directly below and connected to an element in the document tree. In the diagram above, the ul is a child to the div.

<img src="tree_parent.gif">


##### Sibling
A sibling is an element that shares the same parent with another element.

In the diagram below, the li's are siblings as they all share the same parent - the ul.

<img src="tree_siblings.gif">

In [56]:
title=soup.title
title

<title>Home | CodeWithHarry</title>

In [57]:
type(title)

bs4.element.Tag

In [58]:
title.string

'Home | CodeWithHarry'

In [59]:
type(title.string)

bs4.element.NavigableString

##### Commonly used types of objects

<ol>
<li>Tag</li>
<li>NavigableString</li>
<li>BeautifulSoup</li>
<li>Comment</li>
</ol>

In [60]:
print(type(soup))
print(type(title))
print(type(title.string))
markup = '<p><!----This is a comment --></p>'
soup2= BeautifulSoup(markup)
print(type(soup2.p.string))

<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Comment'>


##### get all the paragraphs from page

In [61]:
paras = soup.find_all('p')
paras

[<p class="mt-2 text-sm text-gray-500 md:text-base">Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.</p>,
 <p class="text-gray-700 text-base">Complete Tailwind CSS Course by CodeWithHarry in Hindi - Learn Tailwind CSS from scratch for free! </p>,
 <p class="text-gray-700 text-base">Complete Next.js Course by CodeWithHarry in Hindi - Learn Next.js from Scratch.</p>,
 <p class="text-gray-700 text-base">React is a free and open-source front-end JavaScript library. This series will cover React from starting to the end. We will learn react from the ground up!</p>,
 <p class="leading-relaxed mb-6">I don't have words to thank this man, I'm really grateful to have this channel and website in my daily routine. If you're a mere beginner, then you can trust this guy and can put your time into his cont

##### get all the anchor tags from page

In [62]:
anchors = soup.find_all('a')
anchors

[<a href="/">CodeWithHarry</a>,
 <a href="/">Home</a>,
 <a href="/videos/">Courses</a>,
 <a href="/blog/">Blog</a>,
 <a href="/contact/">Contact</a>,
 <a href="/videos/tailwind-course-in-hindi-1/">Tailwind Course In Hindi</a>,
 <a href="/videos/nextjs-tutorial-in-hindi-1/">Next.js Tutorials For Beginners</a>,
 <a href="/videos/react-tutorials-in-hindi-1/">React Js Tutorials For Beginners</a>,
 <a class="text-gray-500" href="https://www.facebook.com/CodeWithHarry/" rel="noreferrer" target="_blank"><svg class="w-5 h-5" fill="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="2" viewbox="0 0 24 24"><path d="M18 2h-3a5 5 0 00-5 5v3H7v4h3v8h4v-8h3l1-4h-4V7a1 1 0 011-1h3z"></path></svg></a>,
 <a class="ml-3 text-gray-500" href="https://www.instagram.com/CodeWithHarry/" rel="noreferrer" target="_blank"><svg class="w-5 h-5" fill="none" stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="2" viewbox="0 0 24 24"><rect height="20" rx="5" ry="5

##### get first element from page

In [63]:
soup.find('p')

<p class="mt-2 text-sm text-gray-500 md:text-base">Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.</p>

##### get class name of first element

In [64]:
soup.find('p')['class']

['mt-2', 'text-sm', 'text-gray-500', 'md:text-base']

##### find all the elements with class text-base

In [65]:
soup.find_all("p",class_='text-base')

[<p class="text-gray-700 text-base">Complete Tailwind CSS Course by CodeWithHarry in Hindi - Learn Tailwind CSS from scratch for free! </p>,
 <p class="text-gray-700 text-base">Complete Next.js Course by CodeWithHarry in Hindi - Learn Next.js from Scratch.</p>,
 <p class="text-gray-700 text-base">React is a free and open-source front-end JavaScript library. This series will cover React from starting to the end. We will learn react from the ground up!</p>]

##### get the text from tags

In [66]:
soup.find('p').get_text()

'Confused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.'

##### get the text from entire pag

In [67]:
soup.get_text()

"Home | CodeWithHarryCodeWithHarryMenuHomeCoursesBlogContactLoginSignupWelcome to CodeWithHarryConfused on which course to take? I have got you covered. Browse courses and find out the best course for you. Its free! Code With Harry is my attempt to teach basics and those coding techniques to people in short time which took me ages to learn.Free CoursesExplore BlogRecommended CoursesFree CourseTailwind Course In HindiComplete Tailwind CSS Course by CodeWithHarry in Hindi - Learn Tailwind CSS from scratch for free!  Start WatchingFree CourseNext.js Tutorials For BeginnersComplete Next.js Course by CodeWithHarry in Hindi - Learn Next.js from Scratch. Start WatchingFree CourseReact Js Tutorials For BeginnersReact is a free and open-source front-end JavaScript library. This series will cover React from starting to the end. We will learn react from the ground up! Start WatchingTestimonialsI don't have words to thank this man, I'm really grateful to have this channel and website in my daily r

##### get all the links on page

In [68]:
anchors = soup.find_all('a')
all_links = set()
for link in anchors:
    if (link.get('href') != '#'):
        linktext = "https://CodeWithHarry.com" + link.get('href')
        all_links.add(linktext)
        
for link in all_links:
    if ('www' in link):
        continue
    else:
        print(link)

https://CodeWithHarry.com/blog/
https://CodeWithHarry.com/videos/tailwind-course-in-hindi-1/
https://CodeWithHarry.com/
https://CodeWithHarry.com/videos/react-tutorials-in-hindi-1/
https://CodeWithHarry.com/contact/
https://CodeWithHarry.com/videos/nextjs-tutorial-in-hindi-1/
https://CodeWithHarry.com/videos/


In [69]:
searchcontent = soup.find('div', id='search-content')
searchcontent

<div class="relative w-full hidden bg-white shadow-xl" id="search-content"><div class="container mx-auto py-4 text-black"><input autofocus="" class="w-full text-grey-800 transition focus:outline-none focus:border-transparent p-2 appearance-none leading-normal text-xl lg:text-2xl" id="searchfield" placeholder="Search..." type="search"/></div></div>

In [70]:
searchcontent.contents

[<div class="container mx-auto py-4 text-black"><input autofocus="" class="w-full text-grey-800 transition focus:outline-none focus:border-transparent p-2 appearance-none leading-normal text-xl lg:text-2xl" id="searchfield" placeholder="Search..." type="search"/></div>]

.contents = a tag's children as a list

.children = a tag's children as a generator

In [71]:
for elem in searchcontent.contents:
    print(elem)

<div class="container mx-auto py-4 text-black"><input autofocus="" class="w-full text-grey-800 transition focus:outline-none focus:border-transparent p-2 appearance-none leading-normal text-xl lg:text-2xl" id="searchfield" placeholder="Search..." type="search"/></div>


In [72]:
for elem in searchcontent.children:
    print(elem)

<div class="container mx-auto py-4 text-black"><input autofocus="" class="w-full text-grey-800 transition focus:outline-none focus:border-transparent p-2 appearance-none leading-normal text-xl lg:text-2xl" id="searchfield" placeholder="Search..." type="search"/></div>


In [73]:
for item in searchcontent.strings:
    print(item)

In [74]:
for item in searchcontent.stripped_strings:
    print(item)

In [75]:
searchcontent.parent

<div id="__next"><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4 justify-center"><div class="mx-0 px-0 lg:pl-4 flex items-center lg:mx-4"><span class="text-teal-700 no-underline hover:no-underline font-bold text-xl text-purple-800"><a href="/">CodeWithHarry</a></span></div><div class="pr-4 absolute top-1 right-1 pt-3"><button class="lg:hidden flex items-center px-3 py-2 border rounded text-grey border-grey-dark hover:text-black hover:border-purple appearance-none focus:outline-none"><svg class="fill-current h-3 w-3" viewbox="0 0 20 20" xmlns="http://www.w3.org/2000/svg"><title>Menu</title><path d="M0 3h20v2H0V3zm0 6h20v2H0V9zm0 6h20v2H0v-2z"></path></svg></button></div><div class="w-full flex-grow lg:flex lg:flex-1 lg:content-center lg:justify-end lg:w-auto h-0 lg:h-auto overflow-hidden mt-2 lg:mt-0 z-20 transition-all" id="nav-content"><ul class="flex items-center flex-col lg:flex-row

In [76]:
searchcontent.parents

<generator object parents at 0x0000029078AB3200>

In [77]:
for item in searchcontent.parents:
    print(item)

<div id="__next"><div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4 justify-center"><div class="mx-0 px-0 lg:pl-4 flex items-center lg:mx-4"><span class="text-teal-700 no-underline hover:no-underline font-bold text-xl text-purple-800"><a href="/">CodeWithHarry</a></span></div><div class="pr-4 absolute top-1 right-1 pt-3"><button class="lg:hidden flex items-center px-3 py-2 border rounded text-grey border-grey-dark hover:text-black hover:border-purple appearance-none focus:outline-none"><svg class="fill-current h-3 w-3" viewbox="0 0 20 20" xmlns="http://www.w3.org/2000/svg"><title>Menu</title><path d="M0 3h20v2H0V3zm0 6h20v2H0V9zm0 6h20v2H0v-2z"></path></svg></button></div><div class="w-full flex-grow lg:flex lg:flex-1 lg:content-center lg:justify-end lg:w-auto h-0 lg:h-auto overflow-hidden mt-2 lg:mt-0 z-20 transition-all" id="nav-content"><ul class="flex items-center flex-col lg:flex-row

In [78]:
for item in searchcontent.parents:
    print(item.name)

div
body
html
[document]


In [79]:
searchcontent.next_sibling

<div class="mx-auto my-1"></div>

In [80]:
searchcontent.next_sibling.next_sibling

<div class="" style="position:fixed;top:0;left:0;height:2px;background:transparent;z-index:99999999999;width:100%"><div class="" style="height:100%;background:purple;transition:all 500ms ease;width:0%"><div style="box-shadow:0 0 10px purple, 0 0 10px purple;width:5%;opacity:1;position:absolute;height:100%;transition:all 500ms ease;transform:rotate(3deg) translate(0px, -4px);left:-10rem"></div></div></div>

In [81]:
searchcontent.previous_sibling

<div class="w-full z-10 sticky bg-white top-0 border-b border-grey-light"><div class="w-full flex flex-wrap items-center lg:justify-between mt-0 py-4 justify-center"><div class="mx-0 px-0 lg:pl-4 flex items-center lg:mx-4"><span class="text-teal-700 no-underline hover:no-underline font-bold text-xl text-purple-800"><a href="/">CodeWithHarry</a></span></div><div class="pr-4 absolute top-1 right-1 pt-3"><button class="lg:hidden flex items-center px-3 py-2 border rounded text-grey border-grey-dark hover:text-black hover:border-purple appearance-none focus:outline-none"><svg class="fill-current h-3 w-3" viewbox="0 0 20 20" xmlns="http://www.w3.org/2000/svg"><title>Menu</title><path d="M0 3h20v2H0V3zm0 6h20v2H0V9zm0 6h20v2H0v-2z"></path></svg></button></div><div class="w-full flex-grow lg:flex lg:flex-1 lg:content-center lg:justify-end lg:w-auto h-0 lg:h-auto overflow-hidden mt-2 lg:mt-0 z-20 transition-all" id="nav-content"><ul class="flex items-center flex-col lg:flex-row"><div class="sea

In [82]:
searchcontent.previous_sibling.previous_sibling

In [83]:
elem = soup.select('.container mx-auto py-4 text-black')
elem

[]

### Example

Extracting a single day's weather forecast.

In [84]:
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 68. North wind around 6 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 68. North wind around 6 mph. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 68 °F
 </p>
</div>


In [85]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Today
Sunny
High: 68 °F


In [86]:
img = tonight.find("img")
desc = img['title']
print(desc)

Today: Sunny, with a high near 68. North wind around 6 mph. 


Extracting all the information from the page

In [87]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday']

In [88]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

['Sunny', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Sunny', 'Clear', 'Sunny', 'Mostly Clear', 'Sunny']
['High: 68 °F', 'Low: 47 °F', 'High: 70 °F', 'Low: 49 °F', 'High: 74 °F', 'Low: 51 °F', 'High: 74 °F', 'Low: 50 °F', 'High: 73 °F']
['Today: Sunny, with a high near 68. North wind around 6 mph. ', 'Tonight: Mostly clear, with a low around 47. North wind 5 to 10 mph. ', 'Tuesday: Sunny, with a high near 70. North northwest wind 6 to 9 mph. ', 'Tuesday Night: Mostly clear, with a low around 49. West wind 5 to 7 mph becoming calm  in the evening. ', 'Wednesday: Sunny, with a high near 74. North northeast wind 5 to 9 mph. ', 'Wednesday Night: Clear, with a low around 51.', 'Thursday: Sunny, with a high near 74.', 'Thursday Night: Mostly clear, with a low around 50.', 'Friday: Sunny, with a high near 73.']


In [89]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Sunny,High: 68 °F,"Today: Sunny, with a high near 68. North wind ..."
1,Tonight,Mostly Clear,Low: 47 °F,"Tonight: Mostly clear, with a low around 47. N..."
2,Tuesday,Sunny,High: 70 °F,"Tuesday: Sunny, with a high near 70. North nor..."
3,TuesdayNight,Mostly Clear,Low: 49 °F,"Tuesday Night: Mostly clear, with a low around..."
4,Wednesday,Sunny,High: 74 °F,"Wednesday: Sunny, with a high near 74. North n..."
5,WednesdayNight,Clear,Low: 51 °F,"Wednesday Night: Clear, with a low around 51."
6,Thursday,Sunny,High: 74 °F,"Thursday: Sunny, with a high near 74."
7,ThursdayNight,Mostly Clear,Low: 50 °F,"Thursday Night: Mostly clear, with a low aroun..."
8,Friday,Sunny,High: 73 °F,"Friday: Sunny, with a high near 73."



scrape website and save quotes from website

In [90]:
import requests
from bs4 import BeautifulSoup
import csv
   
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
   
soup = BeautifulSoup(r.content, 'html5lib')
   
quotes=[]  # a list to store quotes
   
table = soup.find('div', attrs = {'id':'all_quotes'}) 
   
quotes = {*""} # a set to store quotes
for row in table.findAll('div',
                         attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
    quote = row.img['alt'].split(" #")[0]
    quotes.add(quote)
   
for quote in quotes:
    print(quote)

Bad things do happen; how I respond to them defines my character and the quality of my life. I can choose to sit in perpetual sadness... or I can choose to rise from the pain and treasure the most precious gift I have—life itself.
Having somewhere to go is home, having someone to love is family, having both is blessing.
Failure cannot cope with persistence.
Show me someone who has done something worthwhile, and I’ll show you someone who has overcome adversity.
To persist with a goal, you must treasure the dream more than the costs of sacrifice to attain it.
Do your little bit of good where you are; its those little bits of good put together that overwhelm the world.
At 211 degrees, water is hot. At 212 degrees, it boils. And with boiling water, comes steam. And with steam, you can power a train. 
You keep putting one foot in front of the other, and then one day you look back and you've climbed a mountain.
The creative is the place where no one else has ever been. You have to leave the 