# Beautiful Soup

## Objective: To learn Web Scraping using Beautiful Soup

#### Here, I scraped the [Beautiful Soup Tutorials Point Webpage](https://www.tutorialspoint.com/beautiful_soup/index.htm) and extracted the details, souped the page, navigated the tags, searching the tree, and modified the tree. 

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
url = 'https://www.tutorialspoint.com/beautiful_soup/index.htm'
req = requests.get(url)

# Parse the page using Beautiful constructor using a string method
soup = BeautifulSoup(req.text, 'html.parser')

In [3]:
print(soup.title)

<title>Beautiful Soup Tutorial - Tutorialspoint</title>


In [4]:
print(soup.href)

None


In [7]:
for link in soup.find_all('a'):
    print(link.get('href'))

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/current_affairs.htm
https://www.tutorialspoint.com/upsc_ias_exams.htm
https://www.tutorialspoint.com/tutor_connect/index.php
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/netmeeting.php
https://www.tutorix.com
/videotutorials/login.php
/videotutorials/subscription.php
https://www.facebook.com/tutorialspointindia
https://www.twitter.com/tutorialspoint
https://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.tutorialspoint.com/index.htm
None
/academic_tutorials.htm
/big_data_tutorials.htm
/computer_programming_tutorials.htm
/computer_science_tutorials.htm
/database_tutorials.htm
/devops_tutorials.htm
/digital_marketing_tutorials.htm
/engineering_tutorials.htm
/upsc_ias_exams.

In [15]:
# Souping the Page
with open("./Beautiful Soup Tutorial - Tutorialspoint.html") as fp:
    soup = BeautifulSoup(fp)
print(soup)

<!DOCTYPE html>
<title>Beautiful Soup Tutorial - Tutorialspoint</title>
<meta content="Beautiful Soup Tutorial - In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. I" name="description"/>
<meta content="C, C++, Python, Java, HTML, CSS, JavaScript, SQL, PHP, jQuery, XML, DOM, Bootstrap, Tutorials, Articles, Programming, training, learning, quiz, preferences, examples, code" name="keywords"/>
<link href="https://www.tutorialspoint.com/beautiful_soup/index.htm" rel="canonical"/>
<link href="https://www.tutorialspoint.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<meta content="width=device-width,initial-scale=1.0,user-scalable=yes" name="viewport"/>
<link href="./Beautiful Soup Tutorial - Tutorialspoint_files/style-min-v1.css" rel="stylesheet"/>
<script async="" src="./Beautiful Soup Tutorial - Tutorialspoint_files/analytics.js.download" type="text/javascript"></script>

In [16]:
print(soup.title)

<title>Beautiful Soup Tutorial - Tutorialspoint</title>


In [17]:
print(soup.header)

<header id="header">
<!-- pop-up -->
<div class="pop-modal overlay-pop popdiv" style="display: none;">
<div class="modal-window small">
<span class="close" title="close">Ã—</span>
<div class="pop-content">
</div>
<span class="msg"></span>
</div>
</div>
<div class="wrap_loader">
<div class="imgLoader"><img alt="Tutorialspoint" height="70" src="./Beautiful Soup Tutorial - Tutorialspoint_files/loader.gif" width="70"/></div>
</div>
<input id="vu" name="vu" type="hidden" value=""/>
<!-- pop-up -->
<!-- Top sub-menu Starts Here -->
<div class="mui-appbar mui-container-fulid top-menu">
<div class="mui-container">
<div class="top-menu-item home">
<a href="https://www.tutorialspoint.com/index.htm" target="_blank" title="TutorialsPoint - Home"><i class="fal fa-home"></i> <span>Home</span></a>
</div>
<div class="top-menu-item qa">
<a href="https://www.tutorialspoint.com/about/about_careers.htm" target="_blank" title="Job @ Tutorials Point"><i class="fa fa-suitcase"></i> <span>Jobs</span></a>
</di

In [19]:
print(soup.body.p)

<p>In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for efficiently and clearly navigating, searching and parsing HTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4 in this tutorial. You can combine multiple functionalities introduced in this tutorial into one bigger program to capture multiple meaningful data from the website into some other sub-program as input.</p>


In [20]:
# Kinds of Objects
tag = soup.html
type(tag)

bs4.element.Tag

In [21]:
## Name (tag.name)
tag.name

'html'

In [23]:
tag.name = 'Strong'
tag

<title>Beautiful Soup Tutorial - Tutorialspoint</title>
<meta content="Beautiful Soup Tutorial - In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. I" name="description"/>
<meta content="C, C++, Python, Java, HTML, CSS, JavaScript, SQL, PHP, jQuery, XML, DOM, Bootstrap, Tutorials, Articles, Programming, training, learning, quiz, preferences, examples, code" name="keywords"/>
<link href="https://www.tutorialspoint.com/beautiful_soup/index.htm" rel="canonical"/>
<link href="https://www.tutorialspoint.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<meta content="width=device-width,initial-scale=1.0,user-scalable=yes" name="viewport"/>
<link href="./Beautiful Soup Tutorial - Tutorialspoint_files/style-min-v1.css" rel="stylesheet"/>
<script async="" src="./Beautiful Soup Tutorial - Tutorialspoint_files/analytics.js.download" type="text/javascript"></script><script async=""

In [26]:
# Attributes (tag.attrs)
attribute1 = BeautifulSoup('<div class="TutorialsP"></div>', 'lxml')
tag2 = attribute1.div
print(tag2['class'])

['TutorialsP']


In [27]:
tag2['class'] = 'Online Learning'
tag2['style'] = 'background-color:red'
print(tag2)

<div class="Online Learning" style="background-color:red"></div>


In [28]:
del tag2['style']

In [29]:
print(tag2)

<div class="Online Learning"></div>


In [33]:
print(soup.p.prettify())

<p>
 In this tutorial, we will show you, how to perform web scraping in Python using Beautiful Soup 4 for getting data out of HTML, XML and other markup languages. In this we will try to scrap webpage from various different websites (including IMDB). We will cover beautiful soup 4, python basic tools for efficiently and clearly navigating, searching and parsing HTML web page. We have tried to cover almost all the functionalities of Beautiful Soup 4 in this tutorial. You can combine multiple functionalities introduced in this tutorial into one bigger program to capture multiple meaningful data from the website into some other sub-program as input.
</p>



In [34]:
print(attribute1.prettify())

<html>
 <body>
  <div class="Online Learning">
  </div>
 </body>
</html>


In [35]:
print(attribute1.div.prettify())

<div class="Online Learning">
</div>


In [36]:
# Navigating by Tags
html_doc = """
<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a href="https://www.tutorialspoint.com/java/java_overview.htm" class="prog" id="link1">Java</a>,
<a href="https://www.tutorialspoint.com/cprogramming/index.htm" class="prog" id="link2">C</a>,
<a href="https://www.tutorialspoint.com/python/index.htm" class="prog" id="link3">Python</a>,
<a href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" class="prog" id="link4">JavaScript</a> and
<a href="https://www.tutorialspoint.com/ruby/index.htm" class="prog" id="link5">C</a>;
as per online survey.</p>
<p class="prog">Programming Languages</p>
"""

In [37]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [41]:
soup.html.prettify()

'<html>\n <head>\n  <title>\n   Tutorials Point\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The Biggest Online Tutorials Library, It\'s all Free\n   </b>\n  </p>\n  <p class="prog">\n   Top 5 most used Programming Languages are:\n   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">\n    Java\n   </a>\n   ,\n   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">\n    C\n   </a>\n   ,\n   <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">\n    Python\n   </a>\n   ,\n   <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">\n    JavaScript\n   </a>\n   and\n   <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">\n    C\n   </a>\n   ;\nas per online survey.\n  </p>\n  <p class="prog">\n   Programming Languages\n  </p>\n </body>\n</html>'

In [42]:
print(soup.html.prettify())

<html>
 <head>
  <title>
   Tutorials Point
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Biggest Online Tutorials Library, It's all Free
   </b>
  </p>
  <p class="prog">
   Top 5 most used Programming Languages are:
   <a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">
    Java
   </a>
   ,
   <a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">
    C
   </a>
   ,
   <a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">
    Python
   </a>
   ,
   <a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">
    JavaScript
   </a>
   and
   <a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">
    C
   </a>
   ;
as per online survey.
  </p>
  <p class="prog">
   Programming Languages
  </p>
 </body>
</html>


In [43]:
print(soup.a)

<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>


In [44]:
Htag = soup.head
print(Htag.contents)

[<title>Tutorials Point</title>]


In [46]:
Ttag = Htag.contents[0]
print(Ttag)

<title>Tutorials Point</title>


In [47]:
print(Ttag.contents)

['Tutorials Point']


In [50]:
print(len(Ttag))

1


In [51]:
print(len(Ttag.contents[0]))

15


In [53]:
text = Ttag.contents[0]
print(Ttag.contents)

['Tutorials Point']


In [55]:
for child in Htag.children:
    print(child)

<title>Tutorials Point</title>


In [56]:
for child in Ttag.children:
    print(child)

Tutorials Point


In [57]:
# descendants
for child in Htag.descendants:
    print(child)

<title>Tutorials Point</title>
Tutorials Point


In [59]:
print(len(list(soup.children)))


2


In [60]:
print(len(list(soup.descendants)))

33


In [61]:
# Strings
print(Ttag.string)

Tutorials Point


In [63]:
for string in soup.strings:
    print(repr(string))

'\n'
'Tutorials Point'
'\n'
'\n'
"The Biggest Online Tutorials Library, It's all Free"
'\n'
'Top 5 most used Programming Languages are:\n'
'Java'
',\n'
'C'
',\n'
'Python'
',\n'
'JavaScript'
' and\n'
'C'
';\nas per online survey.'
'\n'
'Programming Languages'
'\n'


In [64]:
# Searching the Tree
print(soup.find_all('p'))

[<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>, <p class="prog">Top 5 most used Programming Languages are:
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a> and
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>;
as per online survey.</p>, <p class="prog">Programming Languages</p>]


In [65]:
print(soup.find_all(True))

[<html><head><title>Tutorials Point</title></head>
<body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a class="prog" href="https://www.tutorialspoint.com/java/java_overview.htm" id="link1">Java</a>,
<a class="prog" href="https://www.tutorialspoint.com/cprogramming/index.htm" id="link2">C</a>,
<a class="prog" href="https://www.tutorialspoint.com/python/index.htm" id="link3">Python</a>,
<a class="prog" href="https://www.tutorialspoint.com/javascript/javascript_overview.htm" id="link4">JavaScript</a> and
<a class="prog" href="https://www.tutorialspoint.com/ruby/index.htm" id="link5">C</a>;
as per online survey.</p>
<p class="prog">Programming Languages</p>
</body></html>, <head><title>Tutorials Point</title></head>, <title>Tutorials Point</title>, <body>
<p class="title"><b>The Biggest Online Tutorials Library, It's all Free</b></p>
<p class="prog">Top 5 most used Programming Languages are:
<a 

In [66]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
a
a
p


All HTML or XML documents are written in some specific encoding like ASCII or UTF-8. However, when you load that HTML/XML document into BeautifulSoup, it has been converted to Unicode.

In [67]:
# Comparing Objects for Equality
markup = "<p>Learn Python and <b>Java</b> and advanced <b>Java</b>! from Tutorialspoint</p>"
soup = BeautifulSoup(markup, "html.parser")
first_b, second_b = soup.find_all('b')


In [69]:
print(first_b == second_b)

True


In [70]:
print(first_b.previous_element == second_b.previous_element)

False


In [71]:
print(first_b is second_b)

False


---

Remaining portion is SoupStrainer  
## Thank you!