# <u>BeautifulSoup</u>
***

## INTRODUCTION TO HTML
***
Basic HTML Document Layout :
<img src="https://ischool.syr.edu/infospace/wp-content/files/2012/03/htmlintro.png" />


In [1]:
#Open Html Layout Demo
import webbrowser
webbrowser.open_new_tab('htmllayout.html')

True

# Class and Id's with HTML
***
The HTML class attribute is used to define equal styles for elements with the same class name.
So, all HTML elements with the same class attribute will have the same format and style.


Blog Post:  <a href="https://css-tricks.com/the-difference-between-id-and-class/">Difference between Class and Id's</a>

In [2]:
#Opens cl.html
webbrowser.open_new_tab('cl.html')

True

In [None]:
#Opens Numbers website
webbrowser.open_new_tab("https://www.the-numbers.com/daily-box-office-chart")

In [3]:
from bs4 import BeautifulSoup
import requests

url = "https://www.the-numbers.com/daily-box-office-chart"
req = requests.get(url)
print(BeautifulSoup(req.content,'html.parser').prettify())

<!DOCTYPE html>
<html>
 <head>
  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1343128-1">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-1343128-1');
  </script>
  <meta content='(PICS-1.1 "https://www.icra.org/ratingsv02.html" l gen true for "https://www.the-numbers.com/" r (cb 1 lz 1 nz 1 oz 1 vz 1) "https://www.rsac.org/ratingsv01.html" l gen true for "https://www.the-numbers.com/" r (n 0 s 0 v 0 l 0))' http-equiv="PICS-Label"/>
  <!--<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >-->
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="telephone=no" name="format-detection"/>
  <!-- for apple mobile -->
  <meta content="521546213" property="fb:admins">
   <meta content="initial-scale=1" name="viewport"/>
   <meta content="Dail

# Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping

## Installation
  pip install beautifulsoup4

## Let's Begin
***

### BeautifulSoup(HTML_Source , parser)

<code>soup = BeautifulSoup(markup,'html.parser')</code>


### Parsers

<b>html.parser</b> - <code>BeautifulSoup(markup, "html.parser")</code>

Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

<b>lxml</b> - <code>BeautifulSoup(markup, "lxml")</code>

Advantages: Very fast, Lenient

Disadvantages: External C dependency

<b>html5lib</b> - <code>BeautifulSoup(markup, "html5lib")</code>

Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

Disadvantages: Very slow, External Python dependency

##### What is returned to soup?

BeautifulSoup() returns a Navigatable Tree:

<img src="https://cdn-images-1.medium.com/max/1600/0*ETFzXPCNHkPpqNv_.png"/>

In [4]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title>
</head><body><p class="title">
<b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
</p> </body> </html>"""

soup = BeautifulSoup(html_doc,'html.parser')

In [5]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
 </body>
</html>


# Some Important Concepts
***
### Tags:
A Tag object corresponds to an XML or HTML tag in the original document.

<b><code>p</code>,<code>b</code>,<code>a</code>,<code>head</code>,<code>title</code></b> and many more.


To scrape data from a tag : <code><soup.tag_name></code> 

In [6]:
tag = soup.a
tag

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>

In [7]:
print("Getting attributes : ",tag['href'],tag['class'],tag['id'],tag.contents,tag.string,sep='\n')

Getting attributes : 
http://example.com/elsie
['sister']
link1
['\nElsie']

Elsie


In [8]:
print("Navigating the Tress\n\n")

print("soup.head :\n",soup.head)
print("\nsoup.title :\n",soup.title)

#note difference between following
print("\nsoup.a :\n",soup.a)
print("\nsoup('a') :\n",soup('a'))
print("\nsoup.find_all('a') :\n",soup.find_all('a'))

print("\nsoup.p :\n",soup.p)
print("\nsoup.get_text() :\n",soup.get_text())

Navigating the Tress


soup.head :
 <head><title>The Dormouse's story</title>
</head>

soup.title :
 <title>The Dormouse's story</title>

soup.a :
 <a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>

soup('a') :
 [<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all('a') :
 [<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Using Get to extract links:
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

soup.p :
 <p class="title">
<b>The Dormouse's story</b></p>

soup.get_text() :
 
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

In [9]:
print(soup.a.name)
print(soup.a.contents)
print(soup.a.string)

a
['\nElsie']

Elsie


## Parents and Children

In [10]:
block = soup.a

for elem in block.parents:
    print(elem.name)

p
body
html
[document]


In [42]:
for elem in soup.body.children:
    print(elem)

<p class="title">
<b>The Dormouse's story</b></p>


<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
</p>
 


## Find

In [12]:
soup.find('b')

<b>The Dormouse's story</b>

In [13]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [14]:
soup.find(class_="story")

<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
</p>

### Next/Previous Siblings

Points to next/previous element on same level in tree

In [15]:
elem = soup.a
elem

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>

In [16]:
elem.next_sibling

','

In [17]:
elem.next_sibling.next_sibling

<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a>

### Next/Previous Element

Points to next/previous element in tree

In [18]:
elem = soup.a
elem

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>

In [19]:
elem.next_element

'\nElsie'

In [20]:
elem.next_element.next_element

','

In [21]:
elem.next_element.next_element.next_element

<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a>

In [22]:
elem.previous_element

'\nOnce upon a time there were three little sisters; and their names were\n'

### Searching in the Tree
Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: <code>find()</code> and <code>find_all()</code>. 

In [23]:
soup.find_all('p')

[<p class="title">
 <b>The Dormouse's story</b></p>, <p class="story">
 Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 ; and they lived at the bottom of a well.
 </p>]

In [25]:
soup.find_all(["a", "b"])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">
 Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [26]:
for sp in soup.find_all(True):
    print(sp)

<html><head><title>The Dormouse's story</title>
</head><body><p class="title">
<b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
</p> </body> </html>
<head><title>The Dormouse's story</title>
</head>
<title>The Dormouse's story</title>
<body><p class="title">
<b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">
Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
; and they lived at the bottom of a well.
</p> </body>
<p class=

In [27]:
for sp in soup.find_all(True):
    print(sp.name)

html
head
title
body
p
b
p
a
a
a


### Search via CSS selectors 
<table style="width:100% ; font-size:18px">
<tbody><tr>
    <th>Selector</th>
    <th>Example</th>
    <th>Example description</th>
  </tr>
  <tr>
    <td><a href="sel_class.asp">.<i>class</i></a></td>
    <td class="notranslate">.intro</td>
    <td>Selects all elements with class="intro"</td>
  </tr>
  <tr>
    <td><a href="sel_id.asp">#<i>id</i></a></td>
    <td class="notranslate">#firstname</td>
    <td>Selects the element with id="firstname"</td>
  </tr> 
    </tbody>
    </table>

In [55]:
css_soup = BeautifulSoup(
'''
<p class="body strikeout"></p>
<a id = "link" class = "my"></a>
'''
,'html.parser')

#If you want to search for tags that match two or more CSS classes, you should use a CSS selector
print(css_soup.select(".strikeout"))
print(css_soup.select("#link"))

[<p class="body strikeout"></p>]
[<a class="my" id="link"></a>]


In [30]:
soup.select("head > title")

[<title>The Dormouse's story</title>]

In [31]:
soup.select("p > #link1")

[<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>]

In [32]:
soup.select("a#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>]

In [33]:
soup.select_one(".sister")

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>

### Some other examples of Searching

In [34]:
soup.find(class_ = "sister" , id='link1')

<a class="sister" href="http://example.com/elsie" id="link1">
Elsie</a>

In [35]:
soup.find_all('a',{'class':'sister'})

[<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">
 Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [36]:
soup.find_all('a',{'class':'sister' , 'id' : 'link3'})

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [37]:
#Find tags that match any selector from a list of selectors:
soup.select("#link1,#link2")

[<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">
 Lacie</a>]

In [38]:
#Test for the existence of an attribute:
soup.select('a[href]')

[<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">
 Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [39]:
soup.select('a[href="http://example.com/elsie"]')

[<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie</a>]

In [40]:
#common task is to get links off a website
print("\nUsing Get to extract links:")
for link in soup.find_all('a'):
    print(link.get('href'))


Using Get to extract links:
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
