# <font color='blue'>Basic Beautiful Soup - DOM vs. Infinite Scroll</font>

### The Document Object Model (DOM) is a programming interface for HTML, XML and SVG documents. 

It provides a structured representation of the document as a tree. The DOM defines methods that allow access to the tree, so that they can change the document structure, style and content. 

<b> The DOM provides a representation of the document as a structured group of nodes and objects, possessing various properties and methods. Nodes can also have event handlers attached to them, and once an event is triggered, the event handlers get executed. </b>

Essentially, it connects web pages to scripts or programming languages.

<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png" width=700>

<img src="http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png" width=700>

In [43]:
# Install BeautifulSoup
# !pip freeze | grep beautiful
# !pip install beautifulsoup4
# this might also be a good time to install Xpath Viewer on Chrome

In [1]:
from bs4 import BeautifulSoup

In [51]:
helloworld = "<p>Hello World</p>"
soup_string = BeautifulSoup(helloworld, features="lxml")
print soup_string

<html><body><p>Hello World</p></body></html>


In [52]:
print soup_string.prettify()

<html>
 <body>
  <p>
   Hello World
  </p>
 </body>
</html>


In [53]:
soup_xml = BeautifulSoup(helloworld, features="xml")
print soup_xml 

<?xml version="1.0" encoding="utf-8"?>
<p>Hello World</p>


In [54]:
print soup_string.body

<body><p>Hello World</p></body>


In [55]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [56]:

soup1 = BeautifulSoup(html_doc, 'html.parser')
soup2 = BeautifulSoup(html_doc, 'lxml')

# print soup.prettify()
soup1 == soup2

False

In [57]:
soup1


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [58]:
soup2

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [59]:
soup = soup1
print soup.prettify()

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [60]:
soup1.title
# <title>The Dormouse's story</title>

<title>The Dormouse's story</title>

In [61]:
soup1.title.name
# u'title'

u'title'

In [62]:
soup.title.text == soup.title.string
soup.title.text
# u'The Dormouse's story'

u"The Dormouse's story"

In [63]:
soup.title.string
# u'The Dormouse's story'

u"The Dormouse's story"

In [64]:
soup.title.parent.name
# u'head'

u'head'

In [65]:
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
for i in soup.find_all('p'):
    print len(i.text), i.text

20 The Dormouse's story
135 Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
3 ...


In [66]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [67]:
soup.p['class']
# u'title'

[u'title']

In [68]:
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [69]:
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [31]:
soup.find(id="link3")['id']
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [33]:
html_atag = """<html><body><p>Test html a tag example</p>
<a href="http://www.packtpub.com">Home</a>
<a href="http;//www.packtpub.com/books">Books</a>
</body>
</html>"""
soup  = BeautifulSoup(html_atag,"lxml")
atag = soup.a
print(atag)

<a href="http://www.packtpub.com">Home</a>


In [34]:
import re
re.findall('a', html_atag)

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

In [35]:
import re
re.findall(r'href=+', html_atag)

['href=', 'href=']

In [33]:
for link in soup1.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [34]:
print soup1.get_text()


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



In [35]:
print soup1.prettify()

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [36]:
BeautifulSoup("Sacr&eacute; bleu!", "lxml")

# <html><head></head><body>Sacré bleu!</body></html>

<html><body><p>Sacr\xe9 bleu!</p></body></html>

In [40]:
bs = BeautifulSoup('<b class="boldest">Extremely bold</b>', "lxml")
tag = bs.b
type(tag)
bs.b

<b class="boldest">Extremely bold</b>

In [41]:
tag.name

'b'

In [42]:
tag.name = "blockquote"
tag.name

'blockquote'