### BeautifulSoup

BeautifulSoup is a python library, that helps you navigate an xml tree and extract attibures and values from it's Nodes
> Remember that any html documents are very similar to xml. Today we are going to assume they are the same, but if you're interested, you should read about the differences between, XML, HTML and XHTML

In this part we will see how we can navigate the tree.

In [96]:
from pathlib import Path

from bs4 import BeautifulSoup

In [10]:
html_doc = Path('./data/sample-page.html')

In [120]:
html_doc = """
<html>
<body class="body">
    <span>
    <i id="qqyyzz-1" class="italic">Lorem ipsum-1</i>
    <i id="qqyyzz-2" class="italic">Lorem ipsum-2</i>
    </span>
</body>
</html>
"""
# removing some newline noise
html_doc = "".join(line.strip() for line in html_doc.split("\n"))
parser = BeautifulSoup(html_doc, 'html.parser')

<i class="italic" id="qqyyzz-1">Lorem ipsum-1</i>

### Navigating The Tree

#### Going Down Using Tag names

In [143]:
# You can go down, by using the tag's name i.e
# to select the <Body>
parser.body

<body class="body">
 <span>
  <i class="italic" id="qqyyzz-1">
   Lorem ipsum-1
  </i>
  <i class="italic" id="qqyyzz-2">
   Lorem ipsum-2
  </i>
 </span>
</body>


In [50]:
# You can use the same trick navigate further down
# html > body > span
parser.body.span


<span>
<i class="italic" id="qqyyzz-1">Lorem ipsum-1</i>
<i class="italic" id="qqyyzz-2">Lorem ipsum-2</i>
</span>

In [52]:
# it will only select the FIRST elemenet of the parent node if are multuple entries
# use the find_all() or .childen function to get a list of all the elemenents tehre
parser.body.span.i


<i class="italic" id="qqyyzz-1">Lorem ipsum-1</i>

In [136]:
for idx, tag in enumerate(parser.body.span.children):
    print('Tag', idx, ':', tag.prettify())

Tag 0 : <i class="italic" id="qqyyzz-1">
 Lorem ipsum-1
</i>

Tag 1 : <i class="italic" id="qqyyzz-2">
 Lorem ipsum-2
</i>


#### Going Up

We can use the `.parent` or `.parents` methods

In [137]:
# going  body > span > body
parent = parser.body.span.parent
print(parser.body.span.parent.prettify())

<body class="body">
 <span>
  <i class="italic" id="qqyyzz-1">
   Lorem ipsum-1
  </i>
  <i class="italic" id="qqyyzz-2">
   Lorem ipsum-2
  </i>
 </span>
</body>


## Left And Right
Use .next_sibling / .previous_sibling to go left or right

In [153]:
# get the first <i> inside the span
iTag = next(parser.body.span.children)
print(iTag)

<i class="italic" id="qqyyzz-1">Lorem ipsum-1</i>


In [155]:
next_item = iTag.next_sibling
next_item


<i class="italic" id="qqyyzz-2">Lorem ipsum-2</i>