### BeautifulSoup

As discussed in the the introduction, the HTML DOM is a tree structure, which each node is a tag. 
Each `Tag` has a name and potentially  attributes  other children.

`BeautifulSoup` is a popular python library, that helps you navigate, find and extract these elements. 

```python
from bs4 import BeautifulSoup

# The html_doc is a string that contains the html

# the html.parser is the default parser
# If you are wondering where it came from:
# html is a module that comes with python, 
# and parser is a class in that module
# so, 'html.parser' is a string that represents 
# the class for beautiful soup to use.
soup = BeautifulSoup(html_doc, 'html.parser')
```
> Note: you can see a list of all the available parsrs [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)

From there, if the html is valid, you can navigate through the tree using the library's methods.

> Note: It's easier think, that the soup variable points to the head of tree. 


In [10]:
html_doc = Path("./data/sample-page.html")

Bellow we are going to use a simple page. 
 - The `<HTML> ` tag has two children
    - `<Head>` tag,
        - which has a `<title>` tag
    - and `<Body>` tag. 
        - which has a `<span>` tag
            - which have two `<i>` tags


In [3]:
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
    <title>Title-1</title>
</head>
<body class="body">
    <span>
        <i id="qqyyzz-1" class="italic">Lorem ipsum-1</i>
        <i id="qqyyzz-2" class="italic">Lorem ipsum-2</i>
    </span>
</body>
</html>
"""
# removing some newline noise
html_doc = "".join(line.strip() for line in html_doc.split("\n"))
soup = BeautifulSoup(html_doc, "html.parser")

In [4]:
soup
# <html>...</html>

<html><head><title>Title-1</title></head><body class="body"><span><i class="italic" id="qqyyzz-1">Lorem ipsum-1</i><i class="italic" id="qqyyzz-2">Lorem ipsum-2</i></span></body></html>

### Navigating The Tree

#### Reference the immediate children of a node by name


From the reference node, we can get a referene to the immediate child of the of the starting node, using the  `tag name` of the node we want to reach. 

~~~python
body = soup.body
head = soup.head
~~~



In [8]:
soup.title

<title>Title-1</title>

Each call of the `tag name` method, returns a reference to the `first` node with that tag name. One advantage of that is that you can chain the calls, to get a reference to the node you want to reach.

i.e if you want to get a reference to the `span` tag, that is inside the `body` tag, you can do the following:
~~~python
span = parser.body.span
~~~

 it will only select the FIRST elemenet of the parent node if are multuple entries. If there are multiple entries, you can use the find_all() or .childen function to get a list of all the elemenents there.

> Note: the children attribute returns a `list` of all the children of the node.  As a list you can either iterate over it or use the index to get a specific element (if you know which one you want).

In [9]:
#  https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing
for idx, tag in enumerate(parser.body.span.children):
    print("Tag", idx, ":", tag.prettify())


Tag 0 : <i class="italic" id="qqyyzz-1">
 Lorem ipsum-1
</i>

Tag 1 : <i class="italic" id="qqyyzz-2">
 Lorem ipsum-2
</i>


#### Going Up

We can use the `.parent` or `.parents` attribute to get a reference to the parent or the parents of the node.

In the example below we start from the root, 
- then going down to body 
- then to the span, 
- and then up again

In [12]:
# going  body > span > body
parent = soup.body.span.parent.parent
print(parent.prettify())

<html>
 <body class="body">
  <span>
   <i class="italic" id="qqyyzz-1">
    Lorem ipsum-1
   </i>
   <i class="italic" id="qqyyzz-2">
    Lorem ipsum-2
   </i>
  </span>
 </body>
</html>


## Left And Right
We can use .next_sibling / .previous_sibling methods to go left or right in the tree.

In [11]:
# get the first <i> inside the span
iTag = next(soup.body.span.children)
print(iTag)
next_item = iTag.next_sibling
next_item

<i class="italic" id="qqyyzz-1">Lorem ipsum-1</i>


<i class="italic" id="qqyyzz-2">Lorem ipsum-2</i>

There are a lot of other methods that you can use to navigate the tree. If you want to know more, you can check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)

In the next [notebook](./searching-the-dom.ipynb) are going to explore how to search for a specific node(s) in the tree.