Find for a spefic node, or group of nodes inside the DOM is a daunting task. Modern websites could have thousands of nodes, organised somehow to suit their needs. 
Trying to find a specific node, or group of nodes, is a task that could take hours, if not days. 
This is where the power of BeautifulSoup comes in.BeautifulSoup allows us to `search` the DOM, and find the nodes we are looking for, in a matter of seconds. 

In this notebook, we will explore the different ways we can search the DOM, and find the nodes we are looking for.

For this example we will use the sample page [simple-page.html](../data/simple-page.html), stored in the `data` folder. 

The page represents mock of  a search result, where two employees are listed and their relevant information are displayed in a card.

> Note: In a browser you can right click in the page, and select View Page Source to view the Actual html-string, or `Inspect` to see the DOM of the page.

#### Ways to search the DOM
##### Using the `find` or the `find_all` method

the .find() and .find_all() methods are the most common ways to search the DOM. The biggest difference between the two is that the .find() method will return the first match, while the .find_all() method will return all the matches.

The formal definition of the .find() and .find_all() methods is as follows:

    find(name, attrs, recursive, string, **kwargs)
    find_all(name, attrs, recursive, string, limit, **kwargs)
    
 - `name` - the name of the tag to search for
    - if the name is not provided, bs4 will search for all `name` tags. i.e title, span, div, etc.
 - `recursive` - whether or not to search inside children of the current node
    - if `recursive` is set to `False`, bs4 will only consider the current's node direct children. 
 - `string` - the text to search for
    - if `string` is provided, bs4 will only return nodes that pass the filter. The filter could be a string, a regular expression, or a function that returns true or false.
 - ** `kwargs` - any other arguments that can be passed to the `find` or `find_all` methods, will be used to filter nodes based on their `attributes`
    - if we want to search for tags that have id = 'id-123' then we can pass `id='id-123'` as a keyword argument. As with the string, the value don't need to be a specific string, it can be a regular expression, or a function that returns true or false.
 - `attrs` - a dictionary of attributes to search for
    - some attributes like data-foo can't be expressed as as kwargs. in cases like this, we can use the `attrs` argument to pass a dictionary of attributes to search for. i.e `attrs={'data-foo': 'bar'}` 


#### CSS Selectors
CSS Selectors are string patterns that can be used to express nodes in the DOM. They are somewhat similar to XPath but they are much easier to use, and more powerful.  
> Note: A full list of the Css selects are available [here](https://www.w3schools.com/cssref/css_selectors.asp).

BeautifulSoup uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements. If it's not already installed, you can install it using the following command:

~~~bash
    pip install soupsieve
~~~
> Note: That if that 

##### Using the `select` and select_one` methods
As the find and find_all methods, the select and select_one  will find all the matches, while the select_one method will return just the first match.

In [10]:
from pathlib import Path
from bs4 import BeautifulSoup
html_doc = Path("../data/simple-page.html").read_text()
html_doc = "".join(line.strip() for line in html_doc.split("\n"))
soup = BeautifulSoup(html_doc, "html.parser")
soup 
# <html>...</html>

### Using the ID attribute.
Each tag _could_ contain an `id` attribute. This attribute is _unique_ (but not enforsable) for within the HTML document and can be used to find a specific tag.  
_Un_ fortunally as the HTML is designed to be fault toleurant, browsers will happy render the DOM without complainin using the '#' character.

In [6]:
from pathlib import Path
from bs4 import BeautifulSoup
html_doc = Path("../data/simple-page.html").read_text()
html_doc = "".join(line.strip() for line in html_doc.split("\n"))
soup = BeautifulSoup(html_doc, "html")
person = soup.select_one("#name-QTV9J")


In [7]:
person
# <div class="card-header solid-orange" id="name-QTV9J"><h>Person:</h><span>Jane Doe</span></div>

<div class="card-header solid-orange" id="name-QTV9J"><h>Person:</h><span>Jane Doe</span></div>

In [69]:
for c in p.children:
    print(c.attrs)

{'id': 'banner', 'class': ['solid-green', 'result-banner']}
{'id': 'result-item-1', 'class': ['result-item', 'solid-green']}
{'id': 'result-item-2', 'class': ['result-item', 'solid-green']}


In [26]:
from pathlib import Path
from bs4 import BeautifulSoup
html_doc = Path("../data/simple-page.html").read_text()
html_doc = "".join(line.strip() for line in html_doc.split("\n"))
soup = BeautifulSoup(html_doc, "html")

for children in soup.select('ul.data'):
    print(children.name, children.attrs, )
    for c in children.children:
        print('\t',c.text)

ul {'class': ['solid-green', 'data']}
	 Occupation: Programmer
	 Room: 1234-A
	 Building: AAAAA
ul {'class': ['solid-green', 'data']}
	 Occupation: System Architect
	 Room: 1234-C
	 Building: BBBBB
