<b> Using example.html in Data, 

In [18]:
from bs4 import BeautifulSoup

In [19]:
file_name = "../../Data/sample.html"

In [20]:
with open(file_name, 'r') as file:
    html_data = file.read()

- Using `html_data`, create a soup object using HTML documents

In [23]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_data, "html.parser")
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Sample Page for Scraping</title>
</head>
<body>
<h1 id="main-title">Welcome to My Sample Page</h1>
<p class="intro">This page is created for practicing <b>BeautifulSoup</b>.</p>
<div class="content">
<h2>Articles</h2>
<p class="article">Article 1: <a href="https://www.usfca.edu/arts-sciences/programs/graduate/data-science-artificial-intelligence">USF MSDSAI</a></p>
<p class="article">Article 2: <a href="https://catalog.usfca.edu/preview_program.php?catoid=37&amp;poid=35409&amp;returnto=8545">USF MSDSAI Classes</a></p>
<p class="article">Article 3: <a href="https://www.usfca.edu/koret">Koret Center</a></p>
</div>
<div id="sidebar">
<h2>Resources</h2>
<ul>
<li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
<li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
<li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
</ul>
</div>
<foo

Accessing Objects using a "Tag" or "NavigableString"
- Tag is letters come right after the opening <> Ex. ```<h1> Header </h1> ```: h1 is the tag. You can access the tag's name, children, and attributes
- A NavigableString object is a special Python string that holds the text content within a tag. You can access it via `string`. In the above case, soup.h1.string would be `Header`.

- Navigating the Tree

>>>  <small>* .tag_name : Find By Tag

In [48]:
soup.title

<title>Sample Page for Scraping</title>

In [49]:
soup.title.string

'Sample Page for Scraping'

>> <small> * find() : Find First Occurence

In [31]:
soup.find("h1")

<h1 id="main-title">Welcome to My Sample Page</h1>

>>> <small> * find_all():  Find All Occurence

In [None]:
soup.find_all("p")

[<p class="intro">This page is created for practicing <b>BeautifulSoup</b>.</p>,
 <p class="article">Article 1: <a href="https://www.usfca.edu/arts-sciences/programs/graduate/data-science-artificial-intelligence">USF MSDSAI</a></p>,
 <p class="article">Article 2: <a href="https://catalog.usfca.edu/preview_program.php?catoid=37&amp;poid=35409&amp;returnto=8545">USF MSDSAI Classes</a></p>,
 <p class="article">Article 3: <a href="https://www.usfca.edu/koret">Koret Center</a></p>,
 <p>Contact: <span id="email">dwoodbridge@usfca.edu</span></p>]

>>> <small> * select(): Uses CSS selectors to find all matching elements. This is often more flexible and powerful for complex queries.

In [62]:
soup.select("div#sidebar")

[<div id="sidebar">
 <h2>Resources</h2>
 <ul>
 <li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
 <li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
 <li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
 </ul>
 </div>]

In [63]:
soup.select("div#sidebar li")

[<li><a class="resource" href="https://docs.python.org">Python Docs</a></li>,
 <li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>,
 <li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>]

>>> <small> - .get() or accessing like dictionary to access Attributes

In [70]:
for a_tag in soup.select("div#sidebar li a"):
    print(a_tag.get("href"))

https://docs.python.org
https://beautiful-soup-4.readthedocs.io
https://pandas.pydata.org


In [71]:
for a_tag in soup.select("div#sidebar li a"):
    print(a_tag["href"])

https://docs.python.org
https://beautiful-soup-4.readthedocs.io
https://pandas.pydata.org


- Navigating Tree

>>> <small> * Navigating Down using `.children`

In [78]:
for div in soup.find_all("div"):
    for child in div.children:
        print(child.text)
    print("--------")



Articles


Article 1: USF MSDSAI


Article 2: USF MSDSAI Classes


Article 3: Koret Center


--------


Resources



Python Docs
BeautifulSoup Docs
Pandas



--------


In [76]:
soup.find_all("div")

[<div class="content">
 <h2>Articles</h2>
 <p class="article">Article 1: <a href="https://www.usfca.edu/arts-sciences/programs/graduate/data-science-artificial-intelligence">USF MSDSAI</a></p>
 <p class="article">Article 2: <a href="https://catalog.usfca.edu/preview_program.php?catoid=37&amp;poid=35409&amp;returnto=8545">USF MSDSAI Classes</a></p>
 <p class="article">Article 3: <a href="https://www.usfca.edu/koret">Koret Center</a></p>
 </div>,
 <div id="sidebar">
 <h2>Resources</h2>
 <ul>
 <li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
 <li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
 <li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
 </ul>
 </div>]

>>> <small> * Navigating Up using `.parent`

In [85]:
for a in soup.find_all("li"):
    print("=============")
    print(a.text)
    print(a.parent)

Python Docs
<ul>
<li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
<li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
<li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
</ul>
BeautifulSoup Docs
<ul>
<li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
<li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
<li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
</ul>
Pandas
<ul>
<li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
<li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
<li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
</ul>


>>> <small> Navigating siblings using `.next_siblings` or `.previous_siblings`
>>>> * Note :  the `.next_sibling` or `.previous_sibling` of a tag will usually be a string containing whitespace.  Therefore you may want to access the real sibling by `.next_sibling.next_sibling` or `previous_sibling.previous_sibling`

In [122]:
soup.div

<div class="content">
<h2>Articles</h2>
<p class="article">Article 1: <a href="https://www.usfca.edu/arts-sciences/programs/graduate/data-science-artificial-intelligence">USF MSDSAI</a></p>
<p class="article">Article 2: <a href="https://catalog.usfca.edu/preview_program.php?catoid=37&amp;poid=35409&amp;returnto=8545">USF MSDSAI Classes</a></p>
<p class="article">Article 3: <a href="https://www.usfca.edu/koret">Koret Center</a></p>
</div>

In [121]:
soup.div.next_sibling.next_sibling

<div id="sidebar">
<h2>Resources</h2>
<ul>
<li><a class="resource" href="https://docs.python.org">Python Docs</a></li>
<li><a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a></li>
<li><a class="resource" href="https://pandas.pydata.org">Pandas</a></li>
</ul>
</div>

In [95]:
for sibling in soup.find("a").previous_siblings:
    print(sibling)

Article 1: 


In [165]:
for item in soup.find_all("a"):
    siblings = list(item.previous_siblings) + list(item.next_siblings)
    print(f"{item}: {siblings}")

<a href="https://www.usfca.edu/arts-sciences/programs/graduate/data-science-artificial-intelligence">USF MSDSAI</a>: ['Article 1: ']
<a href="https://catalog.usfca.edu/preview_program.php?catoid=37&amp;poid=35409&amp;returnto=8545">USF MSDSAI Classes</a>: ['Article 2: ']
<a href="https://www.usfca.edu/koret">Koret Center</a>: ['Article 3: ']
<a class="resource" href="https://docs.python.org">Python Docs</a>: []
<a class="resource" href="https://beautiful-soup-4.readthedocs.io">BeautifulSoup Docs</a>: []
<a class="resource" href="https://pandas.pydata.org">Pandas</a>: []


In [136]:
soup.find("a")

<a href="https://www.usfca.edu/arts-sciences/programs/graduate/data-science-artificial-intelligence">USF MSDSAI</a>