# Beautiful Soup

Beautiful Soup is a popular Python library used for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data easily.

## Install

[Beautiful Soup](https://pypi.org/project/beautifulsoup4/)

## Basic Usage

In [None]:
from bs4 import BeautifulSoup

# Sample HTML content
html_content = """
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to My Website</h1>
    <p>This is a paragraph of a text.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
    <p>This is the second paragraph of text.</p>
    <ul class="table-class">
        <li>Item 4</li>
        <li>Item 5</li>
        <li>Item 6</li>
    </ul>
        <h1>Another h1 title</h1>
</body>
</html>
"""

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
# Access to the title tag
soup.title

In [None]:
# Extract the title
soup.title.string

Or you can also use the ```.text``` attribute. More information about the difference between ```.text``` and ```.string``` on this [stack overflow post.](https://stackoverflow.com/questions/25327693/difference-between-string-and-text-beautifulsoup). Also behind the hood ```.text``` calls ```.get_text()``` which is a function that can take several arguments (including a separator).

In [None]:
# Extract the title
soup.title.text

In [None]:
# Extract the title
soup.title.get_text()

In [None]:
# Extract the first paragraph
soup.find('p').string
#soup.find('p').get_text() # alternate way
#soup.find('p').text # alternate way 

In [None]:
# Extract all list items
li_tags = soup.find_all('li')
for li in li_tags:
    print(li.string)

In [None]:
# find() finds the first occurrence of a tag
soup.find('h1')

In [None]:
# find_all() finds all occurrences of a tag.
soup.find_all('h1')

In [None]:
# or with a p
soup.find_all('p')

In [None]:
# select() uses CSS selectors to find elements.
soup.select('ul li')

In [None]:
# Using the 'attributes' argument allows us to pass a dict of attributes
soup.find_all(attrs={'class':'table-class'})
# This also works with soup.find(attrs={'class':'table-class'}) if you want to retrive only the first element

In [None]:
# Navigating the Parse Tree
soup.p.parent  # Get the parent of the first <p> tag

## Exercice

❓ **>>>** Get the same result on the book to scrap website than the previous notebook (so every item on the left menu) but using the Beautiful Soup Library instead.

In [None]:
import requests
from bs4 import BeautifulSoup

URL = 'https://books.toscrape.com/'
r = requests.get(URL)
soup = BeautifulSoup(r.text)

In [None]:
# Code here!


## Exercice

❓ **>>>** Get the same result with the imdb website than the previous notebook (so the top 250 movies with rank, name, and average rating) but using the Beautiful Soup Library instead. You can also use the library JSON if you want to make things easier.

In [None]:
# Code here!
