## Task 3: Webscrapping with Beautiful Soup
**Goal**: We will learn about scraping a webpage with beautiful soup

**Learning Outcomes**: Learn to use beautiful soup to scape different websites. 

**Prerequisites**: Basic understanding of python.

### Part 1: Introduction to Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with different parsers to provide ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Here is an example html doc: 
```python
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
```

You can import BeautifulSoup and create a BeautifulSoup object like follows: 
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
```

Now you can try many differnt ways to navigate the BeauifulSoup data structure below: 

In [3]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.title.name)
# u'title'

print(soup.title.string)
# u'The Dormouse's story'

print(soup.title.parent.name)
# u'head'

print(soup.p)
# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p['class'])
# u'title'

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find(id="link3"))
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


Two common tasks are 
1. Extracting all the URLs 
2. Extracting all the text from a page

See below for examples on how to do this: 

In [4]:
print("Extracting all the URLs")
for link in soup.find_all('a'):
    print(link.get('href'))

print("Extracting all the text from a page")
print(soup.get_text())

Extracting all the URLs
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
Extracting all the text from a page

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



### Part 2: Scrapping from the Web 
Where do we get the html/xml pages from? Python's requests library handles HTTP communication by making HTTP requests to web servers (GET, POST, etc,). This library will help us retrieve raw HTML content from the website. When we encounter an error, this library handles network related errors. 

```python
import requests
response = requests.get('http://books.toscrape.com/')
```

In [11]:
import requests
response = requests.get('http://books.toscrape.com/')

soup = BeautifulSoup(response.text, 'html.parser')

title = soup.title.text
all_paragraphs = soup.find_all('p')
specific_div = soup.find('div', class_='content')

### Part 3: Build an inventory by scrapping a website

Your task is to 

In [70]:
import plotly.graph_objects as go
# Create frames for each year
frames = []
min_year = quakes['year'].min()
max_year = quakes['year'].max()

for year in range(min_year, max_year + 1):
    plot_df = quakes.query("year == @year")
    frames.append(
        go.Frame(
            data=[go.Densitymap(
                lat=plot_df.Latitude,
                lon=plot_df.Longitude,
                z=plot_df.Magnitude,
                radius=10,
                zmin=3,
                zmax=9,
                zauto=False,
                name=str(year)
            )],
            name=str(year)
        )
    )

# Create base figure with first year's data
initial_df = quakes.query("year == @min_year")
fig = go.Figure(
    data=[go.Densitymap(
        lat=initial_df.Latitude,
        lon=initial_df.Longitude,
        z=initial_df.Magnitude,
        radius=10
    )],
    frames=frames
)

# Add slider and play button
fig.update_layout(
    updatemenus=[{
        'type': 'buttons',
        'showactive': False,
        'buttons': [{
            'label': 'Play',
            'method': 'animate',
            'args': [None, {'frame': {'duration': 500, 'redraw': True}, 'fromcurrent': True}]
        }]
    }],
    sliders=[{
        'currentvalue': {'prefix': 'Year: '},
        'steps': [{'args': [[str(year)]], 'label': str(year), 'method': 'animate'} 
                 for year in range(min_year, max_year + 1)]
    }]
)

# Update layout
fig.update_layout(
    map_style="open-street-map",
    map_center_lon=180,
    margin={"r":0,"t":30,"l":10,"b":20}  # Adjusted margins to accommodate controls
)

fig.show()