# A very intro to web scraping

Let´s start with a [video](https://www.youtube.com/watch?v=Ct8Gxo8StBU)

- A lot of data isn't accessible through data sets or APIs. This data may exist on the internet as web pages, however. One way to access the data without waiting for the provider to create an API is to use a technique called web scraping

- Web scraping loads a web page into Python so we can extract the information we want. We can then work with the data using standard analysis tools like `pandas` and `numpy`

- Before we can do web scraping, we need to understand the structure of the web page we're working with and then find a way to extract parts of that structure in a manner that makes sense.

- We'll use the `requests` library often as we learn about web scraping. (This library enables us to download a web page.) We'll also use the `beautifulsoup` library to extract the relevant parts of the web page.

## Web page structure

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a web browser like Chrome or Firefox downloads a web page, it reads the HTML to determine how to render and display it.

Here's the HTML for a very simple web page:

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <p> Here is some simple content for this page </p>
    </body>
</html>
```

HTML consists of tags. We open a tag like this:

```html
<p>
```

We close a tag like this:

```html
</p>
```

Anything in between the opening and closing of a tag is the content of that tag. We can nest tags to create complex formatting rules. Here's an example:

```html
<p><b>This is a bold text</b></p>
```

The `b` tag bolds the text inside it, and the `p` tag creates a new paragraph. The HTML above will display as a bold paragraph because the `b` tag is inside the `p` tag. In other words, the `b` tag is nested within the `p` tag.

HTML documents contain a few major sections. The `head` section contains information that's useful to the web browser that's rendering the page. (The user doesn't see it.) The `body` section contains the bulk of the content you will see in your browser.

Different tags have different purposes. For example, the `title` tag tells the browser what to display at the top of your tab. The `p` tag indicates that the content inside it is a single paragraph.

Let´s start with a very [simple](https://dataquestio.github.io/web-scraping-pages/simple.html) website

In [2]:
import requests

In [9]:
response = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content
content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

In [4]:
# conda install beautifulsoup4

## BeautifulSoup

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want

We'll use the `BeautifulSoup` library to parse the web page with Python. This library allows us to extract tags from an HTML document.

We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree). BeautifulSoup works the same way.

In our simple page, for example, the root of the "tree" is the `html` tag

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <p> Here is some simple content for this page </p>
    </body>
</html>
```

The `html` tag contains two "branches," `head` and `body`. `head` contains one "branch", `title` and `body` contains one branch, `p`. Drilling down through these multiple branches is one way to parse a web page.

To extract the text inside the `p` tag, we need to get the `body` element, then the `p` element, and then finally the text inside the `p` element.

In [12]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)

# Get the text from the html tag
title_text = parser.head.title.text
print(title_text)


Here is some simple content for this page.
A simple example page


## Applying methods

Use the tag type as a property is not always the best way to parse a document. It's usually better to be more specific by using the `find_all` method. This method will find all occurrences of a tag in the current element, and return a list.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, the process is the same as passing in the tag type as an attribute.

In [14]:
# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

Here is some simple content for this page.


## Element IDs

HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.

Here's an example page:

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <div>
            <p id="first">
                First paragraph
            </p>
        </div>
        <p id="second">
            <b>
                Second paragraph
            </b>
        </p>
    </body>
</html>
```

HTML uses the `div` tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a web page's footer, sidebar, and horizontal menu.

There are two paragraphs on this page. The first is nested inside a `div`. Luckily, the paragraphs have IDs. This means we can access them easily, even though they're nested.

In [36]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')


# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)


                First paragraph.
            


In [34]:
for elem in parser.find_all("p"):
    print(elem.text)


                First paragraph.
            


                Second paragraph.
            



## Pandas to the rescue!

The pandas `read_html()` function is a quick and convenient way to turn an HTML table into a pandas DataFrame. This function can be useful for quickly incorporating tables from various websites without figuring out how to scrape the site’s HTML. However, there can be some challenges in cleaning and formatting the data before analyzing it. <img src="https://pbpython.com/images/html-to-pandas-header.png" width="400">

You cand find a tutorial [here](https://pbpython.com/pandas-html-table.html)

In [29]:
import pandas as pd 
URL = "https://es.wikipedia.org/wiki/Anexo:Ministros_de_Econom%C3%ADa_del_Per%C3%BA"
MEF = pd.read_html(URL) 

In [36]:
MEF[1].tail(10) # Starting in 1969

Unnamed: 0_level_0,Titular,Titular,Partido,Presidente,Periodo
Unnamed: 0_level_1,Imagen,Nombre,Partido,Presidente,Periodo
39,,Claudia María Cooper Fort,Independiente,Pedro Pablo Kuczynski,17 de septiembre de 2017-2 de abril de 2018
40,,David Tuesta Cárdenas,Independiente,Martín Vizcarra,2 de abril de 2018-4 de junio de 2018
41,,Carlos Oliva,Independiente,Martín Vizcarra,7 de junio de 2018-30 de septiembre de 2019
42,,María Antonieta Alva,Independiente,Martín Vizcarra,3 de octubre de 2019-9 de noviembre de 2020
43,,José Arista,Alianza Regional Juntos por Amazonas,Manuel Merino,12 de noviembre de 2020-17 de noviembre de 2020
44,,Waldo Mendoza Bellido,Independiente,Francisco Sagasti,18 de noviembre de 2020 - 29 de julio de 2021
45,,Pedro Francke,Nuevo Perú,Pedro Castillo,30 de julio de 2021 - 1 de febrero de 2022
46,,Óscar Graham,Independiente,Pedro Castillo,1 de febrero de 2022 - 5 de agosto de 2022
47,,Kurt Burneo Farfán,Independiente,Pedro Castillo,5 de agosto de 2022 - 7 de diciembre de 2022
48,,Alex Contreras Miranda,Independiente,Dina Boluarte,10 de diciembre de 2022 -
