# A very intro to web scraping

Let´s start with a [video](https://www.youtube.com/watch?v=Ct8Gxo8StBU)

- A lot of data isn't accessible through data sets or APIs. This data may exist on the internet as web pages, however. One way to access the data without waiting for the provider to create an API is to use a technique called web scraping

- Web scraping loads a web page into Python so we can extract the information we want. We can then work with the data using standard analysis tools like `pandas` and `numpy`

- Before we can do web scraping, we need to understand the structure of the web page we're working with and then find a way to extract parts of that structure in a manner that makes sense.

- We'll use the `requests` library often as we learn about web scraping. (This library enables us to download a web page.) We'll also use the `beautifulsoup` library to extract the relevant parts of the web page.

## Web page structure

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a web browser like Chrome or Firefox downloads a web page, it reads the HTML to determine how to render and display it.

Here's the HTML for a very simple web page:

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <p> Here is some simple content for this page </p>
    </body>
</html>
```

HTML consists of tags. We open a tag like this:

```html
<p>
```

We close a tag like this:

```html
</p>
```

Anything in between the opening and closing of a tag is the content of that tag. We can nest tags to create complex formatting rules. Here's an example:

```html
<p><b>This is a bold text</b></p>
```

The `b` tag bolds the text inside it, and the `p` tag creates a new paragraph. The HTML above will display as a bold paragraph because the `b` tag is inside the `p` tag. In other words, the `b` tag is nested within the `p` tag.

HTML documents contain a few major sections. The `head` section contains information that's useful to the web browser that's rendering the page. (The user doesn't see it.) The `body` section contains the bulk of the content you will see in your browser.

Different tags have different purposes. For example, the `title` tag tells the browser what to display at the top of your tab. The `p` tag indicates that the content inside it is a single paragraph.

Let´s start with a very [simple](https://dataquestio.github.io/web-scraping-pages/simple.html) website

In [1]:
import requests

In [2]:
response = requests.get("https://www.bcrp.gob.pe/")
content = response.content
content

b'<!DOCTYPE html>\n<html lang="es">\n<head>\n\t<meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>\n\t\n\n    <link rel="stylesheet" href="/templates/rc_bcrp/css/bootstrap.min.css" rel="stylesheet" />\n\t<!--link href="/css/template-rc.css" rel="stylesheet" /-->\n    <link rel="stylesheet" href="/templates/rc_bcrp/css/rc-main.css?v=1.0.0" media="screen">\n    <!--link rel="stylesheet" type="text/css" href="/templates/rc_bcrp/css/main_ingles.css?v=1.0.0" -->\n\t\n\t<link rel="stylesheet" href="/templates/rc_bcrp/css/indicadores.css" />\n\t<link rel="stylesheet" href="/templates/rc_bcrp/css/overlay-loader.css">\n\t\n    <script src="/templates/rc_bcrp/js/jquery-3.7.0.min.js"></script>\n    <script src="/templates/rc_bcrp/js/bootstrap.bundle.min.js"></script>\n\n    <!--script src="/templates/rc_bcrp/js/buscador.js"></script-->\n    <script src="/templates/rc_bcrp/js/indicadores.js">

In [None]:
# conda install beautifulsoup4

## BeautifulSoup

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want

We'll use the `BeautifulSoup` library to parse the web page with Python. This library allows us to extract tags from an HTML document.

We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree). BeautifulSoup works the same way.

In our simple page, for example, the root of the "tree" is the `html` tag

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <p> Here is some simple content for this page </p>
    </body>
</html>
```

The `html` tag contains two "branches," `head` and `body`. `head` contains one "branch", `title` and `body` contains one branch, `p`. Drilling down through these multiple branches is one way to parse a web page.

To extract the text inside the `p` tag, we need to get the `body` element, then the `p` element, and then finally the text inside the `p` element.

In [11]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

# # Get the body tag from the document.
# # Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# # With BeautifulSoup, we can access branches by using tag types as attributes.
# body = parser.body

# # Get the p tag from the body.
# p = body.p

# # Print the text inside the p tag.
# # Text is a property that gets the inside text of a tag.
# print(p.text)

# # Get the text from the html tag
# title_text = parser.head.title.text
# print(title_text)


## Applying methods

Use the tag type as a property is not always the best way to parse a document. It's usually better to be more specific by using the `find_all` method. This method will find all occurrences of a tag in the current element, and return a list.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, the process is the same as passing in the tag type as an attribute.

In [20]:
# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
a = body[0].find_all("a")

# Get the text.
print(a[2].text)


          Sign in
        


In [17]:
print(a)

[<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>, <a aria-label="Homepage" class="mr-lg-3 color-fg-inherit flex-order-2" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
<svg aria-hidden="true" class="octicon octicon-mark-github" data-view-component="true" height="32" version="1.1" viewbox="0 0 16 16" width="32">
<path d="M8 0c4.42 0 8 3.58 8 8a8.013 8.013 0 0 1-5.45 7.59c-.4.08-.55-.17-.55-.38 0-.27.01-1.13.01-2.2 0-.75-.25-1.23-.54-1.48 1.78-.2 3.65-.88 3.65-3.95 0-.88-.31-1.59-.82-2.15.08-.2.36-1.02-.08-2.12 0 0-.67-.22-2.2.82-.64-.18-1.32-.27-2-.27-.68 0-1.36.09-2 .27-1.53-1.03-2.2-.82-2.2-.82-.44 1.1-.16 1.92-.08 2.12-.51.56-.82 1.28-.82 2.15 0 3.06 1.86 3.75 3.64 3.95-.23.2-.44.55-.51 1.07-.46.21-1.61.55-2.33-.66-.15-.24-.6-.83-1.23-.82-.67.01-.27.38.01.53.34.19.73.9.82 1.13.16.45.68 1.31 2.69.94 0 .67.01 1.3.01 1.49 0 .21-.15.45-.55.

## Element IDs

HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.

Here's an example page:

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <div>
            <p id="first">
                First paragraph
            </p>
        </div>
        <p id="second">
            <b>
                Second paragraph
            </b>
        </p>
    </body>
</html>
```

HTML uses the `div` tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a web page's footer, sidebar, and horizontal menu.

There are two paragraphs on this page. The first is nested inside a `div`. Luckily, the paragraphs have IDs. This means we can access them easily, even though they're nested.

In [21]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')


# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)


                First paragraph.
            


In [22]:
for elem in parser.find_all("p"):
    print(elem.text)


                First paragraph.
            


                Second paragraph.
            



## Pandas to the rescue!

The pandas `read_html()` function is a quick and convenient way to turn an HTML table into a pandas DataFrame. This function can be useful for quickly incorporating tables from various websites without figuring out how to scrape the site’s HTML. However, there can be some challenges in cleaning and formatting the data before analyzing it. <img src="https://pbpython.com/images/html-to-pandas-header.png" width="400">

You cand find a tutorial [here](https://pbpython.com/pandas-html-table.html)

In [3]:
import pandas as pd 
URL = "https://es.wikipedia.org/wiki/Anexo:Ministros_de_Econom%C3%ADa_del_Per%C3%BA"
MEF = pd.read_html(URL) 

In [5]:
BCRP[0]

Unnamed: 0.1,Unnamed: 0,Ene. 10,Ene. 09
0,Mí­nimo,36910,36960
1,Máximo,37100,37140
2,Promedio,37002,37060


In [26]:
MEF[1] # Starting in 1969

Unnamed: 0_level_0,Titular,Titular,Partido,Presidente,Periodo
Unnamed: 0_level_1,Imagen,Nombre,Partido,Presidente,Periodo
0,,Francisco Morales Bermúdez,s/p,Juan Velasco Alvarado,1 de marzo de 1969-2 de enero de 1974
1,,Guillermo Marcó del Pont,s/p,Juan Velasco Alvarado,2 de enero de 1974-18 de julio de 1974
2,,Amílcar Vargas Gavilano,s/p,Juan Velasco Alvarado,18 de julio de 1974-30 de agosto de 1975
3,,Luis Barúa Castañeda,Independiente,Francisco Morales Bermúdez,2 de septiembre de 1975-16 de mayo de 1977
4,,Walter Piazza Tanguis,Independiente,Francisco Morales Bermúdez,16 de mayo de 1977-6 de julio de 1977
5,,Alcibiades Sáenz Barsallo,s/p,Francisco Morales Bermúdez,6 de julio de 1977-15 de mayo de 1978
6,,Javier Silva Ruete,Independiente,Francisco Morales Bermúdez,15 de mayo de 1978-28 de julio de 1980
7,,Manuel Ulloa Elías,Acción Popular,Fernando Belaúnde Terry,28 de julio de 1980-3 de enero de 1983
8,,Carlos Rodríguez Pastor Mendoza,Independiente,Fernando Belaúnde Terry,3 de enero de 1983-21 de marzo de 1984
9,,José Benavides Muñoz,Acción Popular,Fernando Belaúnde Terry,21 de marzo de 1984-29 de enero de 1985
