# 1. About the process of Web Scraping

Web scraping is the process of extracting data from websites. It allows analysts to automatically collect large volumes of data, which can then be processed and analyzed. This data is essential for deploying statistical models, deriving insights, and informing decision-making processes. Web scraping integrates in this context as follows:

1. **Identification of Data Sources:**
   Determine the web sources that contain relevant and valuable data for the model.

2. **Web Scraping for Data Collection:**
   Programmatically extract data from identified websites, focusing on data that can provide meaningful features for the model.

3. **Data Cleaning and Transformation:**
   Clean the extracted data to remove noise and irrelevant information. Transform the data into a structured format suitable for modeling. Perform feature engineering to enhance the dataset and improve model performance.

4. **Data Integration:**
   Integrate scraped data with other relevant datasets to create a comprehensive dataset for modeling.

5. **Model Development and Training:**
   Use statistical techniques and machine learning algorithms to develop models based on the prepared dataset. Train models with a focus on predictive accuracy, generalization, and adherence to econometric principles.

6. **Model Evaluation and Validation:**
   Evaluate the model using appropriate metrics to ensure its reliability and validity in real-world scenarios. Perform validation techniques like cross-validation to test the model's robustness.

7. **Model Deployment and Application:**
   Deploy the model for practical use, such as forecasting, trend analysis, or decision support. Continuously monitor and update the model with new data to maintain its accuacy and relevance.
date the model with new data to maintain itsaccuracy and relevance.


The web scraping process encompasses the following steps:

| Step | Description |
|------|-------------|
| 1. Identification of the website | Involves knowing the required information and verifying that the selected website contains this information. |
| 2. Inspection of the structure | Entails exploring and inspecting the structure and organization of the website's HTML code. |
| 3. Selection of the tool | Involves choosing a tool to carry out the web scraping process. In Python, popular tools include BeautifulSoup and Selenium. |
| 4. Writing the code | Using the selected tool, instructions are written to access and download information from the web page. |
| 5. Testing | Although the first script may seem functional, errors and unforeseen issues often arise, necessitating iterative testing and troubleshooting to reach the most effective solution. |
| 6. Data extraction | Once the final version of the script is written, it involves executing it to completion for data extraction. |
| 7. Data storage or Processing | Involves organizing and processing the information so that it is ready for use. |


# 2. Foundamentals of HTML and CSS

Web scraping involves extracting data from websites, and most of the web's content is structured using HTML and styled using CSS. Understanding these languages enables us to identify and select the specific data we need efficiently.

**HTML (HyperText Markup Language)**:
   - HTML is the standard markup language for creating web pages. It structures web content and lays the foundation for data extraction.

**CSS (Cascading Style Sheets)**:
   - CSS is used for designing and customizing the appearance of web pages. CSS selectors are vital in web scraping for pinpointing specific elements from which data will be scraped.

## 2.1. HTML common structures

In [10]:
from IPython.core.display import HTML

### 2.1.1. Tables (`<table>`, `<tr>`, `<td>`, `<th>` )

Tables often hold structured data, ideal for scraping.

In [2]:
tables = """
<table>
    <tr>
        <th>Name</th>
        <th>Email</th>
    </tr>
    <tr>
        <td>John Doe</td>
        <td>johndoe@example.com</td>
    </tr>
    <tr>
        <td>Jane Smith</td>
        <td>janesmith@example.com</td>
    </tr>
</table>
"""

HTML( tables )

Name,Email
John Doe,johndoe@example.com
Jane Smith,janesmith@example.com


### 2.1.2. Lists( `<ul>`, `<ol>`, `<li>` )

Lists are used for grouping similar items, often found in menus or summary sections.

In [3]:
lists = """
<ul>
    <li>Home</li>
    <li>About</li>
    <li>Contact</li>
</ul>
"""

HTML( lists )

### 2.1.3. Divisions (`<div>`)

`<div>` elements are containers that divide the webpage into sections, often used for layout.

In [4]:
divissions = """
<div class="article">
    <h2>Article Title</h2>
    <p>Article content...</p>
</div>
"""

HTML( divissions )

### 2.1.4. Anchors  (`<a>`)

Anchor tags define hyperlinks and are crucial for navigating and extracting links.

In [5]:
anchors = """
<a href="https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=ActProy">Consulta amigable</a>
"""
HTML( anchors )

### 2.1.5. Tags summary

| Tag |	Descripción |
|---|---|
|`<html>` |Define el comienzo y el final de un documento HTML |
|`head` |Contiene información meta sobre el documento HTML |   
|`<title>` |	Define el título del documento HTML |
|`<body>` | Contiene el contenido visible del documento HTML |
|`<span>` | Define una sección en línea o un fragmento de texto |
|`<h1>` a `<h6>`|	Define encabezados HTML en orden de importancia |
|`<p>` | Define un párrafo |
| `<a>` | Crea un enlace a otra página |
| `<img>` | Muestra una imagen en el documento |
| `<table>` | Crea una tabla con filas y columnas |
| `<tr>` | Define una fila en una tabla |
| `<th>` | Define una celda de encabezado en una tabla |
| `<td>` | Define una celda de datos en una tabla |
| `<ul>` | Crea una lista desordenada |
| `<ol>` | Crea una lista ordenada |
| `<li>` | Define un elemento de lista |
|`<input>`| Crea un campo de entrada en un formulario |
| `<button>` | Crea un botón interactivo |
| `<option>` |  Define una opción de menú desplegable |
| `<label>` | Etiqueta para un elemento de formulario |
| `<iframe>` or `<frame>` | Incrusta contenido de otro documento HTML dentro de la página |

## 2.2. CCS common structures

### 2.2.1. Classes and IDs

Classes and IDs uniquely identify or group elements, crucial for selecting specific items.

In [7]:
classes = """
<div class="user-profile">
    <span id="username">JohnD</span>
</div>
"""
HTML( classes )

### 2.2.2. Pseudo-Classes (`:nth-child`, `:first-child`, `:last-child`):

Pseudo-classes are used to target elements based on their state or position.


```
li:first-child { font-weight: bold; }
li:nth-child(2) { color: blue; }
```

### 2.2.3. Attribute Selectors

Attribute selectors target elements based on their attributes, like href in anchors.

```
a[href*="profile"] { color: green; }
```

## 2.3. Full example

In [8]:
full_example = """
<!DOCTYPE html>
<html>
<head>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
    <style>
        body { font-family: Arial, sans-serif; }
        .container { 
            width: 80%; 
            margin: auto;
        }
        .header-image { 
            width: 100%;
            max-width: 1220px;
            height: 200px;
            background-image: url('https://drive.google.com/uc?export=view&id=1QYjJ5NNxcLieeDLYLlZoqkOLbuOn6foV'); 
            background-size: contain;
            background-repeat: no-repeat;
            background-position: center;
            margin: auto;
        }       
        .news-title-center { 
            text-align: center;
        }
        .news-item { 
            display: flex; 
            align-items: flex-start; 
            margin-bottom: 20px; 
            border-bottom: 1px solid #FF6F61; /* Línea coral */
            padding-bottom: 20px;
        }
        .news-image { width: 25%; height: auto; margin-right: 20px; }
        .news-content { width: 75%; }
        .news-title { 
            font-size: 18px; 
            font-weight: bold;
            color: #0047AB; /* Azul cobalto */
            text-align: justify;
        }
        .news-summary { 
            font-size: 14px; 
            margin-top: 10px;
            text-align: justify;
        }
        .news-link { 
            text-decoration: none; 
            color: blue; 
            margin-top: 10px; 
            display: inline-block; 
            font-weight: bold;
        }
        .contact-section { 
            margin-top: 30px; 
            font-size: 14px; 
            background-color: #E0F7FA;
            padding: 15px;
            text-align: center;
        }
        .contact-detail { margin-top: 10px; }
        .social-links { list-style: none; padding: 0; }
        .social-links li { display: inline; margin-right: 10px; }
        .social-links .fa-twitter { color: #1DA1F2; }
        .social-links .fa-facebook { color: #3b5998; }
        .social-links .fa-linkedin { color: #0077b5; }
        .social-links .fa-youtube { color: #FF0000; }
        .social-links .fa-instagram { color: #E4405F; }
        .footer { 
            margin-top: 30px; 
            font-size: 12px; 
            text-align: center; 
            padding-top: 20px;
            border-top: 1px solid #ddd;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1 class="news-title-center">Boletín de Noticias</h1>
        <h2 class='news-title-center'>Sección 1</h2>
        <div class="news-item">
            <img src="http://a.files.bbci.co.uk/worldservice/live/assets/images/2014/09/22/140922074747_museo_britanico_144x81_getty_nocredit.jpg" class="news-image">
            <div class="news-content">
                <div class="news-title">Recrearán el Museo Británico en el video juego Minecraft</div>
                <div class="news-summary">El proyecto es parte del programa Futuro del museo, que tiene el objetivo de llegar a un público más amplio. Se espera que la primera etapa de construcción digital se complete para el 16 de octubre.</div>
                <a href="http://www.bbc.co.uk/mundo/ultimas_noticias/2014/09/140922_ultnot_museo_britanico_minecraft_nc.shtml" class="news-link">Leer más</a>
            </div>
        </div>
        
        <div class="news-item">
            <img src="https://elcomercio.pe/resizer/HVHyfa9QmGL2em6SI76sK6PksK4=/cloudfront-us-east-1.images.arcpublishing.com/elcomercio/HBZ36MMTDFFGHEYGD2MFCQ2ONU.jpg" class="news-image">
            <div class="news-content">
                <div class="news-title">Línea 1 del Metro de Lima suspende servicios tras reportar fallas en el sistema</div>
                <div class="news-summary">El Metro de Lima recomendó a los pasajeros seguir las indicaciones del personal de estaciones.</div>
                <a href="https://elcomercio.pe/lima/transporte/linea-1-del-metro-de-lima-suspende-servicios-tras-reportar-fallas-en-el-sistema-ultimas-noticia/" class="news-link">Leer más</a>
            </div>
        </div>
        <h2 class='news-title-center'>Sección 2</h2>
        <div class="news-item">
            <img src="http://a.files.bbci.co.uk/worldservice/live/assets/images/2014/09/23/140923122531_sp_siria_reuters_144.jpg" class="news-image">
            <div class="news-content">
                <div class="news-title">Boletín: Primer ataque de EE.UU. y aliados árabes contra Estado Islámico en Siria y otras noticias</div>
                <div class="news-summary">Además, Israel mata a dos palestinos que presuntamente asesinaron a tres jóvenes israelíes en junio, OMS advierte que para noviembre podría haber 20.000 casos de ébola, y arranca cumbre del clima de la ONU en Nueva York. La actualidad en 1 minuto.</div>
                <a href="http://www.bbc.co.uk/mundo/video_fotos/2014/09/140923_video_boletin_noticias_re.shtml" class="news-link">Leer más</a>
            </div>
        </div>
        
    <div class="footer">
        Este boletín se genera automáticamente. Por favor, no responda a este correo electrónico.
    </div>
</body>
</html>
"""

HTML( full_example )

# 3.1. Getting started with Selenium

To begin working with Selenium, it is essential to install the required libraries. To prevent conflicts with other libraries, it's advisable to create a new environment when working on a project. In Anaconda Prompt, execute the following commands to create and activate a new environment named "wb":

```bash
conda create -n wb
conda activate wb
```

Then we can install the required libraries by running the following commands: selenium


In [9]:
!pip install openpyxl
!pip install pandas
!pip install numpy
!pip install lxml
!pip install unidecode

Collecting openpyxl
  Using cached openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Using cached openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Collecting pandas
  Downloading pandas-2.2.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting numpy>=1.23.2 (from pandas)
  Downloading numpy-2.1.3-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp311-cp311-win_amd64.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.6 MB 4.8 MB/s eta 0:00:03
   ------- -------------------------------- 2.1/11.6 MB 5.9 MB/s eta 0:00:02


In [11]:
# # Install Selenium
#!pip install selenium

# # Install Webdriver Manager
#!pip install webdriver-manager

# # Check Selenium version
import selenium
selenium.__version__

'4.26.1'

In [12]:
# # # Optionally, update Selenium to the latest version
# !pip install -U selenium

# 3.2. Web driver

A Web Driver serves as a bridge between the code and the browser, enabling automation of actions such as clicking buttons, filling out forms, and extracting information. Initializing the Web Driver in Python involves creating an instance of the WebDriver for the desired browser, such as Chrome or Firefox, and then using methods to facilitate navigation and manipulation of elements on the web page.

In [12]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

In [19]:
url     = 'https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=Proyecto'
driver  = webdriver.Chrome()        
driver.get( url )

We can maximize de window. Maximizing the window is particularly important because some web elements might only be visible or interactable when the browser is in a full-screen state.

In [23]:
driver.maximize_window()

In [18]:
#driver.quit()

# 3.3. First operations

We can obtain and store the web page title:

In [20]:
driver.title

'Consulta Amigable - Navegador'

As well as the current url

In [21]:
driver.current_url

'https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=Proyecto'

And even take an screenshot

In [30]:
driver.save_screenshot(r'C:\Users\alexa\OneDrive\Documentos\Session1\screenshot.png')

True

We can also refresh the driver

In [31]:
driver.refresh()

In [32]:
driver.quit()

We can emply the headless mode. This mode allows to run your browser automation tasks without opening an actual browser window. It's useful for saving system resources and running tests faster, especially on servers that don't have a display.

In [13]:
from selenium.webdriver.chrome.options import Options

In [34]:
url     = 'https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=Proyecto'
options = Options()
options.add_argument( '--headless' )

driver  = webdriver.Chrome( options = options )
driver.get( url )

We can also accesss the elements for headless mode

In [35]:
driver.title

'Consulta Amigable - Navegador'

In [36]:
driver.quit()

## 3.4. Location methods

In Selenium, location methods are used to find elements on a web page. These methods allow you to select elements based on their HTML attributes, such as ID, name, XPath, tag name, class name, or CSS selector. Using the right locator ensures that your automation script interacts with the correct elements on the page.

|Method|Description|
|---|---|
|find_element( By.XPATH, "xpath" ) | Used for finding an element by its Xpath|
|find_element( By.ID, "id" ) | Used for finding an element by its id|
|find_element( By.NAME, "name" ) | Used for finding an element by its name attribute|
|find_element( By.TAG_NAME, "tag name" ) | Used for finding an element by its HTML tag|
|find_element( By.CLASS_NAME, "class name" ) | Used for finding an element by its class name|
|find_element( By.CSS_SELECTOR, "css selector" )| Used for finding an element by its CSS selector|

We will try Location methods in the Congressional Bills Page

In [17]:
import pandas as pd
import numpy as np
from selenium.webdriver.common.by import By
from io import StringIO
import os
import time
import requests

In [36]:
url     = 'https://wb2server.congreso.gob.pe/spley-portal/#/expediente/search'
driver  = webdriver.Chrome()        
driver.get( url )
driver.maximize_window()

We can access the Congressional Bills Table in an static way

In [38]:
table_element = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/div/table' )
table_html    = table_element.get_attribute( 'outerHTML' )
table_html_io = StringIO( table_html )
table_df      = pd.read_html( table_html_io )[ 0 ]

In [39]:
path_static = r'C:\Users\alexa\OneDrive\Documentos\Session1'
os.makedirs( path_static, exist_ok = True )

path_table = path_static + '/Congressional_Bills.xlsx'
table_df.to_excel( path_table, index = False )

In [40]:
table_df.head( 5 )

Unnamed: 0,PROYECTOS DE LEY,FECHA DE PRESENTACIÓN,TÍTULO,ESTADO PROCESAL,PROPONENTE,AUTORES
0,Proyecto de Ley 09537/2024-CR,Fecha de Presentación19/11/2024,TítuloLEY DE FOMENTO AL BIENESTAR ANIMAL Y PRO...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresOlivos Martínez, Vivian Castillo Rivas,..."
1,Proyecto de Ley 09536/2024-CR,Fecha de Presentación19/11/2024,TítuloLEY QUE MODIFICA EL ARTÍCULO 26 DE LA LE...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresUgarte Mamani, Jhakeline Katy Gutiérrez..."
2,Proyecto de Ley 09535/2024-CR,Fecha de Presentación19/11/2024,TítuloLEY QUE DECLARA DE INTERÉS NACIONAL Y NE...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresTrigozo Reátegui, Cheryl Muñante Barrio..."
3,Proyecto de Ley 09534/2024-CR,Fecha de Presentación19/11/2024,TítuloLEY QUE PROPONE EL ACUERDO PATRIMONIAL S...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresMuñante Barrios, Alejandro Trigozo Reát..."
4,Proyecto de Ley 09533/2024-CR,Fecha de Presentación19/11/2024,TítuloLEY QUE ESTABLECE EL RÉGIMEN DEL TRABAJO...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresBazán Narro, Sigrid Tesoro Cortez Aguir..."


We can access the Congressional Bills Tables in a dinamic way for the first five pages

In [41]:
path_dinamic = path_static + '/dinamic'
os.makedirs( path_dinamic, exist_ok = True )

for index in np.arange( 1, 6, step = 1 ):

    table_element = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/div/table' )
    table_html    = table_element.get_attribute( 'outerHTML' )
    table_html_io = StringIO( table_html )
    table_df_d    = pd.read_html( table_html_io )[ 0 ]

    next_button   = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/p-paginator[1]/div/button[3]' )
    next_button.click()

    time.sleep( 2 )

    path_table = path_dinamic + f'/Congressional_Bills_n_{ index }.xlsx'
    table_df_d.to_excel( path_table, index = False )
    
    print( f'Success at table: { index }' )

Success at table: 1
Success at table: 2
Success at table: 3
Success at table: 4
Success at table: 5



To determine the total number of pagination buttons, we retrieve the text from the last pagination button.

In [42]:
pagination_next_button = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/p-paginator[1]/div/button[4]' )
pagination_next_button.click()

time.sleep( 2 )

last_pagination_button = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/p-paginator[1]/div/span[2]/button[5]' )
n_pagination_buttons   = last_pagination_button.text
print( f'N. Pagination Buttons is: { n_pagination_buttons }' )

N. Pagination Buttons is: 191


In [43]:
driver.quit()

We can downlad the PDF documents for Congressional Bills as well. We start importing the Options function

In [44]:
from selenium.webdriver.chrome.options import Options

We then retrieve the ID for each bill listed in `table_df`. Additionally, we establish the base URL.

In [45]:
bills_id = table_df[ 'PROYECTOS DE LEY' ].apply( lambda x: x.split( 'Ley ' )[ 1 ].split( '/' )[ 0 ] ).tolist()

base_url   = 'https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/'

In [46]:
path_pdf = path_static + '/pdf_docs'
os.makedirs( path_pdf, exist_ok = True )

In [55]:
n_items =  5

for i, bill in enumerate( bills_id[ :n_items ] ):
    bill_url = base_url + bill
    print (  f'Base: { base_url }' )
    print (  f'Generated: { bill_url }' )

Base: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/
Generated: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/09537
Base: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/
Generated: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/09536
Base: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/
Generated: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/09535
Base: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/
Generated: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/09534
Base: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/
Generated: https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/09533


In [57]:
chrome_options = Options()

prefs = {
    'download.default_directory'        : r'C:\Users\alexa\OneDrive\Documentos\Session1',
    'download.prompt_for_download'      : False,  # Desactiva la ventana emergente que pide confirmación antes de descargar.
    'download.directory_upgrade'        : True,   # Permite la actualización del directorio de descarga si cambia.
    'plugins.always_open_pdf_externally': True    # Hace que los archivos PDF se abran automáticamente con un visor externo sin preguntarlo.
}

chrome_options.add_experimental_option( 'prefs', prefs )

driver = webdriver.Chrome( options = chrome_options )
url = 'https://wb2server.congreso.gob.pe/spley-portal/#/expediente/search'
driver.get(url)
driver.maximize_window()

n_items = 8

for i, bill in enumerate( bills_id[ :n_items ] ):
    bill_url = base_url + bill
    driver.get( bill_url )

    time.sleep( 5 )

    try:
        pdf_button = driver.find_element( By.XPATH, '//*[@id="p-tabpanel-0"]/p-table/div/div/table/tbody/tr/td[5]/button' )
        pdf_button.click()

        print( f'Success downloading bill: { bill }' )
        
    except Exception as e:
       
        print( f'We did not find PDF button for { bill_url }: {e}' )

    time.sleep( 5 )

    if i == ( n_items - 1 ):

        time.sleep( 10 )

driver.quit()

Success downloading bill: 09537
Success downloading bill: 09536
Success downloading bill: 09535
Success downloading bill: 09534
Success downloading bill: 09533
Success downloading bill: 09532
Success downloading bill: 09531
Success downloading bill: 09530


Now let's try to scrape the SIAF web page

In [65]:
url     = 'https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=ActProy'
driver  = webdriver.Chrome()        
driver.get( url )
driver.maximize_window()

In [66]:
goverment_levels_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnTipoGobierno"]' )
goverment_levels_button.click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="ctl00_CPH1_BtnTipoGobierno"]"}
  (Session info: chrome=131.0.6778.85); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00007FF7FB706CB5+28821]
	(No symbol) [0x00007FF7FB673840]
	(No symbol) [0x00007FF7FB51578A]
	(No symbol) [0x00007FF7FB5691BE]
	(No symbol) [0x00007FF7FB5694AC]
	(No symbol) [0x00007FF7FB5B2647]
	(No symbol) [0x00007FF7FB58F33F]
	(No symbol) [0x00007FF7FB5AF412]
	(No symbol) [0x00007FF7FB58F0A3]
	(No symbol) [0x00007FF7FB55A778]
	(No symbol) [0x00007FF7FB55B8E1]
	GetHandleVerifier [0x00007FF7FBA3FCAD+3408013]
	GetHandleVerifier [0x00007FF7FBA5741F+3504127]
	GetHandleVerifier [0x00007FF7FBA4B5FD+3455453]
	GetHandleVerifier [0x00007FF7FB7CBDBB+835995]
	(No symbol) [0x00007FF7FB67EB5F]
	(No symbol) [0x00007FF7FB67A814]
	(No symbol) [0x00007FF7FB67A9AD]
	(No symbol) [0x00007FF7FB66A199]
	BaseThreadInitThunk [0x00007FFB0156259D+29]
	RtlUserThreadStart [0x00007FFB0388AF38+40]


An "iframe", short for "inline frame", is an HTML element used to embed another web page within a parent page. In web scraping, it's important to identify iframes because they have their own separate DOM (Document Object Model). Elements within an iframe cannot be directly accessed from the main page's DOM. Therefore, to interact with or extract data from elements inside an iframe, one must first switch to the iframe's context. Failure to do so will result in an inability to locate and manipulate these elements, making understanding and handling iframes essential for successful scraping of complex web pages.

In [None]:
frames = driver.find_elements( By.TAG_NAME, "frame" )
if frames:
    print( 'There are frames in this web page.' )
    for i, frame in enumerate( frames ):
        frame_html = frame.get_attribute( 'outerHTML' )
        print( f'HTML code of frame { i }:' )
        print( frame_html )
else:
    print('No frames found on the page.' )

Once identified the frame/iframe we can use the switch to frame method

In [67]:
frame = driver.find_element( By.ID, "frame0" )
driver.switch_to.frame( frame )

Now we can click the Government Levels button

In [68]:
goverment_levels_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnTipoGobierno"]' )
goverment_levels_button.click()

Let's access some data from Regional Governments

In [69]:
regional_governments_level_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_RptData_ctl03_TD0"]/input' )
regional_governments_level_button.click()

time.sleep( 2 )

sector_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnSector"]' )
sector_button.click()

time.sleep( 2 )

regional_goverments_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_RptData_ctl02_TD0"]/input' )
regional_goverments_button.click()

time.sleep( 2 )

pliego_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnPliego"]' )
pliego_button.click()

In [70]:
table_element = driver.find_element( By.XPATH, "//table[@class='Data']" )
table_html    = table_element.get_attribute( 'outerHTML' )
table_html_io = StringIO( table_html )
table_df      = pd.read_html( table_html_io )[ 0 ]

In [71]:
table_df.head( 5 )

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,440: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AMA...,1179543064,1540608616,1532841226,1506168202,1501812047,1500163314,1434931313,97.4
1,,441: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ANCASH,2535471411,3034507325,2700692982,2574885651,2573348946,2573099281,2571839978,84.8
2,,442: GOBIERNO REGIONAL DEL DEPARTAMENTO DE APU...,1219427445,1481066763,1450345046,1439674071,1427867460,1424568391,1424164126,96.2
3,,443: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ARE...,2380402279,3199066208,3048096940,2988856508,2957285909,2956176809,2955298615,92.4
4,,444: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AYA...,1739949908,2245901927,2188831480,2177688126,2166802546,2163858251,2163654261,96.3


We are missing the titles (headers). We will find them by locating them through their TAG NAMES.

In [72]:
headers_elements_r0 = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row0"]' ).find_elements( By.TAG_NAME, 'td' )
headers_r0          = [ element.text for element in headers_elements_r0 ]

headers_elements_r1 = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row1"]' ).find_elements( By.TAG_NAME, 'td' )
headers_r1          = [ element.text for element in headers_elements_r1 ]

print( headers_r0, headers_r1, sep = '\n' )

[' ', 'Pliego', 'PIA', 'PIM', 'Certificación', 'Compromiso Anual', 'Ejecución', 'Avance % ']
['Atención de Compromiso Mensual ', 'Devengado ', 'Girado ']


In [73]:
driver.quit()

In [74]:
headers_r0 = [ header for header in headers_r0 if header not in [ 'Ejecución', 'Avance % ' ] ]
headers = headers_r0 + headers_r1 + [ 'Avance %' ]
print( headers )

[' ', 'Pliego', 'PIA', 'PIM', 'Certificación', 'Compromiso Anual', 'Atención de Compromiso Mensual ', 'Devengado ', 'Girado ', 'Avance %']


In [75]:
table_df.columns = headers
table_df = table_df.drop( table_df.columns[ 0 ], axis = 1 )
table_df.head( 5 )

Unnamed: 0,Pliego,PIA,PIM,Certificación,Compromiso Anual,Atención de Compromiso Mensual,Devengado,Girado,Avance %
0,440: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AMA...,1179543064,1540608616,1532841226,1506168202,1501812047,1500163314,1434931313,97.4
1,441: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ANCASH,2535471411,3034507325,2700692982,2574885651,2573348946,2573099281,2571839978,84.8
2,442: GOBIERNO REGIONAL DEL DEPARTAMENTO DE APU...,1219427445,1481066763,1450345046,1439674071,1427867460,1424568391,1424164126,96.2
3,443: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ARE...,2380402279,3199066208,3048096940,2988856508,2957285909,2956176809,2955298615,92.4
4,444: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AYA...,1739949908,2245901927,2188831480,2177688126,2166802546,2163858251,2163654261,96.3


In [77]:
path_siaf          = r'C:\Users\alexa\OneDrive\Documentos\Session1'
path_siaf_regional = path_siaf + '/regional'
os.makedirs( path_siaf_regional, exist_ok = True )

In [78]:
table_path = path_siaf_regional + '/regional_table.xlsx'
table_df.to_excel( table_path, index = False )

## 3.5. Generating URLs in the SIAF Database

We can manipulate SIAF URLs to directly access data at the districtal and provincial levels. The SIAF database employs concatenators, represented by the `&` character in its URLs. By modifying these concatenators, we can directly access data at these levels. Let's consider the concatenators meaning:

| Concatenator   | Description                                          |
|----------------|------------------------------------------------------|
| `&_tgt=xls`    | download Excel files                                 |
| `&_uhc=yes`    | download accumulated results                         |
| `&0=`          | total                                                |
| `&1=M`         | local governments                                    |
| `&37=M`        | municipalities                                       |
| `&5=01`        | UBIGEO for departments (empty for all provinces)     |
| `&6=01`        | UBIGEO for provinces (empty for all provinces)       |
| `&7=07`        | UBIGEO for districts (empty for all districts)       |
| `&y=2023`      | year                                                 |
| `&ap=Proyecto` | type of consult (Proyecto, Actividad, ActProy)       |
| `&cpage=1`     | number of pages                                      |
| `&psize=500`   | number of observations                               |

Now let's access to data at provincial level. The default navigator url looks like this: 

https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=Proyecto


While the navigator url (that can be found at the Sources box) looks like this:


https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?y=2023&ap=Proyecto

We access to the box source for the navigator url at the export url line:

```
Navegar_7.aspx?_tgt=xls&_uhc=yes&0=&1=M&37=M&5=01&6=&y=2023&ap=Proyecto&cpage=1&psize=400
```

We then can generate the provincial URL by adding the base URL:

https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?_tgt=xls&_uhc=yes&0=&1=M&37=M&5=01&6=&y=2023&ap=Proyecto&cpage=1&psize=400

And then modifying the concatenators to our requirements. We quit concatenators for the department level and set empty the concatenator at the provincial level:

https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&6=&y=2023&ap=Proyecto&cpage=1&psize=400

We can replicate this process for the distrital level:

https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&7=&y=2023&ap=Proyecto&cpage=1&psize=2000

Now we can scrape in an efficient way. This time we'll scrape through districts to find their spending functions.

In [79]:
path_siaf_districts = path_siaf + '/districts'
os.makedirs( path_siaf_districts, exist_ok = True )

In [80]:
url     = 'https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&7=&y=2023&ap=Proyecto&cpage=1&psize=2000'
driver  = webdriver.Chrome()        
driver.get( url )
driver.maximize_window()

In [81]:
districts_list = driver.find_elements( By.XPATH, "//tr[contains(@id, 'tr')]" )

for index, district in enumerate(  districts_list[ : 5 ] ):

    functions_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnFuncion"]' )
    district_button  = driver.find_elements( By.XPATH, "//tr[contains(@id, 'tr')]" )[ index ]
    district_name    = district_button.find_element( By.XPATH, './td[2]' ).text.strip().split( ':' )[ 1 ]

    district_button.click()    
    functions_button.click()

    time.sleep( 5 )

    table_element    = driver.find_element( By.XPATH, "//table[@class='Data']" )
    table_html       = table_element.get_attribute( 'outerHTML' )
    table_html_io    = StringIO( table_html )
    table_df         = pd.read_html( table_html_io )[ 0 ]
    table_df.columns = headers
    table_df         = table_df.drop( table_df.columns[ 0 ], axis = 1 )

    folder_path = os.path.join( path_siaf_districts, district_name )
    os.makedirs( folder_path, exist_ok = True )
    file_path   = os.path.join( folder_path, f'{ district_name }.xlsx' )
    table_df.to_excel( file_path )

    print( f'Extracted: { district_name }' )

    come_back_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_RptHistory_ctl04_TD0"]' )
    come_back_button.click()

    time.sleep( 5 )

Extracted:  MUNICIPALIDAD PROVINCIAL DE CHACHAPOYAS
Extracted:  MUNICIPALIDAD DISTRITAL DE ASUNCION
Extracted:  MUNICIPALIDAD DISTRITAL DE BALSAS
Extracted:  MUNICIPALIDAD DISTRITAL DE CHETO
Extracted:  MUNICIPALIDAD DISTRITAL DE CHILIQUIN


In [51]:
# driver.quit()

## 3.6. Explicit and implicit waits

In Selenium for web scraping, there are two main types of waits: implicit and explicit.

1. **Implicit Waits**:
   - Set a default wait time for the entire session.
   - Selenium waits for a specified time to find an element before giving an error.
   - Useful for a general wait time for all elements.

2. **Explicit Waits**:
   - Set wait conditions for specific elements.
   - Selenium waits for certain conditions (like visibility or clickability) to be met.
   - Checks for the condition at intervals until it's met or time runs out.
   - Ideal for elements that load slowly or need specific conditions to be met.


Below is a table outlining the main methods used for explicit waits in Selenium. For a comprehensive list of all explicit wait methods available in Selenium, please click [here](https://selenium-python.readthedocs.io/waits.html).


| Method                                  | Description |
|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `presence_of_element_located`           | Waits until an element is present in the page's DOM. Does not guarantee that the element is visible.                                                                              |
| `visibility_of_element_located`         | Waits until an element is visible on the page, meaning it's not hidden and has a height and width greater than zero.                                                             |
| `element_to_be_clickable`               | Waits until an element is visible and clickable, which implies the element is visible and enabled for actions like clicks.                                                       |
| `visibility_of`                         | Waits until an element is visible and has a size greater than zero. Similar to `visibility_of_element_located`, but uses a direct element instead of a location.                  |
| `presence_of_all_elements_located`      | Waits until all elements matching a specified selector are present in the page's DOM.                                                                                                    |


We are now going to revise our scraper for the SIAF to incorporate explicit waits.


In [56]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os

Next, we are going to incorporate explicit waits into our SIAF scraper. Additionally, we will construct it in the form of a function.

In [57]:
def siaf_scraper( year, spending_type, activity_type, path ):

    '''
    Objective:
    
        Scrape data from the SIAF (Sistema Integrado de Administración Financiera) 
        website for a given year, spending type, and activity type. The data is 
        saved in Excel format.

    Input:
        year (int)          : The year for which the data is to be scraped. 
        
        spending_type (str) : The category of spending to scrape. Valid options are:
                                - 'Categoría Presupuestal'
                                - 'Producto o Proyecto'
                                - 'Función'
                                
        activity_type (str) : The type of activity to scrape. Valid options are:
                                - 'Actividades'
                                - 'Proyectos'
                                - 'Actividades y proyectos'
        
        path (str)          : The file path where the scraped data will be saved.

    Output:
    
        The function saves the scraped data into an Excel file for each entity within 
        the specified year and spending type. The files are saved in the directory 
        specified by the 'path' argument. The function also prints and logs any errors 
        encountered during scraping in a .txt file.
    '''

    activity_mappings = {
                'Actividades'            : 'Actividad',
                'Proyectos'              : 'Proyecto',
                'Actividades y proyectos': 'ActProy'
    }

    activity_type = activity_mappings.get( activity_type )

    url = f'https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&7=&y={ year }&ap={ activity_type }&cpage=1&psize=2000'
    
    driver = webdriver.Chrome()        
    driver.get( url )
    driver.maximize_window()
    wait = WebDriverWait(driver, 10)

    xpath_mappings = {
                'Categoría Presupuestal': '//*[@id="ctl00_CPH1_BtnProgramaPpto"]',
                'Producto o Proyecto'   : '//*[@id="ctl00_CPH1_BtnProdProy"]',
                'Función'               : '//*[@id="ctl00_CPH1_BtnFuncion"]'    
                }

    xpath = xpath_mappings.get( spending_type )

    os.makedirs( path, exist_ok = True )    
    path_txt = os.path.join( path, f'{ year }_{ spending_type }_records.txt' )

    with open(path_txt, 'w') as f:
        
        n_iterations = driver.find_elements( By.XPATH, "//tr[contains(@id, 'tr')]" )
        
        for index, district in enumerate( n_iterations ):
            
            try:
                entities_list     = wait.until( EC.presence_of_all_elements_located( ( By.XPATH, "//tr[contains(@id, 'tr')]" ) ) )
                entity_button     = entities_list[ index ]
                full_entity_name  = entity_button.find_element( By.XPATH, './td[2]' ).text.strip()
                short_entity_name = entity_button.find_element( By.XPATH, './td[2]' ).text.strip().split( ':' )[ 1 ]
                entity_button.click()
                
                spending_type_button = wait.until( EC.element_to_be_clickable( ( By.XPATH, xpath ) ) )
                spending_type_button.click()
                
                headers_elements_r0 = wait.until( EC.presence_of_element_located( ( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row0"]' ) ) ).find_elements( By.TAG_NAME, 'td' )
                headers_elements_r1 = wait.until( EC.presence_of_element_located( ( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row1"]' ) ) ).find_elements( By.TAG_NAME, 'td' )
                headers_r0          = [ element.text for element in headers_elements_r0 if element.text not in [ 'Ejecución', 'Avance % ' ] ]
                headers_r1          = [ element.text for element in headers_elements_r1 ]
                headers             = headers_r0 + headers_r1 + [ 'Avance %', 'Municipalidad', 'year', 'tipo_actividad' ]
        
                table_element          = wait.until( EC.element_to_be_clickable( ( By.XPATH, "//table[@class='Data']" ) ) )
                table_html             = table_element.get_attribute( 'outerHTML' )
                table_html_io          = StringIO( table_html )
                df                     = pd.read_html( table_html_io )[ 0 ]
                df[ 'Municipalidad' ]  = full_entity_name
                df[ 'year' ]           = year
                df[ 'tipo_actividad' ] = activity_type
                df.columns             = headers
                df                     = df.drop( df.columns[ 0 ], axis = 1 )
                df                     = df[ df.columns[ -3 : ].to_list() + df.columns[: -3 ].to_list() ]

        
        
                folder_path = os.path.join( f'{ path }', f'{ year }_{ spending_type }' )
                os.makedirs( folder_path, exist_ok = True )
                file_path   = os.path.join( folder_path, f'{ year }_{ short_entity_name }.xlsx' )
                df.to_excel( file_path, index = False )

                print(f'Extracted: { short_entity_name }\n' )
                f.write(f'Extracted: { short_entity_name }\n' )

                come_back_button = wait.until( EC.element_to_be_clickable( ( By.XPATH, '//*[@id="ctl00_CPH1_RptHistory_ctl04_TD0"]' ) ) )
                come_back_button.click()

            except Exception as e:
                
                print( f'Error at index { index }: { e }\n' )
                f.write( f'Error at index: { index }: { e }\n' )
                continue

    driver.quit()

Now, we can use our function by specifying the necessary parameters.

In [58]:
path_siaf_complete = path_siaf + '/complete'
os.makedirs( path_siaf_complete, exist_ok = True )

In [63]:
year          = '2021'
spending_type = 'Categoría Presupuestal'
activity_type = 'Actividades y proyectos'
path          =  path_siaf_complete

siaf_scraper( year, spending_type, activity_type, path )

Extracted:  MUNICIPALIDAD PROVINCIAL DE CHACHAPOYAS

Extracted:  MUNICIPALIDAD DISTRITAL DE ASUNCION

Extracted:  MUNICIPALIDAD DISTRITAL DE BALSAS

Extracted:  MUNICIPALIDAD DISTRITAL DE CHETO

Extracted:  MUNICIPALIDAD DISTRITAL DE CHILIQUIN

Extracted:  MUNICIPALIDAD DISTRITAL DE CHUQUIBAMBA

Extracted:  MUNICIPALIDAD DISTRITAL DE GRANADA

Extracted:  MUNICIPALIDAD DISTRITAL DE HUANCAS



KeyboardInterrupt: 

In [62]:
# driver.quit()

Also, we can iterate through the years to access the information

In [173]:
# years_list = np.arange( 2015, 2021, 1 )

# for year in years_list:
    
#     spending_type = 'Función'
#     activity_type = 'Actividades y proyectos'
#     path          = path_siaf_complete   
#     siaf_scraper( year, spending_type, activity_type, path )

In [21]:
# Libraries template

# Selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains

# Options driver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import Select

# Dataframes
import pandas as pd
import itertools
import os

# Simulating human behavior
import time
from time import sleep
import random

# Clear data
import unidecode

# Json files
import json
import re
import numpy as np
import itertools
from pandas import json_normalize

# To use explicit waits
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Download files
import urllib.request
import requests
from openpyxl import Workbook

# # pytesseract
# from PIL import Image
# from io import BytesIO
# import pytesseract
# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"