# 1. SIAF web page

SISTEMA INTEGRADO DE ADMINISTRACIÓN FINANCIERA DEL 
SECTOR PÚBLICO (SIAF-S). 
El SIAF-SP constituye el medio oficial para el registro, procesamiento y generación de la información relacionada con la Administración Financiera del Sector Público.)

#### PORTAL DE TRANSPARENCIA ECONÓMICA CONSULTA AMIGABLE

To begin working with Selenium, it is essential to install the required libraries. To prevent conflicts with other libraries, it's advisable to create a new environment when working on a project. In Anaconda Prompt, execute the following commands to create and activate a new environment named "wb":

```bash
conda create -n wb
conda activate wb
```

Then we can install the required libraries by running the following commands: selenium


A Web Driver serves as a bridge between the code and the browser, enabling automation of actions such as clicking buttons, filling out forms, and extracting information. Initializing the Web Driver in Python involves creating an instance of the WebDriver for the desired browser, such as Chrome or Firefox, and then using methods to facilitate navigation and manipulation of elements on the web page.

We can maximize de window. Maximizing the window is particularly important because some web elements might only be visible or interactable when the browser is in a full-screen state.

We can emply the headless mode. This mode allows to run your browser automation tasks without opening an actual browser window. It's useful for saving system resources and running tests faster, especially on servers that don't have a display.

In Selenium, location methods are used to find elements on a web page. These methods allow you to select elements based on their HTML attributes, such as ID, name, XPath, tag name, class name, or CSS selector. Using the right locator ensures that your automation script interacts with the correct elements on the page.

|Method|Description|
|---|---|
|find_element( By.XPATH, "xpath" ) | Used for finding an element by its Xpath|
|find_element( By.ID, "id" ) | Used for finding an element by its id|
|find_element( By.NAME, "name" ) | Used for finding an element by its name attribute|
|find_element( By.TAG_NAME, "tag name" ) | Used for finding an element by its HTML tag|
|find_element( By.CLASS_NAME, "class name" ) | Used for finding an element by its class name|
|find_element( By.CSS_SELECTOR, "css selector" )| Used for finding an element by its CSS selector|

We will try Location methods in the Congressional Bills Page

In [4]:
import pandas as pd
import numpy as np
from selenium.webdriver.common.by import By
from io import StringIO
import os
import time
import requests

In [22]:
url     = 'https://wb2server.congreso.gob.pe/spley-portal/#/expediente/search'
driver  = webdriver.Chrome()        
driver.get( url )
driver.maximize_window()

We can access the Congressional Bills Table in an static way

In [24]:
table_element = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/div/table' )
table_html    = table_element.get_attribute( 'outerHTML' )
table_html_io = StringIO( table_html )
table_df      = pd.read_html( table_html_io )[ 0 ]

In [25]:
path_static = '../data/Part_I/Congressional_Bills'
os.makedirs( path_static, exist_ok = True )

path_table = path_static + '/Congressional_Bills.xlsx'
table_df.to_excel( path_table, index = False )

In [26]:
table_df.head( 5 )

Unnamed: 0,PROYECTOS DE LEY,FECHA DE PRESENTACIÓN,TÍTULO,ESTADO PROCESAL,PROPONENTE,AUTORES
0,Proyecto de Ley 06691/2023-CR,Fecha de Presentación18/12/2023,TítuloLEY DE EXPULSIÓN DE EXTRANJEROS DETENIDO...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresOlivos Martínez, Vivian Ventura Ángel, ..."
1,Proyecto de Ley 06690/2023-CR,Fecha de Presentación18/12/2023,TítuloLEY QUE CREA LA UNIVERSIDAD NACIONAL AUT...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresBalcázar Zelada, José María Coayla Juár..."
2,Proyecto de Ley 06689/2023-CR,Fecha de Presentación18/12/2023,TítuloLEY MARCO PARA LA PROTECCIÓN Y FORTALECI...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresJáuregui Martínez de Aguayo, María de l..."
3,Proyecto de Ley 06688/2023-CR,Fecha de Presentación18/12/2023,TítuloLEY QUE ESTABLECE PARÁMETROS PARA EL SAN...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresLimachi Quispe, Nieves Esmeralda Bazán ..."
4,Proyecto de Ley 06687/2023-CR,Fecha de Presentación15/12/2023,TítuloLEY QUE MODIFICA EL DECRETO LEGISLATIVO ...,Estado ProcesalPRESENTADO,ProponenteCongreso,"AutoresAnderson Ramírez, Carlos Antonio"


We can access the Congressional Bills Tables in a dinamic way for the first five pages

In [8]:
path_dinamic = path_static + '/dinamic'
os.makedirs( path_dinamic, exist_ok = True )

for index in np.arange( 1, 6, step = 1 ):

    table_element = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/div/table' )
    table_html    = table_element.get_attribute( 'outerHTML' )
    table_html_io = StringIO( table_html )
    table_df_d    = pd.read_html( table_html_io )[ 0 ]

    next_button   = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/p-paginator[1]/div/button[3]' )
    next_button.click()

    time.sleep( 2 )

    path_table = path_dinamic + f'/Congressional_Bills_n_{ index }.xlsx'
    table_df_d.to_excel( path_table, index = False )


To determine the total number of pagination buttons, we retrieve the text from the last pagination button.

In [9]:
pagination_next_button = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/p-paginator[1]/div/button[4]' )
pagination_next_button.click()

time.sleep( 2 )

last_pagination_button = driver.find_element( By.XPATH, '/html/body/app-root/app-publico/div[3]/app-search/section/div/div/p-table/div/p-paginator[1]/div/span[2]/button[5]' )
n_pagination_buttons   = last_pagination_button.text
print( f'N. Pagination Buttons is: { n_pagination_buttons }' )

N. Pagination Buttons is: 134


In [10]:
# driver.quit()

We can downlad the PDF documents for Congressional Bills as well. We start importing the Options function

In [11]:
from selenium.webdriver.chrome.options import Options

We then retrieve the ID for each bill listed in `table_df`. Additionally, we establish the base URL.

In [28]:
bills_id = table_df[ 'PROYECTOS DE LEY' ].apply( lambda x: x.split( 'Ley ' )[ 1 ].split( '/' )[ 0 ] ).tolist()

base_url   = 'https://wb2server.congreso.gob.pe/spley-portal/#/expediente/2021/'

In [30]:
path_pdf = path_static + '/pdf_docs'
os.makedirs( path_pdf, exist_ok = True )

In [37]:
chrome_options = Options()

prefs = {
    'download.default_directory'        : r'C:\Users\dell\Documents\GitHub_Old\Taller_12_2023\data\Part_I\Congressional_Bills\pdf_docs',
    'download.prompt_for_download'      : False,
    'download.directory_upgrade'        : True,
    'plugins.always_open_pdf_externally': True
}

chrome_options.add_experimental_option( 'prefs', prefs )

driver = webdriver.Chrome( options = chrome_options )
url = 'https://wb2server.congreso.gob.pe/spley-portal/#/expediente/search'
driver.get(url)
driver.maximize_window()

n_items = 5

for i, bill in enumerate( bills_id[ :n_items ] ):
    bill_url = base_url + bill
    driver.get( bill_url )

    time.sleep( 5 )

    try:
        pdf_button = driver.find_element( By.XPATH, '//*[@id="p-tabpanel-0"]/p-table/div/div/table/tbody/tr/td[5]/button' )
        pdf_button.click()

        print( f'Success downloading bill: { bill }' )
        
    except Exception as e:
       
        print( f'We did not find PDF button for { bill_url }: {e}' )

    time.sleep( 5 )

    if i == ( n_items - 1 ):

        time.sleep( 10 )

driver.quit()

Success downloading bill: 06691
Success downloading bill: 06690
Success downloading bill: 06689
Success downloading bill: 06688
Success downloading bill: 06687


Now let's try to scrape the SIAF web page

In [38]:
url     = 'https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=ActProy'
driver  = webdriver.Chrome()        
driver.get( url )
driver.maximize_window()

In [39]:
goverment_levels_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnTipoGobierno"]' )
goverment_levels_button.click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="ctl00_CPH1_BtnTipoGobierno"]"}
  (Session info: chrome=120.0.6099.109); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00007FF7AA4A2142+3514994]
	(No symbol) [0x00007FF7AA0C0CE2]
	(No symbol) [0x00007FF7A9F676AA]
	(No symbol) [0x00007FF7A9FB1860]
	(No symbol) [0x00007FF7A9FB197C]
	(No symbol) [0x00007FF7A9FF4EE7]
	(No symbol) [0x00007FF7A9FD602F]
	(No symbol) [0x00007FF7A9FF28F6]
	(No symbol) [0x00007FF7A9FD5D93]
	(No symbol) [0x00007FF7A9FA4BDC]
	(No symbol) [0x00007FF7A9FA5C64]
	GetHandleVerifier [0x00007FF7AA4CE16B+3695259]
	GetHandleVerifier [0x00007FF7AA526737+4057191]
	GetHandleVerifier [0x00007FF7AA51E4E3+4023827]
	GetHandleVerifier [0x00007FF7AA1F04F9+689705]
	(No symbol) [0x00007FF7AA0CC048]
	(No symbol) [0x00007FF7AA0C8044]
	(No symbol) [0x00007FF7AA0C81C9]
	(No symbol) [0x00007FF7AA0B88C4]
	BaseThreadInitThunk [0x00007FF8C2E17344+20]
	RtlUserThreadStart [0x00007FF8C32E26B1+33]


An "iframe", short for "inline frame", is an HTML element used to embed another web page within a parent page. In web scraping, it's important to identify iframes because they have their own separate DOM (Document Object Model). Elements within an iframe cannot be directly accessed from the main page's DOM. Therefore, to interact with or extract data from elements inside an iframe, one must first switch to the iframe's context. Failure to do so will result in an inability to locate and manipulate these elements, making understanding and handling iframes essential for successful scraping of complex web pages.

In [41]:
frames = driver.find_elements( By.TAG_NAME, "frame" )
if frames:
    print( 'There are frames in this web page.' )
    for i, frame in enumerate( frames ):
        frame_html = frame.get_attribute( 'outerHTML' )
        print( f'HTML code of frame { i }:' )
        print( frame_html )
else:
    print('No frames found on the page.' )

There are frames in this web page.
HTML code of frame 0:
<frame name="frame0" id="frame0" src="Navegar.aspx?y=2023&amp;ap=ActProy" scrolling="yes">


Once identified the frame/iframe we can use the switch to frame method

In [42]:
frame = driver.find_element( By.ID, "frame0" )
driver.switch_to.frame( frame )

Now we can click the Government Levels button

In [43]:
goverment_levels_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnTipoGobierno"]' )
goverment_levels_button.click()

Let's access some data from Regional Governments

In [44]:
regional_governments_level_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_RptData_ctl03_TD0"]/input' )
regional_governments_level_button.click()

time.sleep( 2 )

sector_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnSector"]' )
sector_button.click()

time.sleep( 2 )

regional_goverments_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_RptData_ctl02_TD0"]/input' )
regional_goverments_button.click()

time.sleep( 2 )

pliego_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnPliego"]' )
pliego_button.click()

In [45]:
table_element = driver.find_element( By.XPATH, "//table[@class='Data']" )
table_html    = table_element.get_attribute( 'outerHTML' )
table_html_io = StringIO( table_html )
table_df      = pd.read_html( table_html_io )[ 0 ]

In [46]:
table_df.head( 5 )

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,440: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AMA...,1179543064,1504888953,1463829153,1332523428,1308523245,1279150903,1178443440,85.0
1,,441: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ANCASH,2535471411,2996975486,2720233624,2448148507,2402360853,2292571159,2271082289,76.5
2,,442: GOBIERNO REGIONAL DEL DEPARTAMENTO DE APU...,1219427445,1513009025,1375547684,1331036632,1311524288,1257485229,1234622502,83.1
3,,443: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ARE...,2380402279,3153553160,2838882882,2714483554,2672481083,2589568212,2577943145,82.1
4,,444: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AYA...,1739949908,2250344232,2084312392,2025983577,1941676557,1885327467,1865571064,83.8


We are missing the titles (headers). We will find them by locating them through their TAG NAMES.

In [47]:
headers_elements_r0 = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row0"]' ).find_elements( By.TAG_NAME, 'td' )
headers_r0          = [ element.text for element in headers_elements_r0 ]

headers_elements_r1 = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row1"]' ).find_elements( By.TAG_NAME, 'td' )
headers_r1          = [ element.text for element in headers_elements_r1 ]

print( headers_r0, headers_r1, sep = '\n' )

[' ', 'Pliego', 'PIA', 'PIM', 'Certificación', 'Compromiso Anual', 'Ejecución', 'Avance % ']
['Atención de Compromiso Mensual ', 'Devengado ', 'Girado ']


In [48]:
driver.quit()

In [49]:
headers_r0 = [ header for header in headers_r0 if header not in [ 'Ejecución', 'Avance % ' ] ]
headers = headers_r0 + headers_r1 + [ 'Avance %' ]
print( headers )

[' ', 'Pliego', 'PIA', 'PIM', 'Certificación', 'Compromiso Anual', 'Atención de Compromiso Mensual ', 'Devengado ', 'Girado ', 'Avance %']


In [50]:
table_df.columns = headers
table_df = table_df.drop( table_df.columns[ 0 ], axis = 1 )
table_df.head( 5 )

Unnamed: 0,Pliego,PIA,PIM,Certificación,Compromiso Anual,Atención de Compromiso Mensual,Devengado,Girado,Avance %
0,440: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AMA...,1179543064,1504888953,1463829153,1332523428,1308523245,1279150903,1178443440,85.0
1,441: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ANCASH,2535471411,2996975486,2720233624,2448148507,2402360853,2292571159,2271082289,76.5
2,442: GOBIERNO REGIONAL DEL DEPARTAMENTO DE APU...,1219427445,1513009025,1375547684,1331036632,1311524288,1257485229,1234622502,83.1
3,443: GOBIERNO REGIONAL DEL DEPARTAMENTO DE ARE...,2380402279,3153553160,2838882882,2714483554,2672481083,2589568212,2577943145,82.1
4,444: GOBIERNO REGIONAL DEL DEPARTAMENTO DE AYA...,1739949908,2250344232,2084312392,2025983577,1941676557,1885327467,1865571064,83.8


In [57]:
path_siaf          = '../data/Part_I/SIAF'
path_siaf_regional = path_siaf + '/regional'
os.makedirs( path_siaf_regional, exist_ok = True )

In [52]:
table_path = path_siaf_regional + '/regional_table.xlsx'
table_df.to_excel( table_path, index = False )

## 3.5. Generating URLs in the SIAF Database

We can manipulate SIAF URLs to directly access data at the districtal and provincial levels. The SIAF database employs concatenators, represented by the `&` character in its URLs. By modifying these concatenators, we can directly access data at these levels. Let's consider the concatenators meaning:

| Concatenator   | Description                                          |
|----------------|------------------------------------------------------|
| `&_tgt=xls`    | download Excel files                                 |
| `&_uhc=yes`    | download accumulated results                         |
| `&0=`          | total                                                |
| `&1=M`         | local governments                                    |
| `&37=M`        | municipalities                                       |
| `&5=01`        | UBIGEO for departments (empty for all provinces)     |
| `&6=01`        | UBIGEO for provinces (empty for all provinces)       |
| `&7=07`        | UBIGEO for districts (empty for all districts)       |
| `&y=2023`      | year                                                 |
| `&ap=Proyecto` | type of consult (Proyecto, Actividad, ActProy)       |
| `&cpage=1`     | number of pages                                      |
| `&psize=500`   | number of observations                               |

Now let's access to data at provincial level. The default navigator url looks like this: 

https://apps5.mineco.gob.pe/transparencia/Navegador/default.aspx?y=2023&ap=Proyecto


While the navigator url (that can be found at the Sources box) looks like this:


https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?y=2023&ap=Proyecto

We access to the box source for the navigator url at the export url line:

```
Navegar_7.aspx?_tgt=xls&_uhc=yes&0=&1=M&37=M&5=01&6=&y=2023&ap=Proyecto&cpage=1&psize=400
```

We then can generate the provincial URL by adding the base URL:

https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?_tgt=xls&_uhc=yes&0=&1=M&37=M&5=01&6=&y=2023&ap=Proyecto&cpage=1&psize=400

And then modifying the concatenators to our requirements. We quit concatenators for the department level and set empty the concatenator at the provincial level:

https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&6=&y=2023&ap=Proyecto&cpage=1&psize=400

We can replicate this process for the distrital level:

https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&7=&y=2023&ap=Proyecto&cpage=1&psize=2000

Now we can scrape in an efficient way. This time we'll scrape through districts to find their spending functions.

In [58]:
path_siaf_districts = path_siaf + '/districts'
os.makedirs( path_siaf_districts, exist_ok = True )

In [59]:
url     = 'https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&7=&y=2023&ap=Proyecto&cpage=1&psize=2000'
driver  = webdriver.Chrome()        
driver.get( url )
driver.maximize_window()

In [60]:
districts_list = driver.find_elements( By.XPATH, "//tr[contains(@id, 'tr')]" )

for index, district in enumerate(  districts_list[ : 5 ] ):

    functions_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_BtnFuncion"]' )
    district_button  = driver.find_elements( By.XPATH, "//tr[contains(@id, 'tr')]" )[ index ]
    district_name    = district_button.find_element( By.XPATH, './td[2]' ).text.strip().split( ':' )[ 1 ]

    district_button.click()    
    functions_button.click()

    time.sleep( 5 )

    table_element    = driver.find_element( By.XPATH, "//table[@class='Data']" )
    table_html       = table_element.get_attribute( 'outerHTML' )
    table_html_io    = StringIO( table_html )
    table_df         = pd.read_html( table_html_io )[ 0 ]
    table_df.columns = headers
    table_df         = table_df.drop( table_df.columns[ 0 ], axis = 1 )

    folder_path = os.path.join( path_siaf_districts, district_name )
    os.makedirs( folder_path, exist_ok = True )
    file_path   = os.path.join( folder_path, f'{ district_name }.xlsx' )
    table_df.to_excel( file_path )

    print( f'Extracted: { district_name }' )

    come_back_button = driver.find_element( By.XPATH, '//*[@id="ctl00_CPH1_RptHistory_ctl04_TD0"]' )
    come_back_button.click()

    time.sleep( 5 )

Extracted:  MUNICIPALIDAD PROVINCIAL DE CHACHAPOYAS
Extracted:  MUNICIPALIDAD DISTRITAL DE ASUNCION
Extracted:  MUNICIPALIDAD DISTRITAL DE BALSAS
Extracted:  MUNICIPALIDAD DISTRITAL DE CHETO
Extracted:  MUNICIPALIDAD DISTRITAL DE CHILIQUIN


KeyboardInterrupt: 

In [61]:
# driver.quit()

## 3.6. Explicit and implicit waits

In Selenium for web scraping, there are two main types of waits: implicit and explicit.

1. **Implicit Waits**:
   - Set a default wait time for the entire session.
   - Selenium waits for a specified time to find an element before giving an error.
   - Useful for a general wait time for all elements.

2. **Explicit Waits**:
   - Set wait conditions for specific elements.
   - Selenium waits for certain conditions (like visibility or clickability) to be met.
   - Checks for the condition at intervals until it's met or time runs out.
   - Ideal for elements that load slowly or need specific conditions to be met.


Below is a table outlining the main methods used for explicit waits in Selenium. For a comprehensive list of all explicit wait methods available in Selenium, please click [here](https://selenium-python.readthedocs.io/waits.html).


| Method                                  | Description |
|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `presence_of_element_located`           | Waits until an element is present in the page's DOM. Does not guarantee that the element is visible.                                                                              |
| `visibility_of_element_located`         | Waits until an element is visible on the page, meaning it's not hidden and has a height and width greater than zero.                                                             |
| `element_to_be_clickable`               | Waits until an element is visible and clickable, which implies the element is visible and enabled for actions like clicks.                                                       |
| `visibility_of`                         | Waits until an element is visible and has a size greater than zero. Similar to `visibility_of_element_located`, but uses a direct element instead of a location.                  |
| `presence_of_all_elements_located`      | Waits until all elements matching a specified selector are present in the page's DOM.                                                                                                    |


We are now going to revise our scraper for the SIAF to incorporate explicit waits.


In [62]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os

Next, we are going to incorporate explicit waits into our SIAF scraper. Additionally, we will construct it in the form of a function.

In [65]:
def siaf_scraper( year, spending_type, activity_type, path ):

    '''
    Objective:
    
        Scrape data from the SIAF (Sistema Integrado de Administración Financiera) 
        website for a given year, spending type, and activity type. The data is 
        saved in Excel format.

    Input:
        year (int)          : The year for which the data is to be scraped. 
        
        spending_type (str) : The category of spending to scrape. Valid options are:
                                - 'Categoría Presupuestal'
                                - 'Producto o Proyecto'
                                - 'Función'
                                
        activity_type (str) : The type of activity to scrape. Valid options are<:
                                - 'Actividades'
                                - 'Proyectos'
                                - 'Actividades y proyectos'
        
        path (str)          : The file path where the scraped data will be saved.

    Output:
    
        The function saves the scraped data into an Excel file for each entity within 
        the specified year and spending type. The files are saved in the directory 
        specified by the 'path' argument. The function also prints and logs any errors 
        encountered during scraping in a .txt file.
    '''

    activity_mappings = {
                'Actividades'            : 'Actividad',
                'Proyectos'              : 'Proyecto',
                'Actividades y proyectos': 'ActProy'
    }

    activity_type = activity_mappings.get( activity_type )

    url = f'https://apps5.mineco.gob.pe/transparencia/Navegador/Navegar_7.aspx?0=&1=M&37=M&7=&y={ year }&ap={ activity_type }&cpage=1&psize=2000'
    
    driver = webdriver.Chrome()        
    driver.get( url )
    driver.maximize_window()
    wait = WebDriverWait(driver, 10)

    xpath_mappings = {
                'Categoría Presupuestal': '//*[@id="ctl00_CPH1_BtnProgramaPpto"]',
                'Producto o Proyecto'   : '//*[@id="ctl00_CPH1_BtnProdProy"]',
                'Función'               : '//*[@id="ctl00_CPH1_BtnFuncion"]'    
                }

    xpath = xpath_mappings.get( spending_type )

    os.makedirs( path, exist_ok = True )    
    path_txt = os.path.join( path, f'{ year }_{ spending_type }_records.txt' )

    with open(path_txt, 'w') as f:
        
        n_iterations = driver.find_elements( By.XPATH, "//tr[contains(@id, 'tr')]" )
        
        for index, district in enumerate( n_iterations ):
            
            try:
                entities_list     = wait.until( EC.presence_of_all_elements_located( ( By.XPATH, "//tr[contains(@id, 'tr')]" ) ) )
                entity_button     = entities_list[ index ]
                full_entity_name  = entity_button.find_element( By.XPATH, './td[2]' ).text.strip()
                short_entity_name = entity_button.find_element( By.XPATH, './td[2]' ).text.strip().split( ':' )[ 1 ]
                entity_button.click()
                
                spending_type_button = wait.until( EC.element_to_be_clickable( ( By.XPATH, xpath ) ) )
                spending_type_button.click()
                
                headers_elements_r0 = wait.until( EC.presence_of_element_located( ( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row0"]' ) ) ).find_elements( By.TAG_NAME, 'td' )
                headers_elements_r1 = wait.until( EC.presence_of_element_located( ( By.XPATH, '//*[@id="ctl00_CPH1_Mt0_Row1"]' ) ) ).find_elements( By.TAG_NAME, 'td' )
                headers_r0          = [ element.text for element in headers_elements_r0 if element.text not in [ 'Ejecución', 'Avance % ' ] ]
                headers_r1          = [ element.text for element in headers_elements_r1 ]
                headers             = headers_r0 + headers_r1 + [ 'Avance %', 'Municipalidad', 'year', 'tipo_actividad' ]
        
                table_element          = wait.until( EC.element_to_be_clickable( ( By.XPATH, "//table[@class='Data']" ) ) )
                table_html             = table_element.get_attribute( 'outerHTML' )
                table_html_io          = StringIO( table_html )
                df                     = pd.read_html( table_html_io )[ 0 ]
                df[ 'Municipalidad' ]  = full_entity_name
                df[ 'year' ]           = year
                df[ 'tipo_actividad' ] = activity_type
                df.columns             = headers
                df                     = df.drop( df.columns[ 0 ], axis = 1 )
                df                     = df[ df.columns[ -3 : ].to_list() + df.columns[: -3 ].to_list() ]

        
        
                folder_path = os.path.join( f'{ path }', f'{ year }_{ spending_type }' )
                os.makedirs( folder_path, exist_ok = True )
                file_path   = os.path.join( folder_path, f'{ year }_{ short_entity_name }.xlsx' )
                df.to_excel( file_path, index = False )

                print(f'Extracted: { short_entity_name }\n' )
                f.write(f'Extracted: { short_entity_name }\n' )

                come_back_button = wait.until( EC.element_to_be_clickable( ( By.XPATH, '//*[@id="ctl00_CPH1_RptHistory_ctl04_TD0"]' ) ) )
                come_back_button.click()

            except Exception as e:
                
                print( f'Error at index { index }: { e }\n' )
                f.write( f'Error at index: { index }: { e }\n' )
                continue

    driver.quit()

Now, we can use our function by specifying the necessary parameters.

In [66]:
path_siaf_complete = path_siaf + '/complete'
os.makedirs( path_siaf_complete, exist_ok = True )

In [67]:
year          = '2020'
spending_type = 'Función'
activity_type = 'Actividades y proyectos'
path          =  path_siaf_complete

siaf_scraper( year, spending_type, activity_type, path )

Extracted:  MUNICIPALIDAD PROVINCIAL DE CHACHAPOYAS

Extracted:  MUNICIPALIDAD DISTRITAL DE ASUNCION

Extracted:  MUNICIPALIDAD DISTRITAL DE BALSAS

Extracted:  MUNICIPALIDAD DISTRITAL DE CHETO

Extracted:  MUNICIPALIDAD DISTRITAL DE CHILIQUIN

Extracted:  MUNICIPALIDAD DISTRITAL DE CHUQUIBAMBA

Extracted:  MUNICIPALIDAD DISTRITAL DE GRANADA

Extracted:  MUNICIPALIDAD DISTRITAL DE HUANCAS

Extracted:  MUNICIPALIDAD DISTRITAL DE LA JALCA




KeyboardInterrupt



In [68]:
driver.quit()

Also, we can iterate through the years to access the information

In [173]:
years_list = np.arange( 2015, 2021, 1 )

for year in years_list:
    
    spending_type = 'Función'
    activity_type = 'Actividades y proyectos'
    path          = path_siaf_complete   
    siaf_scraper( year, spending_type, activity_type, path )