# Scraper of public procurement processes

This notebook contains the scraper for all the procurement proceses for medicine in Ecuador, under the corporate purchase mechanism, launched in 2022.

The webpage that has the data on all the contracts for medicine, which can be found [here](https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/buscarProceso.cpe?sg=1#), only allows searches in 6 month periods, and has a captcha. This scraper allows to download all of the purchasing processes using Playwright and bs4, but filling the date and captcha mannually.

Purchasing processes were launched in 2022 and 2023; there are no processes for 2024 and 2025.

In [1]:
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import pandas as pd
import asyncio
import os
from datetime import datetime

## List of contracts

First, I will run a scraper throught Sercop's website to get a list of procurement processes and a URL to each one of them.

In [3]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/buscarProceso.cpe?sg=1#")

<Response url='https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/buscarProceso.cpe?sg=1' request=<Request url='https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/buscarProceso.cpe?sg=1' method='GET'>>

In [4]:
all_rows = []

semesters = [
    ("First 2022", "Waiting for dates and captcha manually"),
    ("Second 2022", "Waiting for dates and captcha manually"),
    ("First 2023", "Waiting for dates and captcha manually"),
    ("Second 2023", "Waiting for dates and captcha manually"),
]

for label, instruction in semesters:
    print(f"Scraping {label} — {instruction}")
    input("---- Press enter when completed")

    offset = 0

    while True:
        try:
            await page.wait_for_selector("table", timeout=10000)
            html = await page.content()
            soup = BeautifulSoup(html, "lxml")

            rows = soup.select("table tr")[1:]  # Skip header
            for row in rows:
                cols = row.find_all("td")
                if not cols or len(cols) < 7:
                    continue
                try:
                    codigo = cols[0].text.strip()
                    detalle_url = cols[0].find("a")["href"] if cols[0].find("a") else None
                    entidad = cols[1].text.strip()
                    objeto = cols[2].text.strip()
                    estado = cols[3].text.strip()
                    ubicacion = cols[4].text.strip()
                    presupuesto = cols[5].text.strip()
                    fecha_publicacion = cols[6].text.strip()

                    all_rows.append({
                        "codigo": codigo,
                        "detalle_url": detalle_url,
                        "entidad": entidad,
                        "objeto": objeto,
                        "estado": estado,
                        "ubicacion": ubicacion,
                        "presupuesto": presupuesto,
                        "fecha_publicacion": fecha_publicacion,
                        "semestre": label
                    })
                except Exception as e:
                    print("Error parsing row:", e)

            try:
                siguiente_button = await page.query_selector("a:has-text('Siguiente')")
                if not siguiente_button:
                    print("No 'Siguiente' button found — ending this semester")
                    break

                offset += 20
                await asyncio.sleep(2)
                await page.evaluate(f"presentarProcesos({offset})")

            except Exception as e:
                print("Error checking or clicking 'Siguiente':", e)
                break

        except Exception as e:
            print("Unexpected error while scraping:", e)
            break

    print(f"Finished scraping for: {label}")
    await asyncio.sleep(5)

Scraping First 2022 — Waiting for dates and captcha manually


---- Press enter when completed 


No 'Siguiente' button found — ending this semester
Finished scraping for: First 2022
Scraping Second 2022 — Waiting for dates and captcha manually


---- Press enter when completed 


No 'Siguiente' button found — ending this semester
Finished scraping for: Second 2022
Scraping First 2023 — Waiting for dates and captcha manually


---- Press enter when completed 


No 'Siguiente' button found — ending this semester
Finished scraping for: First 2023
Scraping Second 2023 — Waiting for dates and captcha manually


---- Press enter when completed 


No 'Siguiente' button found — ending this semester
Finished scraping for: Second 2023


In [5]:
df = pd.DataFrame(all_rows)
len(df)

2692

In [6]:
df.head(10)

Unnamed: 0,codigo,detalle_url,entidad,objeto,estado,ubicacion,presupuesto,fecha_publicacion,semestre
0,,,Entidad Contratante,Buscar Entidad,,Buscar Entidad,Buscar Entidad,Buscar Entidad,First 2022
1,,,Por Fechas de Publicación (*),Desde: \n Hasta:,Desde:,,,Hasta:,First 2022
2,,,,,,,,,First 2022
3,,,,,,,,,First 2022
4,,,,,,,,,First 2022
5,Código,,Entidad Contratante,Objeto del Proceso,Estado del Proceso,Provincia/Cantón,Presupuesto Referencial Unitario(sin iva),Fecha de Publicación,First 2022
6,SICM-499-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LEVONORGESTREL - FORMA FARMACÉUTICA: SÓLI...,Desierto,PICHINCHA / QUITO,$62.31600,2022-06-23 08:00:00,First 2022
7,SICM-514-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: PREDNISOLONA - FORMA FARMACÉUTICA: SÓLIDO...,Desierto,PICHINCHA / QUITO,$0.06000,2022-06-23 08:00:00,First 2022
8,SICM-500-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LIDOCAÍNA SIN EPINEFRINA - FORMA FARMACÉU...,Adjudicado oferente ganador,PICHINCHA / QUITO,$5.00000,2022-06-23 08:00:00,First 2022
9,SICM-515-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: PROGESTERONA - FORMA FARMACÉUTICA: SÓLIDO...,Desierto,PICHINCHA / QUITO,$0.23000,2022-06-23 08:00:00,First 2022


In [7]:
df["detalle_url"].isna().value_counts()

detalle_url
True     1706
False     986
Name: count, dtype: int64

In [8]:
df = df[df["detalle_url"].notna()]
len(df)

986

In [10]:
df.head()

Unnamed: 0,codigo,detalle_url,entidad,objeto,estado,ubicacion,presupuesto,fecha_publicacion,semestre
6,SICM-499-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LEVONORGESTREL - FORMA FARMACÉUTICA: SÓLI...,Desierto,PICHINCHA / QUITO,$62.31600,2022-06-23 08:00:00,First 2022
7,SICM-514-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: PREDNISOLONA - FORMA FARMACÉUTICA: SÓLIDO...,Desierto,PICHINCHA / QUITO,$0.06000,2022-06-23 08:00:00,First 2022
8,SICM-500-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LIDOCAÍNA SIN EPINEFRINA - FORMA FARMACÉU...,Adjudicado oferente ganador,PICHINCHA / QUITO,$5.00000,2022-06-23 08:00:00,First 2022
9,SICM-515-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: PROGESTERONA - FORMA FARMACÉUTICA: SÓLIDO...,Desierto,PICHINCHA / QUITO,$0.23000,2022-06-23 08:00:00,First 2022
10,SICM-516-2022,informacionProcesoContratacion2.cpe?idSoliComp...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: RITUXIMAB - FORMA FARMACÉUTICA: LÍQUIDO P...,Desierto,PICHINCHA / QUITO,$1406.37000,2022-06-23 08:00:00,First 2022


In [11]:
base_url = "https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/"
df["detalle_url"] = base_url + df["detalle_url"].astype(str)

In [12]:
df.head(1)

Unnamed: 0,codigo,detalle_url,entidad,objeto,estado,ubicacion,presupuesto,fecha_publicacion,semestre
6,SICM-499-2022,https://modulocomprascorporativas.compraspubli...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LEVONORGESTREL - FORMA FARMACÉUTICA: SÓLI...,Desierto,PICHINCHA / QUITO,$62.31600,2022-06-23 08:00:00,First 2022


In [13]:
len(df)

986

In [16]:
now = datetime.now()
timestamp = now.strftime("%Y-%m-%d_%H-%M")
filename = f"scraped_results_{timestamp}.csv"

In [17]:
print(filename)

scraped_results_2025-07-09_12-47.csv


In [76]:
os.makedirs('data/', exist_ok=True)
os.makedirs("data/raw", exist_ok=True)

In [79]:
df.to_csv(f"data/raw/{filename}", index=False)

## Details on each contract

Now I will scrape the details on each procurement process, by looping through each URL and getting the details about each one. For those contracts that are awarded ("adjudicado"), I will also get the data on the company that got the award.

In [2]:
df = pd.read_csv("data/scraped_results_2025-07-09_12-47.csv")

In [3]:
url = df["detalle_url"][1]
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

In [4]:
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page(user_agent=user_agent)
await page.goto(url)
await asyncio.sleep(3)
await page.goto(url)

<Response url='https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/informacionProcesoContratacion2.cpe?idSoliCompra=JD66Gdou-LQnI0nGPoE0wKiirN5WPYwr_xSJ3h1_dAk,' request=<Request url='https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/informacionProcesoContratacion2.cpe?idSoliCompra=JD66Gdou-LQnI0nGPoE0wKiirN5WPYwr_xSJ3h1_dAk,' method='GET'>>

In [5]:
os.makedirs('data/htmls/', exist_ok=True)

In [6]:
data = []

for i, row in df.iterrows():
    
    url = row["detalle_url"]
    code = row["codigo"]

    await asyncio.sleep(1)
    print(f"\rProcessing index: {i}, {url}", end="")
        
    await page.goto(url, timeout=10000)
    await asyncio.sleep(2)
            
    await page.wait_for_selector("#tab3", state="visible", timeout=10000)
    await page.eval_on_selector("#tab3", "el => el.click()")
    await asyncio.sleep(1)
    
    html = await page.content()
    soup = BeautifulSoup(html, "lxml")

    now = datetime.now()
    timestamp = now.strftime("%Y-%m-%d_%H-%M")
    filename = f"{code}_{timestamp}.html"
    
    with open(f'data/htmls/{filename}', 'w') as f:
        f.write(html)

    details_table = soup.select_one("table#rounded-corner tbody")
    details_row = details_table.find_all("td")
    
    item = {}
    
    item["cum_id"] = details_row[0].get_text(strip=True)
    item["principio_activo"] = details_row[1].get_text(strip=True)
    item["forma_farmaceutica"] = details_row[2].get_text(strip=True)
    item["concentracion"] = details_row[3].get_text(strip=True)
    item["presentacion"] = details_row[4].get_text(strip=True)
    item["cantidad"] = details_row[5].get_text(strip=True)
    item["precio_referencial"] = details_row[6].get_text(strip=True)
    item["subtotal"] = details_row[7].get_text(strip=True)

    if "Adjudicado" in row["estado"]:
        try:
            await page.locator("a:has-text('Ver Resultados')").click()
            await asyncio.sleep(3)
       
            html = await page.content()
            soup = BeautifulSoup(html, "lxml")
    
            now = datetime.now()
            timestamp = now.strftime("%Y-%m-%d_%H-%M")
            filename = f"adjudicacion_{code}_{timestamp}.html"
        
            with open(f'data/htmls/{filename}', 'w') as f:
                f.write(html)

            adju_table = soup.select_one("table#tableadjudicacion")
            adju_rows = adju_table.find_all("tr")

            item["proveedor_adjudicado"] = adju_rows[0].find_all("td")[1].get_text(strip=True)
            item["valor_adjudicado"] = adju_rows[1].find_all("td")[1].get_text(strip=True)
            item["fecha_adjudicacion"] = adju_rows[3].find_all("td")[1].get_text(strip=True)

        except:
            print(f"{code} no tiene 'ver resultados'")


    data.append(item)

Processing index: 293, https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/informacionProcesoContratacion2.cpe?idSoliCompra=4Lqpmu2fCeOpDLalJJ0-PiKmGuOjwRb1hfoPvWKW6Bo,SICM-237-2022 no tiene 'ver resultados'
Processing index: 985, https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/informacionProcesoContratacion2.cpe?idSoliCompra=J3BXaW7XUlDunC55iRiV1jsBQKoicYYiO2SUKFVTqUs,

In [7]:
for item, (_, row) in zip(data, df.iterrows()):
    item["codigo"] = row["codigo"]

In [11]:
now = datetime.now()
timestamp = now.strftime("%Y-%m-%d_%H-%M")

In [80]:
import json

with open(f"data/raw/scraped_details_{timestamp}.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

## Merge datasets

Finally, I will merge both datasets in one csv file, to run analysis on them.

In [18]:
details = pd.DataFrame(data)

In [63]:
merged = df.join(details, rsuffix="_scraped")

In [64]:
merged.head()

Unnamed: 0,codigo,detalle_url,entidad,objeto,estado,ubicacion,presupuesto,fecha_publicacion,semestre,cum_id,...,forma_farmaceutica,concentracion,presentacion,cantidad,precio_referencial,subtotal,codigo_scraped,proveedor_adjudicado,valor_adjudicado,fecha_adjudicacion
0,SICM-499-2022,https://modulocomprascorporativas.compraspubli...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LEVONORGESTREL - FORMA FARMACÉUTICA: SÓLI...,Desierto,PICHINCHA / QUITO,$62.31600,2022-06-23 08:00:00,First 2022,G03AC03SPS113I1,...,Sólido parenteral (Implante subdérmico),150 mg (2 varillas de 75 mg),Caja x implante (s) de 75 mg c/u + trocar (es),23933,USD 62.316000,"USD 1,491,408.828000",SICM-499-2022,,,
1,SICM-514-2022,https://modulocomprascorporativas.compraspubli...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: PREDNISOLONA - FORMA FARMACÉUTICA: SÓLIDO...,Desierto,PICHINCHA / QUITO,$0.06000,2022-06-23 08:00:00,First 2022,H02AB06SOR140X0,...,Sólido oral,20 mg,Caja x blíster/ristra,595189,USD 0.060000,"USD 35,711.340000",SICM-514-2022,,,
2,SICM-500-2022,https://modulocomprascorporativas.compraspubli...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: LIDOCAÍNA SIN EPINEFRINA - FORMA FARMACÉU...,Adjudicado oferente ganador,PICHINCHA / QUITO,$5.00000,2022-06-23 08:00:00,First 2022,N01BB02SCT220X0,...,Sólido cutáneo (Parche transdérmico),5 %,Caja x parche/parches,184631,USD 5.000000,"USD 923,155.000000",SICM-500-2022,GRUNENTHAL ECUATORIANA CIA LTDA,USD 2.990000,2022-09-05 13:11:06
3,SICM-515-2022,https://modulocomprascorporativas.compraspubli...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: PROGESTERONA - FORMA FARMACÉUTICA: SÓLIDO...,Desierto,PICHINCHA / QUITO,$0.23000,2022-06-23 08:00:00,First 2022,G03DA04SOR083X0,...,Sólido oral,100 mg,Caja x blíster/ristra,3459277,USD 0.230000,"USD 795,633.710000",SICM-515-2022,,,
4,SICM-516-2022,https://modulocomprascorporativas.compraspubli...,SERVICIO NACIONAL DE CONTRATACION PUBLICA,DCI: RITUXIMAB - FORMA FARMACÉUTICA: LÍQUIDO P...,Desierto,PICHINCHA / QUITO,$1406.37000,2022-06-23 08:00:00,First 2022,L01XC02LPR104D7,...,Líquido parenteral,"1 400 mg/11,7 mL","Caja x vial x 11,7 mL",2230,"USD 1,406.370000","USD 3,136,205.100000",SICM-516-2022,,,


In [68]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   codigo                986 non-null    object
 1   detalle_url           986 non-null    object
 2   entidad               986 non-null    object
 3   objeto                986 non-null    object
 4   estado                986 non-null    object
 5   ubicacion             986 non-null    object
 6   presupuesto           986 non-null    object
 7   fecha_publicacion     986 non-null    object
 8   semestre              986 non-null    object
 9   cum_id                986 non-null    object
 10  principio_activo      986 non-null    object
 11  forma_farmaceutica    986 non-null    object
 12  concentracion         986 non-null    object
 13  presentacion          986 non-null    object
 14  cantidad              986 non-null    object
 15  precio_referencial    986 non-null    ob

In [69]:
merged = merged.drop(columns={"codigo_scraped"})

In [70]:
#Checking the one item that got an error while scraping
merged.iloc[293]

codigo                                                      SICM-237-2022
detalle_url             https://modulocomprascorporativas.compraspubli...
entidad                         SERVICIO NACIONAL DE CONTRATACION PUBLICA
objeto                  DCI: METRONIDAZOL - FORMA FARMACÉUTICA: SÓLIDO...
estado                                        Adjudicado oferente ganador
ubicacion                                               PICHINCHA / QUITO
presupuesto                                                      $0.12500
fecha_publicacion                                     2022-05-03 08:00:00
semestre                                                       First 2022
cum_id                                                    G01AF01SVG241X0
principio_activo                                             Metronidazol
forma_farmaceutica                                         Sólido vaginal
concentracion                                                      500 mg
presentacion                          

In [71]:
#Check the url manually, because it is an awarded contract but I didn't get the supplier
merged.iloc[293]["detalle_url"]

'https://modulocomprascorporativas.compraspublicas.gob.ec/ProcesoContratacion/compras/PC/informacionProcesoContratacion2.cpe?idSoliCompra=4Lqpmu2fCeOpDLalJJ0-PiKmGuOjwRb1hfoPvWKW6Bo,'

In [81]:
merged.to_csv("data/raw/complete_scraped_data.csv", index=False)