<a href="https://colab.research.google.com/github/valeromora/TAM_2025-1/blob/main/Final%20Proyect/Final_proyect_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Set up

In [None]:
# Install necessary libraries if they are not already installed
!pip install PyDrive

In [19]:
!pip install nltk



In [26]:
!pip install -U sentence-transformers faiss-cpu

Collecting sentence-transformers
  Downloading sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from

In [28]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import zipfile
import os
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Database

In [6]:
# https://drive.google.com/file/d/1x6HASgMM9WqaglFR-zH2_LU6eRs4l_tq/view?usp=sharing

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download the file
file_id = '1x6HASgMM9WqaglFR-zH2_LU6eRs4l_tq'
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('downloaded_file.zip')

# Unzip the file
with zipfile.ZipFile('downloaded_file.zip', 'r') as zip_ref:
  zip_ref.extractall('/content/')

# List the extracted files to confirm
!ls /content/

'12EE54A_6.2.1 - Safely managed sanitation & hand-washing'
'1548EA3_6.1.1 - Safely managed drinking water'
'16BBF41_3.4.2 - Suicide'
'1772666_3.1.2 - Births attended by skilled health personnel'
'1F96863_3.4.1 - cardiovascular disease, cancer, diabetes or chronic respiratory disease'
'217795A_3.C.1 - Health worker density and distribution'
'2322814_3.2.1 - Under-five mortality rate'
'2D6FBE4_3.3.5 - Neglected tropical diseases'
'361734E_16.1.1 - Intentional homicide'
'442CEA8_3.3.3 - Malaria'
'45CA7C8_3.C.1 - Health worker density and distribution'
'5C8435F_3.C.1 - Health worker density and distribution'
'5F8A486_2.2.1 - Stunting'
 608DE39
'6A64C9A_7.1.2 - Clean fuels'
'75DDA77_3.A.1 - Age-standardized prevalence of tobacco use'
'77D059C_3.3.1 - HIV infections'
'8074BD9_3.7.1 - Women satisfied with modern methods'
'84FD3DE_3.9.3 - Unintentional poisoning'
'A37BDD6_6.3.1 - Safely treated wastewater'
'A4C49D3_3.2.2- Neonatal mortality rate'
'AC597B1_3.1.1 - Maternal mortality ratio'
'B9C

# Preparación archivos
* Leer esos archivos según su tipo (code list, dataset, metadata, dictionary).

* Convertir cada uno a texto entendible usando pandas para leer el archivo, y luego convertimos:

  * `Metadata`: lo tratamos como key-value.

  * `Code list`: como glosario.

  * `Data Dictionary`: definiciones de columnas.

  * `Dataset`: generamos frases automáticas (resúmenes)

* Prepararlos para ser vectorizados

In [14]:
# Ruta base donde están las carpetas
base_path = "/content"

# Lista final de documentos (textos ya listos para vectorizar)
documents = []

# Recorremos todas las carpetas
for folder in os.listdir(base_path):
    folder_path = os.path.join(base_path, folder)
    if not os.path.isdir(folder_path):
        continue  # saltamos archivos sueltos

    # Recorremos los archivos dentro de cada carpeta
    for filename in os.listdir(folder_path):
        if not filename.endswith(".csv"):
            continue

        file_path = os.path.join(folder_path, filename)

        # Intentamos leer el archivo
        try:
            df = pd.read_csv(file_path)
        except Exception as e:
            print(f"No se pudo leer {filename}: {e}")
            continue

        text_content = ""
        if "Metadata" in filename:
            text_content += f"🧾 Metadata extraída de {filename}:\n"
            for row in df.itertuples(index=False):
                if len(row) >= 2:
                    text_content += f"{row[0]}: {row[1]}\n"

        elif "Code list" in filename:
            text_content += f"📘 Glosario de códigos extraído de {filename}:\n"
            for row in df.itertuples(index=False):
                if len(row) >= 4:
                    dim, key, name, desc = row[:4]
                    text_content += f"{key} ({dim}): {name} → {desc}\n"

        elif "Dictionary" in filename or "Data Dictionary" in filename:
            text_content += f"📚 Diccionario de variables de {filename}:\n"
            for row in df.itertuples(index=False):
                if len(row) >= 2:
                    var, definition = row[:2]
                    text_content += f"{var}: {definition}\n"

        elif "Dataset" in filename:
            text_content += f"📊 Datos resumidos de {filename}:\n"
            for row in df.itertuples(index=False):
                try:
                    country = getattr(row, "GEO_NAME_SHORT", "un país")
                    year = getattr(row, "DIM_TIME", "año desconocido")
                    urb = getattr(row, "Degree_of_urbanization", "zona")
                    percent = getattr(row, "PERCENT_POP_N", "X")
                    text_content += f"En {year}, el {percent}% de la población en {urb} de {country} usaba combustibles limpios.\n"
                except Exception:
                    continue

        else:
            print(f"Archivo sin tipo conocido: {filename}")
            continue

        # Guardamos el contenido como un documento
        documents.append({
            "source": f"{folder}/{filename}",
            "content": text_content
        })

Archivo sin tipo conocido: mnist_test.csv
Archivo sin tipo conocido: california_housing_train.csv
Archivo sin tipo conocido: california_housing_test.csv
Archivo sin tipo conocido: mnist_train_small.csv


In [18]:
print(f"Total documentos extraídos: {len(documents)}")
print(documents[0]['source'])
print(documents[0]['content'][:1000])  # muestra los primeros caracteres del primer documento

Total documentos extraídos: 135
1772666_3.1.2 - Births attended by skilled health personnel/170_1772666_Metadata_2025-04-07.csv
🧾 Metadata extraída de 170_1772666_Metadata_2025-04-07.csv:
Name: Proportion of births attended by skilled health personnel (%)
Short name: Proportion of births attended by skilled health personnel (%)
Indicator unique identifier: 1772666
Indicator codes: MDG_0000000025
Also known as: SDG indicator 3.1.2
SDG Goal: 3.1.2 – Births attended by skilled health personnel
Short description: Proportion of births attended by skilled health personnel.
(SDG 3.1.2)
Definition: Proportion of births attended by skilled health personnel (generally doctors, nurses or midwives but can refer to other health professionals providing childbirth care) is the proportion of childbirths attended by professional health personnel.
According to the current definition (1) these are competent maternal and newborn health (MNH) professionals educated, trained and regulated to national and 

# Chunking

Se busca dividir cada content del documento en trozos más pequeños.
* Divide el texto en grupos de 500 palabras

* Repite las últimas 50 palabras del chunk anterior (para mantener contexto)

In [24]:
MAX_CHUNK_WORDS = 500
OVERLAP = 50

chunks = []

for doc in documents:
    source = doc["source"]
    words = doc["content"].split()

    for i in range(0, len(words), MAX_CHUNK_WORDS - OVERLAP):
        chunk_words = words[i:i + MAX_CHUNK_WORDS]
        chunk_text = ' '.join(chunk_words)

        chunks.append({
            "source": source,
            "content": chunk_text
        })

In [25]:
print(f"Total de chunks: {len(chunks)}")
print("Ejemplo de chunk:\n", chunks[0]["content"][:700])

Total de chunks: 261
Ejemplo de chunk:
 🧾 Metadata extraída de 170_1772666_Metadata_2025-04-07.csv: Name: Proportion of births attended by skilled health personnel (%) Short name: Proportion of births attended by skilled health personnel (%) Indicator unique identifier: 1772666 Indicator codes: MDG_0000000025 Also known as: SDG indicator 3.1.2 SDG Goal: 3.1.2 – Births attended by skilled health personnel Short description: Proportion of births attended by skilled health personnel. (SDG 3.1.2) Definition: Proportion of births attended by skilled health personnel (generally doctors, nurses or midwives but can refer to other health professionals providing childbirth care) is the proportion of childbirths attended by professional heal


# Vectorización (embeddings)
Los modelos de lenguaje como los usados en RAG no buscan respuestas de forma exacta por coincidencia de palabras, sino que comparan significados. Para hacer esto, necesitamos representar cada fragmento de texto (chunk) como un vector en un espacio multidimensional. Luego podemos encontrar los textos más relevantes comparando su distancia semántica.

Utilizamos el modelo all-MiniLM-L6-v2 de la librería sentence-transformers, el cual transforma cada chunk de texto en un vector numérico de 384 dimensiones que resume su contenido de forma semántica. Luego extraemos los textos de los chunks previamente generados y los vectorizamos

In [29]:
# Cargar el modelo
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extraer solo el texto de los chunks
texts = [chunk['content'] for chunk in chunks]

# Vectorizar los textos
embeddings = model.encode(texts, show_progress_bar=True)

# Convertir a matriz numpy
embeddings_np = np.array(embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/9 [00:00<?, ?it/s]

Usamos FAISS (Facebook AI Similarity Search) para almacenar los vectores y poder hacer búsquedas por similitud. Esto nos permite recuperar los chunks más cercanos a una consulta (query) en términos de significado.

In [30]:
# Creamos un índice FAISS para búsquedas por similitud
dimension = embeddings_np.shape[1]
index = faiss.IndexFlatL2(dimension)

# Añadir los vectores
index.add(embeddings_np)

In [31]:
# Guardar los metadatos (puede ser útil para devolver respuestas)
chunk_sources = [chunk["source"] for chunk in chunks]

Con el índice creado, ahora podemos consultar el sistema en lenguaje natural. FAISS devuelve los chunks más relevantes con base en la similitud semántica entre la consulta y los documentos vectorizados.

In [32]:
# Ejemplo de búsqueda
query = "Porcentaje de población con acceso a agua potable"
query_vec = model.encode([query])

# Buscar los 5 chunks más similares
distances, indices = index.search(query_vec, k=5)

# Mostrar los resultados
for i in indices[0]:
    print("🔹 Source:", chunks[i]["source"])
    print(chunks[i]["content"][:300])
    print("-" * 80)

🔹 Source: 8074BD9_3.7.1 - Women satisfied with modern methods/170_8074BD9_Dataset_2024-01-08.csv
📊 Datos resumidos de 170_8074BD9_Dataset_2024-01-08.csv: En 2009-2010, el X% de la población en zona de Colombia usaba combustibles limpios. En 2015-2016, el X% de la población en zona de Colombia usaba combustibles limpios. En 2000, el X% de la población en zona de Colombia usaba combustibles limpi
--------------------------------------------------------------------------------
🔹 Source: E0D4E17_5.2.2 - Sexual violence by persons other than an intimate partner (last 12 months)/170_E0D4E17_Dataset_2024-01-08.csv
📊 Datos resumidos de 170_E0D4E17_Dataset_2024-01-08.csv: En 2018, el X% de la población en zona de Colombia usaba combustibles limpios.
--------------------------------------------------------------------------------
🔹 Source: A37BDD6_6.3.1 - Safely treated wastewater/170_A37BDD6_Metadata_2024-01-08.csv
are covered by the use of private systems using non-public/drinking water supply