# Smart Consultation of Colombia National Development Plan

This notebook showcases the development of an intelligent query tool based on Retrieval Augmented Generation, combining text generation capabilities of a large language model (OpenAI `gpt-4o-mini`), an embedding model for vectorization (OpenAI `text-embedding-3-large`) and the efficiency of a vector database (`ChromaDB`).

The goal of this tool is to simplify access to an understanding of [Colombia's National Development Plan](https://colaboracion.dnp.gov.co/CDT/Prensa/Publicaciones/plan-nacional-de-desarrollo-2022-2026-colombia-potencia-mundial-de-la-vida.pdf) by allowing users to ask questions in natural language and receive clear, precise and context-aware answers about the document's topics, goals and policies.

This project is part of the final assignment for the Kaggle's [Gen AI Intensive Course 2025Q1](https://www.kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1#submission-instructions). The main concepts covered in the course and applied in this project are:

✅ Embeddings (for text vectorization)

✅ Vector search/vector store/vector database (with ChromaDB as the vector database)

✅ Retrieval Augmented Generation (RAG)

✅ Grounding


### Setup

1. Run the following command to install the required dependencies.

In [None]:
!pip install langchain langchain_openai langchain_community langchain_text_splitters pypdf langchain_chroma

2. The data used for this project is the Colombia's National Development Plan (2022-2026). Download it from this [link](https://colaboracion.dnp.gov.co/CDT/Prensa/Publicaciones/plan-nacional-de-desarrollo-2022-2026-colombia-potencia-mundial-de-la-vida.pdf) and put it in the same folder as this notebook.

3. Set the API keys for the OpenAI embeddings and LLM model.

In [1]:
OPENAI_API_KEY = "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"  # Replace with your OpenAI API key

### Loading the pdf document

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file = "plan-nacional-de-desarrollo-2022-2026-colombia-potencia-mundial-de-la-vida.pdf"
loader = PyPDFLoader(file)
doc = loader.load()

print(len(doc)) # Number of pages in the PDF

848


### Split the document in chunks for vectorization

Since the source document is too large to process as a whole (848 pages), it's necessary to split it into smaller chunks before generating embeddings. This chunking process enables efficient retrieval and allows accurate search when answering user queries.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = text_splitter.split_documents(doc)

### Create the vector database

A local ChromaDB **Vector Database** is created with the purpose of storing the embeddings of the document for their future consult.

In [5]:
import chromadb

persistent_vectorstore = chromadb.PersistentClient(path="./vectordb")

### Create the embeddings model

In [4]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model = "text-embedding-3-large")

### Vectorize the chunks and save them in Chroma database

In the following section, the chunks from the source document are transformed into **Embeddings**, using OpenAI's `text-embedding-3-large` model. The resulting embeddings are then stored in a local ChromaDB **Vector Database**.

In [7]:
from langchain_chroma import Chroma
import chromadb

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings_model,
    persist_directory="./vectordb"
)

### Retrieve information from the Vector Database

When an user wants to query information about the source document, we do a **Vector Search** in the Chroma vector database to search for the three sections with the most relevant information related to the user's query, as shown in the section below.

In [8]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

retriever.invoke("Que se planea hacer para proteger a los ciudadanos del desempleo?")

[Document(id='f068835d-658d-4258-8207-76681b3d9821', metadata={'creationdate': '2023-12-27T17:04:13-05:00', 'creator': 'Adobe InDesign 16.4 (Windows)', 'moddate': '2024-02-12T14:56:59+00:00', 'page': 81, 'page_label': '82', 'producer': 'iLovePDF', 'source': 'plan-nacional-de-desarrollo-2022-2026-colombia-potencia-mundial-de-la-vida.pdf', 'total_pages': 848, 'trapped': '/False'}, page_content='82\nPLAN NACIONAL DE DESARROLLO • 2022-2026\nd.  Esquema de protección al desempleo\nSe diseñará un esquema de protección contra el desempleo redefiniendo el Mecanismo \nde Protección al Cesante (MPC), que responderá a las necesidades de la población \ndesempleada y cesante, incluyendo trabajadores formales e informales. Se tendrán \nen cuenta las brechas que existen en las distintas poblaciones (como jóvenes, \nmujeres, personas mayores, con discapacidad y personas LGBTIQ+, entre otras). \nDicho esquema contemplará: (i) La exploración de nuevas formas de financiamiento \npara quienes no acceden a

### Create the LLM to chat with the document

In the following section, we use OpenAI's `gpt-4o-mini model` to formulate a clear, precise and context aware answer that the user can easily understand.

In [9]:
from langchain_openai import ChatOpenAI

llm_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.2, openai_api_key=OPENAI_API_KEY)

### Define the prompt for the LLM model

In [42]:
from langchain_core.prompts import ChatPromptTemplate  
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

system_message = (
    '''Eres un asistente experto en políticas públicas encargado de responder preguntas de los ciudadanos 
    con respecto al plan de desarrollo de colombia. 
    A partir de la siguiente informacion extraída del plan de desarrollo:

    {context}

    Responde de manera clara y concisa, utilizando un lenguaje accesible para el público en general, las preguntas de un ciudadano.

    '''
)


prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_message), 
        ("human", "{input}"),
     ]
)



In the section below, we use langchain chains to connect the retriever that searchs information from the vector database with the LLM model that gets this retrieved information and writes an easy-to-understand answer, thus forming the **RAG** pipeline.

In [43]:
chain = create_stuff_documents_chain(llm_model,prompt)
rag_chain = create_retrieval_chain(retriever, chain)

### Chat with the RAG

In [None]:
# Replace with your user input
user_input = "Que se planea hacer para proteger a los ciudadanos del desempleo?" 
# In English: "What is planned to protect citizens from unemployment?"

In [100]:
# Replace with your user input
user_input = "What does the plan say about the vital minimum of water?"
# En Español: "que dice el plan sobre el minimo vital del agua" 


**Note**: this project was made thinking of Colombian citizens as its users. Since they speak Spanish and Colombia's National Development Plan is also Spanish, but the Kaggle's Capstone Project is required to be in English, the following function was implemented to translate the user input into Spanish for it to be understandable by the model and the retriever, and also to translate the output of the model into English, for it to adhere to the capstone project requirements.

In [73]:
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

llm_translator = ChatOpenAI(model="gpt-4o-mini", temperature=0, openai_api_key=OPENAI_API_KEY)

def translate(text, target_lang):
    m = [
        SystemMessage(content=f"You are a translation assistant that translates text to {target_lang}. First, you have to verify if the text given to you is already in the target language. If it is, you don't have to do anything else, just return the text as it was given to you. If it is not, you have to translate it to the target language."),
        HumanMessage(content=f"Please translate the following text: {text}")
    ]

    response_t = llm_translator.invoke(m)
    return response_t.content.strip()



Translate the user input to Spanish for the model to understand it, in case the user writes it in another language:

In [101]:
target_lang = "Spanish"
translated_input = translate(user_input, target_lang)

In [102]:
translated_input

'¿Qué dice el plan sobre el mínimo vital de agua?'

Use RAG to search the answer to the user query in the vector database:

In [106]:
results = rag_chain.invoke({"input": translated_input})

In [107]:
results

{'input': '¿Qué dice el plan sobre el mínimo vital de agua?',
 'context': [Document(id='c226bb89-6a02-4f3a-8905-8766889402ee', metadata={'creationdate': '2023-12-27T17:04:13-05:00', 'creator': 'Adobe InDesign 16.4 (Windows)', 'moddate': '2024-02-12T14:56:59+00:00', 'page': 733, 'page_label': '734', 'producer': 'iLovePDF', 'source': 'plan-nacional-de-desarrollo-2022-2026-colombia-potencia-mundial-de-la-vida.pdf', 'total_pages': 848, 'trapped': '/False'}, page_content='734\nPLAN NACIONAL DE DESARROLLO • 2022-2026\nEn todo caso, en la determinación de las tarifas, se observarán los principios de \nequidad, suficiencia y moderación y se podrán establecer rangos diferenciales según la \nnaturaleza de los riesgos.\nARTÍCULO 192.o. GARANTÍA DEL ACCESO A AGUA Y SANEAMIENTO \nBÁSICO. El Ministerio de Vivienda, Ciudad y Territorio definirá las condiciones \npara asegurar de manera efectiva al acceso a agua y al saneamiento básico en \naquellos eventos en donde no sea posible mediante la prestaci

To enhance transparency and usability, **Grounding** was applied in the following section. The response generated by the model is accompanied by references to the specific pages of the source document, allowing users to easily verify and explore the original content behind the answer.

In [108]:
answer_spanish = ""
answer_spanish = f"Respuesta: {results["answer"]}" + "\n\n"
pages = ""
for doc in results["context"]:
    pages += doc.metadata.get("page_label", "") + " "
answer_spanish += f"Tomado de las páginas: {pages}"



This outputs the answer of the RAG in Spanish:

In [109]:
print(answer_spanish)

Respuesta: El Plan Nacional de Desarrollo de Colombia establece que el derecho humano al agua debe ser garantizado de manera integral, asegurando su disponibilidad, acceso y calidad. Para esto, se implementará el concepto de "mínimo vital de agua", que busca satisfacer las necesidades de la población más vulnerable. Esto implica que se desarrollarán propuestas normativas que establezcan lineamientos para garantizar el acceso al agua y al saneamiento básico en el país, utilizando esquemas diferenciales y asegurando el suministro adecuado.

Tomado de las páginas: 734 734 113 


Translate the answer to English:

In [110]:
target_lang = "English"
answer_english = translate(answer_spanish, target_lang)

In [111]:
print(answer_english)

Response: The National Development Plan of Colombia establishes that the human right to water must be guaranteed in a comprehensive manner, ensuring its availability, access, and quality. To this end, the concept of "vital minimum of water" will be implemented, which seeks to meet the needs of the most vulnerable population. This implies that regulatory proposals will be developed to establish guidelines to guarantee access to water and basic sanitation in the country, using differential schemes and ensuring adequate supply.

Taken from pages: 734 734 113
