<a href="https://colab.research.google.com/github/ygmanuelog/Task-1-news-query_RPP-lab/blob/main/Task_1_news_query_RPP_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q feedparser tiktoken sentence-transformers chromadb langchain pandas tqdm matplotlib seaborn
!pip install -U langchain-community

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-community 0.4 requires langchain-core<2.0.0,>=1.0.0, but you have langchain-core 0.3.79 which is incompatible.
langchain-classic 1.0.0 requires langchain-core<2.0.0,>=1.0.0, but you have langchain-core 0.3.79 which is incompatible.
langchain-classic 1.0.0 requires langchain-text-splitters<2.0.0,>=1.0.0, but you have langchain-text-splitters 0.3.11 which is incompatible.[0m[31m
Collecting langchain-core<2.0.0,>=1.0.0 (from langchain-community)
  Using cached langchain_core-1.0.0-py3-none-any.whl.metadata (3.4 kB)
Collecting langchain-text-splitters<2.0.0,>=1.0.0 (from langchain-classic<2.0.0,>=1.0.0->langchain-community)
  Using cached langchain_text_splitters-1.0.0-py3-none-any.whl.metadata (2.6 kB)
Using cached langchain_core-1.0.0-py3-none-any.whl (467 kB)
Using cached langchain_text_splitters-1

In [2]:
import feedparser
import pandas as pd
import tiktoken
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from langchain.vectorstores import Chroma as LCChroma
from langchain.embeddings import HuggingFaceEmbeddings
from tqdm.auto import tqdm
import numpy as np
import shutil
import os

***Load Data***

In [3]:
def load_rss(feed_url="https://rpp.pe/rss", limit=50):
    feed = feedparser.parse(feed_url)
    data = []
    for entry in feed.entries[:limit]:
        data.append({
            "title": entry.title,
            "description": entry.description,
            "link": entry.link,
            "published": entry.get("published", "")
        })
    df = pd.DataFrame(data)
    print(f"{len(df)} noticias cargadas desde RPP")
    return df


***Tokenization***

In [4]:
def count_tokens(text, model="gpt-3.5-turbo"):
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))


***Embedding***

In [5]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = SentenceTransformer(model_name)
print(f"Modelo de embeddings cargado: {model_name}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Modelo de embeddings cargado: sentence-transformers/all-MiniLM-L6-v2


***Create or Upsert Chroma Collection***

In [6]:
client = chromadb.Client()
collection = client.get_or_create_collection("rpp_news")

def store_embeddings(df):
    texts = df["description"].tolist()
    embeddings = embedder.encode(texts)
    collection.upsert(
        documents=texts,
        embeddings=embeddings.tolist(),
        ids=[f"id_{i}" for i in range(len(df))],
        metadatas=df.to_dict(orient="records")
    )
    print(f"{len(df)} embeddings almacenados en ChromaDB")


***Query Results***

In [7]:
def query_news(query, n_results=5):
    query_emb = embedder.encode([query])
    results = collection.query(
        query_embeddings=query_emb.tolist(),
        n_results=n_results
    )
    df = pd.DataFrame({
        "title": [meta.get("title", "") for meta in results["metadatas"][0]],
        "description": results["documents"][0],
        "link": [meta.get("link", "") for meta in results["metadatas"][0]],
        "date_published": [meta.get("published", "") for meta in results["metadatas"][0]]
    })
    return df


***Orchestrate with LangChain***

In [11]:
def rpp_pipeline(query="Últimas noticias de economía"):
    df = load_rss()
    print("\nEjemplo:")
    print(df['description'].iloc[0][:200], "...")
    print(f"Tokens: {count_tokens(df['description'].iloc[0])}")
    store_embeddings(df)
    print(f"\nConsulta: {query}")
    results_df = query_news(query)
    print("\nResultados:\n")
    return results_df

df_results = rpp_pipeline("Últimas noticias de economía")
df_results.head(10)


50 noticias cargadas desde RPP

Ejemplo:
Galatasaray y FK Bodo/Glimt se miden  en el estadio Rams Park mañana a las 11:45 horas y Michael Oliver es el elegido para dirigir el partido. ...
Tokens: 42
50 embeddings almacenados en ChromaDB

Consulta: Últimas noticias de economía

Resultados:



Unnamed: 0,title,description,link,date_published
0,¿Seguirá cayendo el dólar o se estabilizará? E...,El dólar acumula una fuerte caída en las últim...,https://rpp.pe/videos/economia/seguira-cayendo...,"Mon, 20 Oct 2025 19:07:26 -0500"
1,¿Seguirá cayendo el dólar o se estabilizará? E...,El dólar acumula una fuerte caída en las últim...,https://rpp.pe/economia/economia/precio-del-do...,"Mon, 20 Oct 2025 19:10:31 -0500"
2,"Alta rotación en el MEF, SUNAT y Petroperú gen...",La alta rotación de ministros y directivos en ...,https://rpp.pe/economia/economia/ipe-alta-rota...,"Mon, 20 Oct 2025 19:00:14 -0500"
3,"Rafael Vela: ""El Tribunal Constitucional lamen...",El fiscal manifestó que tienen el derecho a cr...,https://rpp.pe/politica/judiciales/rafael-vela...,"Mon, 20 Oct 2025 20:28:44 -0500"
4,Abogada de Keiko Fujimori sobre caso Cócteles:...,Giulliana Loza manifestó que el caso 'Cócteles...,https://rpp.pe/politica/judiciales/keiko-fujim...,"Mon, 20 Oct 2025 23:03:06 -0500"
