<a href="https://colab.research.google.com/github/torresnat/hitchikers_guide_pr/blob/main/NewsArticlesRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📜 Data Extraction & Preprocessing for News Articles
The news articles from El Mundo contain historical events, which can be useful for travel recommendations. Since they are in raw .txt format, here’s how to process them:

**1️⃣ Load & Inspect the Data (Monday Morning)**

✅ Steps:

- Load the .txt files and inspect the structure.
- Check if they contain metadata (dates, locations, etc.).
- Identify encoding issues (e.g., UTF-8 vs. ISO-8859-1).

In [5]:
import os

folder_path = "/content/EnglishNewsArticles"  # Adjust path

# List all files in the folder and take only the first 10
file_list = sorted(os.listdir(folder_path))[:10]

for file_name in file_list:
    file_path = os.path.join(folder_path, file_name)
    # Check if the item is a file before opening
    if os.path.isfile(file_path):
        with open(file_path, "r", encoding="utf-8") as f:
            print(f"--- {file_name} ---\n", f.read()[:500])  # Preview first 500 chars

--- 19200103_1.txt ---
 ELWNDO
8 pages 3 ctvs. Semester, $4.00 One year, $7.50
Offices: Salvador Brau. 81 Tel. 833 P. O. Box 345
NEWSPAPER OF THE TANGLE.
EXCEPT SUNDAYS
SAN JUAN, PUERTO RICO. - 71 i. 1- ■- ■- r- - _¡
SATURDAY, JANUARY 3, 1920. r
i NUMBER 271.
ENTERED AS SECOND CLASS MATTER, FEBRUARY 21, 1919. AT THE POST OFFICE AT SAN JEAN. PORTO RICO. UNDER THE ACT OF MARCH 3, 1879.
Plot to assassinate the Prince of Serbia. 192,000 tons in dykes to be delivered by 
--- 19200110_1.txt ---
 ELM1NDO
8 pages 3 ctvs. 'Semester, $4.00 One Year, $7.50
i Offices: ¡ = Saved: E1 I I I Ttl. 632 P. O Be" 345 |
DAILY TIDE.
EXCEPT SUNDAYS
ARO II
SAN JUAN, PUERTO RICO.
SATURDAY i$ JANUARY 1*20.
ENTERED AS SECOND CLASS MATTER, FEBRUARY 21,
l "M9, AT THE POSt OFFICE AT SAN JUAN, PORTO
Ni MI RO 277.
RICO, INDER THE ACT OF MARCH 3.
More "bolshevikis" sent to Ellis Island. Japan releases German prisoners.
Taltemte with Mr. i Sewell about the i present conflict t " : . " 1 Tétala no tay esparauas da 
--- 

📌 Goal: Understand the data structure before further processing.

**2️⃣ Clean & Structure the Text**

✅ Steps:
- Remove irrelevant whitespace, headers, and footers.
- Detect and extract dates if present.
- Identify paragraphs and segment them properly.

In [8]:
#!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.10.0


In [9]:
import os
import re
import openai
import numpy as np
import faiss
import requests
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [10]:
import os
import re

def clean_text(text):
    text = re.sub(r'\n+', ' ', text)  # Remove extra newlines
    text = re.sub(r'\s+', ' ', text).strip()  # Remove excessive spaces
    text = re.sub(r'[^\w\s.,;!?-]', '', text)  # Remove special characters
    return text

# Read and clean all text files
chunks = []
filenames = []

for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):  # Ensure only text files are processed
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            raw_text = f.read()
        cleaned_text = clean_text(raw_text)
        chunks.append(cleaned_text)
        filenames.append(filename)  # Store filenames for reference

print(f"Loaded {len(chunks)} articles.")

Loaded 821 articles.


📌 Goal: Have well-formatted articles ready for processing.

**3️⃣ Chunking & Splitting** CHANGED DATASET! DONT HAVE TO DO THIS STEP!

Since news articles can be long, chunking is necessary for effective retrieval in the RAG system.

✅ Best Strategies:
- Fixed-Length Splitting (e.g., 512 tokens per chunk).
- Semantic Splitting (split by topic using embeddings).

📌 Goal: Ensure each chunk is meaningful and retrievable.

**5️⃣ Embedding & Storing for RAG**

✅ Steps:
- Convert the chunks into vector embeddings (Multilingual SBERT or OpenAI Embeddings).
- Store them in Pinecone/FAISS for fast retrieval.

In [11]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a lightweight sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")  # Faster and good for retrieval

# Convert each article into an embedding
embeddings = np.array([model.encode(chunk) for chunk in chunks])

print("Embeddings shape:", embeddings.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings shape: (821, 384)


**6️⃣ Retrieval & Querying (RAG)**

✅ Final Step:
- Connect the vector database to your chatbot.
- Given a user interest (e.g., history, politics), retrieve relevant articles.
- Use semantic search to return contextual news snippets.

💡 Why this is better?

✅ It returns similarity scores so you can see how relevant the results are

✅ You can tune k to get more or fewer results

In [None]:
#!pip install faiss-cpu

In [12]:
import faiss

# Create a FAISS index for fast similarity search
index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance for similarity
index.add(embeddings)

print(f"Stored {len(embeddings)} embeddings in FAISS.")

Stored 821 embeddings in FAISS.


In [13]:
def search_news(query, k=3):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [(chunks[i], distances[0][j]) for j, i in enumerate(indices[0])]  # Returns chunk + similarity score

# Example Query
results = search_news("hurricane impact in Puerto Rico", k=5)

# Print results
for res, score in results:
    print(f"Score: {score:.4f}\n{res}\n---")

Score: 0.8189
elmúndo MORNING NEWSPAPER 1 HS. i IB Office Puerto Rico Hat A W trado Bldg. -Tel. 122. 1 Pages 3 Ctvs. 1  Semester - - - - 5.00 A V A Year 9.50. DIARY OF. 8 LA MAÑANA ao X San Juan, Puerto Rico. Published tic distributed under permit No 801 uta Ized by the Act or 0 tci.er 6 1917 on file t the Post Office at Sun Juan. Porto Rco By order of the President. A S. Burleson Posmaster Ger.e-a. Entered ai econd c.as, matter february 21 1919 at the Pest Office at San Juan, Porto Rico. United Stacs of A-.nerica under the Act of march s, 1879. i Saturday, September 22, 1928. I Number 3150. The steamer Bridge bringing a large quantity of provisions, will arrive next Tuesday. Hy and ir?ñ?na steamers will arrive from Santo Domingo carrying viyeres. Horrific news of the effects of the cyclone continue to arrive from the interior of the island. THE DAMAGE CAUSED BY THE CYCLONE IN JAYUYA M From our Orre-i on i M The storm began to azular REMATE ON TUESDAY. SEPT. 25. 1923 . AT 2OO P. M. AT 

In [14]:
def search_news(query, k=3):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [(chunks[i], distances[0][j]) for j, i in enumerate(indices[0])]  # Returns chunk + similarity score

# Example Query
results = search_news("economic situation in San Juan", k=5)

# Print results
for res, score in results:
    print(f"Score: {score:.4f}\n{res}\n---")

Score: 0.7662
THE MlNDO WEATHER FORECASTS FOR THE ISLAND. BOY Cloudy with a few scattered showers. - IN SAN JUAN, YESTERDAY - High, 88 degrees; low, 76 degrees. Barometric pressure at sea level, at 480 p.m., 39.88 inches of mercury. NURVA YORK, wtuhr ta. Pü. - In 1 afternoon today a reported the following temperature New York, 87; Chicago, 64; Wáahlngton, 84; Mia. mi, 88. Forecast for mafia in New York and neighboring rindadea cloudy with scattered shower. the highest temperature between M and 70. Wind moderate wind from the bat. TOMORROW DAILY f YEAR XXX Entered ae second class matter. Post Office. San Juan. P. B SAN JUAN, PUERTO RICO - SATURDAY, OCTOBER 15, 1949 NUMBER 13815 FIVE CENTS Interior Requests 815,606,500 For 1950 Substantial amount to be devoted to roads Plans to isolate land ceded by Navy to avoid slums Fnr S. GAIvm Mataran Editor of EL MUNDO More than five million dollars has been requested by the Department of the Interior for the first phase of the Seventh Economic Pro

In [15]:
def search_news(query, k=3):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [(chunks[i], distances[0][j]) for j, i in enumerate(indices[0])]  # Returns chunk + similarity score

# Example Query
results = search_news("cultural festivals in Puerto Rico", k=5)

# Print results
for res, score in results:
    print(f"Score: {score:.4f}\n{res}\n---")

Score: 0.9572
THE MONDO 1 á Pages 3 Ctvs. f UJf Semester - e- 6.00 Ml 1 Jn Year - ---- -i 19.50 QJ Pages 3 Ctvs. f Rico DusM Mtrado Bldg. -TeL 1222 DAILY MORNING MORNING Year IX, I San Juan, Puerto Rico.. .ublMhed and diatributad andar permit No. 801 autortad by the Aet. of October 8, 1917, on filo at the Poet Office at dan Juan, Porto Bieo. By order of the Preaident. A. B. Burdegou, Poitmaatar General. Entered m aeeond cllu matter, ebrnary 81. 1919 at the Poat Office at San Juan. P.rto Kuo. United States of America under the Act of trareb 8. 187. Saturday, December 17, 1927.1 I Number 2359. THE UNIVERSITY OF COLUMBIA INVITES DON JOSE COLL Y CU- CHI TO GIVE A LECTURE NEXT FEBRUARY M The cultural event will be held under the auspices of the Instituto de las Españas, of New York M Columbia University has done the high honor to our compatriot Don José Coll y Cuchí of inviting him to occupy the rostrum of that high cultural center, giving a lecture. The event will be held under the auspice

📌 Goal: Return relevant news articles when a traveler asks about history, politics, or past events.

# **🛠️ Phase 2: Extracting Key Travel Information from News Articles**
💡 Goal: Extract landmarks, events, places of interest, beaches, forests, trails, etc. from the news dataset and structure it into a knowledge base for retrieval.

## 📝 Extraction Approach
**1️⃣ Named Entity Recognition (NER) for Key Elements**

✅ Use spaCy or Hugging Face Transformers to extract:
- Dates 📅
- Locations 📍
- People 👥
- Event types 🔥

🔹 Example (spaCy for NER Extraction):

In [None]:
import spacy

nlp = spacy.load("es_core_news_md")  # Spanish model

def extract_entities(text):
    doc = nlp(text)
    entities = {"DATE": [], "LOC": [], "PERSON": [], "EVENT": []}

    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_].append(ent.text)

    return entities

sample_text = "El 20 de septiembre de 2017, el huracán María devastó Puerto Rico."
print(extract_entities(sample_text))

**2️⃣ Categorization & Sentiment Analysis**

✅ Classify news articles into event types

✅ Use sentiment analysis to gauge the tone (negative, neutral, positive)

🔹 Example (Text Classification & Sentiment with transformers):

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

text = "El huracán María causó devastación en Puerto Rico y dejó miles de damnificados."
print(classifier(text))


💡 A 1-star rating suggests a negative impact event.

**3️⃣ Matching Historical Events to Tourist Attractions**

✅ Cross-reference extracted locations with known attractions

✅ Use a GIS API (Google Maps, OpenStreetMap) to match historical places

🔹 Example (Google Maps API for Locations)

In [None]:
import googlemaps

gmaps = googlemaps.Client(key="YOUR_API_KEY")

place = "Vieques, Puerto Rico"
geocode_result = gmaps.geocode(place)
print(geocode_result[0]['geometry']['location'])

## 🚀 Final Steps: Building a Dataset
**📌 Create a Structured Dataset for RAG**

1️⃣ Extract events from news articles.

2️⃣ Structure the data into JSON/CSV.

3️⃣ Store embeddings for semantic search.

4️⃣ Match events to travel locations.

🔹 Example JSON Format for RAG Storage:

In [None]:
{
  "event": "Hurricane María",
  "date": "2017-09-20",
  "location": "Puerto Rico",
  "type": "Natural Disaster",
  "people": ["Government Officials"],
  "impact": "Massive destruction, migration wave",
  "tourist_relevance": "Yes"
}
