# 📜 Data Extraction & Preprocessing for News Articles
The news articles from El Mundo contain historical events, which can be useful for travel recommendations. Since they are in raw .txt format, here’s how to process them:

**1️⃣ Load & Inspect the Data (Monday Morning)** 

✅ Steps:

- Load the .txt files and inspect the structure.
- Check if they contain metadata (dates, locations, etc.).
- Identify encoding issues (e.g., UTF-8 vs. ISO-8859-1).

In [1]:
import os

folder_path = "/Users/paolarivera/Documents/Ironhack/Week 9 - Final Project/project-dsml-interactive-travel-planner-main/data/elmundo_chunked_en_page1_15years"  # Adjust path

# List all files in the folder and take only the first 10
file_list = sorted(os.listdir(folder_path))[:10]  

for file_name in file_list:
    with open(os.path.join(folder_path, file_name), "r", encoding="utf-8") as f:
        print(f"--- {file_name} ---\n", f.read()[:500])  # Preview first 500 chars

--- 19200103_1.txt ---
 ELWNDO
8 pages 3 ctvs. Semester, $4.00 One year, $7.50
Offices: Salvador Brau. 81 Tel. 833 P. O. Box 345
NEWSPAPER OF THE TANGLE.
EXCEPT SUNDAYS
SAN JUAN, PUERTO RICO. - 71 i. 1- ■- ■- r- - _¡
SATURDAY, JANUARY 3, 1920. r
i NUMBER 271.
ENTERED AS SECOND CLASS MATTER, FEBRUARY 21, 1919. AT THE POST OFFICE AT SAN JEAN. PORTO RICO. UNDER THE ACT OF MARCH 3, 1879.
Plot to assassinate the Prince of Serbia. 192,000 tons in dykes to be delivered by 
--- 19200110_1.txt ---
 ELM1NDO
8 pages 3 ctvs. 'Semester, $4.00 One Year, $7.50
i Offices: ¡ = Saved: E1 I I I Ttl. 632 P. O Be" 345 |
DAILY TIDE.
EXCEPT SUNDAYS
ARO II
SAN JUAN, PUERTO RICO.
SATURDAY i$ JANUARY 1*20.
ENTERED AS SECOND CLASS MATTER, FEBRUARY 21,
l "M9, AT THE POSt OFFICE AT SAN JUAN, PORTO
Ni MI RO 277.
RICO, INDER THE ACT OF MARCH 3.
More "bolshevikis" sent to Ellis Island. Japan releases German prisoners.
Taltemte with Mr. i Sewell about the i present conflict t " : . " 1 Tétala no tay esparauas da 
--- 

📌 Goal: Understand the data structure before further processing.

**2️⃣ Clean & Structure the Text**

✅ Steps:
- Remove irrelevant whitespace, headers, and footers.
- Detect and extract dates if present.
- Identify paragraphs and segment them properly.

In [3]:
import os
import re

# Define folder paths
folder_path = "/Users/paolarivera/Documents/Ironhack/Week 9 - Final Project/project-dsml-interactive-travel-planner-main/data/elmundo_chunked_en_page1_15years"  # Change to your input folder path
output_folder = "EnglishCleanedArticles"  # Define output folder

# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)

def clean_text(text):
    text = re.sub(r'\n+', ' ', text)  # Remove extra newlines
    text = re.sub(r'\s+', ' ', text).strip()  # Remove excessive spaces
    text = re.sub(r'[^\w\s.,;!?-]', '', text)  # Remove special characters
    return text

# Read, clean, and save text files
chunks = []
filenames = []

for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):  # Ensure only text files are processed
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            raw_text = f.read()
        
        cleaned_text = clean_text(raw_text)
        chunks.append(cleaned_text)
        filenames.append(filename)  # Store filenames for reference
        
        # Save cleaned text
        output_path = os.path.join(output_folder, filename)
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(cleaned_text)

print(f"Loaded and cleaned {len(chunks)} articles. Cleaned files saved in '{output_folder}'.")

Loaded and cleaned 821 articles. Cleaned files saved in 'EnglishCleanedArticles'.
