### **HTML Information Extraction Toolkit**
#### Introduction

This notebook demonstrates how to preprocess HTML pages and extract useful textual information. It uses two Python libraries: 
- `trafilatura` for general-purpose text extraction.
- `news-please` for structured information extraction, especially for news articles.

We will walk through the process step-by-step, extracting clean and meaningful text from raw HTML files.

#### **Required Libraries**
Make sure you have the necessary libraries installed. If not, run the following command in your terminal or notebook:

```bash
!pip install trafilatura news-please
```

### **Step 1: Import Necessary Libraries**

Here, we import the required libraries to handle file operations, multiprocessing for efficiency, and text extraction.

In [None]:
import os
import json
import multiprocessing
import re
import trafilatura
from newsplease import NewsPlease

### **Step 2: Set the Directory Path**


Define the base directory where all your HTML files are stored. Make sure you have a folder named `html` containing the files you want to process.

In [None]:
BASE_URI = 'html/'  # Update this path if your folder is elsewhere

### **Step 3: Create a Function to Clean Text**


The `clean_text` function removes unnecessary whitespace, redundant hyphens, and formatting inconsistencies to produce clean text.

In [None]:
def clean_text(text):
    """
    Cleans the extracted text by removing extra whitespace, unnecessary hyphens, etc.
    """
    cleaned_text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    cleaned_text = cleaned_text.strip()       # Strip leading and trailing whitespace
    cleaned_text = cleaned_text.replace("- ", "")  # Remove hyphens followed by spaces
    return cleaned_text

### **Step 4: Create a Function to Extract Text Using Trafilatura**


The `extract_text_trafilatura` function processes an HTML file to extract its main content and metadata using Trafilatura.

In [None]:
def extract_text_trafilatura(filename):
    """
    Extracts text and metadata from an HTML file using Trafilatura.
    """
    json_object = {}
    print(f"Processing (Trafilatura): {filename}")
    with open(os.path.join(BASE_URI, filename), 'r', encoding='utf-8') as file:
        try:
            html_content = file.read()
            result = trafilatura.extract(
                html_content,
                no_fallback=True,
                include_links=False,
                include_comments=False,
                include_tables=False,
                include_images=False,
                include_formatting=True
            )
            metadata = trafilatura.extract_metadata(html_content)
            json_object['title'] = metadata['title'] if metadata else None
            json_object['main_text'] = clean_text(result) if result else None
            json_object['filename'] = filename
        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")

    return json_object

### **Step 5: Create a Function to Extract Text Using NewsPlease**

The `extract_text_newsplease` function processes an HTML file to extract structured news information using NewsPlease.

In [None]:
def extract_text_newsplease(filename):
    """
    Extracts text and metadata from an HTML file using NewsPlease.
    """
    json_object = {}
    print(f"Processing (NewsPlease): {filename}")
    with open(os.path.join(BASE_URI, filename), 'r', encoding='utf-8') as file:
        try:
            news = NewsPlease.from_html(file.read())
            json_object['title'] = news.title
            json_object['description'] = news.description
            json_object['main_text'] = news.maintext
            json_object['language'] = news.language
            json_object['filename'] = filename
        except Exception as e:
            print(f"Error processing {filename}: {str(e)}")

    return json_object

### **Step 6: Process Files Using Multiprocessing**

We use Python's `multiprocessing` to process multiple HTML files in parallel for efficiency. This block processes all files in the `html` folder using both `Trafilatura` and `NewsPlease`.

In [None]:
if __name__ == '__main__':
    # Use a multiprocessing pool to process files in parallel
    pool = multiprocessing.Pool(os.cpu_count())

    # Extract text using Trafilatura
    print("Starting Trafilatura extraction...")
    trafilatura_results = pool.map(extract_text_trafilatura, os.listdir(BASE_URI))
    with open('html_json_trafilatura.json', 'w', encoding='utf-8') as f:
        json.dump(trafilatura_results, f, ensure_ascii=False, indent=4)
    print("Trafilatura extraction completed. Output saved to 'html_json_trafilatura.json'.")

    # Extract text using NewsPlease
    print("Starting NewsPlease extraction...")
    newsplease_results = pool.map(extract_text_newsplease, os.listdir(BASE_URI))
    with open('html_json_newsplease.json', 'w', encoding='utf-8') as f:
        json.dump(newsplease_results, f, ensure_ascii=False, indent=4)
    print("NewsPlease extraction completed. Output saved to 'html_json_newsplease.json'.")

### **Final Notes**

This notebook demonstrates how to extract clean and structured text from HTML files using two methods. The results are saved as JSON files:

1. `html_json_trafilatura.json`: Output from Trafilatura.
2. `html_json_newsplease.json`: Output from NewsPlease.

You can analyze these JSON files further for your research or application needs.