### **HTML Information Extraction Toolkit**
#### Introduction

This notebook demonstrates how to preprocess HTML pages and extract useful textual information. It uses two Python libraries: 
- `trafilatura` for general-purpose text extraction.
- `news-please` for structured information extraction, especially for news articles.

We will walk through the process step-by-step, extracting clean and meaningful text from raw HTML files.

#### **Required Libraries**
Make sure you have the necessary libraries installed. If not, run the following command in your terminal or notebook:

```bash
!pip install trafilatura news-please
```

In [1]:
!pip install trafilatura news-please


Collecting trafilatura
  Using cached trafilatura-2.0.0-py3-none-any.whl.metadata (12 kB)
Collecting news-please
  Using cached news_please-1.6.15-py3-none-any.whl.metadata (2.8 kB)
Collecting charset_normalizer>=3.4.0 (from trafilatura)
  Downloading charset_normalizer-3.4.2-cp39-cp39-macosx_10_9_universal2.whl.metadata (35 kB)
Collecting courlan>=1.3.2 (from trafilatura)
  Using cached courlan-1.3.2-py3-none-any.whl.metadata (17 kB)
Collecting htmldate>=1.9.2 (from trafilatura)
  Using cached htmldate-1.9.3-py3-none-any.whl.metadata (10 kB)
Collecting justext>=3.0.1 (from trafilatura)
  Using cached justext-3.0.2-py2.py3-none-any.whl.metadata (7.3 kB)
Collecting lxml>=5.3.0 (from trafilatura)
  Downloading lxml-6.0.0-cp39-cp39-macosx_10_9_universal2.whl.metadata (6.6 kB)
Collecting Scrapy>=1.1.0 (from news-please)
  Using cached scrapy-2.13.3-py3-none-any.whl.metadata (4.4 kB)
Collecting PyMySQL>=0.7.9 (from news-please)
  Using cached PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)

### **Step 1: Import Necessary Libraries**

Here, we import the required libraries to handle file operations, multiprocessing for efficiency, and text extraction.

In [2]:
import os
import json
import multiprocessing
import re
import trafilatura
from newsplease import NewsPlease
from Trafilatura import *
from NewsPlease import * 

### **Step 2: Set the Directory Path**


Define the base directory where all your HTML files are stored. Make sure you have a folder named `html` containing the files you want to process.

In [3]:
BASE_URI = 'html/'  # Update this path if your folder is elsewhere

### **Step 3: Create a Function to Clean Text**


The `clean_text` function removes unnecessary whitespace, redundant hyphens, and formatting inconsistencies to produce clean text.

In [4]:
def clean_text(text):
    """
    Cleans the extracted text by removing extra whitespace, unnecessary hyphens, etc.
    """
    cleaned_text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    cleaned_text = cleaned_text.strip()       # Strip leading and trailing whitespace
    cleaned_text = cleaned_text.replace("- ", "")  # Remove hyphens followed by spaces
    return cleaned_text


### **Step 4: Create a Function to Extract Text Using Trafilatura**


The `extract_text_trafilatura` function processes an HTML file to extract its main content and metadata using Trafilatura.

In [10]:

def extract_text(filename):
    json_object = {}
    print(filename)
    with open(BASE_URI + filename, 'r', encoding='utf-8') as file:
        try:
            html_content = file.read()
            result = trafilatura.extract(html_content, no_fallback=True, include_links=False, include_comments=False,
                                         include_tables=False, include_images=False, include_formatting=True)
            metadata = trafilatura.extract_metadata(html_content)
            json_object['title'] = metadata.title
            json_object['main_text'] = clean_text(result)
            json_object['filename'] = filename
        except Exception as e:
            print("Error processing", filename)
            print("Exception:", str(e))

    return json_object

### **Step 5: Create a Function to Extract Text Using NewsPlease**

The `extract_text_newsplease` function processes an HTML file to extract structured news information using NewsPlease.

In [6]:
# def news_it(filename):
#     """
#     Extracts news information from HTML files and creates a JSON object.

#     Args:
#         filename (str): The name of the HTML file to process.

#     Returns:
#         dict: A dictionary containing extracted news information or an empty dictionary if extraction fails.
#     """
#     json_object = {}  # Initialize an empty dictionary to store news information
#     print(filename)

#     # Open and read the HTML file
#     with open(BASE_URI + filename, 'r', encoding='utf-8') as file:
#         try:
#             # Use NewsPlease to parse the HTML content and extract news information
#             news = NewsPlease.from_html(file.read())
#             json_object['title'] = news.title
#             json_object['description'] = news.description
#             json_object['main_text'] = news.maintext
#             json_object['language'] = news.language
#             json_object['filename'] = filename
#         except Exception as e:
#             print("Error processing", filename)
#             print("Exception:", str(e))

#     return json_object


### **Step 6: Process Files Using Multiprocessing**

We use Python's `multiprocessing` to process multiple HTML files in parallel for efficiency. This block processes all files in the `html` folder using both `Trafilatura` and `NewsPlease`.

In [11]:
# Create a multiprocessing pool for the extract_text function
with multiprocessing.Pool(os.cpu_count()) as pool:
    # Process the list of HTML files using the extract_text function in parallel
    extract_results = pool.map(extract_text, os.listdir(BASE_URI))
    # Write the extracted text information to a JSON file
    with open('html_json_trafilatura.json', 'w', encoding='utf-8') as f:
        json.dump(extract_results, f, ensure_ascii=False, indent=4)
    print("JSON data written to 'html_json_trafilatura.json'")

# Create a separate multiprocessing pool for the news_it function
with multiprocessing.Pool(os.cpu_count()) as pool:
    # Process the list of HTML files using the news_it function in parallel
    news_results = pool.map(news_it, os.listdir(BASE_URI))
    # Write the extracted news information to a JSON file
    with open('html_json_news.json', 'w', encoding='utf-8') as f:
        json.dump(news_results, f, ensure_ascii=False, indent=4)
    print("JSON data written to 'html_json_news.json'")

Process SpawnPoolWorker-46:
Process SpawnPoolWorker-43:
Process SpawnPoolWorker-44:
Process SpawnPoolWorker-42:
Process SpawnPoolWorker-41:
Process SpawnPoolWorker-45:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/uni/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/uni/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/opt/anaconda3/envs/uni/lib/python3.9/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/opt/anaconda3/envs/uni/lib/python3.9/multiprocessing/queues.py", line 367, in get
    return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'extract_text' on <module '__main__' (built-in)>
  File "/opt/anaconda3/envs/uni/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/anaconda3/envs/uni/lib/python3.9/multiprocessing/process.py"

KeyboardInterrupt: 

### **Final Notes**

This notebook demonstrates how to extract clean and structured text from HTML files using two methods. The results are saved as JSON files:

1. `html_json_trafilatura.json`: Output from Trafilatura.
2. `html_json_news.json`: Output from NewsPlease.

You can analyze these JSON files further for your research or application needs.