# Exploring Enhanced Semantic Analysis in HTML using RAG: Beyond Text## Overview of HTML-based Retrieval Augmented GenerationThis notebook demonstrates the concepts and implementation of using HTML structure for enhanced semantic analysis in RAG systems. We'll explore how preserving HTML markup can improve knowledge retrieval compared to plain text approaches.

## Setup and RequirementsFirst, let's import the required libraries and set up our environment.

In [None]:
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
plt.style.use('seaborn')
sns.set_theme()

## HTML Cleaning and Processing FunctionsHere we implement key functions for processing HTML documents while preserving semantic structure.

In [None]:
def clean_html(html_content):
    """Clean HTML while preserving important semantic elements"""
    try:
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Remove script and style elements
        for script in soup(['script', 'style']):
            script.decompose()
            
        # Preserve important semantic tags
        preserved_tags = ['p', 'h1', 'h2', 'h3', 'ul', 'ol', 'li', 'table']
        clean_text = ' '.join([str(tag) for tag in soup.findAll(preserved_tags)])
        
        return clean_text
    except Exception as e:
        print(f"Error cleaning HTML: {str(e)}")
        return None