# Web Audit Tool - Google Colab

This notebook sets up and runs a comprehensive web audit tool with AI-powered insights using Ollama and GPT-OSS model. Results are automatically saved to Google Drive with timestamps.

## Features:
- üîç **Deep Content Analysis**: Full page scraping and content extraction
- üìä **Comprehensive SEO Audit**: Meta tags, schema markup, Open Graph
- ‚ö° **Performance Analysis**: Load times, resource analysis, Core Web Vitals
- üåê **Multi-page Crawling**: Discovers and analyzes internal pages
- üìà **Visual Reports**: Charts and graphs for audit results
- ü§ñ **AI Insights**: Powered by local LLM models
- üíæ **Auto-save**: Results exported to Excel in Google Drive

In [None]:
# Install system dependencies and packages
!apt-get update
!apt-get install -y chromium-browser chromium-chromedriver wget curl

# Install Python packages
!pip install selenium beautifulsoup4 requests pandas openpyxl xlsxwriter
!pip install lighthouse playwright accessibility-checker
!pip install Pillow matplotlib seaborn plotly lxml
!pip install validators tldextract

In [None]:
# Install Ollama
!curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service in background
import subprocess
import time
import threading
import os

def start_ollama():
    env = os.environ.copy()
    env['OLLAMA_HOST'] = '0.0.0.0:11434'
    subprocess.Popen(['ollama', 'serve'], env=env, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# Start Ollama in a separate thread
ollama_thread = threading.Thread(target=start_ollama, daemon=True)
ollama_thread.start()

# Wait for Ollama to start
time.sleep(15)
print("‚úÖ Ollama service started")

In [None]:
# Pull GPT-OSS model
import subprocess
import time

print("üì• Pulling GPT-OSS model (this may take 10-15 minutes)...")

try:
    # Try to pull the model
    process = subprocess.Popen(['ollama', 'pull', 'gpt-oss'], 
                              stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    
    # Monitor the process
    for line in iter(process.stdout.readline, ''):
        if line.strip():
            print(f"üì¶ {line.strip()}")
    
    process.wait()
    
    if process.returncode == 0:
        print("‚úÖ GPT-OSS model pulled successfully!")
    else:
        print("‚ö†Ô∏è GPT-OSS not available, falling back to llama3")
        subprocess.run(['ollama', 'pull', 'llama3'], check=True)
        
except Exception as e:
    print(f"‚ö†Ô∏è Model pull failed, using llama3: {e}")
    subprocess.run(['ollama', 'pull', 'llama3'], check=True)

# List available models
result = subprocess.run(['ollama', 'list'], capture_output=True, text=True)
print("\nüìã Available models:")
print(result.stdout)

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create results directory
results_base_path = '/content/drive/MyDrive/Web_Audit_Results'
os.makedirs(results_base_path, exist_ok=True)
print(f"üìÅ Results will be saved to: {results_base_path}")

In [None]:
# Import all required libraries
import os
import sys
import json
import time
import requests
import pandas as pd
from datetime import datetime
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
import re

# Selenium imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException

# Set up Chrome driver for Colab
def setup_chrome_driver():
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--window-size=1920,1080')
    chrome_options.add_argument('--disable-extensions')
    chrome_options.add_argument('--disable-plugins')
    
    driver = webdriver.Chrome(options=chrome_options)
    return driver

print("‚úÖ All libraries imported successfully")

In [None]:
# Ollama AI Integration
OLLAMA_BASE_URL = "http://localhost:11434"

def test_ollama_connection():
    """Test if Ollama is running and models are available"""
    try:
        response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json().get('models', [])
            available_models = [model.get('name', '') for model in models]
            
            # Check for preferred models
            if any('gpt-oss' in model for model in available_models):
                return True, 'gpt-oss'
            elif any('llama3' in model for model in available_models):
                return True, 'llama3'
            else:
                return False, None
        return False, None
    except Exception as e:
        print(f"‚ùå Ollama connection error: {e}")
        return False, None

def generate_ai_insights(audit_data, model_name):
    """Generate AI insights from audit data"""
    
    prompt = f"""
    Analyze this website audit and provide actionable recommendations:
    
    Website: {audit_data.get('url', 'Unknown')}
    Title: {audit_data.get('title', 'N/A')}
    Load Time: {audit_data.get('load_time_ms', 'N/A')} ms
    Page Size: {audit_data.get('page_size_kb', 'N/A')} KB
    
    SEO Analysis:
    - H1 tags: {audit_data.get('h1_count', 0)}
    - H2 tags: {audit_data.get('h2_count', 0)}
    - Meta description: {audit_data.get('meta_description_length', 0)} characters
    - Images without alt: {audit_data.get('images_no_alt', 0)}
    
    Technical Issues:
    - Broken links: {audit_data.get('broken_links', 0)}
    - Missing viewport: {'Yes' if not audit_data.get('has_viewport') else 'No'}
    - HTTPS: {'Yes' if audit_data.get('is_https') else 'No'}
    
    Provide 5 specific, actionable recommendations to improve this website.
    """
    
    try:
        payload = {
            "model": model_name,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.1,
                "top_p": 0.9
            }
        }
        
        response = requests.post(f"{OLLAMA_BASE_URL}/api/generate", 
                               json=payload, timeout=60)
        
        if response.status_code == 200:
            return response.json().get('response', 'No insights generated')
        else:
            return f"Error generating insights: {response.status_code}"
            
    except Exception as e:
        return f"Error generating insights: {e}"

# Test connection
connected, model = test_ollama_connection()
if connected:
    print(f"‚úÖ Ollama connected with model: {model}")
else:
    print("‚ö†Ô∏è Ollama not available - AI insights will be disabled")

In [None]:
# Enhanced Web Audit Functions
import validators
import tldextract
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
import seaborn as sns

class ComprehensiveWebAuditor:
    def __init__(self, max_pages=5, timeout=30):
        self.driver = None
        self.session = requests.Session()
        self.results = {}
        self.max_pages = max_pages
        self.timeout = timeout
        self.discovered_pages = set()
        self.crawled_pages = []
        
    def start_session(self):
        """Initialize browser session"""
        self.driver = setup_chrome_driver()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        })
        
    def end_session(self):
        """Close browser session"""
        if self.driver:
            self.driver.quit()
        self.session.close()
            
    def audit_website(self, url):
        """Perform comprehensive website audit"""
        
        if not url.startswith(('http://', 'https://')):
            url = 'https://' + url
            
        if not validators.url(url):
            return {'error': f'Invalid URL: {url}'}
            
        print(f"üîç Starting comprehensive audit for: {url}")
        
        try:
            # Initialize results structure
            self.results = {
                'audit_date': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
                'primary_url': url,
                'pages_analyzed': [],
                'summary_metrics': {},
                'seo_analysis': {},
                'performance_metrics': {},
                'content_analysis': {},
                'technical_analysis': {},
                'accessibility_analysis': {},
                'security_analysis': {},
                'issues_found': [],
                'recommendations': []
            }
            
            # Discover internal pages
            print("üï∑Ô∏è Discovering internal pages...")
            self._discover_pages(url)
            
            # Audit each discovered page
            pages_to_audit = list(self.discovered_pages)[:self.max_pages]
            print(f"üìÑ Analyzing {len(pages_to_audit)} pages...")
            
            for i, page_url in enumerate(pages_to_audit, 1):
                print(f"  üìã Page {i}/{len(pages_to_audit)}: {page_url}")
                page_results = self._audit_single_page(page_url)
                self.crawled_pages.append(page_results)
                
            # Compile comprehensive results
            self._compile_results()
            
            print("‚úÖ Comprehensive audit completed successfully")
            return self.results
            
        except Exception as e:
            print(f"‚ùå Audit failed: {e}")
            return {'error': str(e), 'url': url}
    
    def _discover_pages(self, start_url):
        """Discover internal pages through sitemap and crawling"""
        domain = tldextract.extract(start_url).registered_domain
        self.discovered_pages.add(start_url)
        
        try:
            # Check for sitemap
            sitemap_urls = [
                f"{start_url}/sitemap.xml",
                f"{start_url}/sitemap_index.xml",
                f"{start_url}/robots.txt"
            ]
            
            for sitemap_url in sitemap_urls:
                try:
                    response = self.session.get(sitemap_url, timeout=10)
                    if response.status_code == 200:
                        if 'sitemap.xml' in sitemap_url:
                            self._parse_sitemap(response.text, domain)
                        elif 'robots.txt' in sitemap_url:
                            self._parse_robots(response.text, start_url, domain)
                except:
                    continue
            
            # Crawl homepage for internal links
            self._crawl_for_links(start_url, domain)
            
        except Exception as e:
            print(f"‚ö†Ô∏è Page discovery error: {e}")
    
    def _parse_sitemap(self, sitemap_content, domain):
        """Parse sitemap XML for URLs"""
        try:
            soup = BeautifulSoup(sitemap_content, 'xml')
            for loc in soup.find_all('loc'):
                url = loc.text.strip()
                if domain in url and len(self.discovered_pages) < self.max_pages * 2:
                    self.discovered_pages.add(url)
        except:
            pass
    
    def _parse_robots(self, robots_content, base_url, domain):
        """Parse robots.txt for sitemap references"""
        try:
            for line in robots_content.split('\n'):
                if line.lower().startswith('sitemap:'):
                    sitemap_url = line.split(':', 1)[1].strip()
                    response = self.session.get(sitemap_url, timeout=10)
                    if response.status_code == 200:
                        self._parse_sitemap(response.text, domain)
        except:
            pass
    
    def _crawl_for_links(self, url, domain):
        """Crawl page for internal links"""
        try:
            self.driver.get(url)
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )
            
            links = self.driver.find_elements(By.TAG_NAME, "a")
            for link in links:
                href = link.get_attribute("href")
                if href and domain in href and len(self.discovered_pages) < self.max_pages * 2:
                    self.discovered_pages.add(href)
                    
        except Exception as e:
            print(f"‚ö†Ô∏è Link crawling error: {e}")
    
    def _audit_single_page(self, url):
        """Perform detailed audit of a single page"""
        page_results = {
            'url': url,
            'audit_timestamp': datetime.now().isoformat(),
            'load_metrics': {},
            'seo_data': {},
            'content_data': {},
            'technical_data': {},
            'accessibility_data': {},
            'issues': [],
            'page_source': ''
        }
        
        try:
            # Load page and measure performance
            start_time = time.time()
            self.driver.get(url)
            
            # Wait for page load
            WebDriverWait(self.driver, self.timeout).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )
            
            load_time = (time.time() - start_time) * 1000
            page_results['load_metrics']['load_time_ms'] = round(load_time, 2)
            
            # Get page source for BeautifulSoup analysis
            page_source = self.driver.page_source
            page_results['page_source'] = page_source[:1000] + '...' if len(page_source) > 1000 else page_source
            soup = BeautifulSoup(page_source, 'html.parser')
            
            # Comprehensive analysis
            page_results['seo_data'] = self._analyze_seo(soup)
            page_results['content_data'] = self._analyze_content(soup)
            page_results['technical_data'] = self._analyze_technical(soup)
            page_results['accessibility_data'] = self._analyze_accessibility(soup)
            page_results['load_metrics'].update(self._analyze_performance())
            
            # Identify issues
            page_results['issues'] = self._identify_page_issues(page_results)
            
        except Exception as e:
            page_results['error'] = str(e)
            print(f"‚ö†Ô∏è Error auditing {url}: {e}")
        
        return page_results
    
    def _analyze_seo(self, soup):
        """Comprehensive SEO analysis"""
        seo_data = {}
        
        # Title analysis
        title_tag = soup.find('title')
        if title_tag:
            title_text = title_tag.get_text().strip()
            seo_data['title'] = title_text
            seo_data['title_length'] = len(title_text)
            seo_data['title_optimized'] = 30 <= len(title_text) <= 60
        else:
            seo_data['title'] = 'Missing'
            seo_data['title_length'] = 0
            seo_data['title_optimized'] = False
        
        # Meta tags analysis
        meta_desc = soup.find('meta', attrs={'name': 'description'})
        if meta_desc and meta_desc.get('content'):
            desc_content = meta_desc['content'].strip()
            seo_data['meta_description'] = desc_content
            seo_data['meta_description_length'] = len(desc_content)
            seo_data['meta_desc_optimized'] = 120 <= len(desc_content) <= 160
        else:
            seo_data['meta_description'] = 'Missing'
            seo_data['meta_description_length'] = 0
            seo_data['meta_desc_optimized'] = False
        
        # Keywords
        meta_keywords = soup.find('meta', attrs={'name': 'keywords'})
        seo_data['meta_keywords'] = meta_keywords.get('content', 'Not specified') if meta_keywords else 'Not specified'
        
        # Heading structure
        headings = {}
        for i in range(1, 7):
            h_tags = soup.find_all(f'h{i}')
            headings[f'h{i}_count'] = len(h_tags)
            headings[f'h{i}_text'] = [h.get_text().strip() for h in h_tags[:3]]  # First 3
        
        seo_data.update(headings)
        seo_data['proper_h1_usage'] = headings['h1_count'] == 1
        
        # Open Graph tags
        og_tags = {}
        for meta in soup.find_all('meta', property=lambda x: x and x.startswith('og:')):
            og_tags[meta.get('property')] = meta.get('content')
        seo_data['open_graph'] = og_tags
        
        # Twitter Card tags
        twitter_tags = {}
        for meta in soup.find_all('meta', attrs={'name': lambda x: x and x.startswith('twitter:')}):
            twitter_tags[meta.get('name')] = meta.get('content')
        seo_data['twitter_cards'] = twitter_tags
        
        # Schema markup
        schema_scripts = soup.find_all('script', type='application/ld+json')
        seo_data['schema_markup_count'] = len(schema_scripts)
        seo_data['has_schema'] = len(schema_scripts) > 0
        
        # Image SEO
        images = soup.find_all('img')
        images_with_alt = [img for img in images if img.get('alt')]
        seo_data['total_images'] = len(images)
        seo_data['images_with_alt'] = len(images_with_alt)
        seo_data['images_without_alt'] = len(images) - len(images_with_alt)
        seo_data['alt_text_score'] = round(len(images_with_alt) / max(len(images), 1) * 100, 2)
        
        return seo_data
    
    def _analyze_content(self, soup):
        """Comprehensive content analysis"""
        content_data = {}
        
        # Remove script and style elements
        for script in soup(["script", "style"]):
            script.decompose()
        
        # Text content analysis
        text_content = soup.get_text()
        words = text_content.split()
        word_count = len(words)
        
        content_data['word_count'] = word_count
        content_data['character_count'] = len(text_content)
        content_data['reading_time_minutes'] = round(word_count / 200, 1)  # Average reading speed
        
        # Content quality indicators
        content_data['content_quality_score'] = min(word_count / 300 * 100, 100)
        
        # Link analysis
        links = soup.find_all('a', href=True)
        internal_links = []
        external_links = []
        
        current_domain = tldextract.extract(self.driver.current_url).registered_domain
        
        for link in links:
            href = link['href']
            if href.startswith(('http://', 'https://')):
                link_domain = tldextract.extract(href).registered_domain
                if link_domain == current_domain:
                    internal_links.append(href)
                else:
                    external_links.append(href)
            elif href.startswith('/') or not href.startswith(('mailto:', 'tel:', '#')):
                internal_links.append(href)
        
        content_data['total_links'] = len(links)
        content_data['internal_links'] = len(internal_links)
        content_data['external_links'] = len(external_links)
        content_data['internal_link_list'] = internal_links[:10]  # First 10
        content_data['external_link_list'] = external_links[:10]  # First 10
        
        # Media content
        content_data['video_count'] = len(soup.find_all('video'))
        content_data['audio_count'] = len(soup.find_all('audio'))
        content_data['iframe_count'] = len(soup.find_all('iframe'))
        
        # Lists and tables
        content_data['list_count'] = len(soup.find_all(['ul', 'ol']))
        content_data['table_count'] = len(soup.find_all('table'))
        
        return content_data
    
    def _analyze_technical(self, soup):
        """Technical SEO and structure analysis"""
        technical_data = {}
        
        # URL structure
        current_url = self.driver.current_url
        technical_data['url'] = current_url
        technical_data['is_https'] = current_url.startswith('https://')
        technical_data['url_length'] = len(current_url)
        technical_data['url_clean'] = '?' not in current_url and '#' not in current_url
        
        # Meta tags
        viewport = soup.find('meta', attrs={'name': 'viewport'})
        technical_data['has_viewport'] = viewport is not None
        technical_data['viewport_content'] = viewport.get('content') if viewport else 'Missing'
        
        # Canonical URL
        canonical = soup.find('link', rel='canonical')
        technical_data['has_canonical'] = canonical is not None
        technical_data['canonical_url'] = canonical.get('href') if canonical else 'Missing'
        
        # Robots meta
        robots = soup.find('meta', attrs={'name': 'robots'})
        technical_data['robots_meta'] = robots.get('content') if robots else 'Not specified'
        
        # Language
        html_tag = soup.find('html')
        technical_data['html_lang'] = html_tag.get('lang') if html_tag else 'Not specified'
        
        # Favicon
        favicon = soup.find('link', rel=['icon', 'shortcut icon'])
        technical_data['has_favicon'] = favicon is not None
        
        # CSS and JS resources
        stylesheets = soup.find_all('link', rel='stylesheet')
        scripts = soup.find_all('script')
        
        technical_data['stylesheet_count'] = len(stylesheets)
        technical_data['script_count'] = len(scripts)
        technical_data['inline_styles'] = len(soup.find_all(style=True))
        
        # Page size
        page_size = len(self.driver.page_source.encode('utf-8'))
        technical_data['page_size_bytes'] = page_size
        technical_data['page_size_kb'] = round(page_size / 1024, 2)
        
        return technical_data
    
    def _analyze_accessibility(self, soup):
        """Accessibility analysis"""
        accessibility_data = {}
        
        # Form accessibility
        forms = soup.find_all('form')
        inputs = soup.find_all('input')
        labels = soup.find_all('label')
        
        accessibility_data['form_count'] = len(forms)
        accessibility_data['input_count'] = len(inputs)
        accessibility_data['label_count'] = len(labels)
        
        # ARIA attributes
        aria_elements = soup.find_all(attrs={'aria-label': True})
        aria_elements.extend(soup.find_all(attrs={'aria-labelledby': True}))
        aria_elements.extend(soup.find_all(attrs={'role': True}))
        
        accessibility_data['aria_elements'] = len(set(aria_elements))
        
        # Alt text for images
        images = soup.find_all('img')
        images_with_alt = [img for img in images if img.get('alt')]
        accessibility_data['image_alt_coverage'] = round(len(images_with_alt) / max(len(images), 1) * 100, 2)
        
        # Skip links
        skip_links = soup.find_all('a', href=lambda x: x and x.startswith('#'))
        accessibility_data['skip_links'] = len(skip_links)
        
        # Color contrast (basic check)
        accessibility_data['has_css'] = len(soup.find_all('link', rel='stylesheet')) > 0
        
        return accessibility_data
    
    def _analyze_performance(self):
        """Performance metrics analysis"""
        performance_data = {}
        
        try:
            # Navigation timing
            timing = self.driver.execute_script("""
                var timing = performance.timing;
                var navigation = performance.navigation;
                return {
                    'dns_lookup': timing.domainLookupEnd - timing.domainLookupStart,
                    'tcp_connect': timing.connectEnd - timing.connectStart,
                    'server_response': timing.responseEnd - timing.requestStart,
                    'dom_processing': timing.domComplete - timing.domLoading,
                    'total_load_time': timing.loadEventEnd - timing.navigationStart,
                    'navigation_type': navigation.type,
                    'redirect_count': navigation.redirectCount
                };
            """)
            
            performance_data.update(timing)
            
            # Resource timing
            resources = self.driver.execute_script("""
                return performance.getEntriesByType('resource').map(function(r) {
                    return {
                        'name': r.name,
                        'type': r.initiatorType,
                        'duration': r.duration,
                        'size': r.transferSize || 0
                    };
                });
            """)
            
            if resources:
                resource_summary = defaultdict(list)
                for resource in resources:
                    resource_summary[resource['type']].append(resource)
                
                performance_data['resource_summary'] = {
                    rtype: {
                        'count': len(resources),
                        'total_size': sum(r['size'] for r in resources),
                        'avg_duration': sum(r['duration'] for r in resources) / len(resources)
                    }
                    for rtype, resources in resource_summary.items()
                }
            
        except Exception as e:
            print(f"‚ö†Ô∏è Performance analysis error: {e}")
        
        return performance_data
    
    def _identify_page_issues(self, page_results):
        """Identify issues and recommendations for a page"""
        issues = []
        
        seo = page_results.get('seo_data', {})
        technical = page_results.get('technical_data', {})
        content = page_results.get('content_data', {})
        load_metrics = page_results.get('load_metrics', {})
        
        # SEO issues
        if not seo.get('title_optimized', False):
            issues.append({
                'type': 'SEO',
                'severity': 'High',
                'issue': f"Title length ({seo.get('title_length', 0)} chars) not optimal (30-60 chars)",
                'recommendation': 'Optimize title length to 30-60 characters'
            })
        
        if not seo.get('meta_desc_optimized', False):
            issues.append({
                'type': 'SEO',
                'severity': 'High',
                'issue': f"Meta description length ({seo.get('meta_description_length', 0)} chars) not optimal (120-160 chars)",
                'recommendation': 'Write compelling meta description between 120-160 characters'
            })
        
        if not seo.get('proper_h1_usage', False):
            issues.append({
                'type': 'SEO',
                'severity': 'Medium',
                'issue': f"Improper H1 usage ({seo.get('h1_count', 0)} H1 tags found)",
                'recommendation': 'Use exactly one H1 tag per page'
            })
        
        # Technical issues
        if not technical.get('is_https', False):
            issues.append({
                'type': 'Security',
                'severity': 'High',
                'issue': 'Page not served over HTTPS',
                'recommendation': 'Implement SSL certificate and redirect HTTP to HTTPS'
            })
        
        if not technical.get('has_viewport', False):
            issues.append({
                'type': 'Mobile',
                'severity': 'High',
                'issue': 'Missing viewport meta tag',
                'recommendation': 'Add viewport meta tag for mobile responsiveness'
            })
        
        # Performance issues
        load_time = load_metrics.get('load_time_ms', 0)
        if load_time > 3000:
            issues.append({
                'type': 'Performance',
                'severity': 'High',
                'issue': f'Slow page load time ({load_time}ms)',
                'recommendation': 'Optimize images, reduce server response time, enable compression'
            })
        
        # Content issues
        if content.get('word_count', 0) < 300:
            issues.append({
                'type': 'Content',
                'severity': 'Medium',
                'issue': f"Low word count ({content.get('word_count', 0)} words)",
                'recommendation': 'Add more valuable content (aim for 300+ words)'
            })
        
        return issues
    
    def _compile_results(self):
        """Compile comprehensive results from all pages"""
        if not self.crawled_pages:
            return
        
        # Summary metrics
        self.results['summary_metrics'] = {
            'total_pages_analyzed': len(self.crawled_pages),
            'avg_load_time': round(sum(page.get('load_metrics', {}).get('load_time_ms', 0) 
                                     for page in self.crawled_pages) / len(self.crawled_pages), 2),
            'total_issues_found': sum(len(page.get('issues', [])) for page in self.crawled_pages),
            'pages_with_issues': len([page for page in self.crawled_pages if page.get('issues')]),
            'avg_word_count': round(sum(page.get('content_data', {}).get('word_count', 0) 
                                      for page in self.crawled_pages) / len(self.crawled_pages), 0)
        }
        
        # Aggregate SEO data
        all_issues = []
        for page in self.crawled_pages:
            all_issues.extend(page.get('issues', []))
        
        issue_counts = Counter(issue['type'] for issue in all_issues)
        severity_counts = Counter(issue['severity'] for issue in all_issues)
        
        self.results['issues_summary'] = {
            'by_type': dict(issue_counts),
            'by_severity': dict(severity_counts),
            'all_issues': all_issues
        }
        
        # Store detailed page results
        self.results['pages_analyzed'] = self.crawled_pages

print("‚úÖ Comprehensive WebAuditor class created")

In [None]:
# Enhanced Results saving and reporting functions
def create_comprehensive_report(audit_results):
    """Create comprehensive visual and text reports"""
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Extract domain for filename
    primary_url = audit_results.get('primary_url', 'unknown')
    domain = tldextract.extract(primary_url).registered_domain or 'unknown'
    
    # Create Excel report with multiple sheets
    filename = f"comprehensive_audit_{domain}_{timestamp}.xlsx"
    filepath = os.path.join(results_base_path, filename)
    
    with pd.ExcelWriter(filepath, engine='xlsxwriter') as writer:
        workbook = writer.book
        
        # Define formats
        header_format = workbook.add_format({
            'bold': True, 'bg_color': '#D7E4BC', 'border': 1, 'text_wrap': True
        })
        
        critical_format = workbook.add_format({
            'bg_color': '#FFE6E6', 'border': 1
        })
        
        warning_format = workbook.add_format({
            'bg_color': '#FFF2E6', 'border': 1
        })
        
        good_format = workbook.add_format({
            'bg_color': '#E6F7E6', 'border': 1
        })
        
        # 1. Executive Summary
        summary_data = {
            'Metric': [
                'Audit Date', 'Primary URL', 'Pages Analyzed', 'Total Issues Found',
                'Average Load Time (ms)', 'Pages with Issues', 'Average Word Count'
            ],
            'Value': [
                audit_results.get('audit_date', 'N/A'),
                audit_results.get('primary_url', 'N/A'),
                audit_results.get('summary_metrics', {}).get('total_pages_analyzed', 0),
                audit_results.get('summary_metrics', {}).get('total_issues_found', 0),
                audit_results.get('summary_metrics', {}).get('avg_load_time', 0),
                audit_results.get('summary_metrics', {}).get('pages_with_issues', 0),
                audit_results.get('summary_metrics', {}).get('avg_word_count', 0)
            ]
        }
        
        summary_df = pd.DataFrame(summary_data)
        summary_df.to_excel(writer, sheet_name='Executive Summary', index=False)
        
        # Format summary sheet
        summary_sheet = writer.sheets['Executive Summary']
        summary_sheet.set_column('A:A', 25)
        summary_sheet.set_column('B:B', 40)
        
        for col_num, value in enumerate(summary_df.columns.values):
            summary_sheet.write(0, col_num, value, header_format)
        
        # 2. Issues by Page
        issues_data = []
        pages_analyzed = audit_results.get('pages_analyzed', [])
        
        for page in pages_analyzed:
            page_url = page.get('url', 'Unknown')
            for issue in page.get('issues', []):
                issues_data.append({
                    'Page URL': page_url,
                    'Issue Type': issue.get('type', 'Unknown'),
                    'Severity': issue.get('severity', 'Unknown'),
                    'Issue Description': issue.get('issue', 'No description'),
                    'Recommendation': issue.get('recommendation', 'No recommendation')
                })
        
        if issues_data:
            issues_df = pd.DataFrame(issues_data)
            issues_df.to_excel(writer, sheet_name='Issues Found', index=False)
            
            # Format issues sheet
            issues_sheet = writer.sheets['Issues Found']
            for col_num, value in enumerate(issues_df.columns.values):
                issues_sheet.write(0, col_num, value, header_format)
            
            # Apply conditional formatting based on severity
            for row_num, row in issues_df.iterrows():
                severity = row['Severity']
                row_format = critical_format if severity == 'High' else warning_format if severity == 'Medium' else good_format
                
                for col_num in range(len(issues_df.columns)):
                    issues_sheet.write(row_num + 1, col_num, row.iloc[col_num], row_format)
        
        # 3. Page Details
        page_details = []
        for page in pages_analyzed:
            seo_data = page.get('seo_data', {})
            content_data = page.get('content_data', {})
            technical_data = page.get('technical_data', {})
            load_metrics = page.get('load_metrics', {})
            
            page_details.append({
                'URL': page.get('url', 'Unknown'),
                'Title': seo_data.get('title', 'Missing'),
                'Title Length': seo_data.get('title_length', 0),
                'Meta Description Length': seo_data.get('meta_description_length', 0),
                'Word Count': content_data.get('word_count', 0),
                'Load Time (ms)': load_metrics.get('load_time_ms', 0),
                'Page Size (KB)': technical_data.get('page_size_kb', 0),
                'Internal Links': content_data.get('internal_links', 0),
                'External Links': content_data.get('external_links', 0),
                'Images Total': seo_data.get('total_images', 0),
                'Images with Alt': seo_data.get('images_with_alt', 0),
                'H1 Count': seo_data.get('h1_count', 0),
                'Issues Count': len(page.get('issues', []))
            })
        
        if page_details:
            details_df = pd.DataFrame(page_details)
            details_df.to_excel(writer, sheet_name='Page Details', index=False)
            
            # Format details sheet
            details_sheet = writer.sheets['Page Details']
            for col_num, value in enumerate(details_df.columns.values):
                details_sheet.write(0, col_num, value, header_format)
        
        # 4. SEO Analysis Summary
        seo_summary = []
        for page in pages_analyzed:
            seo_data = page.get('seo_data', {})
            seo_summary.append({
                'URL': page.get('url', 'Unknown'),
                'Has Title': 'Yes' if seo_data.get('title', 'Missing') != 'Missing' else 'No',
                'Title Optimized': 'Yes' if seo_data.get('title_optimized', False) else 'No',
                'Has Meta Description': 'Yes' if seo_data.get('meta_description', 'Missing') != 'Missing' else 'No',
                'Meta Desc Optimized': 'Yes' if seo_data.get('meta_desc_optimized', False) else 'No',
                'Proper H1 Usage': 'Yes' if seo_data.get('proper_h1_usage', False) else 'No',
                'Has Open Graph': 'Yes' if seo_data.get('open_graph', {}) else 'No',
                'Has Schema Markup': 'Yes' if seo_data.get('has_schema', False) else 'No',
                'Alt Text Score': f"{seo_data.get('alt_text_score', 0)}%"
            })
        
        if seo_summary:
            seo_df = pd.DataFrame(seo_summary)
            seo_df.to_excel(writer, sheet_name='SEO Analysis', index=False)
            
            # Format SEO sheet
            seo_sheet = writer.sheets['SEO Analysis']
            for col_num, value in enumerate(seo_df.columns.values):
                seo_sheet.write(0, col_num, value, header_format)
    
    print(f"üìä Comprehensive report saved to: {filepath}")
    return filepath

def create_visual_charts(audit_results):
    """Create visual charts for audit results"""
    
    try:
        # Set up the plotting style
        plt.style.use('default')
        sns.set_palette("husl")
        
        # Create figure with subplots
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle('Website Audit Dashboard', fontsize=16, fontweight='bold')
        
        # 1. Issues by Type
        issues_summary = audit_results.get('issues_summary', {})
        by_type = issues_summary.get('by_type', {})
        
        if by_type:
            axes[0, 0].pie(by_type.values(), labels=by_type.keys(), autopct='%1.1f%%')
            axes[0, 0].set_title('Issues by Type')
        else:
            axes[0, 0].text(0.5, 0.5, 'No Issues Found', ha='center', va='center', transform=axes[0, 0].transAxes)
            axes[0, 0].set_title('Issues by Type')
        
        # 2. Issues by Severity
        by_severity = issues_summary.get('by_severity', {})
        
        if by_severity:
            colors = {'High': '#FF6B6B', 'Medium': '#FFE66D', 'Low': '#4ECDC4'}
            severity_colors = [colors.get(sev, '#95A5A6') for sev in by_severity.keys()]
            axes[0, 1].bar(by_severity.keys(), by_severity.values(), color=severity_colors)
            axes[0, 1].set_title('Issues by Severity')
            axes[0, 1].set_ylabel('Number of Issues')
        else:
            axes[0, 1].text(0.5, 0.5, 'No Issues Found', ha='center', va='center', transform=axes[0, 1].transAxes)
            axes[0, 1].set_title('Issues by Severity')
        
        # 3. Page Load Times
        pages = audit_results.get('pages_analyzed', [])
        if pages:
            load_times = [page.get('load_metrics', {}).get('load_time_ms', 0) for page in pages]
            page_names = [urlparse(page.get('url', '')).path or '/' for page in pages]
            
            axes[1, 0].bar(range(len(load_times)), load_times, color='skyblue')
            axes[1, 0].set_title('Page Load Times (ms)')
            axes[1, 0].set_ylabel('Load Time (ms)')
            axes[1, 0].set_xticks(range(len(page_names)))
            axes[1, 0].set_xticklabels(page_names, rotation=45, ha='right')
            
            # Add horizontal line for 3-second rule
            axes[1, 0].axhline(y=3000, color='red', linestyle='--', alpha=0.7, label='3s target')
            axes[1, 0].legend()
        
        # 4. SEO Score Distribution
        if pages:
            seo_scores = []
            for page in pages:
                seo_data = page.get('seo_data', {})
                score = 0
                score += 20 if seo_data.get('title_optimized', False) else 0
                score += 20 if seo_data.get('meta_desc_optimized', False) else 0
                score += 20 if seo_data.get('proper_h1_usage', False) else 0
                score += 20 if seo_data.get('alt_text_score', 0) > 80 else 0
                score += 20 if seo_data.get('has_schema', False) else 0
                seo_scores.append(score)
            
            axes[1, 1].hist(seo_scores, bins=5, color='lightgreen', alpha=0.7, edgecolor='black')
            axes[1, 1].set_title('SEO Score Distribution')
            axes[1, 1].set_xlabel('SEO Score (%)')
            axes[1, 1].set_ylabel('Number of Pages')
        
        plt.tight_layout()
        
        # Save chart
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        primary_url = audit_results.get('primary_url', 'unknown')
        domain = tldextract.extract(primary_url).registered_domain or 'unknown'
        
        chart_filename = f"audit_dashboard_{domain}_{timestamp}.png"
        chart_filepath = os.path.join(results_base_path, chart_filename)
        
        plt.savefig(chart_filepath, dpi=300, bbox_inches='tight')
        plt.show()
        
        print(f"üìà Dashboard saved to: {chart_filepath}")
        return chart_filepath
        
    except Exception as e:
        print(f"‚ö†Ô∏è Error creating charts: {e}")
        return None

def generate_comprehensive_report(audit_results):
    """Generate comprehensive text report"""
    print("\n" + "="*80)
    print("üìä COMPREHENSIVE WEBSITE AUDIT REPORT")
    print("="*80)
    
    summary = audit_results.get('summary_metrics', {})
    issues_summary = audit_results.get('issues_summary', {})
    
    print(f"üåê Primary URL: {audit_results.get('primary_url', 'N/A')}")
    print(f"üìÖ Audit Date: {audit_results.get('audit_date', 'N/A')}")
    print(f"üìÑ Pages Analyzed: {summary.get('total_pages_analyzed', 0)}")
    print(f"‚ö° Average Load Time: {summary.get('avg_load_time', 0)} ms")
    print(f"üìù Average Word Count: {summary.get('avg_word_count', 0)}")
    
    print(f"\nüö® ISSUES SUMMARY:")
    print(f"  ‚Ä¢ Total Issues Found: {summary.get('total_issues_found', 0)}")
    print(f"  ‚Ä¢ Pages with Issues: {summary.get('pages_with_issues', 0)}")
    
    by_severity = issues_summary.get('by_severity', {})
    if by_severity:
        print(f"  ‚Ä¢ High Priority: {by_severity.get('High', 0)}")
        print(f"  ‚Ä¢ Medium Priority: {by_severity.get('Medium', 0)}")
        print(f"  ‚Ä¢ Low Priority: {by_severity.get('Low', 0)}")
    
    by_type = issues_summary.get('by_type', {})
    if by_type:
        print(f"\nüìä ISSUES BY CATEGORY:")
        for issue_type, count in by_type.items():
            print(f"  ‚Ä¢ {issue_type}: {count}")
    
    # Top issues
    all_issues = issues_summary.get('all_issues', [])
    high_priority = [issue for issue in all_issues if issue.get('severity') == 'High']
    
    if high_priority:
        print(f"\nüî• TOP PRIORITY ISSUES:")
        for i, issue in enumerate(high_priority[:5], 1):
            print(f"  {i}. {issue.get('issue', 'Unknown issue')}")
            print(f"     üí° {issue.get('recommendation', 'No recommendation')}")
    
    # Performance insights
    pages = audit_results.get('pages_analyzed', [])
    if pages:
        slow_pages = [page for page in pages 
                     if page.get('load_metrics', {}).get('load_time_ms', 0) > 3000]
        
        if slow_pages:
            print(f"\nüêå SLOW LOADING PAGES:")
            for page in slow_pages[:3]:
                url = page.get('url', 'Unknown')
                load_time = page.get('load_metrics', {}).get('load_time_ms', 0)
                print(f"  ‚Ä¢ {url}: {load_time} ms")
    
    print("\n" + "="*80)

print("‚úÖ Enhanced reporting functions ready")

In [None]:
# Enhanced main audit function
def comprehensive_audit(url, max_pages=5):
    """Main function to perform comprehensive website audit"""
    
    auditor = ComprehensiveWebAuditor(max_pages=max_pages)
    
    try:
        # Start browser session
        print("üöÄ Starting comprehensive website audit...")
        auditor.start_session()
        
        # Perform comprehensive audit
        results = auditor.audit_website(url)
        
        if 'error' not in results:
            # Generate comprehensive report
            generate_comprehensive_report(results)
            
            # Create visual charts
            print("\nüìà Creating visual dashboard...")
            chart_path = create_visual_charts(results)
            
            # Generate AI insights if available
            if connected and model:
                print(f"\nü§ñ Generating AI insights with {model}...")
                
                # Prepare comprehensive data for AI
                summary_for_ai = {
                    'url': results.get('primary_url'),
                    'pages_analyzed': results.get('summary_metrics', {}).get('total_pages_analyzed', 0),
                    'total_issues': results.get('summary_metrics', {}).get('total_issues_found', 0),
                    'avg_load_time': results.get('summary_metrics', {}).get('avg_load_time', 0),
                    'issues_by_type': results.get('issues_summary', {}).get('by_type', {}),
                    'issues_by_severity': results.get('issues_summary', {}).get('by_severity', {}),
                    'top_issues': [issue.get('issue', '') for issue in 
                                 results.get('issues_summary', {}).get('all_issues', [])[:5]]
                }
                
                insights = generate_ai_insights(summary_for_ai, model)
                results['ai_insights'] = insights
                print(f"\nüí° AI INSIGHTS:")
                print(insights)
            
            # Save comprehensive results
            print(f"\nüíæ Saving comprehensive report...")
            excel_path = create_comprehensive_report(results)
            
            print(f"\n‚úÖ Comprehensive audit completed successfully!")
            print(f"üìÅ Results saved to Google Drive: {results_base_path}")
            
            return results, excel_path, chart_path
        else:
            print(f"‚ùå Audit failed: {results['error']}")
            return results, None, None
            
    except Exception as e:
        print(f"‚ùå Audit error: {e}")
        return {'error': str(e)}, None, None
        
    finally:
        # Always close browser
        auditor.end_session()

# Batch audit function for multiple websites
def batch_comprehensive_audit(urls, max_pages_per_site=3):
    """Perform comprehensive audit on multiple websites"""
    all_results = []
    
    for i, url in enumerate(urls, 1):
        print(f"\n{'='*100}")
        print(f"üîç COMPREHENSIVE AUDIT {i}/{len(urls)}: {url}")
        print('='*100)
        
        try:
            results, excel_path, chart_path = comprehensive_audit(url, max_pages=max_pages_per_site)
            
            # Add file paths to results
            if excel_path:
                results['excel_report_path'] = excel_path
            if chart_path:
                results['chart_path'] = chart_path
                
            all_results.append(results)
            
            # Brief pause between audits
            if i < len(urls):
                print("\n‚è≥ Waiting 5 seconds before next audit...")
                time.sleep(5)
            
        except Exception as e:
            print(f"‚ùå Failed to audit {url}: {e}")
            all_results.append({
                'primary_url': url,
                'error': str(e),
                'audit_date': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            })
    
    # Create batch summary report
    if all_results:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        batch_filename = f"batch_audit_summary_{timestamp}.xlsx"
        batch_filepath = os.path.join(results_base_path, batch_filename)
        
        # Create batch summary
        batch_summary = []
        for result in all_results:
            summary = result.get('summary_metrics', {})
            batch_summary.append({
                'URL': result.get('primary_url', 'Unknown'),
                'Status': 'Success' if 'error' not in result else 'Failed',
                'Pages Analyzed': summary.get('total_pages_analyzed', 0),
                'Total Issues': summary.get('total_issues_found', 0),
                'Avg Load Time (ms)': summary.get('avg_load_time', 0),
                'High Priority Issues': result.get('issues_summary', {}).get('by_severity', {}).get('High', 0),
                'Error': result.get('error', 'None')
            })
        
        batch_df = pd.DataFrame(batch_summary)
        batch_df.to_excel(batch_filepath, index=False)
        
        print(f"\nüìã Batch summary saved to: {batch_filepath}")
    
    return all_results

print("üöÄ Comprehensive Web Audit Tool Ready!")
print("\nüìñ Enhanced Usage:")
print("  ‚Ä¢ Single comprehensive audit: comprehensive_audit('example.com')")
print("  ‚Ä¢ Batch comprehensive audit: batch_comprehensive_audit(['site1.com', 'site2.com'])")
print("  ‚Ä¢ Adjust pages per site: comprehensive_audit('example.com', max_pages=10)")

## üöÄ How to Use the Enhanced Web Audit Tool

### Single Comprehensive Website Audit
```python
# Comprehensive audit with deep analysis
results, excel_path, chart_path = comprehensive_audit('example.com', max_pages=5)
```

### Batch Comprehensive Audit
```python
# Audit multiple websites with comprehensive analysis
websites = [
    'google.com',
    'github.com', 
    'stackoverflow.com'
]
batch_results = batch_comprehensive_audit(websites, max_pages_per_site=3)
```

### Enhanced Features:
- ‚úÖ **Deep Content Scraping**: Full page content extraction with BeautifulSoup
- ‚úÖ **Multi-page Discovery**: Automatically discovers internal pages via sitemap & crawling  
- ‚úÖ **Comprehensive SEO Analysis**: Title, meta, headings, alt text, Open Graph, Schema markup
- ‚úÖ **Performance Metrics**: Load times, resource analysis, Core Web Vitals simulation
- ‚úÖ **Content Analysis**: Word count, reading time, link analysis (internal/external)
- ‚úÖ **Technical SEO**: HTTPS, viewport, canonical URLs, robots, favicon, HTML lang
- ‚úÖ **Accessibility Check**: Form labels, ARIA attributes, skip links, alt text coverage
- ‚úÖ **Issue Detection**: Automatic identification of SEO, performance, and technical issues
- ‚úÖ **Visual Dashboard**: Charts showing issues by type/severity, load times, SEO scores
- ‚úÖ **AI Insights**: Powered by GPT-OSS/Llama3 via Ollama with comprehensive data
- ‚úÖ **Multi-sheet Excel Reports**: Executive summary, issues, page details, SEO analysis
- ‚úÖ **Auto-save**: All results saved to Google Drive with timestamps

### Report Outputs:
- **Excel Report**: Multi-sheet comprehensive analysis
- **Visual Dashboard**: Charts and graphs (PNG format)
- **Console Report**: Detailed text summary
- **AI Insights**: Strategic recommendations

### Results Location:
All audit results are automatically saved to: `Google Drive/Web_Audit_Results/`

### Customization:
- `max_pages`: Number of pages to analyze per website (default: 5)
- `timeout`: Page load timeout in seconds (default: 30)
- Reports include color-coded issue severity (Red=High, Orange=Medium, Green=Low)

In [None]:
# Quick test - uncomment to try comprehensive audit
# comprehensive_audit('example.com', max_pages=3)

# For testing with a real website
# comprehensive_audit('https://github.com', max_pages=2)