# Data.gov.hk Web Crawling and API Access Starter Notebook

## 🎯 Project Objective
This notebook helps you explore data.gov.hk to find datasets suitable for:
- **Regression analysis** or **simulation modeling**
- Connection to specific **government policy or decision**
- **Data governance improvement** recommendations

## 📋 Team Requirements
Your team must:
1. **Be patient** - systematically explore multiple datasets
2. **Select one specific dataset** connected to government policy
3. **Choose modeling approach:**
   - If data supports it: regression/simulation analysis
   - If descriptive only: data governance improvement recommendations

## 🌟 Why This Topic?
- **More versatile and open-ended** than other topics
- **Maximum flexibility** for creative exploration
- **Exciting with some uncertainty** - perfect for adventurous teams!
- **No prior foundation** - you're building something completely new

## Please note that the codes and instructions are generated by AI and could be wrong. Please consult AI to revise and adapt and seek help from your human teachers if needed. 

## 🛠️ Setup: Install Required Libraries

In [None]:
# Install required packages
!pip install requests beautifulsoup4 pandas matplotlib seaborn plotly lxml openpyxl

## 📚 Import Libraries

In [None]:
import requests
import pandas as pd
import json
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import time
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print(f"📅 Notebook initialized on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# 🔍 Phase 1: Discover Available Datasets

## Step 1: Explore Data.gov.hk Categories

In [None]:
# Function to scrape data.gov.hk main categories
def explore_data_gov_categories():
    """
    Scrape the main categories from data.gov.hk
    """
    base_url = "https://data.gov.hk/en/"
    
    try:
        response = requests.get(base_url, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        print("🌐 Successfully connected to data.gov.hk!")
        print(f"📄 Page title: {soup.title.string if soup.title else 'No title found'}")
        
        # Look for category links or sections
        categories = []
        
        # Try to find navigation or category sections
        nav_links = soup.find_all('a', href=True)
        for link in nav_links[:20]:  # Limit to first 20 for exploration
            if 'dataset' in link.get('href', '').lower():
                categories.append({
                    'text': link.get_text(strip=True),
                    'href': link.get('href')
                })
        
        return categories
        
    except requests.RequestException as e:
        print(f"❌ Error connecting to data.gov.hk: {e}")
        return []

# Explore categories
categories = explore_data_gov_categories()
print(f"\n📊 Found {len(categories)} potential category links:")
for i, cat in enumerate(categories[:10], 1):
    print(f"{i}. {cat['text'][:50]}... -> {cat['href'][:50]}...")

## Step 2: Access Data.gov.hk API

In [None]:
# Data.gov.hk API exploration
def explore_data_gov_api():
    """
    Explore the data.gov.hk API to find available datasets
    """
    # Common API endpoints to try
    api_endpoints = [
        "https://api.data.gov.hk/v1/datasets",
        "https://data.gov.hk/api/",
        "https://data.gov.hk/en/help/api-spec"
    ]
    
    results = {}
    
    for endpoint in api_endpoints:
        try:
            print(f"🔍 Trying endpoint: {endpoint}")
            response = requests.get(endpoint, timeout=10)
            
            if response.status_code == 200:
                print(f"✅ Success! Status: {response.status_code}")
                
                # Try to parse as JSON first
                try:
                    data = response.json()
                    results[endpoint] = {
                        'type': 'json',
                        'data': data,
                        'keys': list(data.keys()) if isinstance(data, dict) else 'list',
                        'length': len(data) if isinstance(data, (list, dict)) else 'unknown'
                    }
                    print(f"📄 JSON response with {len(data) if isinstance(data, (list, dict)) else 'unknown'} items")
                except json.JSONDecodeError:
                    # If not JSON, treat as HTML/text
                    soup = BeautifulSoup(response.content, 'html.parser')
                    results[endpoint] = {
                        'type': 'html',
                        'title': soup.title.string if soup.title else 'No title',
                        'content_length': len(response.text)
                    }
                    print(f"📄 HTML response ({len(response.text)} chars)")
            else:
                print(f"❌ Failed: Status {response.status_code}")
                
        except requests.RequestException as e:
            print(f"❌ Error: {e}")
        
        print("-" * 50)
        time.sleep(1)  # Be respectful to the server
    
    return results

# Explore API endpoints
api_results = explore_data_gov_api()

## Step 3: Manual Dataset Categories

Based on typical data.gov.hk structure, here are key categories to explore:

In [None]:
# Define key dataset categories for systematic exploration
dataset_categories = {
    "🚌 Transport": {
        "description": "Traffic patterns, public transport, road safety",
        "policy_connection": "Transport policy, infrastructure planning",
        "modeling_potential": "High - time series, regression analysis",
        "examples": ["Bus route usage", "Traffic accident data", "MTR ridership", "Parking space utilization"]
    },
    "🌱 Environment": {
        "description": "Air quality, waste management, energy",
        "policy_connection": "Environmental protection, sustainability",
        "modeling_potential": "High - correlation analysis, trend prediction",
        "examples": ["Air pollution indices", "Waste collection data", "Energy consumption", "Weather patterns"]
    },
    "🏥 Health": {
        "description": "Disease surveillance, healthcare utilization",
        "policy_connection": "Public health policy, resource allocation",
        "modeling_potential": "Medium-High - epidemiological modeling",
        "examples": ["Infectious disease cases", "Hospital bed occupancy", "Vaccination rates", "Health expenditure"]
    },
    "🎓 Education": {
        "description": "School performance, enrollment, resources",
        "policy_connection": "Education policy, funding allocation",
        "modeling_potential": "Medium - performance analysis",
        "examples": ["School enrollment", "Academic performance", "Teacher-student ratios", "Education spending"]
    },
    "🏠 Housing": {
        "description": "Property prices, public housing, development",
        "policy_connection": "Housing policy, urban planning",
        "modeling_potential": "High - price prediction, demand modeling",
        "examples": ["Property transactions", "Public housing waiting times", "Construction permits", "Rental prices"]
    },
    "💼 Economy": {
        "description": "Business licenses, employment, tourism",
        "policy_connection": "Economic development, business regulation",
        "modeling_potential": "High - economic indicator modeling",
        "examples": ["Business registrations", "Employment rates", "Tourism arrivals", "GDP indicators"]
    }
}

# Display categories with modeling potential
print("🎯 DATASET CATEGORIES FOR SYSTEMATIC EXPLORATION")
print("=" * 60)

for category, info in dataset_categories.items():
    print(f"\n{category}")
    print(f"📝 Description: {info['description']}")
    print(f"🏛️ Policy Connection: {info['policy_connection']}")
    print(f"📊 Modeling Potential: {info['modeling_potential']}")
    print(f"💡 Examples: {', '.join(info['examples'][:2])}...")
    print("-" * 40)

# 🔬 Phase 2: Dataset Selection Framework

## Step 4: Dataset Evaluation Criteria

In [None]:
# Dataset evaluation framework
def evaluate_dataset_potential(dataset_info):
    """
    Evaluate a dataset's potential for regression/simulation analysis
    
    Returns: dictionary with evaluation scores and recommendations
    """
    evaluation = {
        'dataset_name': dataset_info.get('name', 'Unknown'),
        'scores': {},
        'recommendations': [],
        'modeling_approach': None
    }
    
    # Evaluation criteria (1-5 scale)
    criteria = {
        'data_volume': 'How much data is available? (1=very little, 5=extensive)',
        'time_series': 'Does it have temporal dimension? (1=no, 5=rich time series)',
        'numerical_variables': 'Are there quantitative variables? (1=mostly categorical, 5=many numerical)',
        'policy_relevance': 'How connected to government decisions? (1=weak, 5=direct impact)',
        'data_quality': 'How complete and accurate? (1=poor, 5=excellent)',
        'update_frequency': 'How often updated? (1=rarely, 5=real-time)'
    }
    
    print(f"📊 EVALUATING DATASET: {evaluation['dataset_name']}")
    print("=" * 50)
    
    total_score = 0
    for criterion, description in criteria.items():
        print(f"\n{criterion.replace('_', ' ').title()}: {description}")
        score = input(f"Enter score (1-5): ")
        try:
            score = int(score)
            if 1 <= score <= 5:
                evaluation['scores'][criterion] = score
                total_score += score
            else:
                print("⚠️ Invalid score, setting to 3 (neutral)")
                evaluation['scores'][criterion] = 3
                total_score += 3
        except ValueError:
            print("⚠️ Invalid input, setting to 3 (neutral)")
            evaluation['scores'][criterion] = 3
            total_score += 3
    
    evaluation['total_score'] = total_score
    evaluation['average_score'] = total_score / len(criteria)
    
    # Generate recommendations based on scores
    if evaluation['average_score'] >= 4.0:
        evaluation['modeling_approach'] = "High Potential for Regression/Simulation"
        evaluation['recommendations'].append("✅ Excellent candidate for quantitative modeling")
        evaluation['recommendations'].append("🎯 Focus on regression analysis or simulation")
    elif evaluation['average_score'] >= 3.0:
        evaluation['modeling_approach'] = "Moderate Potential - Mixed Approach"
        evaluation['recommendations'].append("⚡ Good for basic modeling with governance focus")
        evaluation['recommendations'].append("🔄 Combine simple analysis with governance recommendations")
    else:
        evaluation['modeling_approach'] = "Focus on Data Governance Improvements"
        evaluation['recommendations'].append("🔧 Emphasize data governance improvement analysis")
        evaluation['recommendations'].append("📋 Recommend better data collection/curation practices")
    
    return evaluation

# Example usage (interactive)
print("🎯 DATASET EVALUATION FRAMEWORK")
print("Use this framework when you find a potential dataset!")
print("\n📝 Instructions:")
print("1. Find a dataset on data.gov.hk")
print("2. Download sample data")
print("3. Run the evaluation function below")
print("4. Make decision based on recommendations")

## Step 5: Quick Dataset Download Function

In [None]:
# Function to quickly download and preview datasets
def quick_dataset_preview(url, dataset_name="Unknown Dataset"):
    """
    Download and preview a dataset from data.gov.hk
    Supports CSV, Excel, and JSON formats
    """
    try:
        print(f"📥 Downloading: {dataset_name}")
        print(f"🔗 URL: {url}")
        
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        
        # Determine file type from URL or content type
        if url.lower().endswith('.csv') or 'csv' in response.headers.get('content-type', '').lower():
            df = pd.read_csv(url)
            file_type = "CSV"
        elif url.lower().endswith(('.xlsx', '.xls')) or 'excel' in response.headers.get('content-type', '').lower():
            df = pd.read_excel(url)
            file_type = "Excel"
        elif url.lower().endswith('.json') or 'json' in response.headers.get('content-type', '').lower():
            data = response.json()
            if isinstance(data, list):
                df = pd.DataFrame(data)
            else:
                df = pd.json_normalize(data)
            file_type = "JSON"
        else:
            # Try CSV as default
            df = pd.read_csv(url)
            file_type = "CSV (assumed)"
        
        print(f"✅ Successfully loaded {file_type} file!")
        
        # Generate preview report
        report = {
            'dataset_name': dataset_name,
            'file_type': file_type,
            'shape': df.shape,
            'columns': list(df.columns),
            'numeric_columns': list(df.select_dtypes(include=['number']).columns),
            'datetime_columns': list(df.select_dtypes(include=['datetime']).columns),
            'missing_data': df.isnull().sum().sum(),
            'sample_data': df.head()
        }
        
        # Display report
        print(f"\n📊 DATASET PREVIEW REPORT")
        print("=" * 40)
        print(f"📁 Dataset: {report['dataset_name']}")
        print(f"📄 Type: {report['file_type']}")
        print(f"📏 Shape: {report['shape'][0]:,} rows × {report['shape'][1]} columns")
        print(f"🔢 Numeric columns: {len(report['numeric_columns'])}/{len(report['columns'])}")
        print(f"📅 DateTime columns: {len(report['datetime_columns'])}")
        print(f"❓ Missing values: {report['missing_data']:,}")
        
        print(f"\n📋 Columns: {', '.join(report['columns'][:5])}{'...' if len(report['columns']) > 5 else ''}")
        
        if report['numeric_columns']:
            print(f"🔢 Numeric: {', '.join(report['numeric_columns'][:5])}{'...' if len(report['numeric_columns']) > 5 else ''}")
        
        print(f"\n📋 Sample Data:")
        display(report['sample_data'])
        
        # Modeling potential assessment
        modeling_score = 0
        if report['shape'][0] > 100: modeling_score += 1
        if report['shape'][0] > 1000: modeling_score += 1
        if len(report['numeric_columns']) >= 2: modeling_score += 1
        if len(report['datetime_columns']) >= 1: modeling_score += 1
        if report['missing_data'] / (report['shape'][0] * report['shape'][1]) < 0.1: modeling_score += 1
        
        print(f"\n⭐ Modeling Potential Score: {modeling_score}/5")
        
        if modeling_score >= 4:
            print("🎯 HIGH POTENTIAL for regression/simulation analysis!")
        elif modeling_score >= 3:
            print("⚡ MODERATE POTENTIAL - consider mixed approach")
        else:
            print("🔧 FOCUS ON DATA GOVERNANCE improvements")
        
        return df, report
        
    except Exception as e:
        print(f"❌ Error loading dataset: {e}")
        return None, None

# Example usage instructions
print("🔍 QUICK DATASET PREVIEW FUNCTION")
print("="*40)
print("Usage: df, report = quick_dataset_preview(url, 'Dataset Name')")
print("\n💡 Tips:")
print("• Look for datasets with >1000 rows for better modeling")
print("• Prefer datasets with multiple numeric columns")
print("• Time series data (dates) enable trend analysis")
print("• Low missing data (<10%) is ideal")

# 📊 Phase 3: Example Dataset Exploration

## Step 6: Sample Data Analysis Template

In [None]:
# Template for analyzing any dataset you select
def comprehensive_dataset_analysis(df, dataset_name="Your Dataset"):
    """
    Comprehensive analysis template for any dataset
    """
    print(f"🔬 COMPREHENSIVE ANALYSIS: {dataset_name}")
    print("=" * 60)
    
    # 1. Basic Information
    print("\n1️⃣ BASIC DATASET INFORMATION")
    print(f"📏 Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # 2. Data Types Analysis
    print("\n2️⃣ DATA TYPES BREAKDOWN")
    dtype_counts = df.dtypes.value_counts()
    for dtype, count in dtype_counts.items():
        print(f"📊 {dtype}: {count} columns")
    
    # 3. Missing Data Analysis
    print("\n3️⃣ MISSING DATA ANALYSIS")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_summary = pd.DataFrame({
        'Missing Count': missing,
        'Missing %': missing_pct
    })
    missing_summary = missing_summary[missing_summary['Missing Count'] > 0].sort_values('Missing %', ascending=False)
    
    if len(missing_summary) > 0:
        print(f"⚠️ {len(missing_summary)} columns have missing data:")
        display(missing_summary.head(10))
    else:
        print("✅ No missing data found!")
    
    # 4. Numeric Variables Analysis
    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) > 0:
        print(f"\n4️⃣ NUMERIC VARIABLES ANALYSIS ({len(numeric_cols)} columns)")
        display(df[numeric_cols].describe())
        
        # Correlation analysis if multiple numeric columns
        if len(numeric_cols) > 1:
            print("\n🔗 CORRELATION MATRIX")
            plt.figure(figsize=(10, 8))
            correlation_matrix = df[numeric_cols].corr()
            sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
            plt.title(f'Correlation Matrix - {dataset_name}')
            plt.tight_layout()
            plt.show()
    
    # 5. Categorical Variables Analysis
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    if len(categorical_cols) > 0:
        print(f"\n5️⃣ CATEGORICAL VARIABLES ANALYSIS ({len(categorical_cols)} columns)")
        for col in categorical_cols[:5]:  # Show first 5 categorical columns
            unique_count = df[col].nunique()
            print(f"📊 {col}: {unique_count} unique values")
            if unique_count <= 10:  # Show value counts for small categories
                print(df[col].value_counts().head())
            print("-" * 30)
    
    # 6. Time Series Check
    print("\n6️⃣ TIME SERIES ANALYSIS")
    date_cols = []
    for col in df.columns:
        if df[col].dtype == 'datetime64[ns]' or 'date' in col.lower() or 'time' in col.lower():
            date_cols.append(col)
    
    if date_cols:
        print(f"📅 Found {len(date_cols)} potential date columns: {date_cols}")
        for col in date_cols[:2]:  # Analyze first 2 date columns
            try:
                df[col] = pd.to_datetime(df[col], errors='coerce')
                print(f"📊 {col}: Range from {df[col].min()} to {df[col].max()}")
            except:
                print(f"⚠️ Could not parse {col} as datetime")
    else:
        print("❌ No clear date/time columns found")
    
    # 7. Modeling Recommendations
    print("\n7️⃣ MODELING RECOMMENDATIONS")
    recommendations = []
    
    if len(numeric_cols) >= 3:
        recommendations.append("✅ REGRESSION ANALYSIS: Multiple numeric variables available")
    
    if date_cols and len(numeric_cols) >= 1:
        recommendations.append("✅ TIME SERIES ANALYSIS: Date columns with numeric targets")
    
    if len(categorical_cols) >= 1 and len(numeric_cols) >= 1:
        recommendations.append("✅ CLASSIFICATION/SEGMENTATION: Mix of categorical and numeric variables")
    
    if df.shape[0] >= 1000:
        recommendations.append("✅ SIMULATION MODELING: Sufficient sample size for robust analysis")
    
    if len(recommendations) == 0:
        recommendations.append("🔧 DATA GOVERNANCE FOCUS: Limited quantitative modeling potential")
        recommendations.append("📋 Recommend improvements in data collection and curation")
    
    for rec in recommendations:
        print(rec)
    
    return {
        'numeric_columns': list(numeric_cols),
        'categorical_columns': list(categorical_cols),
        'date_columns': date_cols,
        'missing_data_summary': missing_summary,
        'recommendations': recommendations
    }

# Instructions for use
print("🎯 COMPREHENSIVE ANALYSIS TEMPLATE")
print("Use this after loading your chosen dataset:")
print("analysis_results = comprehensive_dataset_analysis(df, 'Your Dataset Name')")

# 🏛️ Phase 4: Policy Connection Framework

## Step 7: Linking Data to Government Decisions

In [None]:
# Framework for connecting datasets to government policy
def policy_connection_analysis(dataset_name, data_summary):
    """
    Help identify policy connections for your chosen dataset
    """
    print(f"🏛️ POLICY CONNECTION ANALYSIS: {dataset_name}")
    print("=" * 60)
    
    # Policy areas framework
    policy_areas = {
        "🚌 Transport Policy": {
            "keywords": ["transport", "traffic", "bus", "mtr", "road", "parking", "vehicle"],
            "government_depts": ["Transport Department", "Highways Department"],
            "decisions": ["Route planning", "Traffic management", "Infrastructure investment"],
            "metrics": ["ridership", "accidents", "congestion", "emissions"]
        },
        "🌱 Environmental Policy": {
            "keywords": ["air", "pollution", "waste", "energy", "emission", "environment"],
            "government_depts": ["Environmental Protection Department", "Development Bureau"],
            "decisions": ["Pollution control", "Waste management", "Conservation measures"],
            "metrics": ["pollution index", "waste volume", "energy consumption"]
        },
        "🏥 Health Policy": {
            "keywords": ["health", "hospital", "disease", "medical", "clinic", "patient"],
            "government_depts": ["Department of Health", "Hospital Authority"],
            "decisions": ["Resource allocation", "Service planning", "Disease prevention"],
            "metrics": ["bed occupancy", "waiting times", "infection rates"]
        },
        "🏠 Housing Policy": {
            "keywords": ["housing", "property", "rent", "apartment", "building", "estate"],
            "government_depts": ["Housing Department", "Development Bureau"],
            "decisions": ["Public housing allocation", "Land use planning", "Rent control"],
            "metrics": ["housing prices", "waiting lists", "occupancy rates"]
        },
        "💼 Economic Policy": {
            "keywords": ["business", "employment", "income", "gdp", "economy", "trade"],
            "government_depts": ["Commerce and Economic Development Bureau", "Labour Department"],
            "decisions": ["Business regulation", "Employment support", "Economic development"],
            "metrics": ["unemployment rate", "business registrations", "economic indicators"]
        },
        "🎓 Education Policy": {
            "keywords": ["school", "student", "education", "teacher", "university", "learning"],
            "government_depts": ["Education Bureau"],
            "decisions": ["School funding", "Curriculum planning", "Resource allocation"],
            "metrics": ["enrollment", "performance", "teacher ratios"]
        }
    }
    
    # Analyze dataset for policy connections
    dataset_text = f"{dataset_name} {' '.join(data_summary.get('columns', []))}".lower()
    
    policy_matches = {}
    for policy_area, details in policy_areas.items():
        matches = 0
        matched_keywords = []
        
        for keyword in details["keywords"]:
            if keyword in dataset_text:
                matches += 1
                matched_keywords.append(keyword)
        
        if matches > 0:
            policy_matches[policy_area] = {
                'match_score': matches,
                'matched_keywords': matched_keywords,
                'details': details
            }
    
    # Display results
    if policy_matches:
        print("✅ POLICY CONNECTIONS IDENTIFIED:")
        sorted_matches = sorted(policy_matches.items(), key=lambda x: x[1]['match_score'], reverse=True)
        
        for policy_area, match_info in sorted_matches[:3]:  # Show top 3 matches
            print(f"\n{policy_area} (Score: {match_info['match_score']})")
            print(f"🔑 Keywords found: {', '.join(match_info['matched_keywords'])}")
            print(f"🏛️ Relevant departments: {', '.join(match_info['details']['government_depts'])}")
            print(f"📋 Potential decisions: {', '.join(match_info['details']['decisions'])}")
            print("-" * 40)
    
    else:
        print("⚠️ No clear policy connections identified automatically.")
        print("💡 Consider these general approaches:")
        print("• Look for regulatory compliance aspects")
        print("• Consider resource allocation decisions")
        print("• Examine service delivery efficiency")
        print("• Evaluate public interest implications")
    
    # Policy questions generator
    print("\n❓ SUGGESTED POLICY RESEARCH QUESTIONS:")
    if policy_matches:
        top_policy = sorted_matches[0]
        policy_name = top_policy[0].replace('🏛️', '').replace('🚌', '').replace('🌱', '').replace('🏥', '').replace('🏠', '').replace('💼', '').replace('🎓', '').strip()
        
        questions = [
            f"How can {policy_name.lower()} be optimized using data-driven insights?",
            f"What patterns in the data suggest improvements to current {policy_name.lower()}?",
            f"How do data governance practices affect {policy_name.lower()} effectiveness?",
            f"What additional data should government collect to improve {policy_name.lower()}?"
        ]
    else:
        questions = [
            "How can government data collection be improved for this domain?",
            "What data governance challenges are evident in this dataset?",
            "How could better data curation support policy decisions?",
            "What data quality improvements would enhance government decision-making?"
        ]
    
    for i, question in enumerate(questions, 1):
        print(f"{i}. {question}")
    
    return policy_matches

# Example usage
print("🎯 POLICY CONNECTION FRAMEWORK")
print("Use this to connect your dataset to government decisions:")
print("policy_analysis = policy_connection_analysis('Dataset Name', analysis_results)")

# 🔧 Phase 5: Data Governance Analysis Framework

## Step 8: Governance Assessment Template

In [None]:
# Data governance assessment framework
def data_governance_assessment(dataset_info, dataset_url):
    """
    Assess data governance quality and generate improvement recommendations
    """
    print("🔧 DATA GOVERNANCE ASSESSMENT")
    print("=" * 50)
    
    # Governance criteria checklist
    governance_criteria = {
        "📊 Data Quality": {
            "completeness": "How complete is the dataset? (missing values, coverage)",
            "accuracy": "How accurate is the data? (errors, inconsistencies)",
            "timeliness": "How current is the data? (update frequency, lag time)",
            "consistency": "How consistent is the format? (standardization, conventions)"
        },
        "📋 Documentation": {
            "metadata": "Are variables clearly defined? (data dictionary available)",
            "methodology": "Is data collection method documented? (process transparency)",
            "context": "Is policy context provided? (purpose, use cases)",
            "limitations": "Are data limitations acknowledged? (known issues, scope)"
        },
        "🌐 Accessibility": {
            "format": "Is data in machine-readable format? (CSV, JSON vs PDF)",
            "availability": "Is data easily discoverable? (search, navigation)",
            "download": "Is download process straightforward? (no barriers)",
            "api": "Is API access available? (programmatic access)"
        },
        "🔄 Maintenance": {
            "updates": "How frequently is data updated? (regular schedule)",
            "versioning": "Is historical data preserved? (version control)",
            "contact": "Is there clear contact for questions? (support available)",
            "feedback": "Is there mechanism for user feedback? (improvement process)"
        }
    }
    
    assessment_results = {}
    total_score = 0
    max_score = 0
    
    # Interactive assessment
    print(f"Assessing dataset: {dataset_info.get('name', 'Unknown')}")
    print(f"URL: {dataset_url}")
    print("\nRate each aspect (1=Poor, 2=Fair, 3=Good, 4=Excellent, 0=Unknown):")
    print("=" * 60)
    
    for category, criteria in governance_criteria.items():
        print(f"\n{category}")
        category_scores = {}
        
        for criterion, description in criteria.items():
            print(f"  {criterion}: {description}")
            while True:
                try:
                    score = int(input(f"    Rate {criterion} (0-4): "))
                    if 0 <= score <= 4:
                        category_scores[criterion] = score
                        if score > 0:  # Only count towards total if not "Unknown"
                            total_score += score
                            max_score += 4
                        break
                    else:
                        print("    Please enter a number between 0-4")
                except ValueError:
                    print("    Please enter a valid number")
        
        assessment_results[category] = category_scores
    
    # Calculate overall governance score
    overall_score = (total_score / max_score * 100) if max_score > 0 else 0
    
    print(f"\n📊 GOVERNANCE ASSESSMENT RESULTS")
    print("=" * 40)
    print(f"Overall Governance Score: {overall_score:.1f}%")
    
    # Category breakdown
    for category, scores in assessment_results.items():
        category_avg = sum(s for s in scores.values() if s > 0) / len([s for s in scores.values() if s > 0]) if any(s > 0 for s in scores.values()) else 0
        print(f"{category}: {category_avg:.1f}/4.0")
    
    # Generate recommendations
    print(f"\n💡 IMPROVEMENT RECOMMENDATIONS")
    print("=" * 40)
    
    recommendations = []
    
    # Specific recommendations based on low scores
    for category, scores in assessment_results.items():
        for criterion, score in scores.items():
            if score == 1:  # Poor scores get specific recommendations
                if criterion == "completeness":
                    recommendations.append("🔧 Improve data completeness: Implement validation checks and mandatory field requirements")
                elif criterion == "metadata":
                    recommendations.append("📋 Create comprehensive data dictionary with variable definitions and units")
                elif criterion == "format":
                    recommendations.append("💾 Provide data in machine-readable formats (CSV, JSON) instead of PDF")
                elif criterion == "updates":
                    recommendations.append("🔄 Establish regular update schedule and communicate timing to users")
                elif criterion == "api":
                    recommendations.append("🌐 Develop API access for programmatic data retrieval")
    
    # General recommendations based on overall score
    if overall_score >= 80:
        recommendations.append("✅ Excellent governance practices - consider as best practice example")
    elif overall_score >= 60:
        recommendations.append("⚡ Good foundation - focus on specific improvement areas identified above")
    elif overall_score >= 40:
        recommendations.append("🔧 Significant improvements needed - prioritize data quality and documentation")
    else:
        recommendations.append("🚨 Major governance gaps - comprehensive reform needed")
    
    # Display recommendations
    if recommendations:
        for i, rec in enumerate(recommendations, 1):
            print(f"{i}. {rec}")
    
    return {
        'overall_score': overall_score,
        'category_results': assessment_results,
        'recommendations': recommendations
    }

print("🎯 DATA GOVERNANCE ASSESSMENT")
print("Use this to evaluate and improve data governance:")
print("governance_results = data_governance_assessment(dataset_info, dataset_url)")

# 🎯 Phase 6: Your Project Development

## Step 9: Project Checklist and Next Steps

In [None]:
# Project development checklist
def project_checklist():
    """
    Interactive checklist for your open data exploration project
    """
    checklist_items = {
        "🔍 Discovery Phase": [
            "Browse data.gov.hk systematically by category",
            "Identify 3-5 potential datasets of interest", 
            "Download sample data for initial assessment",
            "Document dataset URLs and basic information"
        ],
        "📊 Dataset Selection": [
            "Use quick_dataset_preview() for each candidate",
            "Run comprehensive_dataset_analysis() on top choices",
            "Evaluate modeling potential (regression/simulation feasibility)",
            "Select ONE dataset for your project focus"
        ],
        "🏛️ Policy Connection": [
            "Run policy_connection_analysis() on chosen dataset",
            "Research relevant government departments and decisions",
            "Formulate specific policy research questions",
            "Connect analysis to real government decision-making"
        ],
        "🔧 Governance Assessment": [
            "Complete data_governance_assessment() evaluation",
            "Identify specific governance improvement opportunities",
            "Research best practices from other jurisdictions",
            "Develop actionable recommendations"
        ],
        "📈 Analysis Approach": [
            "Choose primary method: regression/simulation OR governance focus",
            "Set up analysis framework using provided templates",
            "Plan visualizations and key findings presentation",
            "Prepare policy implications and recommendations"
        ],
        "📋 Documentation": [
            "Document data discovery process and selection rationale",
            "Create comprehensive analysis notebook",
            "Prepare policy brief with recommendations",
            "Plan final presentation for class discussion"
        ]
    }
    
    print("✅ OPEN DATA EXPLORATION PROJECT CHECKLIST")
    print("=" * 60)
    print("Use this checklist to track your progress:\n")
    
    for phase, items in checklist_items.items():
        print(f"{phase}")
        for item in items:
            print(f"  ☐ {item}")
        print()
    
    print("🎯 SUCCESS CRITERIA:")
    print("✅ Clear policy relevance with government connection")
    print("✅ Appropriate analytical approach (regression/simulation OR governance)")
    print("✅ Actionable recommendations for policy improvements")
    print("✅ Evidence of systematic data exploration process")
    print("✅ Professional presentation of findings")
    
    print("\n💡 REMEMBER:")
    print("• Be patient - good datasets require systematic exploration")
    print("• Focus on ONE specific dataset with clear policy connection")
    print("• Choose approach based on data characteristics")
    print("• This is MORE EXCITING with some uncertainty - embrace the exploration!")
    
    return checklist_items

# Display the checklist
checklist = project_checklist()

# 🚀 Ready to Start Your Exploration!

## Quick Start Guide

1. **Start Exploring:** Browse data.gov.hk and use the functions above to analyze potential datasets
2. **Be Patient:** Remember, finding the right dataset takes time and systematic exploration
3. **Choose Your Approach:** 
   - **High modeling potential:** Focus on regression/simulation analysis
   - **Limited modeling potential:** Focus on data governance improvements
4. **Connect to Policy:** Always link your analysis to specific government decisions
5. **Document Everything:** Keep track of your exploration process

## Remember: This Topic is More Exciting with Some Uncertainty!

Unlike other topics with solid foundations, you're building something completely new. Embrace the exploration process and the uncertainty - that's what makes this option exciting and rewarding!

**Good luck with your data exploration journey! 🎯**