# Hong Kong Immigration Passenger Traffic Analysis

## 1. Project Overview

### 1.1 Introduction
This project analyzes daily passenger traffic at Hong Kong immigration checkpoints using various machine learning algorithms. The goal is to predict traffic patterns, classify high-traffic days, and identify clusters for operational planning.

### 1.2 Objectives
- Predict future passenger traffic using Linear Regression
- Classify days as high/low traffic using Logistic Regression and SVM
- Cluster similar traffic patterns using K-means
- Provide actionable insights for immigration authorities

## 2. Data Loading and Exploration

### 2.1 Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (mean_squared_error, r2_score, 
                           accuracy_score, confusion_matrix, 
                           classification_report, silhouette_score)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
print("✅ Libraries imported successfully!")

### 2.2 Load Dataset

In [None]:
# Load the dataset
# Note: You'll need to download this from the provided URL
data_path = "data/raw/passenger_traffic.csv"

try:
    df = pd.read_csv(data_path)
    print(f"✅ Data loaded successfully!")
    print(f"   Shape: {df.shape}")
    print(f"   Columns: {list(df.columns)}")
    print("\nFirst 5 rows:")
    display(df.head())
except FileNotFoundError:
    print("⚠️  Data file not found. Please download from:")
    print("https://data.gov.hk/en-data/dataset/hk-immd-set5-statistics-daily-passenger-traffic")
    print("\nCreating sample data for demonstration...")
    
    # Create sample data for demonstration
    dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
    np.random.seed(42)
    
    df = pd.DataFrame({
        'date': dates,
        'immigration_point': np.random.choice(['Airport', 'Lo Wu', 'Lok Ma Chau', 'Shenzhen Bay'], len(dates)),
        'passenger_count': np.random.randint(50000, 300000, len(dates)) + 
                          (dates.dayofweek >= 5).astype(int) * 50000 +  # Weekend effect
                          (dates.month.isin([7, 8, 12])).astype(int) * 80000,  # Holiday months
        'passenger_type': np.random.choice(['Resident', 'Visitor', 'Transit'], len(dates))
    })
    
    print("Sample data created for demonstration purposes.")
    display(df.head())