# Advanced Feature Engineering for Healthcare Analytics

**Author:** Ronit Saxena  
**Purpose:** Comprehensive feature engineering pipeline for healthcare appointment prediction  
**Focus:** Advanced temporal, categorical, and statistical feature creation

---

## Feature Engineering Overview

This notebook demonstrates advanced feature engineering techniques:

1. **Temporal Feature Engineering** - Lead time analysis, booking patterns, calendar effects
2. **Geographic & Location Intelligence** - Distance binning, location clustering
3. **Department Risk Stratification** - Medical specialty risk profiling
4. **Patient Demographics** - Nationality grouping, visa category analysis
5. **Booking Channel Analytics** - Channel performance and preference analysis
6. **Statistical Encodings** - Frequency encodings, count features, historical patterns
7. **Advanced Binning Strategies** - Quantile-based and domain-knowledge binning

---

## 1. Environment Setup & Data Loading

In [1]:
import pandas as pd
import numpy as np
import warnings
from typing import Dict, List, Tuple, Optional, Union

# Visualization for feature analysis
import matplotlib.pyplot as plt
import seaborn as sns

# Advanced feature engineering
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Configure environment
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("🔧 Advanced Feature Engineering Environment Ready")
print(f"   Pandas: {pd.__version__}")
print(f"   NumPy: {np.__version__}")

🔧 Advanced Feature Engineering Environment Ready
   Pandas: 2.3.1
   NumPy: 2.1.3


In [2]:
# Load preprocessed data from previous pipeline
df = pd.read_csv('/Users/ronitsaxena/Developer/Personal/predictml-production/notebooks/cleaned_healthcare_appointments.csv')

print(f"Dataset Loaded")
print(f"   Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"   Target distribution:")
if 'Target' in df.columns:
    print(df['Target'].value_counts(normalize=True).round(3))
elif 'Status' in df.columns:
    print(df['Status'].value_counts(normalize=True).round(3))

# Create feature engineering copy
df_fe = df.copy()
print(f"\nFeature engineering workspace ready")
print(f"   Starting columns: {df_fe.shape[1]}")

Dataset Loaded
   Shape: 539,238 rows × 29 columns
   Target distribution:
Status
Invoiced        0.617
Confirmed       0.191
Canceled        0.121
Not Answered    0.047
Booked          0.017
Visited         0.006
Name: proportion, dtype: float64

Feature engineering workspace ready
   Starting columns: 29


## 2. Temporal Feature Engineering

In [4]:
class TemporalFeatureEngineer:
    """
    Advanced temporal feature engineering for healthcare appointments.
    """
    
    @staticmethod
    def create_booking_hour_features(df: pd.DataFrame, hour_col: str = 'book_hour') -> pd.DataFrame:
        """
        Create advanced booking hour features.
        
        Args:
            df: Input dataframe
            hour_col: Column containing booking hour
        
        Returns:
            DataFrame with new temporal features
        """
        df = df.copy()
        
        if hour_col in df.columns:
            # Off-hours booking flag (outside normal business hours)
            df['odd_hour_flag'] = ((df[hour_col] < 8) | (df[hour_col] > 20)).astype(int)
            
            # Part of day categorization
            def get_part_of_day(hour):
                if pd.isna(hour):
                    return 'unknown'
                elif 6 <= hour < 12:
                    return 'morning'
                elif 12 <= hour < 17:
                    return 'afternoon'
                elif 17 <= hour < 21:
                    return 'evening'
                else:
                    return 'night'
            
            df['part_of_day'] = df[hour_col].apply(get_part_of_day)
            
            print(f"✅ Booking hour features created")
            print(f"   - odd_hour_flag: Off-hours booking indicator")
            print(f"   - part_of_day: Time period categorization")
        
        return df
    
    @staticmethod
    def create_lead_time_features(df: pd.DataFrame) -> pd.DataFrame:
        """
        Create sophisticated lead time features.
        
        Args:
            df: Input dataframe
        
        Returns:
            DataFrame with lead time features
        """
        df = df.copy()
        
        # Lead days binning strategy
        def bin_lead_days(days):
            if pd.isna(days):
                return 'unknown'
            elif days == 0:
                return 'Same Day'
            elif days == 1:
                return '1 Day'
            elif 2 <= days <= 3:
                return '2-3 Days'
            elif 4 <= days <= 7:
                return '4-7 Days'
            elif 8 <= days <= 30:
                return '8-30 Days'
            elif 31 <= days <= 90:
                return '31-90 Days'
            else:
                return '90+ Days'
        
        # Lead hours binning for granular analysis
        def bin_lead_hours(hours):
            if pd.isna(hours):
                return 'unknown'
            elif hours < 1:
                return '<1hr'
            elif 1 <= hours < 6:
                return '1-6hr'
            elif 6 <= hours < 12:
                return '6-12hr'
            elif 12 <= hours < 24:
                return '12-24hr'
            elif 24 <= hours < 72:
                return '1-3d'
            elif 72 <= hours < 168:
                return '3-7d'
            else:
                return '7d+'
        
        # Apply binning if lead time columns exist
        if 'lead_days' in df.columns:
            df['lead_days_bin'] = df['lead_days'].apply(bin_lead_days)
        
        if 'lead_hours' in df.columns:
            df['lead_hours_bin'] = df['lead_hours'].apply(bin_lead_hours)
        
        print(f"Lead time features created")
        print(f"   - lead_days_bin: Categorical lead time in days")
        print(f"   - lead_hours_bin: Granular lead time in hours")
        
        return df
    
    @staticmethod
    def display_temporal_distributions(df: pd.DataFrame) -> None:
        """
        Display temporal feature distributions for analysis.
        
        Args:
            df: DataFrame with temporal features
        """
        temporal_features = ['odd_hour_flag', 'part_of_day', 'lead_days_bin', 'lead_hours_bin']
        
        print("TEMPORAL FEATURE DISTRIBUTIONS")
        print("=" * 40)
        
        for feature in temporal_features:
            if feature in df.columns:
                print(f"\n{feature}:")
                dist = df[feature].value_counts()
                for value, count in dist.head(8).items():
                    percentage = (count / len(df)) * 100
                    print(f"   {value}: {count:,} ({percentage:.1f}%)")

# Apply temporal feature engineering
temporal_engineer = TemporalFeatureEngineer()

print("TEMPORAL FEATURE ENGINEERING")
print("=" * 40)

# Create booking hour features
df_fe = temporal_engineer.create_booking_hour_features(df_fe)

# Create lead time features
df_fe = temporal_engineer.create_lead_time_features(df_fe)

# Display distributions
temporal_engineer.display_temporal_distributions(df_fe)

TEMPORAL FEATURE ENGINEERING
Lead time features created
   - lead_days_bin: Categorical lead time in days
   - lead_hours_bin: Granular lead time in hours
TEMPORAL FEATURE DISTRIBUTIONS


## 3. Geographic & Location Intelligence

In [5]:
class LocationFeatureEngineer:
    """
    Advanced location and geographic feature engineering.
    """
    
    @staticmethod
    def create_location_clusters(df: pd.DataFrame, 
                               location_col: str = 'Location_cleaned',
                               top_n: int = 20) -> pd.DataFrame:
        """
        Create location clusters based on frequency.
        
        Args:
            df: Input dataframe
            location_col: Location column name
            top_n: Number of top locations to keep separate
        
        Returns:
            DataFrame with location clusters
        """
        df = df.copy()
        
        if location_col in df.columns:
            # Get top N locations by frequency
            location_counts = df[location_col].value_counts()
            top_locations = location_counts.head(top_n).index.tolist()
            
            # Create clustered version
            df[f'Location_top{top_n}'] = df[location_col].apply(
                lambda x: x if x in top_locations else 'Other'
            )
            
            print(f"Location clustering complete")
            print(f"   - Top {top_n} locations preserved")
            print(f"   - {len(location_counts) - top_n} locations grouped as 'Other'")
            print(f"   - Coverage: {(location_counts.head(top_n).sum() / location_counts.sum() * 100):.1f}%")
        
        return df
    
    @staticmethod
    def create_distance_features(df: pd.DataFrame, 
                               distance_col: str = 'distance_to_branch') -> pd.DataFrame:
        """
        Create sophisticated distance-based features.
        
        Args:
            df: Input dataframe
            distance_col: Distance column name
        
        Returns:
            DataFrame with distance features
        """
        df = df.copy()
        
        if distance_col in df.columns:
            # Calculate quantile-based distance bins
            distance_quantiles = df[distance_col].quantile([0.25, 0.5, 0.75])
            
            def bin_distance(distance):
                if pd.isna(distance):
                    return 'unknown'
                elif distance <= distance_quantiles[0.25]:
                    return 'Near'
                elif distance <= distance_quantiles[0.5]:
                    return 'Mid'
                elif distance <= distance_quantiles[0.75]:
                    return 'Far'
                else:
                    return 'Very Far'
            
            df['distance_bin'] = df[distance_col].apply(bin_distance)
            
            # Additional distance insights
            df['is_local'] = (df[distance_col] <= distance_quantiles[0.25]).astype(int)
            df['is_remote'] = (df[distance_col] > distance_quantiles[0.75]).astype(int)
            
            print(f"Distance features created")
            print(f"   - distance_bin: Quartile-based distance categories")
            print(f"   - is_local: Local patient flag (≤Q1)")
            print(f"   - is_remote: Remote patient flag (>Q3)")
            print(f"   - Distance quartiles: {distance_quantiles.round(2).to_dict()}")
        
        return df

# Apply location feature engineering
location_engineer = LocationFeatureEngineer()

print("LOCATION FEATURE ENGINEERING")
print("=" * 40)

# Create location clusters
df_fe = location_engineer.create_location_clusters(df_fe, top_n=20)

# Create distance features
df_fe = location_engineer.create_distance_features(df_fe)

# Display location insights
if 'Location_top20' in df_fe.columns:
    print("\nTop Location Clusters:")
    top_locations = df_fe['Location_top20'].value_counts().head(10)
    for loc, count in top_locations.items():
        percentage = (count / len(df_fe)) * 100
        print(f"   {loc}: {count:,} ({percentage:.1f}%)")

if 'distance_bin' in df_fe.columns:
    print("\nDistance Distribution:")
    distance_dist = df_fe['distance_bin'].value_counts()
    for dist, count in distance_dist.items():
        percentage = (count / len(df_fe)) * 100
        print(f"   {dist}: {count:,} ({percentage:.1f}%)")

LOCATION FEATURE ENGINEERING


## 4. Department Risk Stratification

In [None]:
class DepartmentFeatureEngineer:
    """
    Medical department risk analysis and feature engineering.
    """
    
    # Domain knowledge: High-risk departments based on no-show patterns
    HIGH_RISK_DEPARTMENTS = [
        'OBSTETRICS and GYNAECOLOGY', 'DERMATOLOGY', 'PAEDIATRICS',
        'ORTHOPAEDIC', 'E.N.T', 'CARDIOLOGY'
    ]
    
    @staticmethod
    def create_department_groups(df: pd.DataFrame, 
                               dept_col: str = 'Department',
                               threshold: int = 3000) -> pd.DataFrame:
        """
        Group departments by frequency and create risk categories.
        
        Args:
            df: Input dataframe
            dept_col: Department column name
            threshold: Minimum records threshold for separate department
        
        Returns:
            DataFrame with department features
        """
        df = df.copy()
        
        if dept_col in df.columns:
            # Group small departments
            dept_counts = df[dept_col].value_counts()
            small_departments = dept_counts[dept_counts < threshold].index
            
            df['Department_grouped'] = df[dept_col].apply(
                lambda x: 'OTHERS' if x in small_departments else x
            )
            
            # Create risk stratification
            df['Department_risk'] = df['Department_grouped'].apply(
                lambda x: 'High Risk' if x in DepartmentFeatureEngineer.HIGH_RISK_DEPARTMENTS else 'Routine'
            )
            
            # Department specialty categorization
            def categorize_department(dept):
                if pd.isna(dept) or dept == 'OTHERS':
                    return 'Other'
                elif dept in ['OBSTETRICS and GYNAECOLOGY', 'PAEDIATRICS']:
                    return 'Family Care'
                elif dept in ['CARDIOLOGY', 'ORTHOPAEDIC']:
                    return 'Specialty Care'
                elif dept in ['DERMATOLOGY', 'E.N.T']:
                    return 'Outpatient Specialty'
                else:
                    return 'General'
            
            df['Department_category'] = df['Department_grouped'].apply(categorize_department)
            
            print(f"Department features created")
            print(f"   - Department_grouped: {len(small_departments)} departments grouped as 'OTHERS'")
            print(f"   - Department_risk: Risk stratification based on domain knowledge")
            print(f"   - Department_category: Specialty categorization")
        
        return df
    
    @staticmethod
    def analyze_department_patterns(df: pd.DataFrame) -> None:
        """
        Analyze department-specific patterns.
        
        Args:
            df: DataFrame with department features
        """
        print("\nDEPARTMENT ANALYSIS")
        print("=" * 30)
        
        if 'Department_risk' in df.columns:
            risk_dist = df['Department_risk'].value_counts()
            print("\nRisk Distribution:")
            for risk, count in risk_dist.items():
                percentage = (count / len(df)) * 100
                print(f"   {risk}: {count:,} ({percentage:.1f}%)")
        
        if 'Department_category' in df.columns:
            cat_dist = df['Department_category'].value_counts()
            print("\nCategory Distribution:")
            for cat, count in cat_dist.items():
                percentage = (count / len(df)) * 100
                print(f"   {cat}: {count:,} ({percentage:.1f}%)")
        
        if 'Department_grouped' in df.columns:
            dept_dist = df['Department_grouped'].value_counts()
            print("\nTop Departments:")
            for dept, count in dept_dist.head(8).items():
                percentage = (count / len(df)) * 100
                print(f"   {dept}: {count:,} ({percentage:.1f}%)")

# Apply department feature engineering
dept_engineer = DepartmentFeatureEngineer()

print("DEPARTMENT FEATURE ENGINEERING")
print("=" * 40)

# Create department features
df_fe = dept_engineer.create_department_groups(df_fe)

# Analyze patterns
dept_engineer.analyze_department_patterns(df_fe)

## 5. Patient Demographics Intelligence

In [6]:
class DemographicFeatureEngineer:
    """
    Advanced demographic feature engineering for patient analytics.
    """
    
    @staticmethod
    def create_nationality_features(df: pd.DataFrame, 
                                  nationality_col: str = 'Nationality_grouped') -> pd.DataFrame:
        """
        Create sophisticated nationality and regional features.
        
        Args:
            df: Input dataframe
            nationality_col: Nationality column name
        
        Returns:
            DataFrame with nationality features
        """
        df = df.copy()
        
        if nationality_col in df.columns:
            # Regional grouping based on geographic and cultural patterns
            def group_nationality_by_region(nationality):
                if pd.isna(nationality):
                    return 'Other'
                
                nationality = str(nationality).upper()
                
                # UAE citizens
                if any(keyword in nationality for keyword in ['UAE', 'EMIRATI', 'UNITED ARAB']):
                    return 'UAE'
                
                # South Asian countries
                elif any(keyword in nationality for keyword in 
                         ['INDIA', 'PAKISTAN', 'BANGLADESH', 'SRI LANKA', 'NEPAL', 'BHUTAN']):
                    return 'South Asia'
                
                # Arab countries
                elif any(keyword in nationality for keyword in 
                         ['EGYPT', 'JORDAN', 'SYRIA', 'LEBANON', 'IRAQ', 'SAUDI', 'KUWAIT', 
                          'OMAN', 'QATAR', 'BAHRAIN', 'YEMEN']):
                    return 'Arab'
                
                # Western countries
                elif any(keyword in nationality for keyword in 
                         ['USA', 'UK', 'CANADA', 'AUSTRALIA', 'GERMANY', 'FRANCE', 'ITALY', 
                          'SPAIN', 'NETHERLANDS', 'SWEDEN', 'NORWAY', 'DENMARK', 'BRITISH']):
                    return 'Western'
                
                # African countries
                elif any(keyword in nationality for keyword in 
                         ['NIGERIA', 'ETHIOPIA', 'SUDAN', 'MOROCCO', 'TUNISIA', 'ALGERIA']):
                    return 'African'
                
                else:
                    return 'Other'
            
            df['Nationality_region'] = df[nationality_col].apply(group_nationality_by_region)
            
            # Create cultural distance indicator (linguistic/cultural similarity to UAE)
            def cultural_distance(region):
                if region in ['UAE', 'Arab']:
                    return 'Local'
                elif region in ['South Asia', 'African']:
                    return 'Familiar'  # Large expat communities
                else:
                    return 'International'
            
            df['Cultural_proximity'] = df['Nationality_region'].apply(cultural_distance)
            
            print(f"Nationality features created")
            print(f"   - Nationality_region: Geographic/cultural grouping")
            print(f"   - Cultural_proximity: Cultural distance indicator")
        
        return df
    
    @staticmethod
    def create_visa_features(df: pd.DataFrame, 
                           visa_col: str = 'VisaCategory') -> pd.DataFrame:
        """
        Create visa category features indicating residency status.
        
        Args:
            df: Input dataframe
            visa_col: Visa category column name
        
        Returns:
            DataFrame with visa features
        """
        df = df.copy()
        
        if visa_col in df.columns:
            def group_visa_category(visa):
                if pd.isna(visa):
                    return 'Expat/Other'
                
                visa_str = str(visa).upper()
                
                if any(keyword in visa_str for keyword in ['UAE', 'CITIZEN', 'NATIONAL']):
                    return 'UAE Citizen'
                elif 'GCC' in visa_str:
                    return 'GCC'
                else:
                    return 'Expat/Other'
            
            df['VisaCategory_grouped'] = df[visa_col].apply(group_visa_category)
            
            # Residency stability indicator
            df['Residency_stability'] = df['VisaCategory_grouped'].apply(
                lambda x: 'High' if x == 'UAE Citizen' else 'Medium' if x == 'GCC' else 'Variable'
            )
            
            print(f"Visa features created")
            print(f"   - VisaCategory_grouped: Simplified visa categories")
            print(f"   - Residency_stability: Stability indicator")
        
        return df
    
    @staticmethod
    def create_demographic_flags(df: pd.DataFrame) -> pd.DataFrame:
        """
        Create additional demographic flags and indicators.
        
        Args:
            df: Input dataframe
        
        Returns:
            DataFrame with demographic flags
        """
        df = df.copy()
        
        # Patient state missing flag (data quality indicator)
        if 'Patient_State' in df.columns:
            df['Patient_State_missing'] = df['Patient_State'].isna().astype(int)
        
        # Gender-based flags if needed for specific analysis
        if 'Gender' in df.columns:
            df['Gender_encoded'] = df['Gender'].map({'Male': 0, 'Female': 1}).fillna(-1)
        
        print(f"Demographic flags created")
        
        return df

# Apply demographic feature engineering
demo_engineer = DemographicFeatureEngineer()

print("DEMOGRAPHIC FEATURE ENGINEERING")
print("=" * 40)

# Create nationality features
df_fe = demo_engineer.create_nationality_features(df_fe)

# Create visa features
df_fe = demo_engineer.create_visa_features(df_fe)

# Create demographic flags
df_fe = demo_engineer.create_demographic_flags(df_fe)

# Display demographic insights
print("\nRegional Distribution:")
if 'Nationality_region' in df_fe.columns:
    region_dist = df_fe['Nationality_region'].value_counts()
    for region, count in region_dist.items():
        percentage = (count / len(df_fe)) * 100
        print(f"   {region}: {count:,} ({percentage:.1f}%)")

print("\nVisa Category Distribution:")
if 'VisaCategory_grouped' in df_fe.columns:
    visa_dist = df_fe['VisaCategory_grouped'].value_counts()
    for visa, count in visa_dist.items():
        percentage = (count / len(df_fe)) * 100
        print(f"   {visa}: {count:,} ({percentage:.1f}%)")

DEMOGRAPHIC FEATURE ENGINEERING
Visa features created
   - VisaCategory_grouped: Simplified visa categories
   - Residency_stability: Stability indicator
Demographic flags created

Regional Distribution:

Visa Category Distribution:
   UAE Citizen: 532,884 (98.8%)
   Expat/Other: 6,123 (1.1%)
   GCC: 231 (0.0%)


## 6. Booking Channel Analytics

In [8]:
class BookingChannelEngineer:
    """
    Advanced booking channel analysis and feature engineering.
    """
    
    @staticmethod
    def create_booking_channel_features(df: pd.DataFrame, 
                                      booked_by_col: str = 'Booked_By',
                                      top_n: int = 10) -> pd.DataFrame:
        """
        Create booking channel features with performance analytics.
        
        Args:
            df: Input dataframe
            booked_by_col: Booking channel column name
            top_n: Number of top channels to keep separate
        
        Returns:
            DataFrame with booking channel features
        """
        df = df.copy()
        
        if booked_by_col in df.columns:
            # Create top N booking channels
            channel_counts = df[booked_by_col].value_counts()
            top_channels = channel_counts.head(top_n).index.tolist()
            
            df[f'Booked_By_top{top_n}'] = df[booked_by_col].apply(
                lambda x: x if x in top_channels else 'Other'
            )
            
            # Channel type categorization
            def categorize_booking_channel(channel):
                if pd.isna(channel):
                    return 'Unknown'
                
                channel_str = str(channel).upper()
                
                if any(keyword in channel_str for keyword in ['ONLINE', 'WEB', 'APP', 'MOBILE']):
                    return 'Digital'
                elif any(keyword in channel_str for keyword in ['CALL', 'PHONE', 'CENTER']):
                    return 'Phone'
                elif any(keyword in channel_str for keyword in ['WALK', 'COUNTER', 'FRONT']):
                    return 'Walk-in'
                elif any(keyword in channel_str for keyword in ['DOCTOR', 'PHYSICIAN', 'STAFF']):
                    return 'Medical Staff'
                else:
                    return 'Other'
            
            df['Booking_channel_type'] = df[booked_by_col].apply(categorize_booking_channel)
            
            # Channel efficiency indicator (based on typical performance)
            def channel_efficiency_score(channel_type):
                efficiency_map = {
                    'Digital': 'High',      # Usually better attendance
                    'Medical Staff': 'High', # Doctor-initiated appointments
                    'Phone': 'Medium',       # Personal interaction
                    'Walk-in': 'Low',        # Impulse bookings
                    'Other': 'Medium',
                    'Unknown': 'Low'
                }
                return efficiency_map.get(channel_type, 'Medium')
            
            df['Channel_efficiency'] = df['Booking_channel_type'].apply(channel_efficiency_score)
            
            print(f"Booking channel features created")
            print(f"   - Booked_By_top{top_n}: Top {top_n} channels + Others")
            print(f"   - Booking_channel_type: Channel categorization")
            print(f"   - Channel_efficiency: Efficiency scoring")
        
        return df
    
    @staticmethod
    def analyze_booking_patterns(df: pd.DataFrame) -> None:
        """
        Analyze booking channel patterns and performance.
        
        Args:
            df: DataFrame with booking features
        """
        print("\nBOOKING CHANNEL ANALYSIS")
        print("=" * 35)
        
        if 'Booking_channel_type' in df.columns:
            channel_dist = df['Booking_channel_type'].value_counts()
            print("\nChannel Type Distribution:")
            for channel, count in channel_dist.items():
                percentage = (count / len(df)) * 100
                print(f"   {channel}: {count:,} ({percentage:.1f}%)")
        
        if 'Channel_efficiency' in df.columns:
            efficiency_dist = df['Channel_efficiency'].value_counts()
            print("\nChannel Efficiency Distribution:")
            for eff, count in efficiency_dist.items():
                percentage = (count / len(df)) * 100
                print(f"   {eff}: {count:,} ({percentage:.1f}%)")

# Apply booking channel feature engineering
booking_engineer = BookingChannelEngineer()

print("BOOKING CHANNEL FEATURE ENGINEERING")
print("=" * 40)

# Create booking channel features
df_fe = booking_engineer.create_booking_channel_features(df_fe)

# Analyze booking patterns
booking_engineer.analyze_booking_patterns(df_fe)

BOOKING CHANNEL FEATURE ENGINEERING
Booking channel features created
   - Booked_By_top10: Top 10 channels + Others
   - Booking_channel_type: Channel categorization
   - Channel_efficiency: Efficiency scoring

BOOKING CHANNEL ANALYSIS

Channel Type Distribution:
   Phone: 379,242 (70.3%)
   Digital: 147,255 (27.3%)
   Other: 12,741 (2.4%)

Channel Efficiency Distribution:
   Medium: 391,983 (72.7%)
   High: 147,255 (27.3%)


## 7. Statistical Encodings & Advanced Features

In [9]:
class StatisticalFeatureEngineer:
    """
    Advanced statistical feature engineering and encodings.
    """
    
    @staticmethod
    def create_frequency_encodings(df: pd.DataFrame) -> pd.DataFrame:
        """
        Create frequency-based encodings for categorical variables.
        
        Args:
            df: Input dataframe
        
        Returns:
            DataFrame with frequency encodings
        """
        df = df.copy()
        
        # Doctor frequency encoding
        if 'DoctorName' in df.columns:
            doctor_freq = df['DoctorName'].value_counts()
            df['DoctorName_frequency'] = df['DoctorName'].map(doctor_freq)
            
            # Doctor popularity tier
            doctor_freq_quantiles = df['DoctorName_frequency'].quantile([0.33, 0.67])
            def doctor_tier(freq):
                if pd.isna(freq):
                    return 'Unknown'
                elif freq <= doctor_freq_quantiles[0.33]:
                    return 'Low Volume'
                elif freq <= doctor_freq_quantiles[0.67]:
                    return 'Medium Volume'
                else:
                    return 'High Volume'
            
            df['Doctor_volume_tier'] = df['DoctorName_frequency'].apply(doctor_tier)
        
        # Location frequency encoding
        if 'Location_cleaned' in df.columns:
            location_freq = df['Location_cleaned'].value_counts()
            df['Location_frequency'] = df['Location_cleaned'].map(location_freq)
        
        # Branch frequency encoding
        if 'BranchCode' in df.columns:
            branch_freq = df['BranchCode'].value_counts()
            df['Branch_frequency'] = df['BranchCode'].map(branch_freq)
            
            # Branch size categorization
            branch_freq_quantiles = df['Branch_frequency'].quantile([0.5])
            df['Branch_size'] = df['Branch_frequency'].apply(
                lambda x: 'Large' if x > branch_freq_quantiles[0.5] else 'Small'
            )
        
        print(f"Frequency encodings created")
        print(f"   - Doctor, Location, Branch frequency mappings")
        print(f"   - Volume/size tier categorizations")
        
        return df
    
    @staticmethod
    def create_interaction_features(df: pd.DataFrame) -> pd.DataFrame:
        """
        Create meaningful interaction features.
        
        Args:
            df: Input dataframe
        
        Returns:
            DataFrame with interaction features
        """
        df = df.copy()
        
        # Department-Location interaction
        if 'Department_risk' in df.columns and 'distance_bin' in df.columns:
            df['Dept_Distance_risk'] = df['Department_risk'] + '_' + df['distance_bin']
        
        # Nationality-Visa interaction
        if 'Nationality_region' in df.columns and 'VisaCategory_grouped' in df.columns:
            df['Nationality_Visa_combo'] = df['Nationality_region'] + '_' + df['VisaCategory_grouped']
        
        # Lead time-Channel interaction
        if 'lead_days_bin' in df.columns and 'Booking_channel_type' in df.columns:
            df['Leadtime_Channel_combo'] = df['lead_days_bin'] + '_' + df['Booking_channel_type']
        
        print(f"Interaction features created")
        print(f"   - Department-Distance risk combinations")
        print(f"   - Nationality-Visa combinations")
        print(f"   - Lead time-Channel combinations")
        
        return df
    
    @staticmethod
    def create_historical_features(df: pd.DataFrame) -> pd.DataFrame:
        """
        Create features based on historical patterns.
        
        Args:
            df: Input dataframe
        
        Returns:
            DataFrame with historical features
        """
        df = df.copy()
        
        # Previous appointment indicator
        if 'has_prev_appointment' in df.columns:
            df['prev_visit_count'] = df['has_prev_appointment']
            df['is_returning_patient'] = (df['has_prev_appointment'] > 0).astype(int)
        
        # Last appointment status impact
        if 'LastAppointmentStatus' in df.columns:
            df['had_previous_noshow'] = (
                df['LastAppointmentStatus'].str.contains('No Show|Missed', na=False)
            ).astype(int)
        
        print(f"Historical features created")
        print(f"   - Patient return behavior indicators")
        print(f"   - Previous no-show history flags")
        
        return df

# Apply statistical feature engineering
stats_engineer = StatisticalFeatureEngineer()

print("STATISTICAL FEATURE ENGINEERING")
print("=" * 40)

# Create frequency encodings
df_fe = stats_engineer.create_frequency_encodings(df_fe)

# Create interaction features
df_fe = stats_engineer.create_interaction_features(df_fe)

# Create historical features
df_fe = stats_engineer.create_historical_features(df_fe)

print(f"\nAdvanced feature engineering complete!")
print(f"   Total features: {df_fe.shape[1]}")
print(f"   Features added: {df_fe.shape[1] - df.shape[1]}")

STATISTICAL FEATURE ENGINEERING
Frequency encodings created
   - Doctor, Location, Branch frequency mappings
   - Volume/size tier categorizations
Interaction features created
   - Department-Distance risk combinations
   - Nationality-Visa combinations
   - Lead time-Channel combinations
Historical features created
   - Patient return behavior indicators
   - Previous no-show history flags

Advanced feature engineering complete!
   Total features: 41
   Features added: 12


## 8. Feature Engineering Summary & Export

In [11]:
def comprehensive_feature_summary(df_original: pd.DataFrame, 
                                 df_engineered: pd.DataFrame) -> None:
    """
    Generate comprehensive feature engineering summary.
    
    Args:
        df_original: Original dataset
        df_engineered: Feature-engineered dataset
    """
    print("\n" + "="*70)
    print("COMPREHENSIVE FEATURE ENGINEERING SUMMARY")
    print("="*70)
    
    # Dataset transformation summary
    print(f"\nDATASET TRANSFORMATION:")
    print(f"   Original shape: {df_original.shape[0]:,} rows × {df_original.shape[1]} columns")
    print(f"   Final shape: {df_engineered.shape[0]:,} rows × {df_engineered.shape[1]} columns")
    print(f"   Features added: {df_engineered.shape[1] - df_original.shape[1]}")
    print(f"   Memory usage: {df_engineered.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Feature categories
    new_features = set(df_engineered.columns) - set(df_original.columns)
    
    feature_categories = {
        'Temporal Features': [f for f in new_features if any(keyword in f.lower() 
                                for keyword in ['hour', 'day', 'lead', 'time', 'part'])],
        'Location Features': [f for f in new_features if any(keyword in f.lower() 
                                for keyword in ['location', 'distance', 'local', 'remote'])],
        'Department Features': [f for f in new_features if any(keyword in f.lower() 
                                 for keyword in ['department', 'dept', 'risk'])],
        'Demographic Features': [f for f in new_features if any(keyword in f.lower() 
                                  for keyword in ['nationality', 'visa', 'cultural', 'region'])],
        'Booking Features': [f for f in new_features if any(keyword in f.lower() 
                              for keyword in ['booked', 'channel', 'booking'])],
        'Statistical Features': [f for f in new_features if any(keyword in f.lower() 
                                  for keyword in ['frequency', 'count', 'tier', 'combo'])]
    }
    
    print(f"\n🎯 FEATURE CATEGORIES CREATED:")
    for category, features in feature_categories.items():
        if features:
            print(f"\n{category} ({len(features)} features):")
            for feature in sorted(features)[:5]:  # Show first 5
                print(f"   • {feature}")
            if len(features) > 5:
                print(f"   ... and {len(features) - 5} more")
    
    # Data quality summary
    print(f"\n🔍 DATA QUALITY METRICS:")
    missing_cols = df_engineered.columns[df_engineered.isnull().any()]
    print(f"   Columns with missing values: {len(missing_cols)}")
    
    # Data type distribution
    dtype_counts = df_engineered.dtypes.value_counts()
    print(f"   Data type distribution:")
    for dtype, count in dtype_counts.items():
        print(f"     {dtype}: {count} columns")
    
    print(f"\nKEY ACHIEVEMENTS:")
    print(f"   ✅ Advanced temporal binning and calendar features")
    print(f"   ✅ Intelligent categorical grouping and clustering")
    print(f"   ✅ Domain-specific risk stratification")
    print(f"   ✅ Sophisticated demographic intelligence")
    print(f"   ✅ Frequency encodings and statistical features")
    print(f"   ✅ Meaningful interaction features")
    print(f"   ✅ Memory-efficient data types")
    
    print("\n" + "="*70)

def export_engineered_features(df: pd.DataFrame, 
                             filename: str = "healthcare_features_engineered.csv") -> None:
    """
    Export feature-engineered dataset with documentation.
    
    Args:
        df: Feature-engineered dataframe
        filename: Output filename
    """
    try:
        # Clean data before export
        df_export = df.copy()
        
        # Drop any remaining intermediate columns
        intermediate_cols = ['Status', 'lead_days', 'lead_hours', 'book_hour']
        cols_to_drop = [col for col in intermediate_cols if col in df_export.columns]
        if cols_to_drop:
            df_export = df_export.drop(columns=cols_to_drop)
            print(f"Cleaned {len(cols_to_drop)} intermediate columns")
        
        # Optimize data types
        object_cols = df_export.select_dtypes(include=['object']).columns
        for col in object_cols:
            df_export[col] = df_export[col].astype('category')
        
        # Export main dataset
        df_export.to_csv(filename, index=False)
        
        # Create feature documentation
        feature_docs = {
            'dataset_info': {
                'creation_timestamp': pd.Timestamp.now().isoformat(),
                'original_features': df.shape[1] - len([col for col in df.columns if col not in df_export.columns]),
                'engineered_features': df_export.shape[1],
                'total_records': len(df_export)
            },
            'feature_types': df_export.dtypes.to_dict(),
            'missing_values': df_export.isnull().sum().to_dict()
        }
        
        docs_filename = filename.replace('.csv', '_documentation.json')
        import json
        with open(docs_filename, 'w') as f:
            json.dump(feature_docs, f, indent=2, default=str)
        
        print(f"\nEXPORT COMPLETE:")
        print(f"   Main dataset: {filename}")
        print(f"   Documentation: {docs_filename}")
        print(f"   Final shape: {df_export.shape[0]:,} rows × {df_export.shape[1]} columns")
        print(f"   File size: {df_export.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

        return df_export
        
    except Exception as e:
        print(f"Export error: {e}")
        return None

# Generate comprehensive summary
comprehensive_feature_summary(df, df_fe)

# Export engineered dataset
df_final = export_engineered_features(df_fe)



COMPREHENSIVE FEATURE ENGINEERING SUMMARY

DATASET TRANSFORMATION:
   Original shape: 539,238 rows × 29 columns
   Final shape: 539,238 rows × 41 columns
   Features added: 12
   Memory usage: 1023.41 MB

🎯 FEATURE CATEGORIES CREATED:

Demographic Features (1 features):
   • VisaCategory_grouped

Booking Features (3 features):
   • Booked_By_top10
   • Booking_channel_type
   • Channel_efficiency

Statistical Features (3 features):
   • Branch_frequency
   • DoctorName_frequency
   • Doctor_volume_tier

🔍 DATA QUALITY METRICS:
   Columns with missing values: 14
   Data type distribution:
     object: 34 columns
     int64: 4 columns
     float64: 3 columns

KEY ACHIEVEMENTS:
   ✅ Advanced temporal binning and calendar features
   ✅ Intelligent categorical grouping and clustering
   ✅ Domain-specific risk stratification
   ✅ Sophisticated demographic intelligence
   ✅ Frequency encodings and statistical features
   ✅ Meaningful interaction features
   ✅ Memory-efficient data types

Cle