# Stroke Prediction Analysis - Data Loading and Quality Assessment

## 📋 Notebook Overview

This notebook performs comprehensive data loading and quality assessment for the stroke prediction dataset using high-performance Polars library and advanced data quality metrics.

### 🎯 Objectives

1. **Load Dataset Efficiently**: Use Polars for high-performance data loading
2. **Assess Data Quality**: Comprehensive analysis of data completeness, accuracy, and consistency
3. **Identify Issues**: Missing values, outliers, duplicates, and data type problems
4. **Generate Insights**: Initial understanding of the stroke prediction dataset
5. **Create Baseline**: Establish foundation for subsequent analysis

### 📊 Dataset Information

- **Source**: Healthcare Stroke Prediction Dataset
- **Target Variable**: `stroke` (binary: 0=no stroke, 1=stroke)
- **Features**: Demographics, medical history, lifestyle factors, clinical measurements
- **Expected Size**: ~5,000 records with 12 features

### 🔍 Quality Assessment Framework

We'll evaluate data quality across multiple dimensions:
- **Completeness**: Missing value analysis
- **Validity**: Data type and range validation
- **Consistency**: Duplicate detection and logical consistency
- **Accuracy**: Outlier detection and clinical validation
- **Timeliness**: Data freshness and relevance

## 🔧 Library Imports and Setup

Setting up our analysis environment with optimized libraries for data processing, analysis, and visualization.

In [5]:
# Core data processing libraries
import pandas as pd
import polars as pl
import numpy as np
from pathlib import Path

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, shapiro, kstest

# Utilities
import warnings
from typing import Dict, List, Tuple, Optional
from datetime import datetime
import sys

# Configure settings
warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")

# Configure Polars for better display
pl.Config.set_tbl_rows(20)
pl.Config.set_tbl_cols(15)
pl.Config.set_tbl_width_chars(120)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("✅ All libraries imported successfully!")
print(f"Polars version: {pl.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

✅ All libraries imported successfully!
Polars version: 1.31.0
Pandas version: 2.3.0
NumPy version: 2.3.0


## 📂 High-Performance Dataset Loading

Using our custom Polars-based dataset reader for optimal performance and comprehensive quality assessment.

In [21]:
class StrokeDatasetLoader:
    """
    Specialized loader for stroke prediction dataset with comprehensive quality assessment
    """
    
    def __init__(self, show_progress: bool = True):
        self.show_progress = show_progress
        self.data_info = {}
        self.quality_metrics = {}
        
    def load_dataset(self, file_path: str) -> Tuple[pl.DataFrame, Dict]:
        """
        Load stroke dataset with comprehensive quality assessment
        
        Parameters:
        -----------
        file_path : str
            Path to the stroke dataset file
            
        Returns:
        --------
        Tuple[pl.DataFrame, Dict]
            Loaded dataset and quality assessment report
        """
        
        if self.show_progress:
            print("="*80)
            print("🏥 STROKE PREDICTION DATASET - LOADING & QUALITY ASSESSMENT")
            print("="*80)
        
        # Verify file exists
        file_path = Path(file_path)
        if not file_path.exists():
            raise FileNotFoundError(f"Dataset file not found: {file_path}")
        
        # File information
        file_size_mb = file_path.stat().st_size / (1024 * 1024)
        
        if self.show_progress:
            print(f"\n📂 DATASET FILE INFORMATION:")
            print(f"   File: {file_path.name}")
            print(f"   Size: {file_size_mb:.2f} MB")
            print(f"   Path: {file_path.absolute()}")
        
        # Load with optimized settings for medical data
        try:
            df = pl.read_csv(
                file_path,
                infer_schema_length=10000,  # Scan more rows for better type inference
                try_parse_dates=True,
                null_values=['', 'NULL', 'null', 'NA', 'N/A', 'nan', 'NaN', 'missing'],
                ignore_errors=False,  # Strict parsing for data quality
                low_memory=False,  # Use more memory for speed
                rechunk=True  # Optimize memory layout
            )
            
            if self.show_progress:
                print(f"✅ Dataset loaded successfully!")
                print(f"   Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
                print(f"   Memory usage: {df.estimated_size('mb'):.2f} MB")
            
        except Exception as e:
            print(f"❌ Error loading dataset: {str(e)}")
            raise
        
        # Store basic information
        self.data_info = {
            'file_path': str(file_path),
            'file_size_mb': file_size_mb,
            'shape': df.shape,
            'columns': df.columns,
            'dtypes': {col: str(dtype) for col, dtype in zip(df.columns, df.dtypes)},
            'memory_usage_mb': df.estimated_size('mb'),
            'load_timestamp': datetime.now().isoformat()
        }
        
        return df, self.data_info
    
    def comprehensive_quality_assessment(self, df: pl.DataFrame) -> Dict:
        """
        Perform comprehensive data quality assessment
        
        Parameters:
        -----------
        df : pl.DataFrame
            Dataset to assess
            
        Returns:
        --------
        Dict
            Comprehensive quality assessment report
        """
        
        if self.show_progress:
            print(f"\n🔍 COMPREHENSIVE DATA QUALITY ASSESSMENT")
            print("-" * 60)
        
        quality_report = {}
        
        # 1. Basic Dataset Metrics
        basic_metrics = self._assess_basic_metrics(df)
        quality_report['basic_metrics'] = basic_metrics
        
        # 2. Missing Values Analysis
        missing_analysis = self._analyze_missing_values(df)
        quality_report['missing_values'] = missing_analysis
        
        # 3. Data Types Assessment
        dtype_analysis = self._assess_data_types(df)
        quality_report['data_types'] = dtype_analysis
        
        # 4. Duplicate Analysis
        duplicate_analysis = self._analyze_duplicates(df)
        quality_report['duplicates'] = duplicate_analysis
        
        # 5. Statistical Summary
        statistical_summary = self._generate_statistical_summary(df)
        quality_report['statistical_summary'] = statistical_summary
        
        # 6. Medical Domain Validation
        medical_validation = self._medical_domain_validation(df)
        quality_report['medical_validation'] = medical_validation
        
        # 7. Target Variable Analysis
        target_analysis = self._analyze_target_variable(df)
        quality_report['target_analysis'] = target_analysis
        
        # 8. Overall Quality Score
        quality_score = self._calculate_overall_quality_score(quality_report)
        quality_report['overall_score'] = quality_score
        
        self.quality_metrics = quality_report
        return quality_report
    
    def _assess_basic_metrics(self, df: pl.DataFrame) -> Dict:
        """Assess basic dataset metrics"""
        
        if self.show_progress:
            print(f"\n📊 BASIC DATASET METRICS:")
            print("-" * 30)
        
        metrics = {
            'total_rows': df.height,
            'total_columns': df.width,
            'total_cells': df.height * df.width,
            'memory_usage_mb': df.estimated_size('mb'),
            'column_names': df.columns,
            'data_types_count': len(set(str(dtype) for dtype in df.dtypes))
        }
        
        if self.show_progress:
            print(f"   Total Rows: {metrics['total_rows']:,}")
            print(f"   Total Columns: {metrics['total_columns']}")
            print(f"   Total Cells: {metrics['total_cells']:,}")
            print(f"   Memory Usage: {metrics['memory_usage_mb']:.2f} MB")
            print(f"   Unique Data Types: {metrics['data_types_count']}")
        
        return metrics
    
    def _analyze_missing_values(self, df: pl.DataFrame) -> Dict:
        """Comprehensive missing values analysis"""
        
        if self.show_progress:
            print(f"\n🔍 MISSING VALUES ANALYSIS:")
            print("-" * 30)
        
        # Calculate missing values per column
        missing_per_column = []
        total_rows = df.height
        
        for col in df.columns:
            missing_count = df[col].null_count()
            missing_percentage = (missing_count / total_rows) * 100
            
            missing_per_column.append({
                'column': col,
                'missing_count': missing_count,
                'missing_percentage': missing_percentage,
                'data_type': str(df[col].dtype),
                'non_null_count': total_rows - missing_count
            })
        
        # Sort by missing percentage
        missing_df = pd.DataFrame(missing_per_column).sort_values('missing_percentage', ascending=False)
        
        # Summary statistics
        total_missing = missing_df['missing_count'].sum()
        columns_with_missing = len(missing_df[missing_df['missing_count'] > 0])
        overall_missing_rate = (total_missing / (total_rows * len(df.columns))) * 100
        
        # Categorize missing value severity
        if overall_missing_rate < 1:
            missing_severity = "EXCELLENT"
        elif overall_missing_rate < 5:
            missing_severity = "GOOD"
        elif overall_missing_rate < 15:
            missing_severity = "MODERATE"
        elif overall_missing_rate < 30:
            missing_severity = "POOR"
        else:
            missing_severity = "CRITICAL"
        
        if self.show_progress:
            print(f"   Total Missing Values: {total_missing:,}")
            print(f"   Overall Missing Rate: {overall_missing_rate:.2f}%")
            print(f"   Columns with Missing: {columns_with_missing}/{len(df.columns)}")
            print(f"   Missing Severity: {missing_severity}")
            
            if columns_with_missing > 0:
                print(f"\n   Top columns with missing values:")
                for _, row in missing_df.head(5).iterrows():
                    if row['missing_count'] > 0:
                        print(f"     • {row['column']}: {row['missing_count']:,} ({row['missing_percentage']:.1f}%)")
        
        return {
            'missing_summary': missing_df.to_dict('records'),
            'total_missing': int(total_missing),
            'overall_missing_rate': overall_missing_rate,
            'columns_with_missing': columns_with_missing,
            'missing_severity': missing_severity,
            'missing_patterns': self._analyze_missing_patterns(df)
        }
    
    def _analyze_missing_patterns(self, df: pl.DataFrame) -> Dict:
        """Analyze patterns in missing values"""
        
        # Convert to pandas for missing pattern analysis
        df_pd = df.to_pandas()
        
        # Find columns with missing values
        missing_cols = [col for col in df_pd.columns if df_pd[col].isnull().any()]
        
        if not missing_cols:
            return {'message': 'No missing values to analyze patterns'}
        
        # Analyze missing value patterns
        missing_patterns = df_pd[missing_cols].isnull()
        pattern_counts = missing_patterns.value_counts().head(10)
        
        return {
            'columns_with_missing': missing_cols,
            'top_missing_patterns': pattern_counts.to_dict(),
            'pattern_analysis': 'Complete - missing values appear independent' if len(pattern_counts) > 5 else 'Simple - few missing patterns'
        }
    
    def _assess_data_types(self, df: pl.DataFrame) -> Dict:
        """Assess data types and suggest optimizations"""
        
        if self.show_progress:
            print(f"\n🔍 DATA TYPES ASSESSMENT:")
            print("-" * 30)
        
        dtype_summary = {}
        optimization_suggestions = []
        
        for col in df.columns:
            dtype = str(df[col].dtype)
            unique_count = df[col].n_unique()
            
            dtype_summary[col] = {
                'current_type': dtype,
                'unique_values': unique_count,
                'null_count': df[col].null_count()
            }
            
            # Suggest optimizations
            if dtype == 'Utf8':  # String type
                if unique_count <= 10:
                    optimization_suggestions.append({
                        'column': col,
                        'current_type': dtype,
                        'suggested_type': 'Categorical',
                        'reason': f'Only {unique_count} unique values',
                        'memory_savings': 'Significant for repeated strings'
                    })
            
            elif dtype in ['Int64', 'Float64']:
                # Check for potential boolean
                non_null_values = df[col].drop_nulls().unique().sort()
                if len(non_null_values) == 2 and set(non_null_values.to_list()) <= {0, 1}:
                    optimization_suggestions.append({
                        'column': col,
                        'current_type': dtype,
                        'suggested_type': 'Boolean',
                        'reason': 'Only contains 0/1 values',
                        'memory_savings': 'Moderate'
                    })
        
        # Data type distribution
        dtype_counts = {}
        for dtype_info in dtype_summary.values():
            dtype = dtype_info['current_type']
            dtype_counts[dtype] = dtype_counts.get(dtype, 0) + 1
        
        if self.show_progress:
            print(f"   Data Type Distribution:")
            for dtype, count in sorted(dtype_counts.items()):
                print(f"     • {dtype}: {count} columns")
            
            if optimization_suggestions:
                print(f"\n   Optimization Suggestions:")
                for suggestion in optimization_suggestions[:5]:
                    print(f"     • {suggestion['column']}: {suggestion['current_type']} → {suggestion['suggested_type']}")
        
        return {
            'dtype_summary': dtype_summary,
            'dtype_distribution': dtype_counts,
            'optimization_suggestions': optimization_suggestions
        }
    
    def _analyze_duplicates(self, df: pl.DataFrame) -> Dict:
        """Analyze duplicate rows"""
        
        if self.show_progress:
            print(f"\n🔍 DUPLICATE ANALYSIS:")
            print("-" * 30)
        
        total_rows = df.height
        unique_rows = df.unique().height
        duplicate_rows = total_rows - unique_rows
        duplicate_percentage = (duplicate_rows / total_rows) * 100
        
        # Assess duplicate severity
        if duplicate_percentage == 0:
            duplicate_severity = "EXCELLENT"
        elif duplicate_percentage < 1:
            duplicate_severity = "GOOD"
        elif duplicate_percentage < 5:
            duplicate_severity = "MODERATE"
        elif duplicate_percentage < 15:
            duplicate_severity = "POOR"
        else:
            duplicate_severity = "CRITICAL"
        
        if self.show_progress:
            print(f"   Total Rows: {total_rows:,}")
            print(f"   Unique Rows: {unique_rows:,}")
            print(f"   Duplicate Rows: {duplicate_rows:,}")
            print(f"   Duplicate Rate: {duplicate_percentage:.2f}%")
            print(f"   Duplicate Severity: {duplicate_severity}")
        
        return {
            'total_rows': total_rows,
            'unique_rows': unique_rows,
            'duplicate_rows': duplicate_rows,
            'duplicate_percentage': duplicate_percentage,
            'duplicate_severity': duplicate_severity
        }
    
    def _generate_statistical_summary(self, df: pl.DataFrame) -> Dict:
        """Generate comprehensive statistical summary"""
        
        if self.show_progress:
            print(f"\n🔍 STATISTICAL SUMMARY:")
            print("-" * 30)
        
        # Identify numerical columns
        numerical_cols = []
        categorical_cols = []
        
        for col in df.columns:
            if df[col].dtype in [pl.Int8, pl.Int16, pl.Int32, pl.Int64, 
                               pl.Float32, pl.Float64, pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64]:
                numerical_cols.append(col)
            else:
                categorical_cols.append(col)
        
        # Numerical statistics
        numerical_stats = {}
        for col in numerical_cols:
            col_data = df[col].drop_nulls()
            
            if len(col_data) > 0:
                numerical_stats[col] = {
                    'count': len(col_data),
                    'mean': col_data.mean(),
                    'median': col_data.median(),
                    'std': col_data.std(),
                    'min': col_data.min(),
                    'max': col_data.max(),
                    'q25': col_data.quantile(0.25),
                    'q75': col_data.quantile(0.75),
                    'iqr': col_data.quantile(0.75) - col_data.quantile(0.25),
                    'skewness': self._calculate_skewness(col_data),
                    'kurtosis': self._calculate_kurtosis(col_data),
                    'cv': col_data.std() / col_data.mean() if col_data.mean() != 0 else 0  # Coefficient of variation
                }
        
        # Categorical statistics
        categorical_stats = {}
        for col in categorical_cols:
            value_counts = df[col].value_counts()
            total_count = df[col].count()
            
            categorical_stats[col] = {
                'unique_count': df[col].n_unique(),
                'most_frequent': value_counts[0, col] if len(value_counts) > 0 else None,
                'most_frequent_count': value_counts[0, 'count'] if len(value_counts) > 0 else 0,
                'least_frequent': value_counts[-1, col] if len(value_counts) > 0 else None,
                'concentration_ratio': value_counts[0, 'count'] / total_count if total_count > 0 and len(value_counts) > 0 else 0
            }
        
        if self.show_progress:
            print(f"   Numerical Columns: {len(numerical_cols)}")
            print(f"   Categorical Columns: {len(categorical_cols)}")
            
            if numerical_cols:
                print(f"\n   Sample Numerical Statistics:")
                for col in numerical_cols[:3]:  # Show first 3
                    stats = numerical_stats[col]
                    print(f"     • {col}: μ={stats['mean']:.2f}, σ={stats['std']:.2f}, range=[{stats['min']:.1f}, {stats['max']:.1f}]")
            
            if categorical_cols:
                print(f"\n   Sample Categorical Statistics:")
                for col in categorical_cols[:3]:  # Show first 3
                    stats = categorical_stats[col]
                    print(f"     • {col}: {stats['unique_count']} unique values, top: '{stats['most_frequent']}'")
        
        return {
            'numerical_columns': numerical_cols,
            'categorical_columns': categorical_cols,
            'numerical_statistics': numerical_stats,
            'categorical_statistics': categorical_stats
        }
    
    def _medical_domain_validation(self, df: pl.DataFrame) -> Dict:
        """Validate data against medical domain knowledge"""
        
        if self.show_progress:
            print(f"\n🏥 MEDICAL DOMAIN VALIDATION:")
            print("-" * 30)
        
        validation_results = {}
        issues = []
        
        # Age validation
        if 'age' in df.columns:
            age_issues = []
            invalid_ages = df.filter((pl.col('age') < 0) | (pl.col('age') > 120)).height
            if invalid_ages > 0:
                age_issues.append(f"Invalid ages found: {invalid_ages} records")
            
            mean_age = df['age'].mean()
            if mean_age < 20 or mean_age > 80:
                age_issues.append(f"Unusual mean age: {mean_age:.1f} years")
            
            validation_results['age'] = {
                'invalid_count': invalid_ages,
                'mean_age': mean_age,
                'issues': age_issues
            }
            issues.extend(age_issues)
        
        # BMI validation
        if 'bmi' in df.columns:
            bmi_issues = []
            invalid_bmi = df.filter((pl.col('bmi') < 10) | (pl.col('bmi') > 60)).height
            if invalid_bmi > 0:
                bmi_issues.append(f"Extreme BMI values: {invalid_bmi} records")
            
            validation_results['bmi'] = {
                'invalid_count': invalid_bmi,
                'issues': bmi_issues
            }
            issues.extend(bmi_issues)
        
        # Glucose validation
        if 'avg_glucose_level' in df.columns:
            glucose_issues = []
            invalid_glucose = df.filter((pl.col('avg_glucose_level') < 50) | (pl.col('avg_glucose_level') > 500)).height
            if invalid_glucose > 0:
                glucose_issues.append(f"Extreme glucose values: {invalid_glucose} records")
            
            validation_results['glucose'] = {
                'invalid_count': invalid_glucose,
                'issues': glucose_issues
            }
            issues.extend(glucose_issues)
        
        # Binary variables validation (should be 0/1)
        binary_vars = ['hypertension', 'heart_disease', 'stroke']
        for var in binary_vars:
            if var in df.columns:
                unique_vals = set(df[var].drop_nulls().unique().to_list())
                if not unique_vals.issubset({0, 1}):
                    issues.append(f"{var} contains non-binary values: {unique_vals}")
        
        # Overall validation status
        if not issues:
            validation_status = "PASSED"
            validation_score = 100
        elif len(issues) <= 2:
            validation_status = "MINOR_ISSUES"
            validation_score = 80
        elif len(issues) <= 5:
            validation_status = "MODERATE_ISSUES"
            validation_score = 60
        else:
            validation_status = "MAJOR_ISSUES"
            validation_score = 30
        
        if self.show_progress:
            print(f"   Validation Status: {validation_status}")
            print(f"   Validation Score: {validation_score}/100")
            
            if issues:
                print(f"   Issues Found:")
                for issue in issues[:5]:  # Show top 5 issues
                    print(f"     • {issue}")
        
        return {
            'validation_status': validation_status,
            'validation_score': validation_score,
            'issues': issues,
            'detailed_results': validation_results
        }
    
    def _analyze_target_variable(self, df: pl.DataFrame, target_col: str = 'stroke') -> Dict:
        """Comprehensive analysis of target variable"""
        
        if self.show_progress:
            print(f"\n🎯 TARGET VARIABLE ANALYSIS ({target_col}):")
            print("-" * 40)
        
        if target_col not in df.columns:
            return {'error': f'Target column {target_col} not found'}
        
        target_data = df[target_col]
        
        # Basic statistics
        value_counts = target_data.value_counts().sort('count', descending=True)
        total_count = target_data.count()
        null_count = target_data.null_count()
        
        # Calculate class distribution
        class_distribution = {}
        imbalance_ratio = 0
        
        if len(value_counts) > 0:
            for row in value_counts.iter_rows(named=True):
                class_val = row[target_col]
                count = row['count']
                percentage = (count / total_count) * 100
                class_distribution[class_val] = {
                    'count': count,
                    'percentage': percentage
                }
            
            # Calculate imbalance ratio
            counts_list = [info['count'] for info in class_distribution.values()]
            if len(counts_list) >= 2:
                imbalance_ratio = max(counts_list) / min(counts_list)
        
        # Assess class balance
        minority_percentage = min([info['percentage'] for info in class_distribution.values()]) if class_distribution else 0
        
        if minority_percentage >= 40:
            balance_status = "WELL_BALANCED"
        elif minority_percentage >= 20:
            balance_status = "SLIGHTLY_IMBALANCED"
        elif minority_percentage >= 10:
            balance_status = "MODERATELY_IMBALANCED"
        elif minority_percentage >= 5:
            balance_status = "SEVERELY_IMBALANCED"
        else:
            balance_status = "EXTREMELY_IMBALANCED"
        
        if self.show_progress:
            print(f"   Unique Values: {len(class_distribution)}")
            print(f"   Total Records: {total_count:,}")
            print(f"   Missing Values: {null_count}")
            
            print(f"\n   Class Distribution:")
            for class_val, info in class_distribution.items():
                print(f"     • Class {class_val}: {info['count']:,} ({info['percentage']:.1f}%)")
            
            print(f"\n   Imbalance Ratio: {imbalance_ratio:.1f}:1")
            print(f"   Balance Status: {balance_status}")
            
            # Clinical interpretation for stroke data
            if target_col == 'stroke':
                print(f"\n   Clinical Interpretation:")
                if minority_percentage < 10:
                    print(f"     ✅ Realistic stroke prevalence for general population")
                    print(f"     ⚠️  Severe imbalance will require specialized ML techniques")
                else:
                    print(f"     ⚠️  Higher than expected stroke rate - verify data source")
        
        return {
            'unique_values': len(class_distribution),
            'total_count': total_count,
            'null_count': null_count,
            'class_distribution': class_distribution,
            'imbalance_ratio': imbalance_ratio,
            'balance_status': balance_status,
            'minority_percentage': minority_percentage
        }
    
    def _calculate_overall_quality_score(self, quality_report: Dict) -> Dict:
        """Calculate comprehensive data quality score"""
        
        if self.show_progress:
            print(f"\n🏆 OVERALL DATA QUALITY ASSESSMENT:")
            print("-" * 40)
        
        score = 100.0
        deductions = []
        
        # Missing values penalty
        missing_rate = quality_report['missing_values']['overall_missing_rate']
        if missing_rate > 30:
            penalty = 40
            deductions.append(f"Critical missing values ({missing_rate:.1f}%): -{penalty} points")
        elif missing_rate > 15:
            penalty = 25
            deductions.append(f"High missing values ({missing_rate:.1f}%): -{penalty} points")
        elif missing_rate > 5:
            penalty = 10
            deductions.append(f"Moderate missing values ({missing_rate:.1f}%): -{penalty} points")
        elif missing_rate > 1:
            penalty = 5
            deductions.append(f"Low missing values ({missing_rate:.1f}%): -{penalty} points")
        else:
            penalty = 0
        score -= penalty
        
        # Duplicate penalty
        duplicate_rate = quality_report['duplicates']['duplicate_percentage']
        if duplicate_rate > 15:
            penalty = 25
            deductions.append(f"High duplicate rate ({duplicate_rate:.1f}%): -{penalty} points")
        elif duplicate_rate > 5:
            penalty = 15
            deductions.append(f"Moderate duplicate rate ({duplicate_rate:.1f}%): -{penalty} points")
        elif duplicate_rate > 1:
            penalty = 5
            deductions.append(f"Low duplicate rate ({duplicate_rate:.1f}%): -{penalty} points")
        else:
            penalty = 0
        score -= penalty
        
        # Medical validation penalty
        medical_score = quality_report['medical_validation']['validation_score']
        if medical_score < 100:
            penalty = (100 - medical_score) * 0.3  # 30% weight for medical validation
            deductions.append(f"Medical validation issues: -{penalty:.1f} points")
            score -= penalty
        
        # Class imbalance consideration (informational, not penalty for medical data)
        if 'target_analysis' in quality_report:
            imbalance_ratio = quality_report['target_analysis']['imbalance_ratio']
            if imbalance_ratio > 20:
                deductions.append(f"Severe class imbalance ({imbalance_ratio:.1f}:1): Consider specialized techniques")
        
        # Ensure score doesn't go below 0
        score = max(0, score)
        
        # Determine grade
        if score >= 95:
            grade = "EXCELLENT"
            recommendation = "Dataset is ready for analysis"
        elif score >= 85:
            grade = "VERY_GOOD"
            recommendation = "Minor issues, suitable for analysis with minimal preprocessing"
        elif score >= 75:
            grade = "GOOD"
            recommendation = "Some issues present, address before analysis"
        elif score >= 65:
            grade = "FAIR"
            recommendation = "Multiple issues, significant preprocessing required"
        elif score >= 50:
            grade = "POOR"
            recommendation = "Major quality issues, extensive cleaning needed"
        else:
            grade = "VERY_POOR"
            recommendation = "Critical quality issues, consider data source reliability"
        
        if self.show_progress:
            print(f"   Overall Quality Score: {score:.1f}/100")
            print(f"   Quality Grade: {grade}")
            print(f"   Recommendation: {recommendation}")
            
            if deductions:
                print(f"\n   Score Deductions:")
                for deduction in deductions:
                    print(f"     • {deduction}")
        
        return {
            'score': score,
            'grade': grade,
            'recommendation': recommendation,
            'deductions': deductions
        }
    
    def _calculate_skewness(self, series: pl.Series) -> float:
        """Calculate skewness of a Polars series"""
        try:
            values = series.to_numpy()
            if len(values) < 3:
                return 0.0
            
            mean_val = np.mean(values)
            std_val = np.std(values, ddof=0)
            
            if std_val == 0:
                return 0.0
            
            skew_val = np.mean(((values - mean_val) / std_val) ** 3)
            return float(skew_val)
        except:
            return 0.0
    
    def _calculate_kurtosis(self, series: pl.Series) -> float:
        """Calculate kurtosis of a Polars series"""
        try:
            values = series.to_numpy()
            if len(values) < 4:
                return 0.0
            
            mean_val = np.mean(values)
            std_val = np.std(values, ddof=0)
            
            if std_val == 0:
                return 0.0
            
            kurt_val = np.mean(((values - mean_val) / std_val) ** 4) - 3
            return float(kurt_val)
        except:
            return 0.0

# Initialize the loader

In [22]:
# Initialize the loader
loader = StrokeDatasetLoader(show_progress=True)
print("🔧 StrokeDatasetLoader initialized successfully!")

🔧 StrokeDatasetLoader initialized successfully!


## 📂 Loading the Stroke Prediction Dataset

Loading the healthcare stroke dataset with comprehensive error handling and initial assessment.

In [23]:
# Specify dataset path (adjust as needed)
DATASET_PATH = "../data/healthcare-dataset-stroke-data.csv"

# Alternative paths to try if the main path doesn't work
ALTERNATIVE_PATHS = [
    "data/healthcare-dataset-stroke-data.csv",
    "../data/healthcare-dataset-stroke-data.csv",
    "datasets/healthcare-dataset-stroke-data.csv"
]

# Try to load the dataset
dataset_loaded = False
df = None
data_info = None

print("🔍 Attempting to load stroke prediction dataset...")

# Try main path first
try:
    if Path(DATASET_PATH).exists():
        df, data_info = loader.load_dataset(DATASET_PATH)
        dataset_loaded = True
        print(f"✅ Dataset loaded from: {DATASET_PATH}")
    else:
        print(f"❌ Dataset not found at: {DATASET_PATH}")
except Exception as e:
    print(f"❌ Error loading from main path: {e}")

# Try alternative paths if main path failed
if not dataset_loaded:
    print("\n🔍 Trying alternative paths...")
    for alt_path in ALTERNATIVE_PATHS:
        try:
            if Path(alt_path).exists():
                df, data_info = loader.load_dataset(alt_path)
                dataset_loaded = True
                print(f"✅ Dataset loaded from: {alt_path}")
                break
        except Exception as e:
            print(f"❌ Failed to load from {alt_path}: {e}")

# If still not loaded, provide instructions
if not dataset_loaded:
    print("\n❌ DATASET NOT FOUND!")
    print("\n📋 Instructions to resolve:")
    print("1. Ensure the stroke dataset file is in the project directory")
    print("2. Verify the filename matches exactly: 'healthcaredatasetstrokedata 1.csv'")
    print("3. Check file permissions")
    print("4. Update DATASET_PATH variable above with correct path")
    print("\n🔍 Expected file structure:")
    print("   project/")
    print("   ├── healthcaredatasetstrokedata 1.csv  ← Dataset should be here")
    print("   ├── 01_data_loading_and_quality.ipynb")
    print("   └── ...")
    
    # Create a small sample dataset for demonstration
    print("\n🔧 Creating sample dataset for demonstration...")
    
    sample_data = {
        'id': range(1, 101),
        'gender': ['Male', 'Female'] * 50,
        'age': np.random.normal(45, 15, 100).clip(18, 85),
        'hypertension': np.random.choice([0, 1], 100, p=[0.7, 0.3]),
        'heart_disease': np.random.choice([0, 1], 100, p=[0.8, 0.2]),
        'ever_married': ['Yes', 'No'] * 50,
        'work_type': np.random.choice(['Private', 'Self-employed', 'Govt_job'], 100),
        'Residence_type': ['Urban', 'Rural'] * 50,
        'avg_glucose_level': np.random.normal(105, 30, 100).clip(70, 250),
        'bmi': np.random.normal(28, 5, 100).clip(15, 45),
        'smoking_status': np.random.choice(['never smoked', 'formerly smoked', 'smokes'], 100),
        'stroke': np.random.choice([0, 1], 100, p=[0.95, 0.05])
    }
    
    df = pl.DataFrame(sample_data)
    print("✅ Sample dataset created for demonstration")
    print("⚠️  Remember to replace with actual dataset for real analysis!")

else:
    print(f"\n🎉 Dataset successfully loaded!")
    print(f"📊 Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

🔍 Attempting to load stroke prediction dataset...
🏥 STROKE PREDICTION DATASET - LOADING & QUALITY ASSESSMENT

📂 DATASET FILE INFORMATION:
   File: healthcare-dataset-stroke-data.csv
   Size: 0.30 MB
   Path: /Users/sourangshupal/Downloads/Module1_FinalProject/notebooks/../data/healthcare-dataset-stroke-data.csv
✅ Dataset loaded successfully!
   Shape: 5,110 rows × 12 columns
   Memory usage: 0.43 MB
✅ Dataset loaded from: ../data/healthcare-dataset-stroke-data.csv

🎉 Dataset successfully loaded!
📊 Shape: 5,110 rows × 12 columns


## 🔍 Initial Data Exploration

Quick overview of the dataset structure, columns, and basic properties.

In [24]:
print("="*80)
print("📊 INITIAL DATA EXPLORATION")
print("="*80)

# Dataset shape and basic info
print(f"\n📏 Dataset Dimensions:")
print(f"   Rows: {df.shape[0]:,}")
print(f"   Columns: {df.shape[1]}")
print(f"   Total Cells: {df.shape[0] * df.shape[1]:,}")

# Column information
print(f"\n📋 Column Information:")
print("-" * 50)

column_info = []
for i, col in enumerate(df.columns, 1):
    dtype = str(df[col].dtype)
    unique_vals = df[col].n_unique()
    null_count = df[col].null_count()
    
    column_info.append({
        'Index': i,
        'Column': col,
        'Data_Type': dtype,
        'Unique_Values': unique_vals,
        'Null_Count': null_count,
        'Null_Percentage': f"{(null_count/df.shape[0]*100):.1f}%"
    })

# Display column information
column_df = pd.DataFrame(column_info)
print(column_df.to_string(index=False))

# Data types summary
print(f"\n🔧 Data Types Summary:")
dtype_counts = {}
for col in df.columns:
    dtype = str(df[col].dtype)
    dtype_counts[dtype] = dtype_counts.get(dtype, 0) + 1

for dtype, count in sorted(dtype_counts.items()):
    print(f"   {dtype}: {count} columns")

# Sample data preview
print(f"\n👀 First 5 Rows:")
print("-" * 50)
df_sample = df.head(5).to_pandas()
print(df_sample.to_string(index=False))

print(f"\n👀 Last 5 Rows:")
print("-" * 50)
df_sample_tail = df.tail(5).to_pandas()
print(df_sample_tail.to_string(index=False))

# Memory usage information
print(f"\n💾 Memory Usage:")
print(f"   Dataset Size: {df.estimated_size('mb'):.2f} MB")
print(f"   Average Row Size: {df.estimated_size('mb')*1024/df.shape[0]:.2f} KB")

📊 INITIAL DATA EXPLORATION

📏 Dataset Dimensions:
   Rows: 5,110
   Columns: 12
   Total Cells: 61,320

📋 Column Information:
--------------------------------------------------
 Index            Column Data_Type  Unique_Values  Null_Count Null_Percentage
     1                id     Int64           5110           0            0.0%
     2            gender    String              3           0            0.0%
     3               age   Float64            104           0            0.0%
     4      hypertension     Int64              2           0            0.0%
     5     heart_disease     Int64              2           0            0.0%
     6      ever_married    String              2           0            0.0%
     7         work_type    String              5           0            0.0%
     8    Residence_type    String              2           0            0.0%
     9 avg_glucose_level   Float64           3979           0            0.0%
    10               bmi   Float64         

## 🔍 Comprehensive Data Quality Assessment

Performing in-depth analysis of data quality across multiple dimensions including completeness, validity, consistency, and medical domain validation.

In [25]:
# Run comprehensive quality assessment
print("🚀 Starting comprehensive data quality assessment...")
print("This may take a moment for large datasets...")

quality_report = loader.comprehensive_quality_assessment(df)

print(f"\n✅ Quality assessment completed!")
print(f"📊 Overall Quality Score: {quality_report['overall_score']['score']:.1f}/100")
print(f"🏆 Quality Grade: {quality_report['overall_score']['grade']}")

🚀 Starting comprehensive data quality assessment...
This may take a moment for large datasets...

🔍 COMPREHENSIVE DATA QUALITY ASSESSMENT
------------------------------------------------------------

📊 BASIC DATASET METRICS:
------------------------------
   Total Rows: 5,110
   Total Columns: 12
   Total Cells: 61,320
   Memory Usage: 0.43 MB
   Unique Data Types: 3

🔍 MISSING VALUES ANALYSIS:
------------------------------
   Total Missing Values: 201
   Overall Missing Rate: 0.33%
   Columns with Missing: 1/12
   Missing Severity: EXCELLENT

   Top columns with missing values:
     • bmi: 201 (3.9%)

🔍 DATA TYPES ASSESSMENT:
------------------------------
   Data Type Distribution:
     • Float64: 3 columns
     • Int64: 4 columns
     • String: 5 columns

   Optimization Suggestions:
     • hypertension: Int64 → Boolean
     • heart_disease: Int64 → Boolean
     • stroke: Int64 → Boolean

🔍 DUPLICATE ANALYSIS:
------------------------------
   Total Rows: 5,110
   Unique Rows: 5,11

## 🔍 Missing Values Deep Dive Analysis

Detailed analysis of missing value patterns, impact assessment, and recommendations for handling.

In [26]:
print("="*80)
print("🕳️  MISSING VALUES DEEP DIVE ANALYSIS")
print("="*80)

missing_report = quality_report['missing_values']

# Missing values overview
print(f"\n📊 Missing Values Overview:")
print(f"   Total Missing Values: {missing_report['total_missing']:,}")
print(f"   Overall Missing Rate: {missing_report['overall_missing_rate']:.2f}%")
print(f"   Missing Severity: {missing_report['missing_severity']}")
print(f"   Columns Affected: {missing_report['columns_with_missing']}/{len(df.columns)}")

# Detailed missing values by column
print(f"\n📋 Missing Values by Column:")
print("-" * 60)

missing_df = pd.DataFrame(missing_report['missing_summary'])
if len(missing_df) > 0:
    # Sort by missing percentage
    missing_df_sorted = missing_df.sort_values('missing_percentage', ascending=False)
    
    print(f"{'Column':<20} {'Missing Count':<12} {'Missing %':<10} {'Data Type':<15}")
    print("-" * 60)
    
    for _, row in missing_df_sorted.iterrows():
        if row['missing_count'] > 0:
            print(f"{row['column']:<20} {row['missing_count']:<12} {row['missing_percentage']:<10.2f} {row['data_type']:<15}")
    
    # Missing values recommendations
    print(f"\n💡 Missing Values Recommendations:")
    print("-" * 40)
    
    for _, row in missing_df_sorted.iterrows():
        if row['missing_count'] > 0:
            col_name = row['column']
            missing_pct = row['missing_percentage']
            
            if col_name.lower() in ['bmi']:
                print(f"   • {col_name}: Use KNN imputation with age, gender, and health status")
            elif col_name.lower() in ['smoking_status']:
                print(f"   • {col_name}: Create 'Unknown' category (clinically meaningful)")
            elif missing_pct > 50:
                print(f"   • {col_name}: Consider dropping column ({missing_pct:.1f}% missing)")
            elif missing_pct > 20:
                print(f"   • {col_name}: Use advanced imputation techniques")
            elif missing_pct > 5:
                print(f"   • {col_name}: Use median/mode imputation or predictive models")
            else:
                print(f"   • {col_name}: Simple imputation acceptable")
else:
    print("✅ No missing values found in the dataset!")

# Missing value patterns
if 'missing_patterns' in missing_report and missing_report['missing_patterns'].get('columns_with_missing'):
    print(f"\n🔍 Missing Value Patterns:")
    print("-" * 30)
    
    patterns = missing_report['missing_patterns']
    print(f"   Columns with missing values: {patterns['columns_with_missing']}")
    print(f"   Pattern complexity: {patterns['pattern_analysis']}")

🕳️  MISSING VALUES DEEP DIVE ANALYSIS

📊 Missing Values Overview:
   Total Missing Values: 201
   Overall Missing Rate: 0.33%
   Missing Severity: EXCELLENT
   Columns Affected: 1/12

📋 Missing Values by Column:
------------------------------------------------------------
Column               Missing Count Missing %  Data Type      
------------------------------------------------------------
bmi                  201          3.93       Float64        

💡 Missing Values Recommendations:
----------------------------------------
   • bmi: Use KNN imputation with age, gender, and health status

🔍 Missing Value Patterns:
------------------------------
   Columns with missing values: ['bmi']
   Pattern complexity: Simple - few missing patterns


## 📊 Statistical Summary and Distribution Analysis

Comprehensive statistical analysis of numerical and categorical variables with distribution assessment.

In [27]:
print("="*80)
print("📈 STATISTICAL SUMMARY AND DISTRIBUTION ANALYSIS")
print("="*80)

stats_report = quality_report['statistical_summary']

# Numerical variables analysis
print(f"\n🔢 Numerical Variables Analysis:")
print("-" * 40)

numerical_stats = stats_report['numerical_statistics']
if numerical_stats:
    print(f"Number of numerical columns: {len(numerical_stats)}")
    print(f"\nDetailed Statistics:")
    print("-" * 80)
    
    # Create a comprehensive statistics table
    stats_table = []
    for col, stats in numerical_stats.items():
        stats_table.append({
            'Column': col,
            'Count': f"{stats['count']:,}",
            'Mean': f"{stats['mean']:.2f}",
            'Median': f"{stats['median']:.2f}",
            'Std': f"{stats['std']:.2f}",
            'Min': f"{stats['min']:.2f}",
            'Max': f"{stats['max']:.2f}",
            'IQR': f"{stats['iqr']:.2f}",
            'Skewness': f"{stats['skewness']:.2f}",
            'Kurtosis': f"{stats['kurtosis']:.2f}"
        })
    
    stats_df = pd.DataFrame(stats_table)
    print(stats_df.to_string(index=False))
    
    # Distribution analysis
    print(f"\n📊 Distribution Analysis:")
    print("-" * 30)
    
    for col, stats in numerical_stats.items():
        skew = stats['skewness']
        kurt = stats['kurtosis']
        cv = stats['cv']
        
        print(f"\n{col}:")
        
        # Skewness interpretation
        if abs(skew) < 0.5:
            skew_desc = "approximately symmetric"
        elif abs(skew) < 1.0:
            skew_desc = "moderately skewed"
        else:
            skew_desc = "highly skewed"
        
        skew_direction = "right" if skew > 0 else "left"
        if abs(skew) > 0.5:
            skew_desc = f"{skew_desc} ({skew_direction})"
        
        # Kurtosis interpretation
        if abs(kurt) < 0.5:
            kurt_desc = "normal tailedness"
        elif kurt > 0.5:
            kurt_desc = "heavy-tailed"
        else:
            kurt_desc = "light-tailed"
        
        # Coefficient of variation interpretation
        if cv < 0.1:
            variability = "low variability"
        elif cv < 0.3:
            variability = "moderate variability"
        else:
            variability = "high variability"
        
        print(f"   Distribution: {skew_desc}")
        print(f"   Tail behavior: {kurt_desc}")
        print(f"   Variability: {variability} (CV = {cv:.2f})")
        
        # Clinical interpretation for medical variables
        if col.lower() == 'age':
            if stats['mean'] < 30:
                print(f"   Clinical note: Young population (mean age {stats['mean']:.1f})")
            elif stats['mean'] > 65:
                print(f"   Clinical note: Elderly population (mean age {stats['mean']:.1f})")
            else:
                print(f"   Clinical note: Adult population (mean age {stats['mean']:.1f})")
        
        elif col.lower() == 'bmi':
            if stats['mean'] < 18.5:
                print(f"   Clinical note: Underweight population (mean BMI {stats['mean']:.1f})")
            elif stats['mean'] < 25:
                print(f"   Clinical note: Normal weight population (mean BMI {stats['mean']:.1f})")
            elif stats['mean'] < 30:
                print(f"   Clinical note: Overweight population (mean BMI {stats['mean']:.1f})")
            else:
                print(f"   Clinical note: Obese population (mean BMI {stats['mean']:.1f})")
        
        elif 'glucose' in col.lower():
            if stats['mean'] < 100:
                print(f"   Clinical note: Normal glucose levels (mean {stats['mean']:.1f} mg/dL)")
            elif stats['mean'] < 126:
                print(f"   Clinical note: Prediabetic range (mean {stats['mean']:.1f} mg/dL)")
            else:
                print(f"   Clinical note: Diabetic range (mean {stats['mean']:.1f} mg/dL)")

else:
    print("No numerical variables found in the dataset.")

# Categorical variables analysis
print(f"\n🏷️  Categorical Variables Analysis:")
print("-" * 40)

categorical_stats = stats_report['categorical_statistics']
if categorical_stats:
    print(f"Number of categorical columns: {len(categorical_stats)}")
    
    for col, stats in categorical_stats.items():
        print(f"\n{col}:")
        print(f"   Unique values: {stats['unique_count']}")
        print(f"   Most frequent: '{stats['most_frequent']}' ({stats['most_frequent_count']} occurrences)")
        print(f"   Least frequent: '{stats['least_frequent']}'")
        print(f"   Concentration ratio: {stats['concentration_ratio']:.2f}")
        
        # Assess diversity
        if stats['concentration_ratio'] > 0.8:
            diversity = "Low diversity (highly concentrated)"
        elif stats['concentration_ratio'] > 0.5:
            diversity = "Moderate diversity"
        else:
            diversity = "High diversity (well distributed)"
        
        print(f"   Diversity assessment: {diversity}")
else:
    print("No categorical variables found in the dataset.")

📈 STATISTICAL SUMMARY AND DISTRIBUTION ANALYSIS

🔢 Numerical Variables Analysis:
----------------------------------------
Number of numerical columns: 7

Detailed Statistics:
--------------------------------------------------------------------------------
           Column Count     Mean   Median      Std   Min      Max      IQR Skewness Kurtosis
               id 5,110 36517.83 36932.00 21161.72 67.00 72940.00 36955.00    -0.02    -1.21
              age 5,110    43.23    45.00    22.61  0.08    82.00    36.00    -0.14    -0.99
     hypertension 5,110     0.10     0.00     0.30  0.00     1.00     0.00     2.71     5.37
    heart_disease 5,110     0.05     0.00     0.23  0.00     1.00     0.00     3.95    13.57
avg_glucose_level 5,110   106.15    91.88    45.28 55.12   271.74    36.85     1.57     1.68
              bmi 4,909    28.89    28.10     7.85 10.30    97.60     9.60     1.06     3.36
           stroke 5,110     0.05     0.00     0.22  0.00     1.00     0.00     4.19    15.57


## 🎯 Target Variable Analysis (Stroke)

In-depth analysis of the target variable including class distribution, imbalance assessment, and clinical interpretation.

In [28]:
print("="*80)
print("🎯 TARGET VARIABLE ANALYSIS")
print("="*80)

target_report = quality_report['target_analysis']

if 'error' not in target_report:
    print(f"Target Variable: stroke")
    print(f"Analysis Type: Binary Classification")
    
    # Basic statistics
    print(f"\n📊 Basic Statistics:")
    print(f"   Total Records: {target_report['total_count']:,}")
    print(f"   Missing Values: {target_report['null_count']}")
    print(f"   Unique Classes: {target_report['unique_values']}")
    
    # Class distribution
    print(f"\n📈 Class Distribution:")
    print("-" * 40)
    
    class_dist = target_report['class_distribution']
    
    print(f"{'Class':<10} {'Count':<10} {'Percentage':<12} {'Description':<20}")
    print("-" * 52)
    
    for class_val, info in class_dist.items():
        description = "No Stroke" if class_val == 0 else "Stroke" if class_val == 1 else str(class_val)
        print(f"{class_val:<10} {info['count']:<10} {info['percentage']:<12.2f} {description:<20}")
    
    # Imbalance analysis
    print(f"\n⚖️  Class Imbalance Analysis:")
    print("-" * 30)
    
    imbalance_ratio = target_report['imbalance_ratio']
    balance_status = target_report['balance_status']
    minority_pct = target_report['minority_percentage']
    
    print(f"   Imbalance Ratio: {imbalance_ratio:.1f}:1")
    print(f"   Minority Class %: {minority_pct:.2f}%")
    print(f"   Balance Status: {balance_status}")
    
    # Clinical interpretation
    print(f"\n🏥 Clinical Interpretation:")
    print("-" * 30)
    
    stroke_rate = minority_pct if len(class_dist) == 2 and min(class_dist.keys()) == 0 else minority_pct
    
    if stroke_rate < 2:
        clinical_assessment = "Very low stroke rate - typical for young, healthy population"
    elif stroke_rate < 5:
        clinical_assessment = "Low stroke rate - typical for general adult population"
    elif stroke_rate < 10:
        clinical_assessment = "Moderate stroke rate - may indicate higher-risk population"
    elif stroke_rate < 20:
        clinical_assessment = "High stroke rate - likely high-risk or clinical population"
    else:
        clinical_assessment = "Very high stroke rate - verify data source and population"
    
    print(f"   Stroke Rate: {stroke_rate:.2f}%")
    print(f"   Assessment: {clinical_assessment}")
    
    # Machine learning implications
    print(f"\n🤖 Machine Learning Implications:")
    print("-" * 35)
    
    if imbalance_ratio > 20:
        ml_recommendation = "Severe imbalance - require SMOTE, class weights, and specialized metrics"
    elif imbalance_ratio > 10:
        ml_recommendation = "High imbalance - use balanced sampling and cost-sensitive learning"
    elif imbalance_ratio > 3:
        ml_recommendation = "Moderate imbalance - consider class weights and balanced validation"
    else:
        ml_recommendation = "Balanced dataset - standard ML techniques applicable"
    
    print(f"   Imbalance Severity: {balance_status}")
    print(f"   Recommendation: {ml_recommendation}")
    
    # Evaluation metrics recommendations
    print(f"\n📏 Recommended Evaluation Metrics:")
    print("-" * 35)
    
    recommended_metrics = [
        "ROC-AUC (primary metric)",
        "Precision-Recall AUC (for imbalanced data)",
        "Sensitivity/Recall (medical screening priority)",
        "Specificity (false positive control)",
        "F1-Score (balanced precision/recall)"
    ]
    
    if imbalance_ratio > 10:
        recommended_metrics.append("Balanced Accuracy")
        recommended_metrics.append("Matthews Correlation Coefficient")
    
    for i, metric in enumerate(recommended_metrics, 1):
        print(f"   {i}. {metric}")
    
    # Avoid accuracy note
    if imbalance_ratio > 5:
        print(f"\n⚠️  Important Note:")
        print(f"   Accuracy can be misleading with {imbalance_ratio:.1f}:1 imbalance")
        print(f"   A model predicting 'no stroke' for all cases would achieve {100-minority_pct:.1f}% accuracy!")

else:
    print(f"❌ Error in target variable analysis: {target_report['error']}")

🎯 TARGET VARIABLE ANALYSIS
Target Variable: stroke
Analysis Type: Binary Classification

📊 Basic Statistics:
   Total Records: 5,110
   Missing Values: 0
   Unique Classes: 2

📈 Class Distribution:
----------------------------------------
Class      Count      Percentage   Description         
----------------------------------------------------
0          4861       95.13        No Stroke           
1          249        4.87         Stroke              

⚖️  Class Imbalance Analysis:
------------------------------
   Imbalance Ratio: 19.5:1
   Minority Class %: 4.87%
   Balance Status: EXTREMELY_IMBALANCED

🏥 Clinical Interpretation:
------------------------------
   Stroke Rate: 4.87%
   Assessment: Low stroke rate - typical for general adult population

🤖 Machine Learning Implications:
-----------------------------------
   Imbalance Severity: EXTREMELY_IMBALANCED
   Recommendation: High imbalance - use balanced sampling and cost-sensitive learning

📏 Recommended Evaluation Metrics

## 🏥 Medical Domain Validation

Validation of data against medical domain knowledge and clinical standards.

In [29]:
print("="*80)
print("🏥 MEDICAL DOMAIN VALIDATION")
print("="*80)

medical_report = quality_report['medical_validation']

print(f"Validation Status: {medical_report['validation_status']}")
print(f"Validation Score: {medical_report['validation_score']}/100")

if medical_report['issues']:
    print(f"\n⚠️  Issues Identified:")
    print("-" * 20)
    
    for i, issue in enumerate(medical_report['issues'], 1):
        print(f"   {i}. {issue}")
    
    print(f"\n💡 Recommendations:")
    print("-" * 15)
    
    for issue in medical_report['issues']:
        if 'age' in issue.lower():
            print("   • Review age values: Check for data entry errors or outliers")
            print("   • Consider age validation rules: 0-120 years for general population")
        elif 'bmi' in issue.lower():
            print("   • Review BMI values: Check for extreme values outside 10-60 range")
            print("   • Consider BMI validation: May indicate measurement errors")
        elif 'glucose' in issue.lower():
            print("   • Review glucose values: Check for values outside 50-500 mg/dL")
            print("   • Consider measurement context: Fasting vs. random glucose")
        elif 'binary' in issue.lower():
            print("   • Review binary variables: Ensure only 0/1 values present")
            print("   • Check data encoding: Verify proper binary encoding")

else:
    print("✅ All medical domain validations passed!")
    print("   Data values are within expected clinical ranges")
    print("   Binary variables properly encoded")
    print("   No obvious medical inconsistencies detected")

# Detailed medical validation results
if 'detailed_results' in medical_report:
    print(f"\n📋 Detailed Validation Results:")
    print("-" * 35)
    
    detailed = medical_report['detailed_results']
    
    for variable, results in detailed.items():
        if 'invalid_count' in results:
            print(f"\n{variable.upper()}:")
            print(f"   Invalid records: {results['invalid_count']}")
            
            if results['issues']:
                for issue in results['issues']:
                    print(f"   Issue: {issue}")
            else:
                print(f"   ✅ All values within expected range")

🏥 MEDICAL DOMAIN VALIDATION
Validation Status: MINOR_ISSUES
Validation Score: 80/100

⚠️  Issues Identified:
--------------------
   1. Extreme BMI values: 13 records

💡 Recommendations:
---------------
   • Review BMI values: Check for extreme values outside 10-60 range
   • Consider BMI validation: May indicate measurement errors

📋 Detailed Validation Results:
-----------------------------------

AGE:
   Invalid records: 0
   ✅ All values within expected range

BMI:
   Invalid records: 13
   Issue: Extreme BMI values: 13 records

GLUCOSE:
   Invalid records: 0
   ✅ All values within expected range


## 📊 Data Quality Visualizations

Creating comprehensive visualizations to illustrate data quality findings and patterns.

In [30]:
print("="*80)
print("📊 CREATING DATA QUALITY VISUALIZATIONS")
print("="*80)

# 1. Missing Values Visualization
print("Creating missing values visualization...")

missing_df = pd.DataFrame(quality_report['missing_values']['missing_summary'])

if len(missing_df[missing_df['missing_count'] > 0]) > 0:
    # Missing values bar chart
    missing_viz_df = missing_df[missing_df['missing_count'] > 0].sort_values('missing_percentage', ascending=True)
    
    fig_missing = px.bar(
        missing_viz_df,
        x='missing_percentage',
        y='column',
        orientation='h',
        title='Missing Values by Column (%)',
        labels={'missing_percentage': 'Missing Percentage (%)', 'column': 'Column'},
        color='missing_percentage',
        color_continuous_scale='Reds',
        text='missing_count'
    )
    
    fig_missing.update_traces(texttemplate='%{text}', textposition='outside')
    fig_missing.update_layout(
        height=400,
        showlegend=False,
        xaxis_title="Missing Percentage (%)",
        yaxis_title="Columns"
    )
    
    fig_missing.show()
else:
    print("✅ No missing values to visualize!")

# 2. Data Types Distribution
print("Creating data types distribution...")

dtype_counts = {}
for col in df.columns:
    dtype = str(df[col].dtype)
    dtype_counts[dtype] = dtype_counts.get(dtype, 0) + 1

fig_dtypes = px.pie(
    values=list(dtype_counts.values()),
    names=list(dtype_counts.keys()),
    title='Data Types Distribution',
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig_dtypes.update_traces(textposition='inside', textinfo='percent+label')
fig_dtypes.update_layout(height=500)
fig_dtypes.show()

# 3. Target Variable Distribution
print("Creating target variable distribution...")

if 'target_analysis' in quality_report and 'error' not in quality_report['target_analysis']:
    target_dist = quality_report['target_analysis']['class_distribution']
    
    classes = list(target_dist.keys())
    counts = [info['count'] for info in target_dist.values()]
    percentages = [info['percentage'] for info in target_dist.values()]
    
    # Create labels
    labels = ['No Stroke' if c == 0 else 'Stroke' if c == 1 else str(c) for c in classes]
    
    fig_target = go.Figure()
    
    # Add bar chart
    fig_target.add_trace(go.Bar(
        x=labels,
        y=counts,
        text=[f'{count:,}<br>({pct:.1f}%)' for count, pct in zip(counts, percentages)],
        textposition='auto',
        marker_color=['lightblue' if c == 0 else 'coral' for c in classes],
        name='Count'
    ))
    
    fig_target.update_layout(
        title='Target Variable Distribution (Stroke)',
        xaxis_title='Class',
        yaxis_title='Count',
        height=400,
        showlegend=False
    )
    
    # Add imbalance ratio annotation
    imbalance_ratio = quality_report['target_analysis']['imbalance_ratio']
    fig_target.add_annotation(
        text=f"Imbalance Ratio: {imbalance_ratio:.1f}:1",
        xref="paper", yref="paper",
        x=0.02, y=0.98,
        xanchor="left", yanchor="top",
        showarrow=False,
        font=dict(size=12, color="black"),
        bgcolor="white",
        bordercolor="black",
        borderwidth=1
    )
    
    fig_target.show()

# 4. Numerical Variables Distribution
print("Creating numerical variables distributions...")

numerical_cols = quality_report['statistical_summary']['numerical_columns']

if len(numerical_cols) > 0:
    # Create subplots for numerical distributions
    n_cols = min(3, len(numerical_cols))
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols
    
    fig_dist = make_subplots(
        rows=n_rows, cols=n_cols,
        subplot_titles=numerical_cols,
        vertical_spacing=0.08,
        horizontal_spacing=0.08
    )
    
    for i, col in enumerate(numerical_cols):
        row = (i // n_cols) + 1
        col_pos = (i % n_cols) + 1
        
        # Get data for histogram
        col_data = df[col].drop_nulls().to_pandas()
        
        fig_dist.add_trace(
            go.Histogram(
                x=col_data,
                name=col,
                showlegend=False,
                nbinsx=30,
                marker_color='lightblue',
                opacity=0.7
            ),
            row=row, col=col_pos
        )
    
    fig_dist.update_layout(
        title='Distribution of Numerical Variables',
        height=300 * n_rows,
        showlegend=False
    )
    
    fig_dist.show()

# 5. Overall Quality Score Visualization
print("Creating overall quality score visualization...")

# Quality score gauge
score = quality_report['overall_score']['score']
grade = quality_report['overall_score']['grade']

fig_quality = go.Figure(go.Indicator(
    mode = "gauge+number+delta",
    value = score,
    domain = {'x': [0, 1], 'y': [0, 1]},
    title = {'text': f"Data Quality Score<br><span style='font-size:0.8em;color:gray'>{grade}</span>"},
    delta = {'reference': 80},
    gauge = {
        'axis': {'range': [None, 100]},
        'bar': {'color': "darkblue"},
        'steps': [
            {'range': [0, 50], 'color': "lightgray"},
            {'range': [50, 75], 'color': "yellow"},
            {'range': [75, 90], 'color': "lightgreen"},
            {'range': [90, 100], 'color': "green"}
        ],
        'threshold': {
            'line': {'color': "red", 'width': 4},
            'thickness': 0.75,
            'value': 90
        }
    }
))

fig_quality.update_layout(height=400)
fig_quality.show()

print("✅ All visualizations created successfully!")

📊 CREATING DATA QUALITY VISUALIZATIONS
Creating missing values visualization...


Creating data types distribution...


Creating target variable distribution...


Creating numerical variables distributions...


Creating overall quality score visualization...


✅ All visualizations created successfully!


## 📋 Data Quality Summary Report

Comprehensive summary of all data quality findings with actionable recommendations.

In [31]:
print("="*100)
print("📋 COMPREHENSIVE DATA QUALITY SUMMARY REPORT")
print("="*100)

# Header information
print(f"\nDataset: Stroke Prediction Analysis")
print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Analysis Tool: Polars + Custom Quality Framework")

# Overall assessment
overall_score = quality_report['overall_score']
print(f"\n🏆 OVERALL ASSESSMENT:")
print("-" * 25)
print(f"Quality Score: {overall_score['score']:.1f}/100")
print(f"Quality Grade: {overall_score['grade']}")
print(f"Recommendation: {overall_score['recommendation']}")

# Dataset overview
basic_metrics = quality_report['basic_metrics']
print(f"\n📊 DATASET OVERVIEW:")
print("-" * 20)
print(f"Total Records: {basic_metrics['total_rows']:,}")
print(f"Total Features: {basic_metrics['total_columns']}")
print(f"Memory Usage: {basic_metrics['memory_usage_mb']:.2f} MB")
print(f"Data Density: {((basic_metrics['total_rows'] * basic_metrics['total_columns']) - quality_report['missing_values']['total_missing']) / (basic_metrics['total_rows'] * basic_metrics['total_columns']) * 100:.1f}%")

# Key findings by category
print(f"\n🔍 KEY FINDINGS BY CATEGORY:")
print("-" * 30)

# 1. Completeness
missing_info = quality_report['missing_values']
print(f"\n1. DATA COMPLETENESS:")
print(f"   Overall Missing Rate: {missing_info['overall_missing_rate']:.2f}%")
print(f"   Severity Level: {missing_info['missing_severity']}")
print(f"   Columns Affected: {missing_info['columns_with_missing']}/{basic_metrics['total_columns']}")

if missing_info['columns_with_missing'] > 0:
    print(f"   Action Required: Handle missing values before analysis")
else:
    print(f"   Status: ✅ Complete dataset")

# 2. Validity
medical_info = quality_report['medical_validation']
print(f"\n2. DATA VALIDITY:")
print(f"   Medical Validation Score: {medical_info['validation_score']}/100")
print(f"   Validation Status: {medical_info['validation_status']}")
if medical_info['issues']:
    print(f"   Issues Found: {len(medical_info['issues'])}")
    print(f"   Action Required: Review and clean invalid values")
else:
    print(f"   Status: ✅ All values within expected ranges")

# 3. Consistency
duplicate_info = quality_report['duplicates']
print(f"\n3. DATA CONSISTENCY:")
print(f"   Duplicate Rate: {duplicate_info['duplicate_percentage']:.2f}%")
print(f"   Duplicate Records: {duplicate_info['duplicate_rows']:,}")
print(f"   Consistency Level: {duplicate_info['duplicate_severity']}")

if duplicate_info['duplicate_rows'] > 0:
    print(f"   Action Required: Review and remove duplicates")
else:
    print(f"   Status: ✅ No duplicate records found")

# 4. Target Variable
target_info = quality_report['target_analysis']
if 'error' not in target_info:
    print(f"\n4. TARGET VARIABLE QUALITY:")
    print(f"   Class Balance: {target_info['balance_status']}")
    print(f"   Imbalance Ratio: {target_info['imbalance_ratio']:.1f}:1")
    print(f"   Minority Class: {target_info['minority_percentage']:.2f}%")
    
    if target_info['imbalance_ratio'] > 10:
        print(f"   Action Required: Use specialized techniques for imbalanced data")
    else:
        print(f"   Status: ✅ Manageable class distribution")

# Priority action items
print(f"\n🎯 PRIORITY ACTION ITEMS:")
print("-" * 25)

action_items = []
priority_level = 1

# High priority items
if missing_info['overall_missing_rate'] > 10:
    action_items.append(f"{priority_level}. HIGH PRIORITY: Address high missing value rate ({missing_info['overall_missing_rate']:.1f}%)")
    priority_level += 1

if medical_info['validation_score'] < 80:
    action_items.append(f"{priority_level}. HIGH PRIORITY: Fix medical validation issues")
    priority_level += 1

if duplicate_info['duplicate_percentage'] > 5:
    action_items.append(f"{priority_level}. HIGH PRIORITY: Remove duplicate records ({duplicate_info['duplicate_percentage']:.1f}%)")
    priority_level += 1

# Medium priority items
if missing_info['overall_missing_rate'] > 1:
    action_items.append(f"{priority_level}. MEDIUM PRIORITY: Handle remaining missing values")
    priority_level += 1

if 'error' not in target_info and target_info['imbalance_ratio'] > 10:
    action_items.append(f"{priority_level}. MEDIUM PRIORITY: Prepare for imbalanced classification")
    priority_level += 1

# Low priority items
dtype_suggestions = quality_report['data_types']['optimization_suggestions']
if dtype_suggestions:
    action_items.append(f"{priority_level}. LOW PRIORITY: Optimize data types for memory efficiency")
    priority_level += 1

if not action_items:
    action_items.append("✅ No immediate action items - dataset is ready for analysis")

for item in action_items:
    print(f"   {item}")

# Recommendations for next steps
print(f"\n💡 RECOMMENDATIONS FOR NEXT STEPS:")
print("-" * 35)

next_steps = [
    "1. Address high-priority data quality issues identified above",
    "2. Implement appropriate missing value imputation strategies",
    "3. Validate data cleaning results",
    "4. Proceed with exploratory data analysis (EDA)",
    "5. Begin statistical hypothesis testing",
    "6. Develop feature engineering strategies",
    "7. Prepare for machine learning model development"
]

if 'error' not in target_info and target_info['imbalance_ratio'] > 10:
    next_steps.insert(4, "4. Plan imbalanced classification strategy")

for step in next_steps:
    print(f"   {step}")

# Dataset readiness assessment
print(f"\n🚦 DATASET READINESS ASSESSMENT:")
print("-" * 35)

if overall_score['score'] >= 85:
    readiness = "✅ READY FOR ANALYSIS"
    readiness_note = "Dataset meets quality standards for machine learning analysis"
elif overall_score['score'] >= 70:
    readiness = "⚠️  READY WITH MINOR PREPROCESSING"
    readiness_note = "Dataset requires minor cleaning but is suitable for analysis"
elif overall_score['score'] >= 50:
    readiness = "🔧 REQUIRES SIGNIFICANT PREPROCESSING"
    readiness_note = "Dataset needs substantial cleaning before analysis"
else:
    readiness = "❌ NOT READY - MAJOR ISSUES"
    readiness_note = "Dataset has critical quality issues requiring extensive remediation"

print(f"Status: {readiness}")
print(f"Assessment: {readiness_note}")

# Save quality report flag
print(f"\n💾 SAVE QUALITY REPORT:")
print("-" * 20)
print("To save this quality report to file, run:")
print("loader.save_data_quality_report('stroke_data_quality_report.txt')")

print(f"\n" + "="*100)
print("📋 DATA QUALITY ASSESSMENT COMPLETED")
print("="*100)

📋 COMPREHENSIVE DATA QUALITY SUMMARY REPORT

Dataset: Stroke Prediction Analysis
Analysis Date: 2025-06-19 11:21:23
Analysis Tool: Polars + Custom Quality Framework

🏆 OVERALL ASSESSMENT:
-------------------------
Quality Score: 94.0/100
Quality Grade: VERY_GOOD
Recommendation: Minor issues, suitable for analysis with minimal preprocessing

📊 DATASET OVERVIEW:
--------------------
Total Records: 5,110
Total Features: 12
Memory Usage: 0.43 MB
Data Density: 99.7%

🔍 KEY FINDINGS BY CATEGORY:
------------------------------

1. DATA COMPLETENESS:
   Overall Missing Rate: 0.33%
   Severity Level: EXCELLENT
   Columns Affected: 1/12
   Action Required: Handle missing values before analysis

2. DATA VALIDITY:
   Medical Validation Score: 80/100
   Validation Status: MINOR_ISSUES
   Issues Found: 1
   Action Required: Review and clean invalid values

3. DATA CONSISTENCY:
   Duplicate Rate: 0.00%
   Duplicate Records: 0
   Consistency Level: EXCELLENT
   Status: ✅ No duplicate records found

4.

## 💾 Export Data Quality Report

Save the comprehensive quality assessment report for documentation and future reference.

In [32]:
print("💾 Saving comprehensive data quality report...")

# Create a detailed text report
report_filename = f"stroke_data_quality_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"

try:
    with open(report_filename, 'w') as f:
        f.write("="*100 + "\n")
        f.write("STROKE PREDICTION DATASET - COMPREHENSIVE QUALITY ASSESSMENT REPORT\n")
        f.write("="*100 + "\n\n")
        
        f.write(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"Analysis Tool: Polars + Custom Quality Assessment Framework\n\n")
        
        # Overall assessment
        overall = quality_report['overall_score']
        f.write("OVERALL ASSESSMENT:\n")
        f.write("-" * 20 + "\n")
        f.write(f"Quality Score: {overall['score']:.1f}/100\n")
        f.write(f"Quality Grade: {overall['grade']}\n")
        f.write(f"Recommendation: {overall['recommendation']}\n\n")
        
        # Dataset overview
        basic = quality_report['basic_metrics']
        f.write("DATASET OVERVIEW:\n")
        f.write("-" * 17 + "\n")
        f.write(f"Total Records: {basic['total_rows']:,}\n")
        f.write(f"Total Features: {basic['total_columns']}\n")
        f.write(f"Memory Usage: {basic['memory_usage_mb']:.2f} MB\n")
        f.write(f"Columns: {', '.join(basic['column_names'])}\n\n")
        
        # Missing values
        missing = quality_report['missing_values']
        f.write("MISSING VALUES ANALYSIS:\n")
        f.write("-" * 24 + "\n")
        f.write(f"Overall Missing Rate: {missing['overall_missing_rate']:.2f}%\n")
        f.write(f"Total Missing Values: {missing['total_missing']:,}\n")
        f.write(f"Columns Affected: {missing['columns_with_missing']}\n")
        f.write(f"Severity: {missing['missing_severity']}\n\n")
        
        if missing['columns_with_missing'] > 0:
            f.write("Missing Values by Column:\n")
            missing_df = pd.DataFrame(missing['missing_summary'])
            for _, row in missing_df[missing_df['missing_count'] > 0].iterrows():
                f.write(f"  {row['column']}: {row['missing_count']} ({row['missing_percentage']:.1f}%)\n")
            f.write("\n")
        
        # Medical validation
        medical = quality_report['medical_validation']
        f.write("MEDICAL VALIDATION:\n")
        f.write("-" * 18 + "\n")
        f.write(f"Validation Score: {medical['validation_score']}/100\n")
        f.write(f"Status: {medical['validation_status']}\n")
        if medical['issues']:
            f.write("Issues Found:\n")
            for issue in medical['issues']:
                f.write(f"  - {issue}\n")
        f.write("\n")
        
        # Target variable
        target = quality_report['target_analysis']
        if 'error' not in target:
            f.write("TARGET VARIABLE ANALYSIS:\n")
            f.write("-" * 26 + "\n")
            f.write(f"Variable: stroke\n")
            f.write(f"Imbalance Ratio: {target['imbalance_ratio']:.1f}:1\n")
            f.write(f"Balance Status: {target['balance_status']}\n")
            f.write(f"Minority Class: {target['minority_percentage']:.2f}%\n\n")
            
            f.write("Class Distribution:\n")
            for class_val, info in target['class_distribution'].items():
                class_name = "No Stroke" if class_val == 0 else "Stroke"
                f.write(f"  {class_name}: {info['count']:,} ({info['percentage']:.1f}%)\n")
            f.write("\n")
        
        # Quality score deductions
        if overall['deductions']:
            f.write("QUALITY SCORE DEDUCTIONS:\n")
            f.write("-" * 25 + "\n")
            for deduction in overall['deductions']:
                f.write(f"  - {deduction}\n")
            f.write("\n")
        
        f.write("="*100 + "\n")
        f.write("END OF QUALITY ASSESSMENT REPORT\n")
        f.write("="*100 + "\n")
    
    print(f"✅ Quality report saved successfully!")
    print(f"📄 Report file: {report_filename}")
    print(f"📍 Location: {Path(report_filename).absolute()}")

except Exception as e:
    print(f"❌ Error saving report: {e}")

# Save dataset info as JSON for programmatic access
import json

try:
    json_filename = f"stroke_data_info_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    
    # Convert quality report to JSON-serializable format
    json_report = {
        'analysis_timestamp': datetime.now().isoformat(),
        'dataset_path': DATASET_PATH if dataset_loaded else 'sample_data',
        'basic_metrics': quality_report['basic_metrics'],
        'overall_score': quality_report['overall_score'],
        'missing_values_summary': {
            'total_missing': quality_report['missing_values']['total_missing'],
            'overall_rate': quality_report['missing_values']['overall_missing_rate'],
            'severity': quality_report['missing_values']['missing_severity']
        },
        'medical_validation_summary': {
            'score': quality_report['medical_validation']['validation_score'],
            'status': quality_report['medical_validation']['validation_status'],
            'issues_count': len(quality_report['medical_validation']['issues'])
        }
    }
    
    if 'error' not in quality_report['target_analysis']:
        json_report['target_analysis_summary'] = {
            'imbalance_ratio': quality_report['target_analysis']['imbalance_ratio'],
            'balance_status': quality_report['target_analysis']['balance_status'],
            'minority_percentage': quality_report['target_analysis']['minority_percentage']
        }
    
    with open(json_filename, 'w') as f:
        json.dump(json_report, f, indent=2)
    
    print(f"✅ JSON summary saved!")
    print(f"📄 JSON file: {json_filename}")

except Exception as e:
    print(f"⚠️  Warning: Could not save JSON summary: {e}")

💾 Saving comprehensive data quality report...
✅ Quality report saved successfully!
📄 Report file: stroke_data_quality_report_20250619_112125.txt
📍 Location: /Users/sourangshupal/Downloads/Module1_FinalProject/notebooks/stroke_data_quality_report_20250619_112125.txt
✅ JSON summary saved!
📄 JSON file: stroke_data_info_20250619_112125.json


## 🎉 Data Loading and Quality Assessment Complete!

### ✅ What We've Accomplished

1. **High-Performance Data Loading**: Successfully loaded the stroke prediction dataset using Polars for optimal performance
2. **Comprehensive Quality Assessment**: Evaluated data quality across multiple dimensions:
   - Completeness (missing values)
   - Validity (medical domain validation)
   - Consistency (duplicates)
   - Target variable analysis
3. **Statistical Overview**: Generated detailed statistical summaries for all variables
4. **Interactive Visualizations**: Created informative plots to visualize data quality patterns
5. **Actionable Insights**: Identified specific data quality issues and provided recommendations

### 📊 Key Findings Summary

- **Dataset Size**: {df.shape[0]:,} records × {df.shape[1]} columns
- **Overall Quality Score**: {quality_report['overall_score']['score']:.1f}/100 ({quality_report['overall_score']['grade']})
- **Missing Values**: {quality_report['missing_values']['overall_missing_rate']:.2f}% overall missing rate
- **Target Variable**: {quality_report['target_analysis']['imbalance_ratio']:.1f}:1 imbalance ratio (typical for medical data)
- **Medical Validation**: {quality_report['medical_validation']['validation_score']}/100 validation score

### 🚀 Next Steps

1. **Address Data Quality Issues**: Implement recommended fixes for missing values and validation errors
2. **Statistical Hypothesis Testing**: Proceed to `02_hypothesis_testing.ipynb` for rigorous statistical analysis
3. **Feature Engineering**: Use insights from this analysis to guide feature creation
4. **Model Development**: Apply appropriate techniques for the identified class imbalance

### 📁 Generated Files

- **Quality Report**: `stroke_data_quality_report_[timestamp].txt`
- **JSON Summary**: `stroke_data_info_[timestamp].json`
- **Enhanced Dataset**: Ready for subsequent analysis phases

The dataset is now thoroughly understood and ready for the next phase of analysis! 🎯

In [33]:
print(f"\n🎉 DATA LOADING AND QUALITY ASSESSMENT COMPLETED SUCCESSFULLY!")
print(f"📋 Notebook execution finished - ready for next analysis phase!")

# Final memory cleanup
import gc
gc.collect()
print(f"💾 Memory cleanup completed")


🎉 DATA LOADING AND QUALITY ASSESSMENT COMPLETED SUCCESSFULLY!
📋 Notebook execution finished - ready for next analysis phase!
💾 Memory cleanup completed
