# Unit Cooler Digital Twin - Exploratory Data Analysis

**Project:** HVAC Unit Cooler Digital Twin  
**Date:** 2025-11-18  
**Dataset:** datos_combinados_entrenamiento_20251118_105234.csv

## Objective
Comprehensive exploratory data analysis of the consolidated Unit Cooler experimental dataset to understand data quality, patterns, and relationships for model development.

In [None]:
# Import libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from data.data_loader import load_and_preprocess, DataLoader
from utils.eda_utils import EDAAnalyzer, print_eda_summary
from utils.visualization import *

# Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

%matplotlib inline
%load_ext autoreload
%autoreload 2

## 1. Data Loading

In [None]:
# Load data
data_path = '../data/raw/datos_combinados_entrenamiento_20251118_105234.csv'
df, metadata = load_and_preprocess(data_path)

print(f"\nDataset shape: {df.shape}")
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")

## 2. Dataset Overview

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Column names and types
print("Column names and data types:")
df.dtypes

In [None]:
# Basic statistics
print("Descriptive statistics:")
df.describe().T

## 3. Data Quality Assessment

In [None]:
# Missing values analysis
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing_Count': missing,
    'Missing_Pct': missing_pct
}).sort_values('Missing_Pct', ascending=False)

print("Missing values summary:")
print(missing_df[missing_df['Missing_Count'] > 0])

In [None]:
# Visualize missing values
plot_missing_values(df)

In [None]:
# Data quality issues
analyzer = EDAAnalyzer(df)
issues = analyzer.check_data_quality_issues()

print("Data quality issues:")
for issue_type, issue_data in issues.items():
    print(f"\n{issue_type}:")
    if isinstance(issue_data, dict):
        for k, v in list(issue_data.items())[:10]:
            print(f"  {k}: {v}")
    elif isinstance(issue_data, list):
        for item in issue_data[:10]:
            print(f"  {item}")

## 4. Target Variables Analysis

Key target variables:
- **UCAOT**: Unit Cooler Air Outlet Temperature
- **UCWOT**: Unit Cooler Water Outlet Temperature
- **UCAF**: Unit Cooler Air Flow

In [None]:
# Target variables statistics
target_vars = ['UCAOT', 'UCWOT', 'UCAF']
print("Target variables statistics:")
df[target_vars].describe().T

In [None]:
# Distribution of target variables
plot_distributions(df, target_vars)

In [None]:
# Boxplots for target variables
plot_boxplots(df, target_vars)

## 5. Input Variables Analysis

In [None]:
# Key input variables
input_vars = ['UCWIT', 'UCAIT', 'UCWF', 'UCAIH', 'AMBT']
available_inputs = [v for v in input_vars if v in df.columns]

print("Input variables statistics:")
df[available_inputs].describe().T

In [None]:
# Distribution of input variables
plot_distributions(df, available_inputs)

## 6. Correlation Analysis

In [None]:
# Correlation matrix for key variables
key_vars = target_vars + available_inputs
corr_matrix = df[key_vars].corr()

print("Correlation matrix:")
corr_matrix

In [None]:
# Visualize correlation heatmap
plot_correlation_heatmap(df, variables=key_vars)

In [None]:
# Identify highly correlated pairs
high_corr = analyzer.identify_highly_correlated_pairs(threshold=0.8)

print("Highly correlated variable pairs (|r| >= 0.8):")
for var1, var2, corr in high_corr:
    print(f"  {var1} <-> {var2}: {corr:.3f}")

In [None]:
# Correlations with target variables
target_corrs = analyzer.analyze_target_correlations(target_vars)

print("Top correlations with each target:")
target_corrs.head(15)

In [None]:
# Visualize target correlations
plot_target_correlations(df, target_vars, top_n=15)

## 7. Time Series Analysis

In [None]:
# Plot time series for key variables
plot_time_series(df, key_vars, sample_size=5000)

## 8. Outlier Analysis

In [None]:
# Detect outliers using IQR method
outliers_iqr = analyzer.detect_outliers(method='iqr', threshold=1.5)

print("Outlier counts (IQR method, threshold=1.5):")
outliers_sorted = sorted(outliers_iqr.items(), key=lambda x: x[1], reverse=True)
for var, count in outliers_sorted[:15]:
    pct = count / len(df) * 100
    print(f"  {var:15s}: {count:6,} ({pct:5.2f}%)")

## 9. Feature Engineering Preparation

Calculate physics-based features for model development.

In [None]:
# Calculate temperature differences
df_features = df.copy()

# Temperature deltas
if 'UCWIT' in df.columns and 'UCWOT' in df.columns:
    df_features['delta_T_water'] = df['UCWIT'] - df['UCWOT']

if 'UCAIT' in df.columns and 'UCAOT' in df.columns:
    df_features['delta_T_air'] = df['UCAOT'] - df['UCAIT']

# Calculate thermal power (simplified)
Cp_water = 4186.0  # J/(kg·K)
Cp_air = 1005.0    # J/(kg·K)

if 'UCWF' in df.columns and 'delta_T_water' in df_features.columns:
    df_features['Q_water_calc'] = df['UCWF'] * Cp_water * df_features['delta_T_water'] / 1000  # kW

if 'UCAF' in df.columns and 'delta_T_air' in df_features.columns:
    df_features['Q_air_calc'] = df['UCAF'] * Cp_air * df_features['delta_T_air'] / 1000  # kW

# Calculate efficiency (if both Q values available)
if 'Q_air_calc' in df_features.columns and 'Q_water_calc' in df_features.columns:
    df_features['efficiency'] = df_features['Q_air_calc'] / (df_features['Q_water_calc'] + 1e-6)

print("Engineered features:")
new_features = ['delta_T_water', 'delta_T_air', 'Q_water_calc', 'Q_air_calc', 'efficiency']
available_new = [f for f in new_features if f in df_features.columns]
df_features[available_new].describe().T

## 10. Summary and Key Findings

In [None]:
print_eda_summary(df)

## Key Findings

### Data Quality
1. **Dataset Size**: 56,211 rows × 32 columns
2. **Missing Values**: Significant missing data in many columns (23-76%)
   - UCSDP: 76.42% missing
   - UCFMC: 75.75% missing
   - UCFMV: 75.07% missing
   - UCAIH: 72.06% missing
   - Most other variables: ~23-33% missing
3. **Negative Flow Values**: 12,620 negative values in UCWF (water flow)
4. **Outliers**: Many variables show 10-30% outliers (IQR method)

### Target Variables
1. **UCAOT** (Air Outlet Temp): Mean=34.6°C, Std=58.8°C
2. **UCWOT** (Water Outlet Temp): Mean=103.1°C, Std=211.7°C (extreme variance)
3. **UCAF** (Air Flow): Mean=6,259, Std=17,841 (high variance)

### Correlations
1. **UCAIH** (Air Inlet Humidity) strongly negatively correlated with:
   - UCAOT: r=-0.624
   - UCWOT: r=-0.658
2. **High multicollinearity** between flow measurements:
   - UCFMS ↔ UCFMV: r=0.996
   - UCAF ↔ UCFMV: r=0.977

### Recommendations
1. **Data Cleaning**: Address negative flow values and extreme outliers
2. **Imputation Strategy**: Develop robust imputation for ~23-30% missing data
3. **Feature Selection**: Remove highly correlated features to reduce multicollinearity
4. **Physics Constraints**: Implement constraints to ensure physical validity
5. **Stratified Sampling**: Ensure train/val/test splits represent all operational regimes

### Next Steps
1. Develop comprehensive data preprocessing pipeline
2. Implement physics-based feature engineering
3. Create Physics-Informed Neural Network (PINN) architecture
4. Validate against baseline models (LinearRegression, RandomForest)