# Exploratory Data Analysis (EDA)

**Project:** AIDD Final Project
**Author:** [Your Name]
**Date:** November 9, 2025

This notebook performs comprehensive exploratory data analysis on the dataset.

## 1. Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Import custom modules
import sys
sys.path.append('../src')
from data_loader import DataLoader
from data_cleaner import DataCleaner

print("Libraries imported successfully!")

## 2. Load Data

In [None]:
# Initialize data loader
loader = DataLoader(data_dir='../data')

# Load dataset (modify this to load your actual data)
# Example: df = loader.load_csv('your_data.csv')
# For demo purposes, we'll use a sample dataset
import seaborn as sns
df = sns.load_dataset('iris')  # Replace with your actual dataset

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")

## 3. Initial Data Inspection

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display basic information
print("Dataset Information:")
df.info()

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")
print(f"Percentage: {(duplicates / len(df)) * 100:.2f}%")

## 4. Data Distribution Analysis

In [None]:
# Analyze numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numerical columns: {numerical_cols}")

# Plot distributions
if len(numerical_cols) > 0:
    n_cols = min(3, len(numerical_cols))
    n_rows = (len(numerical_cols) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
    axes = axes.flatten() if len(numerical_cols) > 1 else [axes]
    
    for idx, col in enumerate(numerical_cols):
        if idx < len(axes):
            axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
            axes[idx].set_title(f'Distribution of {col}')
            axes[idx].set_xlabel(col)
            axes[idx].set_ylabel('Frequency')
    
    # Hide unused subplots
    for idx in range(len(numerical_cols), len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Box plots for outlier detection
if len(numerical_cols) > 0:
    fig, axes = plt.subplots(1, min(3, len(numerical_cols)), figsize=(15, 5))
    axes = axes.flatten() if len(numerical_cols) > 1 else [axes]
    
    for idx, col in enumerate(numerical_cols[:3]):
        if idx < len(axes):
            axes[idx].boxplot(df[col].dropna())
            axes[idx].set_title(f'Box Plot: {col}')
            axes[idx].set_ylabel(col)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Analyze categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"\nCategorical columns: {categorical_cols}")

if len(categorical_cols) > 0:
    for col in categorical_cols[:3]:  # Show first 3
        print(f"\n{col} value counts:")
        print(df[col].value_counts())
        
        # Plot
        plt.figure(figsize=(10, 4))
        df[col].value_counts().plot(kind='bar', color='steelblue')
        plt.title(f'Distribution of {col}')
        plt.xlabel(col)
        plt.ylabel('Count')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

## 5. Correlation Analysis

In [None]:
# Correlation matrix
if len(numerical_cols) > 1:
    plt.figure(figsize=(10, 8))
    correlation_matrix = df[numerical_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Matrix')
    plt.tight_layout()
    plt.show()
    
    # Find highly correlated pairs
    print("\nHighly Correlated Pairs (|correlation| > 0.7):")
    high_corr = []
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            if abs(correlation_matrix.iloc[i, j]) > 0.7:
                high_corr.append((
                    correlation_matrix.columns[i],
                    correlation_matrix.columns[j],
                    correlation_matrix.iloc[i, j]
                ))
    
    if high_corr:
        for col1, col2, corr in high_corr:
            print(f"{col1} <-> {col2}: {corr:.3f}")
    else:
        print("No highly correlated pairs found.")

## 6. Pairwise Relationships

In [None]:
# Scatter plot matrix (pairplot) for numerical features
if len(numerical_cols) > 1 and len(numerical_cols) <= 5:
    # Only plot if we have a reasonable number of columns
    target_col = categorical_cols[0] if categorical_cols else None
    
    if target_col:
        sns.pairplot(df, vars=numerical_cols, hue=target_col, diag_kind='hist', 
                    plot_kws={'alpha': 0.6}, height=2.5)
    else:
        sns.pairplot(df[numerical_cols], diag_kind='hist', 
                    plot_kws={'alpha': 0.6}, height=2.5)
    
    plt.suptitle('Pairwise Relationships', y=1.02)
    plt.tight_layout()
    plt.show()
else:
    print("Pairplot skipped (too many or too few numerical columns)")

## 7. Target Variable Analysis

In [None]:
# Analyze target variable (modify 'target' to your actual target column name)
# For demo, we'll use the first categorical column as target
if categorical_cols:
    target_col = categorical_cols[0]  # Change this to your actual target
    print(f"Target Variable: {target_col}")
    print(f"\nTarget Distribution:")
    print(df[target_col].value_counts())
    print(f"\nTarget Proportions:")
    print(df[target_col].value_counts(normalize=True))
    
    # Visualize target distribution
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Count plot
    df[target_col].value_counts().plot(kind='bar', ax=ax1, color='steelblue')
    ax1.set_title(f'Target Distribution: {target_col}')
    ax1.set_xlabel(target_col)
    ax1.set_ylabel('Count')
    ax1.tick_params(axis='x', rotation=45)
    
    # Pie chart
    df[target_col].value_counts().plot(kind='pie', ax=ax2, autopct='%1.1f%%')
    ax2.set_title(f'Target Proportions: {target_col}')
    ax2.set_ylabel('')
    
    plt.tight_layout()
    plt.show()
else:
    print("No categorical target variable found in the dataset.")

## 8. Feature vs Target Analysis

In [None]:
# Analyze how features relate to target
if categorical_cols and numerical_cols:
    target_col = categorical_cols[0]
    
    # Box plots of features by target
    n_features = min(3, len(numerical_cols))
    fig, axes = plt.subplots(1, n_features, figsize=(15, 5))
    axes = axes.flatten() if n_features > 1 else [axes]
    
    for idx, col in enumerate(numerical_cols[:n_features]):
        df.boxplot(column=col, by=target_col, ax=axes[idx])
        axes[idx].set_title(f'{col} by {target_col}')
        axes[idx].set_xlabel(target_col)
        axes[idx].set_ylabel(col)
    
    plt.suptitle('')  # Remove default title
    plt.tight_layout()
    plt.show()

## 9. Key Insights and Findings

### Summary of Findings:

1. **Dataset Overview:**
   - Total observations: [Number]
   - Total features: [Number]
   - Missing values: [Description]

2. **Data Quality:**
   - Duplicate rows: [Number]
   - Outliers detected in: [Columns]
   - Data type issues: [If any]

3. **Feature Insights:**
   - Most important correlations: [List]
   - Features with high variance: [List]
   - Potential feature engineering opportunities: [Ideas]

4. **Target Variable:**
   - Distribution: [Balanced/Imbalanced]
   - Class proportions: [Details]
   - Relationships with features: [Key findings]

5. **Recommendations:**
   - Data cleaning steps needed: [List]
   - Feature engineering suggestions: [List]
   - Modeling approach recommendations: [Ideas]

## 10. Next Steps

Based on this EDA, the next steps are:

1. **Data Cleaning:**
   - Handle missing values
   - Remove or cap outliers
   - Remove duplicates

2. **Feature Engineering:**
   - Create interaction features
   - Encode categorical variables
   - Scale numerical features

3. **Model Development:**
   - Split data into train/test sets
   - Try multiple algorithms
   - Perform hyperparameter tuning

4. **Model Evaluation:**
   - Compare model performance
   - Analyze feature importance
   - Validate on test set