# Exploratory Data Analysis (EDA)

`Objective:` Perform a deep-dive analysis into the housing dataset using statistical visualizations. We aim to identify patterns, correlations, and anomalies that will inform our machine learning models.

#### 1. Environment Setup and Data Loading

We import our modular visualization toolbox from the `src` directory. To ensure our analysis is accurate, we load the processed dataset generated in the previous notebook, which has already been cleaned of missing values and extreme outliers.

In [None]:
import sys; sys.path.append("..")
import pandas as pd
from src.visualization import (
    plot_distribution,
    plot_correlation_heatmap,
    plot_scatter_with_trend,
    plot_outlier_boxplot,
    plot_violin_by_category
)

# Load the cleaned dataset
df = pd.read_csv('../data/processed/cleaned_housing.csv')

# Verify data loaded correctly
df.head()

#### 2. Distribution Analysis (Visual Type 1)

We examine the distribution of our target variable, `Price`. This helps us understand the spread of housing costs and check for any remaining skewness that might affect our regression algorithms.

In [None]:
# Visualize the distribution of the House Price
plot_distribution(df, 'Price')

#### 3. Relationship and Correlation Analysis (Visual Type 2)

Using a correlation heatmap, we investigate the linear relationships between all numerical features. This is a critical step for feature selection, allowing us to identify which variables (like `median_income`) have the strongest impact on house prices.

In [None]:
# Generate correlation heatmap for all numerical features
plot_correlation_heatmap(df)

#### 4. Pattern Identification: Income vs Price (Visual Type 3)

Based on our heatmap, `median_income` appears to be a primary driver of value. We create a scatter plot with a regression trend line to visualize this relationship and identify how tightly the data follows a linear pattern.

In [None]:
# Analyze relationship between income and price with a trend line
plot_scatter_with_trend(df, 'median_income', 'Price')

#### 5. Outlier and Spread Analysis (Visual Type 4)

We use box plots to analyze the spread and identify anomalies in our **room counts** and **household metrics**. Even after cleaning, visualizing the quartiles helps us understand the typical density of California housing districts.

In [None]:
# Outlier analysis for room and household metrics
plot_outlier_boxplot(df, ['total_rooms', 'total_bedrooms', 'households'])

#### 6. Multivariate Analysis (Visual Type 5)

To satisfy the multivariate analysis requirement, we analyze how `Price` density varies. (Note: Since `ocean_proximity` was encoded in Notebook 02, we visualize a key derived feature distribution here or a binary category).

In [None]:
# Distribution density of Price across a specific encoded category
# We check the INLAND category as it often shows a distinct price difference
plot_violin_by_category(df, 'ocean_proximity_INLAND', 'Price')

#### 7. Key Insights Summary

Based on the visualizations above, we have identified the following patterns:

* **Strongest Predictor:** median_income shows a very strong positive correlation with house prices, confirming it as the most significant feature for our model.
* **Geographical Impact:** The violin plot indicates that "Inland" properties have a significantly lower median price and less variance than properties closer to the coast.
* **Derived Feature Value:** Our engineered feature Rooms_Per_Household shows a moderate relationship with price, offering more granularity than the raw total_rooms count.
* **Price Capping:** The distribution analysis shows a slight accumulation at the higher end of the price scale, which is a characteristic of the California dataset's original collection method.