## üìä Lecture 2-1: EDA & Visualization Lecture+Demo  

This notebook demonstrates **Exploratory Data Analysis (EDA) and Visualization** through a structured, step-by-step workflow using **code cells**, each paired with a detailed markdown explanation.

Each markdown cell explains:
- **What** the corresponding code cell does  
- **Why** the step matters in EDA and data mining  
- **What to look for** in the resulting output or visualization  

### Datasets Used (No Internet Required)
- **Iris Dataset** ‚Äî small, labeled dataset suitable for class-wise comparisons and basic statistical visualization  
- **California Housing Dataset** ‚Äî larger, real-world dataset suitable for dense plots, trends, spatial patterns, and uncertainty  

### How to Use This Notebook
- Run the notebook **top to bottom** to ensure variables and figures are created in the correct order  
- If a plot appears **too dense**, reduce the transparency (`alpha`) or visualize a **subset of the data**  
- Both techniques are demonstrated to illustrate best practices for scalable visualization  

This notebook is designed to support **live lecture demos**, **guided self-study**, and **post-lecture review**.
---

---
## üìò Lecture Outline: Visualization Methods

- Environment setup and reproducibility
- Data inspection and summary statistics
- Visualization with **Matplotlib** (basic plots)
- Visualization with **Seaborn** (statistical plots)
- Distribution, relationship, and spatial visualizations
- Correlation and multivariate analysis


---
## üéØ Learning Objectives 

- **Understand** the purpose of EDA and visualization in data analysis  
- **Apply** Matplotlib and Seaborn to visualize data patterns  
- **Analyze** data quality and feature relationships using visualizations  
- **Evaluate** insights from EDA to guide preprocessing decisions  


---
## üêß Python on Campus Linux Machines (Conda Environment Setup)

Python and Conda are already installed on the campus Linux machines.  
For this course, everyone will use a **single shared conda environment** defined in  
`dmenv.yaml`.  

Using the same environment ensures:
- identical package versions for all students
- reproducible results
- compatibility with instructor demos and grading


---

## 1) Open a Terminal (Local or SSH)

You can complete the setup in two ways:

- **On a campus Linux workstation:** open the **Terminal**
- **Via SSH** from your laptop into the campus servers:
  - `guardian.it.mtu.edu`
  - `colossus.it.mtu.edu`

‚úÖ After logging in, you should be in your **home directory** (e.g., `~`).


---

## 2) Ensure `dmenvsp26.yaml` Is in Your Home Directory

Make sure the file **`dmenvsp26.yaml`** is available in your Linux home directory  
(or in a subfolder within it).

The environment creation command must be run from the directory
where `dmenvsp26.yaml` is located.


---

## 3) Create the Conda Environment (Run Once)

Navigate to the directory containing `dmenvsp26.yaml`, then run:

```bash
conda env create -n dmsp26 --file dmenvsp26.yaml


---

## 4) Verify and Activate the Conda Environment

After creating the environment, verify that it was created successfully:

```bash
conda env list

conda activate dmsp26



---

### ‚úÖ Jupyter Kernel Integration

## 5) Add the Environment as a Jupyter Kernel

To use this environment inside **Jupyter Notebook** or **JupyterLab**, register it as a kernel:

```bash
python -m ipykernel install --user --name=dmsp26


---


## 6) Deactivate and Manage Disk Space

When you are finished working, deactivate the environment:

```bash
conda deactivate


---

## üßπ Saving Disk Space with Conda (Recommended)

Creating the environment installs approximately **4 GB** of packages.
Conda also stores cached tarballs and unused packages, which can consume your disk quota over time.

You can safely free disk space by running **one** of the following commands:

Remove cached tarballs:
```bash
conda clean -t


---
# EDA & Visualization

### üì• Data Loading & Initial Inspection

This cell loads the dataset into a pandas DataFrame and performs a first inspection. The goal is to confirm the dataset shape, feature names, and basic structure before any preprocessing or visualization is attempted. Early inspection helps identify schema mismatches and prevents downstream errors.

## 1) Imports + global plotting helper  
**Purpose:** Load core libraries and define a small helper (`show()`) so every plot renders cleanly.  
**Why:** Consistent visuals (tight layout, predictable rendering) reduce ‚Äúplot noise‚Äù during EDA.  
**Look for:** Confirmation whether SciPy is available (enables true KDE contours).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Optional SciPy (for KDE)
try:
    from scipy.stats import gaussian_kde
    HAS_SCIPY = True
except Exception:
    HAS_SCIPY = False

def show():
    plt.tight_layout()
    plt.show()

print("SciPy available for KDE:", HAS_SCIPY)


### üì• Data Loading & Initial Inspection

This cell loads the dataset into a pandas DataFrame and performs a first inspection. The goal is to confirm the dataset shape, feature names, and basic structure before any preprocessing or visualization is attempted. Early inspection helps identify schema mismatches and prevents downstream errors.

## 2) Load Iris dataset into a DataFrame  
**Purpose:** Convert scikit-learn Iris into a tidy pandas DataFrame and map class labels.  
**Why:** DataFrames make EDA easier (describe, groupby, plotting).  
**Look for:** 4 numeric features + a `species` label column.

In [None]:
iris = load_iris(as_frame=True)
df_iris = iris.frame.copy()
df_iris["species"] = df_iris["target"].map(dict(enumerate(iris.target_names)))
df_iris = df_iris.drop(columns=["target"])

df_iris.head()

### üß© Missing Value Analysis

This cell examines the presence and proportion of missing values in each feature. Understanding missingness is critical in data mining because it determines whether imputation, removal, or specialized models are required. Here we quantify missingness before applying any correction strategy.

## 3) Structural checks: shape, types, missingness  
**Purpose:** Quick ‚Äúsanity check‚Äù before any statistics/plots.  
**Why:** Missing values, wrong dtypes, or unexpected shapes can invalidate plots.  
**Look for:** No missing values; numeric dtypes for features.

In [None]:
print("Shape:", df_iris.shape)
print("Dtypes:", df_iris.dtypes)
print("Missing values per column:", df_iris.isna().sum())
df_iris.sample(5, random_state=7)

## 4) Summary statistics (describe)  
**Purpose:** Get mean/std/min/max and quartiles quickly.  
**Why:** Baseline understanding of spread + potential anomalies.  
**Look for:** Differences in petal features across species will show up later in plots.

In [None]:
df_iris.describe()

### üì¶ Boxplot Analysis

Boxplots summarize feature distributions using quartiles and highlight potential outliers. They are particularly useful for comparing scale and spread across multiple features in a compact form.

## 5) Five-number summary + IQR (manual)  
**Purpose:** Compute min, Q1, median, Q3, max and IQR explicitly.  
**Why:** IQR is robust and is used for outlier detection (boxplot logic).  
**Look for:** Petal length/width typically show larger separation than sepal features.

In [None]:
num_cols_iris = df_iris.select_dtypes(include="number").columns
five_num = df_iris[num_cols_iris].quantile([0, 0.25, 0.5, 0.75, 1]).T
five_num.columns = ["min", "Q1", "median", "Q3", "max"]
five_num["IQR"] = five_num["Q3"] - five_num["Q1"]
five_num

### üìä Distribution Visualization

This visualization explores the empirical distribution of a feature using histograms and/or KDE curves. Distributional analysis reveals skewness, outliers, and potential transformations (e.g., log-scaling) needed for modeling.

## 6) Histograms for all Iris features  
**Purpose:** Visualize distributions feature-by-feature.  
**Why:** Histograms reveal skew, multimodality, and range at a glance.  
**Look for:** Petal features often show multi-modal structure due to class mixture.

In [None]:
df_iris[num_cols_iris].hist(bins=20, figsize=(12, 7))
show()

### üì¶ Boxplot Analysis

Boxplots summarize feature distributions using quartiles and highlight potential outliers. They are particularly useful for comparing scale and spread across multiple features in a compact form.

## 7) Boxplots for all Iris features  
**Purpose:** Compare spread + detect outliers (via whiskers).  
**Why:** Compact distribution summary; useful for quick screening.  
**Look for:** Outliers (points beyond whiskers) and different spreads across features.

In [None]:
feature_cols = df_iris.select_dtypes(include="number").columns

iris_long = df_iris.melt(
    id_vars=None,
    value_vars=feature_cols,
    var_name="feature",
    value_name="value"
)

plt.figure(figsize=(10, 4))
sns.boxplot(data=iris_long, x="feature", y="value")
plt.xticks(rotation=20)
plt.title("Iris: Boxplots of Numeric Features (Seaborn)")
plt.show()


### üì¶ Boxplot Analysis

Boxplots summarize feature distributions using quartiles and highlight potential outliers. They are particularly useful for comparing scale and spread across multiple features in a compact form.

## 8) Violin plots for all Iris features  
**Purpose:** Show distribution shape (density) + central tendency.  
**Why:** Violin plots can reveal multimodality that boxplots hide.  
**Look for:** Wider areas indicate higher density; compare shapes across features.

In [None]:
iris_long = df_iris.melt(var_name="feature", value_name="value")

# keep only rows where value is numeric
iris_long["value"] = pd.to_numeric(iris_long["value"], errors="coerce")
iris_long = iris_long.dropna(subset=["value"])

plt.figure(figsize=(10, 4))
sns.violinplot(data=iris_long, x="feature", y="value", inner="quartile", cut=0)
plt.xticks(rotation=20)
plt.title("Iris: Violin Plots of Numeric Features (Seaborn)")
plt.show()


### üìä Distribution Visualization

This visualization explores the empirical distribution of a feature using histograms and/or KDE curves. Distributional analysis reveals skewness, outliers, and potential transformations (e.g., log-scaling) needed for modeling.

## 9) Overlaid histograms by species (petal length)  
**Purpose:** Compare a feature distribution across classes.  
**Why:** Class-wise separation suggests predictability for classification.  
**Look for:** Setosa is typically well-separated on petal length.

In [None]:
feature = "petal length (cm)"
plt.figure(figsize=(10, 4))
sns.histplot(data=df_iris, x=feature, bins=30, stat="count", kde=True)
plt.title("Iris: Histogram + KDE (Seaborn)")
plt.xlabel(feature); plt.ylabel("count")
show()


### üìä Distribution Visualization

This visualization explores the empirical distribution of a feature using histograms and/or KDE curves. Distributional analysis reveals skewness, outliers, and potential transformations (e.g., log-scaling) needed for modeling.

## 10) KDE (density) view for petal length (SciPy if available)  
**Purpose:** Smooth distribution estimate.  
**Why:** KDE avoids binning artifacts from histograms.  
**Look for:** Peaks that correspond to class clusters; fallback uses smoothed histogram if SciPy absent.

In [None]:
x = df_iris[feature].values

plt.figure(figsize=(10, 4))
# Seaborn KDE (works without SciPy in many installs; if it fails, fallback to histogram)
try:
    sns.kdeplot(x=x, fill=True)
    plt.title("Iris: KDE Density Estimate (Seaborn)")
    plt.xlabel(feature); plt.ylabel("density")
except Exception:
    sns.histplot(x=x, bins=30, stat="density", kde=True)
    plt.title("Iris: Histogram + KDE (fallback)")
    plt.xlabel(feature); plt.ylabel("density")

show()


## 11) Covariance matrix (Iris)  
**Purpose:** Quantify joint variability between features.  
**Why:** Covariance depends on scale; we typically prefer correlation for comparability.  
**Look for:** Larger covariance magnitudes for features with larger scales.

In [None]:
df_iris[num_cols_iris].cov()

## 12) Correlation matrix (Iris)  
**Purpose:** Scale-free measure of linear dependence (-1 to +1).  
**Why:** Helps spot redundant features or strong linear relationships.  
**Look for:** Strong positive correlation between petal length and petal width.

In [None]:
corr_iris = df_iris[num_cols_iris].corr(method="pearson")
corr_iris

### üî• Correlation Heatmap

This heatmap visualizes pairwise correlations between numeric features. It is a key step in multivariate EDA, helping identify redundant features, strong predictors, and multicollinearity risks.

## 13) Correlation heatmap (Iris)  
**Purpose:** Visual encoding of correlation magnitudes.  
**Why:** Heatmaps make patterns easier to see than tables.  
**Look for:** Blocks of high correlation (bright/dark regions).

In [None]:
plt.figure(figsize=(6, 5))
sns.heatmap(corr_iris, annot=True, fmt=".2f", cmap="vlag", square=True, cbar=True)
plt.title("Iris: Correlation Heatmap (Seaborn)")
show()


### üîó Relationship Exploration

Scatter plots visualize pairwise relationships between variables. They help identify correlations, nonlinear patterns, and heteroscedasticity, which inform feature selection and model choice.

## 14) Scatter plot with class color (Iris)  
**Purpose:** Visualize x‚Äìy relationships and class separability.  
**Why:** Scatter plots reveal clusters/outliers/non-linearity.  
**Look for:** Clear separation among species for petal measurements.

In [None]:
x_col, y_col = "petal length (cm)", "petal width (cm)"
plt.figure(figsize=(7, 5))

for sp in df_iris["species"].unique():
    sub = df_iris[df_iris["species"] == sp]
    plt.scatter(sub[x_col], sub[y_col], label=sp, alpha=0.8)

plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title("Iris: Scatter Plot Colored by Species")
plt.legend()
show()

### üîó Relationship Exploration

Scatter plots visualize pairwise relationships between variables. They help identify correlations, nonlinear patterns, and heteroscedasticity, which inform feature selection and model choice.

## 15) Scatter matrix (pairwise)  
**Purpose:** Pairwise relationships among all features.  
**Why:** One of the fastest ways to see separation + correlation patterns.  
**Look for:** Which feature pairs show the cleanest class separation.

In [None]:
pd.plotting.scatter_matrix(df_iris[num_cols_iris], figsize=(10, 10), diagonal="hist")
show()

## 16) IQR-based outlier counts (Iris)  
**Purpose:** Count potential outliers for each feature using 1.5√óIQR rule.  
**Why:** Outliers can bias means, correlations, and models.  
**Look for:** Features with more outliers (could be measurement noise or rare cases).

In [None]:
outlier_counts = {}
for c in num_cols_iris:
    q1, q3 = df_iris[c].quantile([0.25, 0.75])
    iqr = q3 - q1
    low, high = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    outlier_counts[c] = ((df_iris[c] < low) | (df_iris[c] > high)).sum()

pd.Series(outlier_counts).sort_values(ascending=False)

## 17) Build Anscombe-style datasets (summary stats)  
**Purpose:** Show that identical summary stats can hide different patterns.  
**Why:** Reinforces: ‚ÄúAlways plot your data.‚Äù  
**Look for:** Similar means/variances/correlations across datasets.

In [None]:
x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5], dtype=float)
y1 = np.array([8.0, 6.9, 7.6, 8.8, 8.3, 9.9, 7.2, 4.3, 10.8, 4.8, 5.7])
y2 = np.array([9.1, 8.1, 8.7, 7.8, 9.3, 8.8, 6.1, 3.1, 9.1, 7.3, 4.7])
y3 = np.array([7.5, 6.8, 12.7, 7.1, 7.8, 8.8, 6.1, 5.4, 8.2, 6.4, 5.7])
x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8], dtype=float)
y4 = np.array([6.6, 5.8, 7.7, 8.8, 7.1, 6.4, 5.7, 12.5, 5.6, 7.9, 6.9])

datasets = [("I", x, y1), ("II", x, y2), ("III", x, y3), ("IV", x4, y4)]

stats = []
for name, xv, yv in datasets:
    stats.append({
        "set": name,
        "mean_x": xv.mean(),
        "mean_y": yv.mean(),
        "var_x": xv.var(),
        "var_y": yv.var(),
        "corr": np.corrcoef(xv, yv)[0, 1]
    })

pd.DataFrame(stats)

## 18) Plot the Anscombe-style datasets  
**Purpose:** Visual proof that the datasets differ substantially.  
**Why:** Prevents false confidence in correlation/mean alone.  
**Look for:** Curvature, outliers, vertical-line pattern, etc.

In [None]:
plt.figure(figsize=(10, 7))
for i, (name, xv, yv) in enumerate(datasets, start=1):
    plt.subplot(2, 2, i)
    plt.scatter(xv, yv)
    m, b = np.polyfit(xv, yv, 1)
    xx = np.linspace(xv.min(), xv.max(), 100)
    plt.plot(xx, m * xx + b)
    plt.title(f"Dataset {name}")
    plt.xlabel("x"); plt.ylabel("y")
show()

### üì• Data Loading & Initial Inspection

This cell loads the dataset into a pandas DataFrame and performs a first inspection. The goal is to confirm the dataset shape, feature names, and basic structure before any preprocessing or visualization is attempted. Early inspection helps identify schema mismatches and prevents downstream errors.

## 19) Load California Housing (real-world dataset)  
**Purpose:** Move to a larger dataset to demonstrate dense plots, binning, uncertainty, and practical issues.  
**Why:** Real data introduces skew, heavy tails, and complex relationships.  
**Look for:** Numeric features + target `median_house_value`.

In [None]:
# Required access method:
data = fetch_california_housing(as_frame=True)

X = data.data  # pandas DataFrame of shape (20640, 8)
y = data.target  # pandas Series

# Give the target a proper name
y = y.rename("median_house_value")

# Concatenate column-wise
df_house = pd.concat([X, y], axis=1)

df_house.head()

In [None]:
df_house.shape

In [None]:
df_house.columns

### üß© Missing Value Analysis

This cell examines the presence and proportion of missing values in each feature. Understanding missingness is critical in data mining because it determines whether imputation, removal, or specialized models are required. Here we quantify missingness before applying any correction strategy.

## 20) Housing structural checks  
**Purpose:** Confirm shape, missing values, and basic summary.  
**Why:** Avoid plotting wrong columns or misreading scales.  
**Look for:** Typically no missing values in this dataset.

In [None]:
print("Shape:", df_house.shape)
print("Missing values (top 10):", df_house.isna().sum().sort_values(ascending=False).head(10))
df_house.describe().T.head(10)

### üìä Distribution Visualization

This visualization explores the empirical distribution of a feature using histograms and/or KDE curves. Distributional analysis reveals skewness, outliers, and potential transformations (e.g., log-scaling) needed for modeling.

## 21) Target distribution (histogram)  
**Purpose:** Visualize the distribution of house values.  
**Why:** Detect skew, censoring, and long tails.  
**Look for:** A heavy right tail; potential ‚Äúcap‚Äù at the high end.

In [None]:
target = "MedInc"
plt.figure(figsize=(10, 4))
sns.histplot(data=df_house, x=target, bins=40, stat="count")
plt.title("California Housing: Target Distribution (median_house_value) ‚Äî Seaborn")
plt.xlabel(target); plt.ylabel("count")
show()


## 22) Log transform demonstration  
**Purpose:** Show how log1p compresses heavy tails.  
**Why:** Helps interpret skewed variables and can linearize relationships.  
**Look for:** More symmetric shape after log transform.

In [None]:
x = df_house[target].values
x_log = np.log1p(x)

plt.figure(figsize=(10, 4))
plt.hist(x_log, bins=40)
plt.title("California Housing: log1p(median_house_value) Distribution")
plt.xlabel("log1p(median_house_value)"); plt.ylabel("count")
show()

### üîó Relationship Exploration

Scatter plots visualize pairwise relationships between variables. They help identify correlations, nonlinear patterns, and heteroscedasticity, which inform feature selection and model choice.

## 23) Scatter: median_income vs median_house_value (raw)  
**Purpose:** Inspect a key predictor relationship.  
**Why:** Scatter shows non-linearities, heteroskedasticity, and censoring.  
**Look for:** Increasing trend + density changes; potential value cap.

In [None]:
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df_house, x="MedInc", y="median_house_value", alpha=0.15, s=15, edgecolor=None)
plt.title("California Housing: median_income vs median_house_value (Seaborn)")
plt.xlabel("median_income"); plt.ylabel("HouseAge")
show()


## 24) Correlation matrix + absolute correlation with target  
**Purpose:** Rapidly rank features by linear association with target.  
**Why:** Helps prioritize deeper EDA on top drivers.  
**Look for:** median_income usually highest; note correlation doesn‚Äôt imply causation.

In [None]:
corr_h = df_house.corr()
corr_to_target = corr_h["median_house_value"].drop("median_house_value").abs().sort_values(ascending=False)
corr_to_target

### üî• Correlation Heatmap

This heatmap visualizes pairwise correlations between numeric features. It is a key step in multivariate EDA, helping identify redundant features, strong predictors, and multicollinearity risks.

## 25) Correlation heatmap (Housing)  
**Purpose:** Visualize all pairwise correlations.  
**Why:** Spot redundant features and potential multicollinearity.  
**Look for:** Blocks of correlated features (e.g., AveRooms vs AveBedrms).

In [None]:
plt.figure(figsize=(7, 6))
sns.heatmap(corr_h, annot=False, cmap="vlag", center=0, square=False, cbar=True)
plt.title("California Housing: Correlation Heatmap (Seaborn)")
show()


# See you next lecture!