# Short Lab: Visualizing Data Mining Techniques

This lab is based on **Lecture 7.1 (Data Mining)** and **Lecture 7.2 (Decision Trees)**.

You will practice and visualize:
- Classification and clustering (Data Mining core tasks)
- Decision Tree splits with **Gini** vs **Entropy**
- Feature importance and model quality


## 1. Setup

If you run this in a fresh environment, install dependencies from `requirements.txt` first.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    adjusted_rand_score,
    classification_report,
)

RANDOM_STATE = 42
sns.set_theme(style="whitegrid", context="notebook")


## 2. Load Open-Source Dataset

We use the **Wine** dataset from scikit-learn (originally from UCI ML Repository).


In [None]:
wine = load_wine(as_frame=True)
df = wine.frame.copy()
df["target_name"] = df["target"].map(lambda x: wine.target_names[x])

print("Shape:", df.shape)
print("Classes:", list(wine.target_names))
df.head()


## 3. Quick EDA (Data Visualization)


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Class distribution
sns.countplot(data=df, x="target_name", hue="target_name", palette="Set2", ax=axes[0], legend=False)
axes[0].set_title("Class Distribution")
axes[0].set_xlabel("Wine class")
axes[0].set_ylabel("Count")

# Correlation heatmap
corr = df.drop(columns=["target_name"]).corr(numeric_only=True)
sns.heatmap(corr, cmap="coolwarm", center=0, ax=axes[1])
axes[1].set_title("Feature Correlation Heatmap")

plt.tight_layout()
plt.show()


In [None]:
selected_features = ["alcohol", "malic_acid", "flavanoids", "color_intensity", "target_name"]
sns.pairplot(df[selected_features], hue="target_name", corner=True, diag_kind="kde")
plt.suptitle("Feature Relationships by Class", y=1.02)
plt.show()


## 4. Clustering Visualization (K-Means)

Even without labels, clustering can find natural groups in data.


In [None]:
X = df.drop(columns=["target", "target_name"])
y = df["target"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=RANDOM_STATE, n_init=20)
cluster_labels = kmeans.fit_predict(X_scaled)

# Reduce to 2D only for visualization
pca = PCA(n_components=2, random_state=RANDOM_STATE)
X_pca = pca.fit_transform(X_scaled)

ari = adjusted_rand_score(y, cluster_labels)
print(f"Adjusted Rand Index (clusters vs true classes): {ari:.3f}")


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=df["target_name"], palette="Set2", ax=axes[0])
axes[0].set_title("True Classes (PCA Projection)")
axes[0].set_xlabel("PC1")
axes[0].set_ylabel("PC2")

sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=cluster_labels, palette="tab10", ax=axes[1])
axes[1].set_title("K-Means Clusters (PCA Projection)")
axes[1].set_xlabel("PC1")
axes[1].set_ylabel("PC2")

plt.tight_layout()
plt.show()


## 5. Decision Trees: Gini vs Entropy

From Lecture 7.2: compare split criteria and visualize the learned tree.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

def train_and_eval(criterion, depth=4):
    model = DecisionTreeClassifier(
        criterion=criterion,
        max_depth=depth,
        random_state=RANDOM_STATE,
    )
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    return model, preds, acc

model_gini, preds_gini, acc_gini = train_and_eval("gini", depth=4)
model_entropy, preds_entropy, acc_entropy = train_and_eval("entropy", depth=4)

print(f"Gini accuracy:    {acc_gini:.3f}")
print(f"Entropy accuracy: {acc_entropy:.3f}")


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ConfusionMatrixDisplay(confusion_matrix(y_test, preds_gini)).plot(ax=axes[0], colorbar=False)
axes[0].set_title("Confusion Matrix: Gini")

ConfusionMatrixDisplay(confusion_matrix(y_test, preds_entropy)).plot(ax=axes[1], colorbar=False)
axes[1].set_title("Confusion Matrix: Entropy")

plt.tight_layout()
plt.show()

print("Classification report (Entropy):")
print(classification_report(y_test, preds_entropy, target_names=wine.target_names))


In [None]:
plt.figure(figsize=(20, 10))
plot_tree(
    model_entropy,
    feature_names=X.columns,
    class_names=wine.target_names,
    filled=True,
    rounded=True,
    fontsize=9,
)
plt.title("Decision Tree Visualization (Entropy, max_depth=4)")
plt.show()


## 6. Model Complexity and Overfitting

Observe how tree depth changes train/test accuracy.


In [None]:
depths = range(1, 11)
train_scores_gini, test_scores_gini = [], []
train_scores_entropy, test_scores_entropy = [], []

for d in depths:
    m_g = DecisionTreeClassifier(criterion="gini", max_depth=d, random_state=RANDOM_STATE)
    m_e = DecisionTreeClassifier(criterion="entropy", max_depth=d, random_state=RANDOM_STATE)

    m_g.fit(X_train, y_train)
    m_e.fit(X_train, y_train)

    train_scores_gini.append(m_g.score(X_train, y_train))
    test_scores_gini.append(m_g.score(X_test, y_test))

    train_scores_entropy.append(m_e.score(X_train, y_train))
    test_scores_entropy.append(m_e.score(X_test, y_test))

plt.figure(figsize=(10, 5))
plt.plot(depths, train_scores_gini, marker="o", label="Train (Gini)")
plt.plot(depths, test_scores_gini, marker="o", label="Test (Gini)")
plt.plot(depths, train_scores_entropy, marker="s", label="Train (Entropy)")
plt.plot(depths, test_scores_entropy, marker="s", label="Test (Entropy)")
plt.xlabel("Max depth")
plt.ylabel("Accuracy")
plt.title("Decision Tree Depth vs Accuracy")
plt.legend()
plt.show()


## 7. Feature Importance


In [None]:
importance = pd.Series(model_entropy.feature_importances_, index=X.columns).sort_values(ascending=False)

plt.figure(figsize=(10, 5))
sns.barplot(x=importance.values, y=importance.index, palette="viridis")
plt.title("Feature Importance (Decision Tree, Entropy)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()

importance.head(10)


## 8. Missing Data and Imputation Techniques

Real datasets often have missing values. Below we simulate missingness and compare common imputation techniques.


In [None]:
# Simulate missing values in 10% of feature cells
missing_rate = 0.10
rng = np.random.default_rng(RANDOM_STATE)

X_missing = X.copy()
mask = rng.random(X_missing.shape) < missing_rate
X_missing = X_missing.mask(mask)

missing_total = int(X_missing.isna().sum().sum())
missing_pct = 100 * missing_total / X_missing.size
print(f"Missing cells: {missing_total} ({missing_pct:.1f}%)")

X_missing.isna().mean().sort_values(ascending=False)


In [None]:
X_train_miss, X_test_miss, y_train_miss, y_test_miss = train_test_split(
    X_missing, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

imputers = {
    "mean": SimpleImputer(strategy="mean"),
    "median": SimpleImputer(strategy="median"),
    "most_frequent": SimpleImputer(strategy="most_frequent"),
    "knn": KNNImputer(n_neighbors=5),
}

rows = []
for name, imputer in imputers.items():
    X_train_imp = imputer.fit_transform(X_train_miss)
    X_test_imp = imputer.transform(X_test_miss)

    clf = DecisionTreeClassifier(criterion="entropy", max_depth=4, random_state=RANDOM_STATE)
    clf.fit(X_train_imp, y_train_miss)
    preds = clf.predict(X_test_imp)

    rows.append(
        {
            "imputer": name,
            "accuracy": accuracy_score(y_test_miss, preds),
            "remaining_missing_train": int(np.isnan(X_train_imp).sum()),
            "remaining_missing_test": int(np.isnan(X_test_imp).sum()),
        }
    )

imputation_results = pd.DataFrame(rows).sort_values("accuracy", ascending=False)
imputation_results


In [None]:
feature = "flavanoids"
feature_idx = X.columns.get_loc(feature)

plt.figure(figsize=(10, 5))
sns.kdeplot(X[feature], label="original (no missing)", linewidth=2)

for name, imputer in imputers.items():
    X_imp_full = imputer.fit_transform(X_missing)
    sns.kdeplot(X_imp_full[:, feature_idx], label=f"imputed: {name}", alpha=0.85)

plt.title(f"Distribution Comparison for '{feature}'")
plt.xlabel(feature)
plt.ylabel("Density")
plt.legend()
plt.show()


## 9. Variant Tasks for 10 Students

Assign one variant to each student. Every variant includes a few tasks to complete.

### Variant 1
1. Run K-Means with `n_clusters=2`, `3`, and `4`.
2. Compare ARI values in a small table.
3. Explain which `k` works best and why.

### Variant 2
1. Keep `k=3` and test K-Means with and without `StandardScaler`.
2. Compare ARI and PCA cluster plots.
3. Write a short conclusion about scaling impact.

### Variant 3
1. Build decision trees with `max_depth=2`, `4`, and `8` using `criterion='gini'`.
2. Compare train/test accuracy in one plot.
3. Explain where overfitting starts.

### Variant 4
1. Build decision trees with `criterion='gini'` and `criterion='entropy'` at `max_depth=4`.
2. Compare confusion matrices and macro F1.
3. Decide which criterion is better on this dataset.

### Variant 5
1. Simulate missingness at `10%` and `25%`.
2. Compare `SimpleImputer(mean)` and `SimpleImputer(median)`.
3. Report which method is more stable as missingness increases.

### Variant 6
1. Simulate missingness at `15%`.
2. Compare `SimpleImputer(most_frequent)` vs `KNNImputer`.
3. Evaluate both using decision-tree test accuracy.

### Variant 7
1. After imputation, compare feature distributions for two features (not only `flavanoids`).
2. Show KDE plots for original vs imputed data.
3. Identify which feature is most sensitive to imputation choice.

### Variant 8
1. Train an entropy tree (`max_depth=4`).
2. Extract top-5 most important features.
3. Remove the top-1 feature, retrain, and compare accuracy.

### Variant 9
1. Add 5-fold cross-validation for decision trees (`gini` and `entropy`).
2. Report mean and std accuracy for each criterion.
3. Compare CV results with the single train/test split.

### Variant 10
1. Create one final comparison table with the best result from clustering, tree tuning, and imputation.
2. Select your recommended pipeline for this dataset.
3. Provide a 5-7 sentence final summary for a non-technical audience.
