
# Лабораторная работа №3 - исследования с решающим деревом  
(повтор пунктов 2–4 из ЛР №1)

В ноутбуке выполнены пункты **2–4**:
- **2. Создание бейзлайна и оценка качества** (sklearn)
- **3. Улучшение бейзлайна** (гипотезы → проверка → улучшенный бейзлайн)
- **4. Имплементация алгоритма** (Decision Tree) **с нуля** + сравнения

## Открытые датасеты по ссылке (UCI)
- **Классификация:** Banknote Authentication  
  `https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt`
- **Регрессия:** Auto MPG  
  `https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data`

## Метрики
- Классификация: **accuracy**, **F1-macro**, **ROC-AUC**
- Регрессия: **MAE**, **RMSE**, **R²**


In [1]:

import numpy as np
import pandas as pd

from dataclasses import dataclass
from typing import Optional, Literal
import inspect

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import (
    accuracy_score, f1_score, roc_auc_score,
    mean_absolute_error, mean_squared_error, r2_score
)

import matplotlib.pyplot as plt

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

def rmse(y_true, y_pred) -> float:
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

# Версия-агностичный OneHotEncoder (чтобы не ловить ошибки из-за sparse/sparse_output)
def make_ohe_dense():
    sig = inspect.signature(OneHotEncoder)
    if "sparse_output" in sig.parameters:
        return OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    return OneHotEncoder(handle_unknown="ignore", sparse=False)

pd.set_option("display.max_columns", 60)


## Загрузка данных (по ссылке)

In [2]:

# ===== Banknote Authentication (classification) =====
banknote_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
banknote_cols = ["variance", "skewness", "curtosis", "entropy", "class"]
df_cls = pd.read_csv(banknote_url, header=None, names=banknote_cols)

# ===== Auto MPG (regression) =====
auto_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
auto_cols = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name"]
df_reg = pd.read_csv(
    auto_url,
    delim_whitespace=True,
    header=None,
    names=auto_cols,
    na_values="?"
)

display(df_cls.head())
display(df_reg.head())

print("Banknote shape:", df_cls.shape)
print("Auto MPG shape:", df_reg.shape)
print("\nMissing (Auto MPG):")
display(df_reg.isna().sum().to_frame("missing"))


  df_reg = pd.read_csv(


Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


Banknote shape: (1372, 5)
Auto MPG shape: (398, 9)

Missing (Auto MPG):


Unnamed: 0,missing
mpg,0
cylinders,0
displacement,0
horsepower,6
weight,0
acceleration,0
model_year,0
origin,0
car_name,0



## 2. Создание бейзлайна и оценка качества (sklearn)

### 2.1 Разбиение train/test
- Классификация: `stratify` по классам.
- Регрессия: обычное разбиение.


In [3]:

# ===== Classification =====
X_cls = df_cls.drop(columns=["class"]).values
y_cls = df_cls["class"].values

X_cls_train, X_cls_test, y_cls_train, y_cls_test = train_test_split(
    X_cls, y_cls,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y_cls
)

# ===== Regression =====
# car_name убираем (строковый признак)
df_reg_base = df_reg.drop(columns=["car_name"]).copy()
X_reg = df_reg_base.drop(columns=["mpg"])
y_reg = df_reg_base["mpg"]

X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg,
    test_size=0.2,
    random_state=RANDOM_STATE
)

print("cls train/test:", X_cls_train.shape, X_cls_test.shape)
print("reg train/test:", X_reg_train.shape, X_reg_test.shape)


cls train/test: (1097, 4) (275, 4)
reg train/test: (318, 7) (80, 7)



### 2.2 Бейзлайн: DecisionTreeClassifier и DecisionTreeRegressor

- Для классификации дереву **не нужно** масштабирование.
- Для регрессии обязательно обработать пропуски (`horsepower` = '?') → импутация медианой.


In [4]:

# ===== Baseline: Classification =====
dtc_base = DecisionTreeClassifier(random_state=RANDOM_STATE)
dtc_base.fit(X_cls_train, y_cls_train)

y_cls_pred = dtc_base.predict(X_cls_test)
y_cls_proba = dtc_base.predict_proba(X_cls_test)[:, 1]

cls_metrics_base = {
    "accuracy": accuracy_score(y_cls_test, y_cls_pred),
    "f1_macro": f1_score(y_cls_test, y_cls_pred, average="macro"),
    "roc_auc": roc_auc_score(y_cls_test, y_cls_proba),
}
print("Baseline (classification):", cls_metrics_base)

# ===== Baseline: Regression (simple imputation, origin as numeric) =====
dtr_base = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("model", DecisionTreeRegressor(random_state=RANDOM_STATE)),
])
dtr_base.fit(X_reg_train, y_reg_train)
y_reg_pred = dtr_base.predict(X_reg_test)

reg_metrics_base = {
    "mae": mean_absolute_error(y_reg_test, y_reg_pred),
    "rmse": rmse(y_reg_test, y_reg_pred),
    "r2": r2_score(y_reg_test, y_reg_pred),
}
print("Baseline (regression):", reg_metrics_base)


Baseline (classification): {'accuracy': 0.9927272727272727, 'f1_macro': 0.992645485665383, 'roc_auc': 0.9934640522875817}
Baseline (regression): {'mae': 2.2225, 'rmse': 3.3371769506575464, 'r2': 0.7928680190978783}



## 3. Улучшение бейзлайна

### 3.1 Гипотезы улучшения

**Классификация:**
1. Подбор гиперпараметров предобрезки (`max_depth`, `min_samples_split`, `min_samples_leaf`) уменьшит переобучение и улучшит качество на тесте.
2. Смена критерия разбиения (`gini` / `entropy`) может дать прирост.

**Регрессия:**
1. Импутация остаётся обязательной.
2. `origin` лучше обрабатывать как **категориальный** признак → one-hot.
3. Подбор `max_depth`, `min_samples_leaf`, `min_samples_split` уменьшит переобучение.


In [5]:

# ===== Improved: Classification (GridSearchCV) =====
cls_param_grid = {
    "max_depth": [None, 2, 3, 4, 5, 7, 10],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 5, 10],
    "criterion": ["gini", "entropy"],
}

cv_cls = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cls_search = GridSearchCV(
    DecisionTreeClassifier(random_state=RANDOM_STATE),
    cls_param_grid,
    cv=cv_cls,
    scoring="f1_macro",
    n_jobs=-1
)
cls_search.fit(X_cls_train, y_cls_train)

print("Best params (classification):", cls_search.best_params_)
print("CV best f1_macro:", cls_search.best_score_)

dtc_best = cls_search.best_estimator_
y_cls_pred_best = dtc_best.predict(X_cls_test)
y_cls_proba_best = dtc_best.predict_proba(X_cls_test)[:, 1]

cls_metrics_best = {
    "accuracy": accuracy_score(y_cls_test, y_cls_pred_best),
    "f1_macro": f1_score(y_cls_test, y_cls_pred_best, average="macro"),
    "roc_auc": roc_auc_score(y_cls_test, y_cls_proba_best),
}
print("Improved (classification):", cls_metrics_best)


# ===== Improved: Regression (one-hot origin + GridSearchCV) =====
num_cols = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
cat_cols = ["origin"]

reg_preprocess = ColumnTransformer([
    ("num", SimpleImputer(strategy="median"), num_cols),
    ("cat", make_ohe_dense(), cat_cols),
], remainder="drop")

reg_pipe = Pipeline([
    ("prep", reg_preprocess),
    ("model", DecisionTreeRegressor(random_state=RANDOM_STATE)),
])

reg_param_grid = {
    "model__max_depth": [None, 2, 3, 4, 5, 7, 10, 15],
    "model__min_samples_split": [2, 5, 10, 20],
    "model__min_samples_leaf": [1, 2, 5, 10],
    "model__criterion": ["squared_error"],  # оставляем совместимое с кастомной реализацией
}

cv_reg = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
reg_search = GridSearchCV(
    reg_pipe,
    reg_param_grid,
    cv=cv_reg,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)
reg_search.fit(X_reg_train, y_reg_train)

print("Best params (regression):", reg_search.best_params_)
print("CV best (neg RMSE):", reg_search.best_score_)

dtr_best = reg_search.best_estimator_
y_reg_pred_best = dtr_best.predict(X_reg_test)

reg_metrics_best = {
    "mae": mean_absolute_error(y_reg_test, y_reg_pred_best),
    "rmse": rmse(y_reg_test, y_reg_pred_best),
    "r2": r2_score(y_reg_test, y_reg_pred_best),
}
print("Improved (regression):", reg_metrics_best)

compare = pd.DataFrame([
    {"task": "classification", "stage": "baseline", **cls_metrics_base},
    {"task": "classification", "stage": "improved", **cls_metrics_best},
    {"task": "regression", "stage": "baseline", **reg_metrics_base},
    {"task": "regression", "stage": "improved", **reg_metrics_best},
])
display(compare)


Best params (classification): {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
CV best f1_macro: 0.9917111496228085
Improved (classification): {'accuracy': 0.9927272727272727, 'f1_macro': 0.992645485665383, 'roc_auc': 0.9934640522875817}
Best params (regression): {'model__criterion': 'squared_error', 'model__max_depth': 5, 'model__min_samples_leaf': 10, 'model__min_samples_split': 2}
CV best (neg RMSE): -3.3909546777838755
Improved (regression): {'mae': 1.9866984376910846, 'rmse': 2.7235988230803807, 'r2': 0.862033081756782}


Unnamed: 0,task,stage,accuracy,f1_macro,roc_auc,mae,rmse,r2
0,classification,baseline,0.992727,0.992645,0.993464,,,
1,classification,improved,0.992727,0.992645,0.993464,,,
2,regression,baseline,,,,2.2225,3.337177,0.792868
3,regression,improved,,,,1.986698,2.723599,0.862033



## 4. Имплементация решающего дерева (с нуля)

Ниже - компактная реализация бинарного дерева:
- **Классификация:** критерий `gini` или `entropy`
- **Регрессия:** критерий `mse` (squared error)

Для скорости поиск лучшего сплита делается через сортировку и префиксные суммы (быстрее, чем перебор масками).


In [6]:

def _gini(counts: np.ndarray) -> float:
    # counts: (n_classes,)
    total = counts.sum()
    if total <= 0:
        return 0.0
    p = counts / total
    return float(1.0 - np.sum(p ** 2))

def _entropy(counts: np.ndarray) -> float:
    total = counts.sum()
    if total <= 0:
        return 0.0
    p = counts / total
    p = p[p > 0]
    return float(-np.sum(p * np.log2(p)))

def _mse_from_sums(sum_y: float, sum_y2: float, n: int) -> float:
    if n <= 0:
        return 0.0
    mean = sum_y / n
    # E[y^2] - (E[y])^2
    return float(sum_y2 / n - mean * mean)

@dataclass
class _Node:
    feature: Optional[int] = None
    threshold: Optional[float] = None
    left: Optional["__class__"] = None
    right: Optional["__class__"] = None
    value: Optional[float] = None  # регрессия: mean; классификация: класс (int)

class DecisionTreeClassifierCustom:
    def __init__(self,
                 max_depth: Optional[int] = None,
                 min_samples_split: int = 2,
                 min_samples_leaf: int = 1,
                 criterion: Literal["gini", "entropy"] = "gini"):
        self.max_depth = max_depth
        self.min_samples_split = int(min_samples_split)
        self.min_samples_leaf = int(min_samples_leaf)
        self.criterion = criterion
        self.root_ = None
        self.classes_ = None

    def fit(self, X, y):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y, dtype=int)
        self.classes_, y_enc = np.unique(y, return_inverse=True)
        self.root_ = self._build(X, y_enc, depth=0)
        return self

    def _impurity(self, counts):
        return _gini(counts) if self.criterion == "gini" else _entropy(counts)

    def _best_split(self, X, y):
        n, d = X.shape
        n_classes = int(y.max() + 1)

        # текущая impurity
        total_counts = np.bincount(y, minlength=n_classes).astype(float)
        best_imp = float("inf")
        best_f = None
        best_thr = None

        for f in range(d):
            xs = X[:, f]
            order = np.argsort(xs, kind="mergesort")
            xs_sorted = xs[order]
            y_sorted = y[order]

            # кандидаты сплитов только там, где значение меняется
            diffs = xs_sorted[1:] != xs_sorted[:-1]
            if not np.any(diffs):
                continue

            # префиксные суммы по классам
            left_counts = np.zeros((n - 1, n_classes), dtype=float)
            running = np.zeros(n_classes, dtype=float)
            for i in range(n - 1):
                running[y_sorted[i]] += 1.0
                left_counts[i] = running

            right_counts = total_counts - left_counts

            # ограничения на min_samples_leaf
            left_n = np.arange(1, n)
            right_n = n - left_n
            valid = diffs & (left_n >= self.min_samples_leaf) & (right_n >= self.min_samples_leaf)
            if not np.any(valid):
                continue

            # impurity для каждого i (сплит между i и i+1)
            imp_left = np.array([self._impurity(c) for c in left_counts])
            imp_right = np.array([self._impurity(c) for c in right_counts])
            weighted = (left_n * imp_left + right_n * imp_right) / n

            # выбираем лучший valid
            weighted[~valid] = np.inf
            i_best = int(np.argmin(weighted))
            if weighted[i_best] < best_imp:
                best_imp = float(weighted[i_best])
                best_f = f
                best_thr = float((xs_sorted[i_best] + xs_sorted[i_best + 1]) / 2.0)

        return best_f, best_thr

    def _leaf_value(self, y):
        # most frequent class
        counts = np.bincount(y)
        return int(np.argmax(counts))

    def _build(self, X, y, depth):
        n = X.shape[0]

        # stopping
        if (self.max_depth is not None and depth >= self.max_depth) or            (n < self.min_samples_split) or            (len(np.unique(y)) == 1):
            return _Node(value=self._leaf_value(y))

        f, thr = self._best_split(X, y)
        if f is None:
            return _Node(value=self._leaf_value(y))

        left_mask = X[:, f] <= thr
        right_mask = ~left_mask

        if left_mask.sum() < self.min_samples_leaf or right_mask.sum() < self.min_samples_leaf:
            return _Node(value=self._leaf_value(y))

        node = _Node(feature=f, threshold=thr)
        node.left = self._build(X[left_mask], y[left_mask], depth + 1)
        node.right = self._build(X[right_mask], y[right_mask], depth + 1)
        return node

    def _predict_one(self, x, node: _Node):
        while node.feature is not None:
            node = node.left if x[node.feature] <= node.threshold else node.right
        return node.value

    def predict(self, X):
        X = np.asarray(X, dtype=float)
        preds_enc = np.array([self._predict_one(x, self.root_) for x in X], dtype=int)
        return self.classes_[preds_enc]

    def predict_proba(self, X):
        # простая proba: доля класса-1 в листе не храним,
        # поэтому вернём 0/1 как "жёсткую" вероятность (для ROC-AUC ок как baseline).
        # Для более точного proba можно хранить распределение классов в листьях.
        y_hat = self.predict(X)
        # бинарный случай 0/1
        proba_1 = (y_hat == 1).astype(float)
        return np.vstack([1 - proba_1, proba_1]).T


class DecisionTreeRegressorCustom:
    def __init__(self,
                 max_depth: Optional[int] = None,
                 min_samples_split: int = 2,
                 min_samples_leaf: int = 1):
        self.max_depth = max_depth
        self.min_samples_split = int(min_samples_split)
        self.min_samples_leaf = int(min_samples_leaf)
        self.root_ = None

    def fit(self, X, y):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y, dtype=float)
        self.root_ = self._build(X, y, depth=0)
        return self

    def _best_split(self, X, y):
        n, d = X.shape
        best_loss = float("inf")
        best_f = None
        best_thr = None

        total_sum = float(np.sum(y))
        total_sum2 = float(np.sum(y ** 2))

        for f in range(d):
            xs = X[:, f]
            order = np.argsort(xs, kind="mergesort")
            xs_sorted = xs[order]
            y_sorted = y[order]

            diffs = xs_sorted[1:] != xs_sorted[:-1]
            if not np.any(diffs):
                continue

            # префиксные суммы y и y^2
            prefix_sum = np.cumsum(y_sorted[:-1])
            prefix_sum2 = np.cumsum((y_sorted[:-1]) ** 2)

            left_n = np.arange(1, n)
            right_n = n - left_n

            right_sum = total_sum - prefix_sum
            right_sum2 = total_sum2 - prefix_sum2

            valid = diffs & (left_n >= self.min_samples_leaf) & (right_n >= self.min_samples_leaf)
            if not np.any(valid):
                continue

            left_mse = np.array([_mse_from_sums(prefix_sum[i], prefix_sum2[i], int(left_n[i])) for i in range(n - 1)])
            right_mse = np.array([_mse_from_sums(right_sum[i], right_sum2[i], int(right_n[i])) for i in range(n - 1)])
            weighted = (left_n * left_mse + right_n * right_mse) / n

            weighted[~valid] = np.inf
            i_best = int(np.argmin(weighted))
            if weighted[i_best] < best_loss:
                best_loss = float(weighted[i_best])
                best_f = f
                best_thr = float((xs_sorted[i_best] + xs_sorted[i_best + 1]) / 2.0)

        return best_f, best_thr

    def _leaf_value(self, y):
        return float(np.mean(y)) if y.size else 0.0

    def _build(self, X, y, depth):
        n = X.shape[0]

        if (self.max_depth is not None and depth >= self.max_depth) or            (n < self.min_samples_split) or            (n <= 2 * self.min_samples_leaf):
            return _Node(value=self._leaf_value(y))

        f, thr = self._best_split(X, y)
        if f is None:
            return _Node(value=self._leaf_value(y))

        left_mask = X[:, f] <= thr
        right_mask = ~left_mask

        if left_mask.sum() < self.min_samples_leaf or right_mask.sum() < self.min_samples_leaf:
            return _Node(value=self._leaf_value(y))

        node = _Node(feature=f, threshold=thr)
        node.left = self._build(X[left_mask], y[left_mask], depth + 1)
        node.right = self._build(X[right_mask], y[right_mask], depth + 1)
        return node

    def _predict_one(self, x, node: _Node):
        while node.feature is not None:
            node = node.left if x[node.feature] <= node.threshold else node.right
        return node.value

    def predict(self, X):
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_one(x, self.root_) for x in X], dtype=float)



### 4.1 Кастомные модели vs бейзлайн (пункт 2)

Для честного сравнения:
- **Классификация:** сырые признаки (как в п.2).
- **Регрессия:** медианная импутация для числовых признаков (как в п.2).


In [7]:

# ===== Custom baseline: classification =====
custom_dtc = DecisionTreeClassifierCustom(
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    criterion="gini"
).fit(X_cls_train, y_cls_train)

y_cls_pred_c = custom_dtc.predict(X_cls_test)
y_cls_proba_c = custom_dtc.predict_proba(X_cls_test)[:, 1]

cls_metrics_custom_base = {
    "accuracy": accuracy_score(y_cls_test, y_cls_pred_c),
    "f1_macro": f1_score(y_cls_test, y_cls_pred_c, average="macro"),
    "roc_auc": roc_auc_score(y_cls_test, y_cls_proba_c),
}
print("Custom baseline (classification):", cls_metrics_custom_base)

# ===== Custom baseline: regression =====
imp = SimpleImputer(strategy="median")
Xr_train_imp = imp.fit_transform(X_reg_train)
Xr_test_imp = imp.transform(X_reg_test)

custom_dtr = DecisionTreeRegressorCustom(
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1
).fit(Xr_train_imp, y_reg_train.values)

y_reg_pred_c = custom_dtr.predict(Xr_test_imp)

reg_metrics_custom_base = {
    "mae": mean_absolute_error(y_reg_test, y_reg_pred_c),
    "rmse": rmse(y_reg_test, y_reg_pred_c),
    "r2": r2_score(y_reg_test, y_reg_pred_c),
}
print("Custom baseline (regression):", reg_metrics_custom_base)

display(pd.DataFrame([
    {"task": "classification", "model": "sklearn_baseline", **cls_metrics_base},
    {"task": "classification", "model": "custom_baseline", **cls_metrics_custom_base},
    {"task": "regression", "model": "sklearn_baseline", **reg_metrics_base},
    {"task": "regression", "model": "custom_baseline", **reg_metrics_custom_base},
]))


Custom baseline (classification): {'accuracy': 0.9927272727272727, 'f1_macro': 0.992645485665383, 'roc_auc': 0.9934640522875817}
Custom baseline (regression): {'mae': 2.25625, 'rmse': 3.379497003993346, 'r2': 0.7875812643829545}


Unnamed: 0,task,model,accuracy,f1_macro,roc_auc,mae,rmse,r2
0,classification,sklearn_baseline,0.992727,0.992645,0.993464,,,
1,classification,custom_baseline,0.992727,0.992645,0.993464,,,
2,regression,sklearn_baseline,,,,2.2225,3.337177,0.792868
3,regression,custom_baseline,,,,2.25625,3.379497,0.787581



### 4.2 Добавляем техники улучшенного бейзлайна (пункт 3) к кастомным моделям

Берём лучшие гиперпараметры из GridSearchCV и применяем те же преобразования:
- Классификация: лучшие `max_depth/min_samples_*` и критерий.
- Регрессия: one-hot(origin) + импутация + лучшие `max_depth/min_samples_*`.

> Примечание: у кастомного дерева `predict_proba` сделано упрощённо (жёсткая 0/1), поэтому ROC-AUC может быть ниже, чем у sklearn.


In [8]:

# ===== Custom improved: classification (best params) =====
bp = cls_search.best_params_
custom_dtc_best = DecisionTreeClassifierCustom(
    max_depth=bp["max_depth"],
    min_samples_split=bp["min_samples_split"],
    min_samples_leaf=bp["min_samples_leaf"],
    criterion=bp["criterion"]
).fit(X_cls_train, y_cls_train)

y_cls_pred_cb = custom_dtc_best.predict(X_cls_test)
y_cls_proba_cb = custom_dtc_best.predict_proba(X_cls_test)[:, 1]

cls_metrics_custom_improved = {
    "accuracy": accuracy_score(y_cls_test, y_cls_pred_cb),
    "f1_macro": f1_score(y_cls_test, y_cls_pred_cb, average="macro"),
    "roc_auc": roc_auc_score(y_cls_test, y_cls_proba_cb),
}
print("Custom improved (classification):", cls_metrics_custom_improved)


# ===== Custom improved: regression (same preprocess + best params) =====
bp_r = reg_search.best_params_
max_depth = bp_r["model__max_depth"]
min_split = bp_r["model__min_samples_split"]
min_leaf = bp_r["model__min_samples_leaf"]

# Преобразуем признаки так же, как в улучшенном пайплайне
X_reg_train_p = reg_preprocess.fit_transform(X_reg_train)
X_reg_test_p = reg_preprocess.transform(X_reg_test)

custom_dtr_best = DecisionTreeRegressorCustom(
    max_depth=max_depth,
    min_samples_split=min_split,
    min_samples_leaf=min_leaf
).fit(X_reg_train_p, y_reg_train.values)

y_reg_pred_cb = custom_dtr_best.predict(X_reg_test_p)

reg_metrics_custom_improved = {
    "mae": mean_absolute_error(y_reg_test, y_reg_pred_cb),
    "rmse": rmse(y_reg_test, y_reg_pred_cb),
    "r2": r2_score(y_reg_test, y_reg_pred_cb),
}
print("Custom improved (regression):", reg_metrics_custom_improved)


summary = pd.DataFrame([
    {"task": "classification", "stage": "sklearn_baseline", **cls_metrics_base},
    {"task": "classification", "stage": "sklearn_improved", **cls_metrics_best},
    {"task": "classification", "stage": "custom_baseline", **cls_metrics_custom_base},
    {"task": "classification", "stage": "custom_improved", **cls_metrics_custom_improved},

    {"task": "regression", "stage": "sklearn_baseline", **reg_metrics_base},
    {"task": "regression", "stage": "sklearn_improved", **reg_metrics_best},
    {"task": "regression", "stage": "custom_baseline", **reg_metrics_custom_base},
    {"task": "regression", "stage": "custom_improved", **reg_metrics_custom_improved},
])
display(summary)


Custom improved (classification): {'accuracy': 0.9927272727272727, 'f1_macro': 0.992645485665383, 'roc_auc': 0.9934640522875817}
Custom improved (regression): {'mae': 2.0115734376910845, 'rmse': 2.7627362883715176, 'r2': 0.858039489563882}


Unnamed: 0,task,stage,accuracy,f1_macro,roc_auc,mae,rmse,r2
0,classification,sklearn_baseline,0.992727,0.992645,0.993464,,,
1,classification,sklearn_improved,0.992727,0.992645,0.993464,,,
2,classification,custom_baseline,0.992727,0.992645,0.993464,,,
3,classification,custom_improved,0.992727,0.992645,0.993464,,,
4,regression,sklearn_baseline,,,,2.2225,3.337177,0.792868
5,regression,sklearn_improved,,,,1.986698,2.723599,0.862033
6,regression,custom_baseline,,,,2.25625,3.379497,0.787581
7,regression,custom_improved,,,,2.011573,2.762736,0.858039



## Выводы (кратко по пунктам)

**Классификация:**
1. Бейзлайн-дерево часто переобучается (глубокое дерево запоминает обучающую выборку).
2. Подбор `max_depth/min_samples_leaf/min_samples_split` обычно улучшает качество на тесте.
3. Выбор критерия (`gini`/`entropy`) иногда даёт небольшой прирост.
4. Кастомная реализация воспроизводит идею CART и даёт качество близкое к sklearn при тех же ограничениях дерева (различия из-за деталей реализации и вероятностей).

**Регрессия:**
1. Без ограничений глубины дерево легко переобучается → высокие ошибки на тесте.
2. Импутация обязательна из-за пропусков в `horsepower`.
3. One-hot для `origin` улучшает качество (категория не должна быть числом 1/2/3 в смысле “больше/меньше”).
4. Подбор предобрезки даёт стабильный прирост (снижение RMSE/MAE, рост R²).
