**1. Answering the questions**

**Leave-One-Out Cross-Validation (LOOCV)**

Принцип работы:

Из всех наблюдений выбираем ровно одно для теста, а оставшиеся n-1 — для обучения.

Обучаем модель на этих n-1 наблюдениях и тестируем на единственном оставшемся.

Повторяем процесс n раз (каждое наблюдение побывает в роли тестового ровно один раз).

Среднее значение метрик по всем итерациям = оценка качества модели.

Плюсы:
Каждая выборка данных используется по максимуму, оценка не зависит от случайного/неудачного split, оценка ошибки валидная (не имеет большого смещения), так как обучение происходит на всем датасете. 

Минусы:
Сложность вычисления, нужно обучить модель n раз, что может происходить долго, высокая дисперсия оценки (оценка сильно зависит от одного случайного примера).

**Grid Search**

Нужно задать сетку значений для каждого параметра

Модель обучается с разными комбинациями параметров из сетки, вычислияются метрики

Берется модель с лучшей комбинацией параметров на основе метрик

**Randomized Grid Search**

Так же как и Grid Search, но комбинации параметров выбираются случайным образом, число итераций вычисления метрик определяется параметром <i>n_iter</i>.

**Bayesian Optimization**

Перебирает несколько комбинаций случайных параметров, выбирает лучшую комбинацию на основе метрик, использует их в "черновой" модели, заново перебирает метрики в соответствии с функцией приобретения (умный баланс исследования и улучшения), повторяет два последних шага пока не исчерпан баланс итераций.

**Классификация методов отбора признаков**

Фильтры — оценивают каждый признак статистикой и берут top-k.

Обёртки — многократно обучают модель, перебирая подмножества признаков.

Встроенные — модель отбирает признаки во время обучения (напр., Lasso, Elastic Net).

Пост-хок (модель-агностик) — объясняют уже обученную модель (Permutation Importance, SHAP).

**Pearson**

$$
r_{X,Y} \;=\;
\frac{\displaystyle \sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y})}
{\displaystyle \sqrt{\sum_{i=1}^{n} (x_i-\bar{x})^2}\,\sqrt{\sum_{i=1}^{n} (y_i-\bar{y})^2}}
$$

Алгоритм измеряет, как две величины меняются вместе, чем положительнее коррелируют величины, тем ближе r к 1, и чем больше отрицательная корреляция, тем ближе r к -1.

**Chi2**

$$
\chi^2 \;=\;
\sum_{i=1}^{n} \frac{(O_i-E_i)^2}{E_i}
$$

Алгоритм проверяет, зависят ли частоты классов от значений признака, чем больше значение chi2 -> тем полезнее признак.

**Lasso**

$$
\min_{w,b} \; \frac{1}{2n}\,\displaystyle\sum_{i=1}^{n}\!\bigl(y_i-(x_i^\top w + b)\bigr)^2 \;+\; \alpha\,\displaystyle\sum_{j=1}^{p} |w_j|
$$

На небольшие по важности веса накладывается штраф, вплоть до их полного обнуления. Оставляет ненулевыми только те веса, которые значительно улучшают метрики.

**Permutation Sugnificance**

Способ измерения, как сильно модель опирается на признак (если перемешать значения конкретного признака между объектами и при этом качество модели упадет, значит признак важен).

Есть обученная модель, по ней считаются метрики на валидном наборе данных, перемишиваются значения признака по строкам, повторяется расчет метрик. Результатом будет разница между базовыми метриками и метриками после перемешивания.

**SHAP**

SHapley Additive exPlanations

Способ объяснять предсказания модели через вклады признаков по теории Шепли.

$$
f(x) \;=\; \mathbb{E}[f(X)] \;+\; \sum_{j=1}^{p} \phi_j(x)
$$

где $\phi_j(x)$ — вклад признака $j$ в предсказание $f(x)$, если вклад > 0 -> признак положительно влияет на предсказание и наоборот.

**2. Introduction (Preprocessing)** 


In [1]:
from sklearn.model_selection import train_test_split, KFold, GroupKFold, StratifiedKFold, TimeSeriesSplit, cross_val_score
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, get_scorer
from typing import List, Tuple, Union, Any
from scipy.stats import ks_2samp, chi2_contingency
from time import perf_counter
import optuna
import numpy as np
import pandas as pd
import warnings
import shap

warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df_train = pd.read_json('../datasets/data/train.json')
df_test = pd.read_json('../datasets/data/test.json')

In [3]:
# убираем выбросы по цене и координатам

df_train = df_train[(df_train['price'] >= 500) & (df_train['price'] <= 20000)]
df_test = df_test[(df_test['price'] >= 500) & (df_test['price'] <= 20000)]

df_train = df_train[(df_train['latitude'] > 40) & (df_train['latitude'] < 41.5)]
df_test = df_test[(df_test['longitude'] > -75) & (df_test['longitude'] < -72)]

In [4]:
mapping = {'low': 0, 'medium': 1, 'high': 2}
df_train['interest_level'] = df_train['interest_level'].map(mapping)

In [5]:
feature_list = ['Elevator', 'HardwoodFloors', 'CatsAllowed', 
                'DogsAllowed', 'Doorman', 'Dishwasher', 'NoFee', 
                'LaundryinBuilding', 'FitnessCenter', 'Pre-War', 
                'LaundryinUnit', 'RoofDeck', 'OutdoorSpace', 'DiningRoom', 
                'HighSpeedInternet', 'Balcony', 'SwimmingPool', 'LaundryInBuilding', 
                'NewConstruction', 'Terrace']

In [6]:
for feature in feature_list:
    df_train[feature] = df_train['features'].apply(lambda x: int(feature in [f.replace(" ", "") for f in x]))
    df_test[feature] = df_test['features'].apply(lambda x: int(feature in [f.replace(" ", "") for f in x]))

In [7]:
X = df_train[feature_list].join(df_train['created'])
X = X.join(df_train['building_id'])

y = df_train['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**3. Implementing split methods**

In [8]:
def train_val_test_split(validation_size=0.2, test_size=0.2, random_state=42, shuffle=True):
    X_tmp, X_test, y_tmp, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=shuffle)
    
    val_size_adj = validation_size / (1 - test_size)
    X_train, X_val, y_train, y_val = train_test_split(X_tmp, y_tmp, test_size=val_size_adj, random_state=random_state, shuffle=shuffle)
    
    return X_train, y_train, X_test, y_test, X_val, y_val

In [9]:
def s21_date_split(date_split):
    X = X.copy()
    
    col = X['created']
    col = pd.to_datetime(col, errors="raise")
    split_ts = pd.to_datetime(date_split)
   
    mask_train = col <= split_ts

    X_train, X_test = X.loc[mask_train], X.loc[~mask_train]
    y_train, y_test = y.loc[mask_train], y.loc[~mask_train]

    return X_train, X_test, y_train, y_test

In [10]:
def s21_val_test_split(X, validation_date, test_date):
    X = X.copy()
    date_col = X['created']

    date_col = pd.to_datetime(date_col, errors='raise')
    split_val_ts = pd.to_datetime(validation_date)
    split_test_ts = pd.to_datetime(test_date)

    mask_train = date_col < split_val_ts
    mask_val = (date_col >= split_val_ts) & (date_col < split_test_ts)
    mask_test = date_col >= split_test_ts
    
    X_train, y_train = X.loc[mask_train], y.loc[mask_train]
    X_val, y_val = X.loc[mask_val], y.loc[mask_val]
    X_test, y_test = X.loc[mask_test], y.loc[mask_test]

    return (X_train, y_train), (X_val, y_val), (X_test, y_test)

In [11]:
train_samples, val_samples, test_samples = s21_val_test_split(X, '2016-04-01 23:26:07', '2016-06-29 17:56:12')

Детерминизм сплита - воспроизводимость разбиений при фиксированном random_state
Детерменированная процедура сплита означает одинаковое разбиение данных при одинаковых параметрах. Для этого нужно фиксировать random_state во всех случайных операциях и зафиксировать сиды.

In [12]:
np.random.seed(42)

**4. Implementing cv methods**

In [13]:
def s21_k_fold_indices(
    X: pd.DataFrame,
    k: int = 5,
    shuffle: bool = True,
    random_state: int = 42
) -> List[Tuple[np.ndarray, np.ndarray]]:

    n_samples = len(X)
    if k < 2 or k > n_samples:
        raise ValueError(f"k must be between 2 and {n_samples}")

    idx = np.arange(n_samples)

    if shuffle:
        rng = np.random.default_rng(random_state)
        rng.shuffle(idx)

    fold_sizes = np.full(k, n_samples // k, dtype=int)
    fold_sizes[: (n_samples % k)] += 1

    folds = []
    start = 0
    for fold_size in fold_sizes:
        stop = start + fold_size
        test_idx = idx[start:stop]
        train_idx = np.concatenate((idx[:start], idx[stop:]))
        folds.append((train_idx, test_idx))
        start = stop

    return folds

In [14]:
def s21_group_k_fold_indices(
    X: pd.DataFrame,
    groups: Union[str, pd.Series, np.ndarray, list],
    k: int = 5,
    shuffle: bool = True,
    random_state: int = 42,
    balance_by: str = "samples"  
) -> List[Tuple[np.ndarray, np.ndarray]]:

    n_samples = len(X)

    if isinstance(groups, str):
        g = X[groups].to_numpy()
    else:
        g = np.asarray(groups)

    uniq, inv = np.unique(g, return_inverse=True) 
    n_groups = len(uniq)
    if k < 2 or k > n_groups:
        raise ValueError(f"k must be between 2 and {n_groups} (unique groups)")

    rows_by_group = [np.where(inv == gi)[0] for gi in range(n_groups)]
    group_sizes = np.array([len(ix) for ix in rows_by_group])

    order = np.arange(n_groups)
    rng = np.random.default_rng(random_state) if shuffle else None
    if balance_by == "samples":
        if shuffle:
            rng.shuffle(order)                 
        order = order[np.argsort(group_sizes[order])[::-1]]  
    else:
        if shuffle:
            rng.shuffle(order)

    fold_bins = [[] for _ in range(k)]
    fold_loads = np.zeros(k, dtype=int)
    for gi in order:
        target = int(np.argmin(fold_loads))
        fold_bins[target].append(gi)
        fold_loads[target] += group_sizes[gi]

    folds: List[Tuple[np.ndarray, np.ndarray]] = []
    all_idx = np.arange(n_samples)
    for fold_groups in fold_bins:
        if fold_groups:
            test_idx = np.concatenate([rows_by_group[gi] for gi in fold_groups])
        else:
            test_idx = np.array([], dtype=int)
        train_idx = np.setdiff1d(all_idx, test_idx, assume_unique=False)
        folds.append((train_idx, test_idx))

    return folds

In [15]:
def s21_stratified_k_fold_indices(
    X: pd.DataFrame,
    stratify_field: Union[str, pd.Series, np.ndarray, list],
    k: int = 5,
    shuffle: bool = True,
    random_state: int = 42
) -> List[Tuple[np.ndarray, np.ndarray]]:

    n_samples = len(X)

    y = X[stratify_field].to_numpy() if isinstance(stratify_field, str) else np.asarray(stratify_field)
    if y.shape[0] != n_samples:
        raise ValueError("len(stratify_field) must be the same as number of lines in X")
    _, counts = np.unique(y, return_counts=True)
    if counts.min() < k:
        raise ValueError(f"Some classes have < {k} samples.")

    rng = np.random.default_rng(random_state) if shuffle else None

    classes = np.unique(y)
    idx_by_class = {}
    for c in classes:
        idx = np.where(y == c)[0]
        if shuffle:
            rng.shuffle(idx)
        idx_by_class[c] = idx

    test_bins: List[List[int]] = [[] for _ in range(k)]

    for c in classes:
        idx = idx_by_class[c]
        m = len(idx)
        base = m // k
        rem = m % k
        sizes = np.array([base + (1 if i < rem else 0) for i in range(k)], dtype=int)
        start = 0
        for i, sz in enumerate(sizes):
            if sz:
                part = idx[start:start+sz]
                test_bins[i].extend(part.tolist())
                start += sz

    folds: List[Tuple[np.ndarray, np.ndarray]] = []
    all_idx = np.arange(n_samples)
    for i in range(k):
        test_idx = np.array(test_bins[i], dtype=int)
        if shuffle and len(test_idx) > 1:
            rng.shuffle(test_idx)
        train_idx = np.setdiff1d(all_idx, test_idx, assume_unique=False)
        folds.append((train_idx, test_idx))

    return folds

In [16]:
def s21_time_series_split_indices(
    X: pd.DataFrame,
    date_field: Union[str, pd.Series, np.ndarray, list],
    k: int = 5,
) -> List[Tuple[np.ndarray, np.ndarray]]:
    
    d = pd.to_datetime(X[date_field] if isinstance(date_field, str)
                       else pd.Series(date_field, index=X.index), errors="coerce")
    valid = d.notna().to_numpy()
    order = np.arange(len(X))[valid][np.argsort(d.to_numpy()[valid], kind="mergesort")]
    m = len(order)
    
    if m < 3:
        return []

    k = int(max(1, min(k, m-1)))
    cuts = [(m * i) // (k + 1) for i in range(k + 1)] + [m]

    folds: List[Tuple[np.ndarray, np.ndarray]] = []
    for i in range(k):
        a = cuts[i+1]
        b = cuts[i+2]
        train_idx = order[:a]
        test_idx  = order[a:b]

        if len(train_idx) and len(test_idx):
            folds.append((train_idx.copy(), test_idx.copy()))

    return folds

**5. Cross-validation comparison**

In [17]:
def kf_stratifier(X, n_splits, shuffle, random_state):
    folder = KFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)
    folds = [(tr, te) for tr, te in folder.split(np.arange(len(X)))]

    return folds

In [18]:
def gkf_stratifier(X, n_splits, group_name):
    groups = X[group_name].to_numpy()

    folder = GroupKFold(n_splits=n_splits)
    folds = [(tr, te) for tr, te in folder.split(np.arange(len(X)), groups=groups)]

    return folds

In [19]:
def skf_stratifier(X, y_bins, n_splits, shuffle, random_state):
    yb = pd.Series(y_bins, index=X.index)
    mask = yb.notna()

    y_labels = yb.loc[mask].astype(str).to_numpy()
    idx = np.arange(mask.sum())

    folder = StratifiedKFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)
    tmp = [(tr, te) for tr, te in folder.split(idx, y_labels)]

    orig_idx = np.flatnonzero(mask.to_numpy())
    folds = [(orig_idx[tr], orig_idx[te]) for tr, te in tmp]  

    return folds

In [20]:
def tss_stratifier(X, n_splits):
    d = pd.to_datetime(X["created"], errors="raise")
    order = d.to_numpy().argsort(kind="mergesort")  

    tscv = TimeSeriesSplit(n_splits=n_splits)
    tmp = [(tr, te) for tr, te in tscv.split(np.arange(len(order)))]
    folds = [(order[tr], order[te]) for tr, te in tmp]
    
    return folds

In [21]:
default_folds = s21_k_fold_indices(X, k=5, shuffle=True, random_state=42)

groupped_folds = s21_group_k_fold_indices(X, groups="building_id", k=5, shuffle=True, random_state=42)

y_bins = pd.qcut(np.log1p(y), q=8, duplicates="drop")  # 8 квантильных бинов
stratified_folds = s21_stratified_k_fold_indices(X, stratify_field=y_bins, k=5, shuffle=True, random_state=42)

times_split_folds = s21_time_series_split_indices(X, date_field="created", k=5)

kf = kf_stratifier(X, n_splits=5, shuffle=True, random_state=42)

gkf = gkf_stratifier(X, n_splits=5, group_name="building_id")

skf = skf_stratifier(X, y_bins, n_splits=5, shuffle=True, random_state=42)

tss = tss_stratifier(X, n_splits=5)

In [22]:
def _to_hashable(x: Any) -> Any:
    if isinstance(x, np.ndarray):
        return tuple(_to_hashable(v) for v in x.tolist())
    if isinstance(x, (list, tuple)):
        return tuple(_to_hashable(v) for v in x)
    if isinstance(x, set):
        return tuple(sorted(_to_hashable(v) for v in x))
    if isinstance(x, dict):
        return tuple(sorted((k, _to_hashable(v)) for k, v in x.items()))
    return x


def _safe_cat_series(s: pd.Series) -> pd.Series:
    def _is_hashable(v):
        try:
            hash(v)
            return True
        except TypeError:
            return False

    out = s.copy()
    out = out.where(~out.isna(), "__NaN__")
    sample = out.dropna().iloc[0] if out.size and out.dropna().size else None
    if sample is not None and _is_hashable(sample):
        return out
    return out.map(lambda v: "__NaN__" if v == "__NaN__" else _to_hashable(v))


def compare_train_distribution(
    X: pd.DataFrame,
    tr_a, 
    tr_b,  
    topn: int = 15,
    cat_max_card: int = 50
) -> pd.DataFrame:
    
    A = X.iloc[tr_a] if isinstance(tr_a, (np.ndarray, list, tuple)) else tr_a
    B = X.iloc[tr_b] if isinstance(tr_b, (np.ndarray, list, tuple)) else tr_b

    num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = [c for c in X.columns if c not in num_cols]

    rows: List[Tuple] = []

    for c in num_cols:
        a = A[c].to_numpy(dtype=float)
        b = B[c].to_numpy(dtype=float)
        a_f = a[~np.isnan(a)]; b_f = b[~np.isnan(b)]
        if len(a_f) and len(b_f):
            ks = ks_2samp(a_f, b_f, alternative="two-sided", mode="auto")
            ks_stat, ks_p = float(ks.statistic), float(ks.pvalue)
            mean_A = float(np.nanmean(a)); mean_B = float(np.nanmean(b))
            std_A  = float(np.nanstd(a, ddof=1)) if len(a_f) > 1 else np.nan
            std_B  = float(np.nanstd(b, ddof=1)) if len(b_f) > 1 else np.nan
        else:
            ks_stat = ks_p = mean_A = mean_B = std_A = std_B = np.nan
        rows.append(("numeric", c, mean_A, mean_B, std_A, std_B, ks_stat, ks_p))

    for c in cat_cols:
        sA = _safe_cat_series(A[c])
        sB = _safe_cat_series(B[c])

        if max(sA.nunique(dropna=False), sB.nunique(dropna=False)) > cat_max_card:
            rows.append(("categorical(HIGH_CARD)", c, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan))
            continue

        vc_a = sA.value_counts(dropna=False)
        vc_b = sB.value_counts(dropna=False)
        cats = sorted(set(vc_a.index).union(vc_b.index), key=lambda x: str(x))

        ca = vc_a.reindex(cats, fill_value=0).to_numpy()
        cb = vc_b.reindex(cats, fill_value=0).to_numpy()

        try:
            _, chi2_p, _, _ = chi2_contingency(np.vstack([ca, cb]), correction=False)
        except Exception:
            chi2_p = np.nan

        pa = ca / max(1, ca.sum()); pb = cb / max(1, cb.sum())
        l1 = 0.5 * float(np.abs(pa - pb).sum())
        rows.append(("categorical", c, np.nan, np.nan, np.nan, np.nan, l1, chi2_p))

    res = pd.DataFrame(rows, columns=[
        "type","feature","mean_A","mean_B","std_A","std_B","stat","p_or_l1"
    ])


    def _score(row):
        if row["type"] == "numeric":
            return np.nan_to_num(row["stat"], nan=0.0)
        elif row["type"] == "categorical":
            return np.nan_to_num(row["stat"], nan=0.0)
        return 0.0
    res["_score"] = res.apply(_score, axis=1)


    return res.sort_values("_score", ascending=False).drop(columns="_score").head(topn)

In [23]:
tr_a, _ = default_folds[0]
tr_b, _ = kf[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
15,numeric,Balcony,0.061246,0.058629,0.239784,0.234933,0.002616,0.999238
9,numeric,Pre-War,0.186684,0.184423,0.389663,0.387834,0.002261,0.999959
12,numeric,OutdoorSpace,0.107402,0.105929,0.309628,0.307751,0.001473,1.0
19,numeric,Terrace,0.046284,0.045115,0.210101,0.207559,0.001169,1.0
13,numeric,DiningRoom,0.10344,0.102347,0.304536,0.303108,0.001092,1.0
3,numeric,DogsAllowed,0.447569,0.446553,0.49725,0.497142,0.001016,1.0
2,numeric,CatsAllowed,0.478433,0.477544,0.499541,0.499502,0.000889,1.0
8,numeric,FitnessCenter,0.26876,0.267972,0.44332,0.442909,0.000787,1.0
11,numeric,RoofDeck,0.132399,0.131611,0.338928,0.338072,0.000787,1.0
1,numeric,HardwoodFloors,0.475766,0.476553,0.499419,0.499456,0.000787,1.0


In [24]:
tr_a, _ = groupped_folds[0]
tr_b, _ = gkf[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
4,numeric,Doorman,0.4637,0.462506,0.498687,0.498599,0.001194,1.0
5,numeric,Dishwasher,0.475436,0.474343,0.499403,0.499348,0.001092,1.0
0,numeric,Elevator,0.598257,0.597165,0.490257,0.490474,0.001092,1.0
11,numeric,RoofDeck,0.157217,0.156404,0.36401,0.363243,0.000813,1.0
8,numeric,FitnessCenter,0.294289,0.293604,0.455728,0.455418,0.000686,1.0
7,numeric,LaundryinBuilding,0.3801,0.37949,0.485417,0.485266,0.00061,1.0
1,numeric,HardwoodFloors,0.544785,0.544226,0.497997,0.498047,0.000559,1.0
17,numeric,LaundryInBuilding,0.053219,0.052761,0.224472,0.223559,0.000457,1.0
18,numeric,NewConstruction,0.060585,0.060179,0.238571,0.237821,0.000406,1.0
6,numeric,NoFee,0.396865,0.39651,0.489254,0.489179,0.000356,1.0


In [25]:
tr_a, _ = stratified_folds[0]
tr_b, _ = skf[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
1,numeric,HardwoodFloors,0.476017,0.474191,0.499431,0.49934,0.001827,1.0
6,numeric,NoFee,0.365022,0.366052,0.481442,0.48173,0.00103,1.0
9,numeric,Pre-War,0.186119,0.185312,0.389207,0.388556,0.000806,1.0
7,numeric,LaundryinBuilding,0.330039,0.330793,0.470233,0.470505,0.000754,1.0
2,numeric,CatsAllowed,0.478304,0.479043,0.499535,0.499567,0.000739,1.0
5,numeric,Dishwasher,0.413419,0.412717,0.492453,0.492329,0.000702,1.0
8,numeric,FitnessCenter,0.268279,0.268811,0.443069,0.443347,0.000532,1.0
0,numeric,Elevator,0.524516,0.52426,0.499405,0.499417,0.000257,1.0
18,numeric,NewConstruction,0.052182,0.052355,0.222397,0.222744,0.000173,1.0
16,numeric,SwimmingPool,0.055079,0.055251,0.228136,0.228472,0.000172,1.0


In [26]:
tr_a, _ = times_split_folds[0]
tr_b, _ = tss[0]

compare_train_distribution(X, tr_a, tr_b)

Unnamed: 0,type,feature,mean_A,mean_B,std_A,std_B,stat,p_or_l1
0,numeric,Elevator,0.519327,0.5192,0.499657,0.499662,0.000127,1.0
1,numeric,HardwoodFloors,0.487014,0.486895,0.499862,0.499859,0.000119,1.0
5,numeric,Dishwasher,0.415559,0.415458,0.492848,0.492831,0.000101,1.0
6,numeric,NoFee,0.368492,0.368402,0.482425,0.482401,9e-05,1.0
9,numeric,Pre-War,0.170955,0.171035,0.376492,0.376562,8e-05,1.0
8,numeric,FitnessCenter,0.265699,0.265635,0.441732,0.441697,6.5e-05,1.0
10,numeric,LaundryinUnit,0.175101,0.175058,0.380076,0.38004,4.3e-05,1.0
7,numeric,LaundryinBuilding,0.330326,0.330367,0.470359,0.470374,4.1e-05,1.0
11,numeric,RoofDeck,0.127789,0.127758,0.333875,0.333841,3.1e-05,1.0
12,numeric,OutdoorSpace,0.103768,0.103743,0.304978,0.304945,2.5e-05,1.0


**Вывод**

распределения совпадают, что говорит о правильности реализации алгоритмов.
Лучше всего использовать TimeSeriesSplit

**6. Feature Selection**

In [27]:
speed_results = []

val_ratio = 0.60   # накопленные доли
test_ratio = 0.80  

created_dt = pd.to_datetime(X['created'], errors='raise')
val_ts  = created_dt.quantile(val_ratio)
test_ts = created_dt.quantile(test_ratio)

val_date = pd.Timestamp(val_ts).strftime('%Y-%m-%d %H:%M:%S')
test_date = pd.Timestamp(test_ts).strftime('%Y-%m-%d %H:%M:%S')

t0_baseline = perf_counter()
(X_tr, y_tr), (X_val, y_val), (X_te, y_te) = s21_val_test_split(X, val_date, test_date)
t1_baseline = perf_counter() - t0_baseline
speed_results.append(t1_baseline)

In [28]:
for parts in (X_tr, X_val, X_te):
    parts.drop(columns=['created', 'building_id'], inplace=True)

# num_cols = X_tr.select_dtypes(include=[np.number, "bool"]).columns.tolist()

alphas = np.logspace(-3, 1, 25)
best_alpha, best_val_mae, lasso_model = None, np.inf, None

for alpha in alphas:
    pipe = Pipeline([
        ("sc", StandardScaler()),
        ("lasso", Lasso(alpha=alpha, max_iter=10000, random_state=42)),
    ])
    pipe.fit(X_tr, y_tr)
    val_mae = mean_absolute_error(y_val, pipe.predict(X_val))
    if val_mae < best_val_mae:
        best_val_mae = val_mae
        best_alpha   = alpha
        lasso_model   = pipe

print(f"Best alpha: {best_alpha:.5f} | MAE(val)={best_val_mae:.3f}")

Best alpha: 6.81292 | MAE(val)=1136.750


In [29]:
def build_pipeline(features, X, y):
    preprocess = ColumnTransformer(
        transformers=[
        ("num", Pipeline([
            ("sc",  StandardScaler()),
        ]), features),
    ],
    remainder="drop"
    )
    
    model = Pipeline([
        ("sc", preprocess),
        ("lasso", Lasso(alpha=best_alpha, max_iter=10000, random_state=42))
    ])

    model_fit = model.fit(X, y)

    return model_fit

In [30]:
t0_sorted_by_weight = perf_counter()
feat_names = lasso_model.named_steps["sc"].get_feature_names_out()

coefs = pd.Series(lasso_model.named_steps['lasso'].coef_, index=feat_names)
top_10_by_weight = coefs.abs().sort_values(ascending=False).head(10).index.tolist()

lasso_model_top10 = build_pipeline(top_10_by_weight, X_tr, y_tr)
t1_sorted_by_weight = perf_counter() - t0_sorted_by_weight
speed_results.append(t1_sorted_by_weight)

In [31]:
def evaluate_model(model, X, y, method_name=None):
    y_pred = model.predict(X)
    
    mae = mean_absolute_error(y, y_pred)
    mse = mean_squared_error(y, y_pred)
    r2 = r2_score(y, y_pred)

    rmse = float(np.sqrt(mse))
    
    result = mae, rmse, r2, method_name

    return result

result_default = evaluate_model(lasso_model, X_val, y_val, "baseline all")
result_top_10_by_weight = evaluate_model(lasso_model_top10, X_val, y_val, "top10 by weight")

evaluation_results = []
evaluation_results.append(result_default)
evaluation_results.append(result_top_10_by_weight)

In [32]:
def feature_selection(X, y):
    y_series = pd.Series(y, index=X.index)
    corrs = {}
    for col in X.columns:
        s = X[col]
        m = s.notna() & y_series.notna()
        corrs[col] = s[m].corr(y_series[m], method='pearson')
    
    abs_corr = pd.Series(corrs).abs()

    rank_table = pd.DataFrame({
        "abs_corr_to_y": abs_corr.reindex(X.columns),
    }).sort_values("abs_corr_to_y", ascending=False)

    return rank_table

In [33]:
t0_feature_selection = perf_counter()
RANK = feature_selection(X_tr, y_tr)
top_10_features = RANK.head(10).index.tolist()
t1_feature_selection = perf_counter() - t0_feature_selection
speed_results.append(t1_feature_selection)
print("Top-10:", top_10_features, "\n")
print(RANK)

Top-10: ['LaundryinUnit', 'Doorman', 'DiningRoom', 'Dishwasher', 'FitnessCenter', 'Elevator', 'OutdoorSpace', 'Terrace', 'LaundryinBuilding', 'NoFee'] 

                   abs_corr_to_y
LaundryinUnit           0.276073
Doorman                 0.268324
DiningRoom              0.238796
Dishwasher              0.218743
FitnessCenter           0.217903
Elevator                0.199779
OutdoorSpace            0.152180
Terrace                 0.141916
LaundryinBuilding       0.129637
NoFee                   0.126967
RoofDeck                0.125620
SwimmingPool            0.123419
Balcony                 0.119184
HardwoodFloors          0.101472
HighSpeedInternet       0.089499
NewConstruction         0.072650
DogsAllowed             0.057830
CatsAllowed             0.047402
Pre-War                 0.023118
LaundryInBuilding       0.018516


In [34]:
lasso_model_top10_corr = build_pipeline(top_10_features, X_tr, y_tr)
result_by_corr = evaluate_model(lasso_model_top10_corr, X_val, y_val, "top10 by correlation")

evaluation_results.append(result_by_corr)

In [35]:
def permutation_importance(model, X, y, n_repeats=10, random_state=None, sort_by="r2"):
    rng = np.random.default_rng(random_state)

    y_base = model.predict(X)
    base_r2   = r2_score(y, y_base)
    base_mae  = mean_absolute_error(y, y_base)
    base_rmse = np.sqrt(mean_squared_error(y, y_base))

    rows = []
    for col in X.columns:
        d_r2 = d_mae = d_rmse = 0.0
        col_vals = X[col].to_numpy()
        for _ in range(n_repeats):
            idx = rng.permutation(len(X))
            Xp = X.copy(deep=True)
            Xp[col] = col_vals[idx]
            y_hat = model.predict(Xp)
            d_r2   += (base_r2   - r2_score(y, y_hat))
            d_mae  += (mean_absolute_error(y, y_hat) - base_mae)
            d_rmse += (np.sqrt(mean_squared_error(y, y_hat)) - base_rmse)
        rows.append({
            "feature": col,
            "imp_r2":   d_r2 / n_repeats,
            "imp_mae":  d_mae / n_repeats,
            "imp_rmse": d_rmse / n_repeats,
        })

    key = {"r2":"imp_r2","mae":"imp_mae","rmse":"imp_rmse"}[sort_by]
    perm_result = pd.DataFrame(rows).sort_values(key, ascending=False).reset_index(drop=True)
    
    return perm_result

In [36]:
t0_permutation_importance = perf_counter()
importance = permutation_importance(lasso_model, X_val, y_val, n_repeats=20, random_state=42, sort_by='r2')
importance_head = importance.head(10)

print(importance_head)

feature_names_10_perm = importance_head['feature'].tolist()
t1_permutation_importance = perf_counter() - t0_permutation_importance
speed_results.append(t1_permutation_importance)

             feature    imp_r2    imp_mae   imp_rmse
0      LaundryinUnit  0.081809  66.261221  87.114907
1            Doorman  0.060739  57.788909  65.068570
2         DiningRoom  0.044440  32.734899  47.832559
3     HardwoodFloors  0.015189  15.523144  16.490620
4         Dishwasher  0.011686  14.982560  12.700893
5      FitnessCenter  0.009669   9.824373  10.515821
6            Terrace  0.007726   8.159498   8.407423
7  LaundryinBuilding  0.007418   9.300744   8.072370
8           Elevator  0.006024   5.968099   6.558086
9  HighSpeedInternet  0.001667   3.209551   1.817172


In [37]:
lasso_model_top10_perm = build_pipeline(feature_names_10_perm, X_tr, y_tr)
result_by_perm = evaluate_model(lasso_model_top10_perm, X_val, y_val, "top10 by permutation")

evaluation_results.append(result_by_perm)

In [38]:
t0_shap = perf_counter()
prep = lasso_model.named_steps['sc']
model = lasso_model.named_steps['lasso']

bckgrn_tr = prep.transform(X_tr.iloc[:1000]) # фон для определения базового распределения признаков
bckgrn_val = prep.transform(X_val)

feat_full = prep.get_feature_names_out().tolist()
feat_base = [n.split('__', 1)[-1] for n in feat_full]

expl = shap.LinearExplainer(model, bckgrn_tr)
sv   = expl.shap_values(bckgrn_val)                       
imp  = np.abs(sv).mean(axis=0)  

rank = (pd.DataFrame({'feature': feat_base, 'mean_abs_shap': imp})
          .groupby('feature', as_index=False)['mean_abs_shap'].sum()
          .sort_values('mean_abs_shap', ascending=False)
          .reset_index(drop=True))

top_10_shap = rank.head(10)['feature'].tolist()
t1_shap = perf_counter() - t0_shap
speed_results.append(t1_shap)

In [39]:
lasso_model_top10_shap = build_pipeline(top_10_shap, X_tr, y_tr)
result_by_shap = evaluate_model(lasso_model_top10_shap, X_val, y_val, "top10 by shap")

evaluation_results.append(result_by_shap)

In [40]:
result_rows = []
i = 0
for mae, rmse, r2, method in evaluation_results:
    result_rows.append({"method": method, "MAE": mae, "RMSE": rmse, "R2": r2, "TIME": speed_results[i]})
    i+=1

compare_df = pd.DataFrame(result_rows)

In [41]:
compare_df.head()

Unnamed: 0,method,MAE,RMSE,R2,TIME
0,baseline all,1136.750393,1792.013892,0.178553,0.008732
1,top10 by weight,1138.017098,1795.392465,0.175452,0.005649
2,top10 by correlation,1145.081381,1803.800301,0.167711,0.008241
3,top10 by permutation,1138.016604,1795.392801,0.175452,0.396348
4,top10 by shap,1138.016787,1795.392847,0.175452,0.002462


**7. Hyperparameter optimization**

In [42]:
def s21_grid_search_cv(
    X,
    y,
    alphas=None,
    l1_ratios=None,
    scoring=None,       
    cv=5,             
    shuffle=True,
    random_state=42,
):

    if alphas is None:
        alphas = np.logspace(-4, 1, 30)
    if l1_ratios is None:
        l1_ratios = np.linspace(0.05, 0.95, 19)

    splitter = KFold(n_splits=cv, shuffle=shuffle, random_state=random_state) 

    if scoring is None:
        def scorer(est, Xb, yb):
            return est.score(Xb, yb)
    elif isinstance(scoring, str):
        _sk = get_scorer(scoring)
        def scorer(est, Xb, yb):               
            return _sk(est, Xb, yb)
    else:
        def scorer(est, Xb, yb):
            return scoring(est, Xb, yb) 

    rows = []
    best_idx = None
    best_mean = -np.inf

    for a in alphas:
        for l1 in l1_ratios:
            split_scores = []
            for tr_idx, val_idx in splitter.split(X, y):
                X_tr_i, X_val_i = X.iloc[tr_idx], X.iloc[val_idx]
                y_tr_i, y_val_i = y.iloc[tr_idx], y.iloc[val_idx]

                pipe = Pipeline([
                    ("sc", StandardScaler()),
                    ("enet", ElasticNet(alpha=a, l1_ratio=l1, max_iter=10000, random_state=random_state))
                ])
                pipe.fit(X_tr_i, y_tr_i)
                split_scores.append(float(scorer(pipe, X_val_i, y_val_i)))

            split_scores = np.asarray(split_scores, float)
            mean_score = float(split_scores.mean())
            std_score  = float(split_scores.std(ddof=1)) if len(split_scores) > 1 else 0.0

            rows.append({
                "mean_test_score": mean_score,
                "std_test_score": std_score,
                **{f"split{i}_test_score": float(s) for i, s in enumerate(split_scores)},
                "alpha": a,
                "l1_ratio": l1,
            })

            if mean_score > best_mean:
                best_mean = mean_score
                best_idx = len(rows) - 1

    cv_results = pd.DataFrame(rows)
    cv_results["rank_test_score"] = cv_results["mean_test_score"].rank(method="min", ascending=False).astype(int)
    cv_results = cv_results.sort_values(["rank_test_score", "alpha", "l1_ratio"]).reset_index(drop=True)

    best_row = rows[best_idx]
    best_params = {
        "alpha": float(best_row["alpha"]),
        "l1_ratio": float(best_row["l1_ratio"]),
        "mean_test_score": best_row["mean_test_score"],
    }

    return best_params, cv_results

In [43]:
gridsearch_params, gridsearch_tab = s21_grid_search_cv(
    X_tr, y_tr,
    cv=5,
    random_state=42
)

gridsearch_tab.head()

Unnamed: 0,mean_test_score,std_test_score,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,alpha,l1_ratio,rank_test_score
0,0.167455,0.008201,0.177669,0.155604,0.164508,0.171338,0.168155,0.017433,0.6,1
1,0.167455,0.0082,0.177668,0.155605,0.164508,0.171338,0.168155,0.011721,0.4,2
2,0.167455,0.008204,0.177672,0.1556,0.164508,0.171339,0.168154,0.00788,0.15,3
3,0.167455,0.0082,0.177667,0.155605,0.164508,0.171338,0.168155,0.00788,0.1,4
4,0.167455,0.008206,0.177676,0.155597,0.164508,0.171338,0.168154,0.025929,0.75,5


In [44]:
def s21_random_search(
    X, y,
    n_iter=100,
    alpha_low=1e-4, alpha_high=1e-1,
    scoring=None,
    cv=5, shuffle=True,
    random_state=42
):

    splitter = KFold(n_splits=cv, shuffle=shuffle, random_state=random_state)

    if scoring is None:
        def scorer(est, Xb, yb): return est.score(Xb, yb)
    elif isinstance(scoring, str):
        _sk = get_scorer(scoring)
        def scorer(est, Xb, yb): return _sk(est, Xb, yb)
    else:
        def scorer(est, Xb, yb): return scoring(est, Xb, yb)
    
    rng = np.random.default_rng(random_state)
    
    rows, best_idx, best_mean = [], None, -np.inf
    
    for i in range(n_iter):
        a = float(10 ** rng.uniform(np.log10(alpha_low), np.log10(alpha_high)))
        l1 = float(rng.uniform(0.0, 1.0))

        split_scores = []
        for tr_idx, val_idx in splitter.split(X, y):
            X_tr_i, X_val_i = X.iloc[tr_idx], X.iloc[val_idx]
            y_tr_i, y_val_i = y.iloc[tr_idx], y.iloc[val_idx]

            pipe = Pipeline([
                ('scaler', StandardScaler()),
                ('model', ElasticNet(alpha=a, l1_ratio=l1, random_state=random_state))
            ])

            pipe.fit(X_tr_i, y_tr_i)
            split_scores.append(float(scorer(pipe, X_val_i, y_val_i)))

        mean_score = np.mean(split_scores)
        std_score = np.std(split_scores)

        rows.append({
            "iter": i,
            "mean_test_score": mean_score,
            "std_test_score": std_score,
            **{f"split{i}_test_score": float(s) for i, s in enumerate(split_scores)},
            "alpha": a,
            "l1_ratio": l1,
        })

        if mean_score > best_mean:
            best_mean = mean_score
            best_idx = len(rows) - 1

    cv_results = pd.DataFrame(rows)
    cv_results["rank_test_score"] = cv_results["mean_test_score"].rank(method="min", ascending=False).astype(int)
    cv_results = cv_results.sort_values(["rank_test_score", "iter"]).reset_index(drop=True)

    best_row = rows[best_idx]
    best_params = {
        "alpha": best_row["alpha"],
        "l1_ratio": best_row["l1_ratio"],
        "mean_test_score": float(best_row["mean_test_score"]),
    }

    return best_params, cv_results     

In [45]:
best_params, randomsearch_results = s21_random_search(
    X_tr, y_tr,
    n_iter=100,
    cv=5,
    random_state=42
)

randomsearch_results.head()

Unnamed: 0,iter,mean_test_score,std_test_score,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,alpha,l1_ratio,rank_test_score
0,92,0.167455,0.007335,0.177668,0.155605,0.164508,0.171337,0.168155,0.020703,0.661661,1
1,22,0.167455,0.007342,0.177681,0.155592,0.164508,0.171338,0.168154,0.031389,0.804764,2
2,52,0.167455,0.007342,0.17768,0.155592,0.164508,0.171339,0.168154,0.02169,0.71689,3
3,83,0.167454,0.007344,0.177684,0.155589,0.164508,0.171339,0.168153,0.030858,0.808251,4
4,26,0.167454,0.007328,0.177658,0.155615,0.164508,0.171336,0.168155,0.022949,0.664851,5


In [46]:
elasticnet_model = Pipeline([
    ("num", StandardScaler()),
    ("enet", ElasticNet(alpha=0.01743328822199989, l1_ratio=0.6, random_state=42))
])

elasticnet_model.fit(X_tr, y_tr)

0,1,2
,steps,"[('num', ...), ('enet', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,alpha,0.01743328822199989
,l1_ratio,0.6
,fit_intercept,True
,precompute,False
,max_iter,1000
,copy_X,True
,tol,0.0001
,warm_start,False
,positive,False
,random_state,42


In [47]:
def optuna_elasticnet(
    X, y,
    scoring="neg_mean_absolute_error",   
    cv=None,                            
    X_val=None, y_val=None,      
    n_trials=50,
    alpha_low=1e-4, alpha_high=1e1,
    l1_low=0.0,  l1_high=1.0,
    random_state=42,
    n_jobs=-1,
    silence_logs=True,
):
    if silence_logs:
        optuna.logging.set_verbosity(optuna.logging.WARNING)

    num_cols = X.select_dtypes(include=[np.number, "bool"]).columns.tolist()
    if not num_cols:
        raise ValueError("В X нет числовых/булевых признаков.")

    scorer = get_scorer(scoring) if isinstance(scoring, str) else scoring

    def make_pipe(alpha, l1_ratio):
        return Pipeline([
            ("prep", ColumnTransformer([("num", StandardScaler(), num_cols)], remainder="drop")),
            ("enet", ElasticNet(alpha=alpha, l1_ratio=l1_ratio, max_iter=10000, random_state=42))
        ])

    def objective(trial: optuna.Trial) -> float:
        alpha    = trial.suggest_float("alpha",    alpha_low, alpha_high, log=True)
        l1_ratio = trial.suggest_float("l1_ratio", l1_low,    l1_high)
        pipe = make_pipe(alpha, l1_ratio)

        if cv is not None:
            scores = cross_val_score(pipe, X, y, scoring=scorer, cv=cv, n_jobs=n_jobs)
            return float(scores.mean())  # neg-MAE/R2/... — всегда «больше лучше»
        else:
            assert X_val is not None and y_val is not None, "Для holdout передай X_val, y_val"
            pipe.fit(X, y)
            if scorer is None:
                return float(pipe.score(X_val, y_val))
            return float(scorer(pipe, X_val, y_val))

    study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=random_state))
    study.optimize(objective, n_trials=n_trials)

    best_alpha  = study.best_params["alpha"]
    best_l1r    = study.best_params["l1_ratio"]
    best_model  = make_pipe(best_alpha, best_l1r).fit(X, y)

    trials_df = study.trials_dataframe(attrs=("number","value","params","state","datetime_start","datetime_complete"))
    best_params = {"alpha": best_alpha, "l1_ratio": best_l1r, "best_score": study.best_value}
    
    return best_model, best_params, trials_df, study

In [48]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)
model_cv, params_cv, trials_cv, _ = optuna_elasticnet(
    X_tr, y_tr,
    scoring="neg_mean_absolute_error",
    cv=cv, n_trials=50, random_state=42
)

model_ho, params_ho, trials_ho, _ = optuna_elasticnet(
    X_tr, y_tr,
    scoring="neg_mean_absolute_error",
    cv=None, X_val=X_val, y_val=y_val,
    n_trials=50, random_state=42
)

print(f'Optuna CV: {params_cv}')
print(f'Optuna no CV: {params_ho}')
print(f'Metrics CV: {evaluate_model(model_cv, X_val, y_val)}')
print(f'Metrics no CV: {evaluate_model(model_ho, X_val, y_val)}')

Optuna CV: {'alpha': 0.24604299061871288, 'l1_ratio': 0.5256193055092931, 'best_score': -1117.7500055160203}
Optuna no CV: {'alpha': 0.23557608772583644, 'l1_ratio': 0.5936455307820954, 'best_score': -1135.8130495680796}
Metrics CV: (1135.8421238821604, 1794.771241305407, 0.1760227766343616, None)
Metrics no CV: (1135.8130495680796, 1793.9865767857568, 0.1767430961314822, None)
