### 1. Вопросы

1. Методы разделение данных:

| Метод                        | Описание                                                                 | Плюсы                                                                 | Минусы                                                               |
|------------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|
| Leave-One-Out (LOO)          | В каждом цикле тестовая выборка состоит из одного наблюдения.         | Использует максимум данных для обучения.                             | Дорого по вычислениям, высокая дисперсия.                           |
| Hold-Out                     | Простое разбиение на train/test (например, 80/20).                     | Быстро и просто.                                                     | Результаты зависят от конкретного разбиения.                        |
| K-Fold Cross-Validation      | Данные делятся на K частей, каждый фолд становится тестовым по очереди.| Более стабильная оценка модели.                                      | Требует K-кратного обучения модели.                                 |
| Stratified K-Fold            | Как K-Fold, но с равномерным распределением классов.                   | Подходит для несбалансированных данных.                             | Дольше, чем обычный K-Fold.                                         |
| Time Series Split            | Разбиение с учетом временного порядка данных.                         | Подходит для временных рядов.                                       | Старые данные могут не отражать будущие тренды.                     |
| Group K-Fold                 | Данные делятся так, чтобы объекты одной группы не попадали в train/test.| Предотвращает утечку информации.                                    | Менее гибкое разбиение.                                             |
| Leave-P-Out                  | В тест отправляются P случайных объектов.                             | Дает точную оценку, как LOO.                                        | Очень дорого по вычислениям при больших P.                          |
| Monte Carlo Split            | Многократное случайное разбиение train/test.                          | Гибкость в размере выборок.                                         | Возможны повторяющиеся тестовые данные.                             |


2. Методы подбора гиперпараметров:

| Метод                        | Описание                                                                 | Плюсы                                                                 | Минусы                                                               |
|------------------------------|-------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------------------------------------------------|
| Grid Search                  | Полный перебор всех возможных комбинаций гиперпараметров.               | Гарантированно находит лучший вариант, прост в реализации.          | Очень дорогой по вычислениям, неэффективен для непрерывных параметров. |
| Randomized Search            | Случайный выбор гиперпараметров в заданном диапазоне.                   | Быстрее, чем Grid Search, охватывает больше значений.               | Нет гарантии нахождения оптимального варианта.                      |
| Bayesian Optimization        | Строит вероятностную модель зависимости гиперпараметров от метрики.     | Эффективнее, чем полный перебор, использует информацию о предыдущих тестах. | Сложнее в реализации, требует вычислительных затрат на прогнозирование. |

3. Методы отбора признаков:
- Методы фильтрации - оценивают важность признаков независимо от модели машинного обучения. Они используют статистические или информационные меры для определения значимости каждого признака и отбрасывают те, которые имеют наименьшее влияние на целевую переменную:
    - Корреляция Пирсона: используется для выявления линейной зависимости между признаками и целевой переменной.
    - Тест Хи-квадрат: применим для категориальных данных, чтобы измерить зависимость между признаками и целевой переменной.
- Методы обертки - используют модель машинного обучения для оценки качества подмножества признаков. Эти методы обучают модель на различных подмножествах признаков и оценивают производительность, чтобы выбрать оптимальное подмножество:
     - Рекурсивное исключение признаков (RFE): метод, который поочередно исключает наименее важные признаки, обучая модель на каждом шаге.
     - Пошаговый отбор (Forward/Backward Selection): начинается с одного признака и добавляются/удаляются признаки на основе их влияния на производительность модели.
-  Встроенные методы - выполняют выбор признаков во время обучения модели. Они используют алгоритм машинного обучения, который автоматически оценивает важность признаков в процессе обучения:
     - Lasso (L1 регуляризация): заставляет коэффициенты признаков стремиться к нулю, тем самым удаляя менее важные признаки.
     - Ridge (L2 регуляризация): ограничивает величину коэффициентов признаков, но не делает их равными нулю.

### 2. Введение

In [105]:
import pandas as pd
import numpy as np
import shap
import math
import optuna
import logging
from collections import Counter
from sklearn.model_selection import train_test_split, KFold, GroupKFold, StratifiedKFold, TimeSeriesSplit
from sklearn.linear_model import LassoCV, ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV

In [2]:
df = pd.read_json('data/train.json')
df.head(3)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium


In [3]:
lower = df['price'].quantile(0.01)
upper = df['price'].quantile(0.99)
df = df[(df['price'] > lower) & (df['price'] < upper)]

In [4]:
pd.set_option('future.no_silent_downcasting', True)
df['interest_level'] = df['interest_level'].replace({'low': 0, 'medium': 1, 'high': 2}).astype(int)
features_list = [item for row in df['features'] for item in row]
most_common = [name for name, count in Counter(features_list).most_common(20)]

In [5]:
result_df = df[['bathrooms', 'bedrooms', 'interest_level', 'created', 'price', 'features']].copy()

for feature in most_common:
    result_df[feature] = df['features'].apply(lambda x: 1 if feature in x else 0)

result_df = result_df.drop('features', axis=1)

In [6]:
result_df.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,created,price,Elevator,Hardwood Floors,Cats Allowed,Dogs Allowed,Doorman,...,Laundry in Unit,Roof Deck,Outdoor Space,Dining Room,High Speed Internet,Balcony,Swimming Pool,Laundry In Building,New Construction,Terrace
4,1.0,1,1,2016-06-16 05:55:27,2400,0,1,1,1,0,...,0,0,0,1,0,0,0,0,0,0
6,1.0,2,0,2016-06-01 05:44:33,3800,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,1.0,2,1,2016-06-14 15:19:59,3495,1,1,0,0,1,...,1,0,0,0,0,0,0,0,0,0
10,1.5,3,1,2016-06-24 07:54:24,3000,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,1.0,0,0,2016-06-28 03:50:23,2795,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### 3. Реализация разделения данных

In [7]:
X = result_df.drop('price', axis=1)
y = result_df['price']

In [8]:
def split_into_two(X, y, test_size=0.2):
    indices = np.arange(len(X))
    np.random.shuffle(indices)
    test_size = int(len(X) * test_size) - 1

    test_indices = indices[:test_size]
    train_indices = indices[test_size:]
    X_train, X_test = X.iloc[train_indices], X.iloc[test_indices]
    y_train, y_test = y.iloc[train_indices], y.iloc[test_indices]

    return X_train, X_test, y_train, y_test

In [9]:
X_train, X_test, y_train, y_test = split_into_two(X, y, test_size=0.2)

In [10]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(38676, 24) (9667, 24) (38676,) (9667,)


In [11]:
def split_into_three(X, y, test_size=0.2, validation_size=0.2):
    indices = np.arange(len(X))
    # np.random.shuffle(indices)

    test_size = int(len(X) * test_size)
    valid_size = int(len(X) * validation_size)
    train_size = len(X) - test_size - valid_size

    test_indices = indices[:test_size]
    validation_indices = indices[test_size:test_size + valid_size]
    train_indices = indices[test_size + valid_size:]

    X_train, X_valid, X_test = X.iloc[train_indices], X.iloc[validation_indices], X.iloc[test_indices],
    y_train, y_valid, y_test = y.iloc[train_indices], y.iloc[validation_indices], y.iloc[test_indices],

    return X_train, X_valid, X_test, y_train, y_valid, y_test

In [12]:
X_train, X_valid, X_test, y_train, y_valid, y_test = split_into_three(X, y, test_size=0.2, validation_size=0.2)

In [13]:
print(X_train.shape, X_valid.shape, X_test.shape, y_train.shape, y_valid.shape, y_test.shape)

(29007, 24) (9668, 24) (9668, 24) (29007,) (9668,) (9668,)


In [14]:
def split_date_two(X, y, date_split='2016-06-01'):
    X['created'] = pd.to_datetime(X['created'])
    X = X.sort_values(by='created')
    train_df = X[X['created'] < date_split]
    test_df = X[X['created'] >= date_split]

    X_train, X_test = X.loc[train_df.index], X.loc[test_df.index]
    y_train, y_test = y.loc[train_df.index], y.loc[test_df.index]

    return X_train, X_test, y_train, y_test

In [15]:
X_train, X_test, y_train, y_test = split_date_two(X, y, date_split='2016-06-01')

In [16]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(31551, 24) (16792, 24) (31551,) (16792,)


In [17]:
def split_date_three(X, y, validation_date='2016-05-01', test_date='2016-06-01'):
    X['created'] = pd.to_datetime(X['created'])
    X = X.sort_values(by='created')
    train_df = X[X['created'] < validation_date]
    valid_df = X[(X['created'] >= validation_date) & (X['created'] < test_date)]
    test_df = X[X['created'] >= test_date]

    X_train, X_valid, X_test = X.loc[train_df.index], X.loc[valid_df.index], X.loc[test_df.index]
    y_train, y_valid, y_test = y.loc[train_df.index], y.loc[valid_df.index], y.loc[test_df.index]

    return X_train, X_valid, X_test, y_train, y_valid, y_test

In [18]:
X_train, X_valid, X_test, y_train, y_valid, y_test = split_date_three(X, y, validation_date='2016-05-01',
                                                                      test_date='2016-06-01')

### 4. Методы перекрестной проверки

In [19]:
def my_kfold(X, k=4):
    indices = np.arange(len(X))
    fold_size = int(len(X) // k) + 1

    current = 0
    for i in range(k):
        start, stop = current, current + fold_size
        test_index = indices[start:stop]
        train_index = np.concatenate([indices[:start], indices[stop:]])
        current = stop
        yield train_index, test_index


for fold, (train_index, test_index) in enumerate(my_kfold(X, k=4)):
    print(f'Fold {fold + 1}')
    print(f'Train index {train_index}')
    print(f'Test index {test_index}')

Fold 1
Train index [12086 12087 12088 ... 48340 48341 48342]
Test index [    0     1     2 ... 12083 12084 12085]
Fold 2
Train index [    0     1     2 ... 48340 48341 48342]
Test index [12086 12087 12088 ... 24169 24170 24171]
Fold 3
Train index [    0     1     2 ... 48340 48341 48342]
Test index [24172 24173 24174 ... 36255 36256 36257]
Fold 4
Train index [    0     1     2 ... 36255 36256 36257]
Test index [36258 36259 36260 ... 48340 48341 48342]


In [20]:
def my_group_kfold(X, group, k=3):
    indices = np.arange(len(X))
    unique_groups = np.unique(group)

    for i in range(k):
        # Разделение уникальных групп на части для каждого фолда
        test_groups = unique_groups[i * len(unique_groups) // k: (i + 1) * len(unique_groups) // k]
        # Находим индексы элементов, которые принадлежат тестовым группам
        test_index = np.where(np.isin(group, test_groups))[0]
        # Находим индексы для обучающего набора, исключив тестовые индексы
        train_index = np.setdiff1d(indices, test_index)

        yield train_index, test_index


for fold, (train_index, test_index) in enumerate(my_group_kfold(X, group=X['interest_level'], k=3)):
    print(f'Fold {fold + 1}')
    print(f'Train index {train_index}, Group {X.iloc[train_index]["interest_level"].values}')
    print(f'Test index {test_index}, Group {X.iloc[test_index]["interest_level"].values}')

Fold 1
Train index [    0     2     3 ... 48340 48341 48342], Group [1 1 1 ... 1 1 2]
Test index [    1     4     5 ... 48336 48337 48338], Group [0 0 0 ... 0 0 0]
Fold 2
Train index [    1     4     5 ... 48337 48338 48342], Group [0 0 0 ... 0 0 2]
Test index [    0     2     3 ... 48339 48340 48341], Group [1 1 1 ... 1 1 1]
Fold 3
Train index [    0     1     2 ... 48339 48340 48341], Group [1 0 1 ... 1 1 1]
Test index [    7    17    31 ... 48303 48332 48342], Group [2 2 2 ... 2 2 2]


In [21]:
def my_stratify_kfold(X, stratify, k=3):
    unique_classes = np.unique(stratify)

    # Разбиение данных по классам
    for i in range(k):
        test_index = []
        train_index = []

        # Разделение каждого класса
        for cls in unique_classes:
            cls_indices = np.where(stratify == cls)[0]
            fold_size = len(cls_indices) // k

            # Разделяем индексы этого класса на тестовую и обучающую выборку
            test_cls_indices = cls_indices[i * fold_size: (i + 1) * fold_size]
            train_cls_indices = np.setdiff1d(cls_indices, test_cls_indices)

            test_index.extend(test_cls_indices)
            train_index.extend(train_cls_indices)

        yield np.sort(train_index), np.sort(test_index)


for fold, (train_index, test_index) in enumerate(my_stratify_kfold(X, stratify=y, k=3)):
    print(f'Fold {fold + 1}')
    print(f'Train index {train_index}')
    print(f'Test index {test_index}')

Fold 1
Train index [   23    70    82 ... 48340 48341 48342]
Test index [    0     1     2 ... 38901 42672 43645]
Fold 2
Train index [    0     1     2 ... 48340 48341 48342]
Test index [  615  1400  1796 ... 46799 46820 47205]
Fold 3
Train index [    0     1     2 ... 48337 48340 48342]
Test index [ 3560  3691  4176 ... 48338 48339 48341]


In [22]:
def my_time_split(X, date, k=3):
    X_sorted = X.sort_values(by=date)
    indices = np.arange(len(X_sorted))
    fold_size = len(X_sorted) // k

    for i in range(k):
        start = i * fold_size
        stop = (i + 1) * fold_size if i != k - 1 else len(X_sorted)

        test_index = indices[start:stop]
        train_index = indices[:stop]

        yield train_index, test_index


for fold, (train_index, test_index) in enumerate(my_time_split(X, date='created', k=3)):
    print(f'Fold {fold + 1}')
    print(f'Train index {train_index}')
    print(f'Test index {test_index}')

Fold 1
Train index [    0     1     2 ... 16111 16112 16113]
Test index [    0     1     2 ... 16111 16112 16113]
Fold 2
Train index [    0     1     2 ... 32225 32226 32227]
Test index [16114 16115 16116 ... 32225 32226 32227]
Fold 3
Train index [    0     1     2 ... 48340 48341 48342]
Test index [32228 32229 32230 ... 48340 48341 48342]


### 5. Сравнение методов

In [23]:
X = result_df.drop('price', axis=1)
y = result_df['price']
print(X.shape, y.shape)

(48343, 24) (48343,)


In [24]:
X_train, X_test, y_train, y_test = split_into_two(X, y, test_size=0.2)
print(f'Size of train set: {len(X_train)}')
print(f'Size of test set: {len(y_test)}')

Size of train set: 38676
Size of test set: 9667


In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(f'Size of train set: {len(X_train)}')
print(f'Size of test set: {len(y_test)}')

Size of train set: 38674
Size of test set: 9669


In [26]:
for i, (train_index, test_index) in enumerate(my_kfold(X, k=4)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[12086 12087 12088 ... 48340 48341 48342]
  Test:  index=[    0     1     2 ... 12083 12084 12085]
Fold 1:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[12086 12087 12088 ... 24169 24170 24171]
Fold 2:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[24172 24173 24174 ... 36255 36256 36257]
Fold 3:
  Train: index=[    0     1     2 ... 36255 36256 36257]
  Test:  index=[36258 36259 36260 ... 48340 48341 48342]


In [27]:
sk_kf = KFold(n_splits=4)

for i, (train_index, test_index) in enumerate(sk_kf.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[12086 12087 12088 ... 48340 48341 48342]
  Test:  index=[    0     1     2 ... 12083 12084 12085]
Fold 1:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[12086 12087 12088 ... 24169 24170 24171]
Fold 2:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[24172 24173 24174 ... 36255 36256 36257]
Fold 3:
  Train: index=[    0     1     2 ... 36255 36256 36257]
  Test:  index=[36258 36259 36260 ... 48340 48341 48342]


In [28]:
for i, (train_index, test_index) in enumerate(my_group_kfold(X, group=X['interest_level'], k=3)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[    0     2     3 ... 48340 48341 48342]
  Test:  index=[    1     4     5 ... 48336 48337 48338]
Fold 1:
  Train: index=[    1     4     5 ... 48337 48338 48342]
  Test:  index=[    0     2     3 ... 48339 48340 48341]
Fold 2:
  Train: index=[    0     1     2 ... 48339 48340 48341]
  Test:  index=[    7    17    31 ... 48303 48332 48342]


In [29]:
sk_gkf = GroupKFold(n_splits=3)

for i, (train_index, test_index) in enumerate(sk_gkf.split(X, groups=X['interest_level'])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[    0     2     3 ... 48340 48341 48342]
  Test:  index=[    1     4     5 ... 48336 48337 48338]
Fold 1:
  Train: index=[    1     4     5 ... 48337 48338 48342]
  Test:  index=[    0     2     3 ... 48339 48340 48341]
Fold 2:
  Train: index=[    0     1     2 ... 48339 48340 48341]
  Test:  index=[    7    17    31 ... 48303 48332 48342]


In [30]:
for i, (train_index, test_index) in enumerate(my_stratify_kfold(X, stratify=X['interest_level'], k=3)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[15996 15997 15998 ... 48340 48341 48342]
  Test:  index=[    0     1     2 ... 16453 16456 16461]
Fold 1:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[15996 15997 15998 ... 32352 32358 32361]
Fold 2:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[32169 32171 32172 ... 48337 48338 48339]


In [31]:
sk_skf = StratifiedKFold(n_splits=3)

for i, (train_index, test_index) in enumerate(sk_skf.split(X, y=X['interest_level'])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[15996 15997 15998 ... 48340 48341 48342]
  Test:  index=[    0     1     2 ... 16456 16461 16465]
Fold 1:
  Train: index=[    0     1     2 ... 48340 48341 48342]
  Test:  index=[15996 15997 15998 ... 32361 32367 32371]
Fold 2:
  Train: index=[    0     1     2 ... 32361 32367 32371]
  Test:  index=[32169 32171 32172 ... 48340 48341 48342]


In [32]:
def my_time_split(X, date, k=3):
    X_sorted = X.sort_values(by=date)
    indices = np.arange(len(X_sorted))
    fold_size = math.ceil(len(X_sorted) // k)

    for i in range(k):
        start = (i + 1) * fold_size
        end = (i + 2) * fold_size

        train_index = indices[:start]
        test_index = indices[start:end]

        yield train_index, test_index

In [33]:
for i, (train_index, test_index) in enumerate(my_time_split(X, date='created', k=3)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[    0     1     2 ... 16111 16112 16113]
  Test:  index=[16114 16115 16116 ... 32225 32226 32227]
Fold 1:
  Train: index=[    0     1     2 ... 32225 32226 32227]
  Test:  index=[32228 32229 32230 ... 48339 48340 48341]
Fold 2:
  Train: index=[    0     1     2 ... 48339 48340 48341]
  Test:  index=[48342]


In [34]:
sk_tss = TimeSeriesSplit(n_splits=3)

for i, (train_index, test_index) in enumerate(sk_tss.split(X, y=X['created'])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 0:
  Train: index=[    0     1     2 ... 12085 12086 12087]
  Test:  index=[12088 12089 12090 ... 24170 24171 24172]
Fold 1:
  Train: index=[    0     1     2 ... 24170 24171 24172]
  Test:  index=[24173 24174 24175 ... 36255 36256 36257]
Fold 2:
  Train: index=[    0     1     2 ... 36255 36256 36257]
  Test:  index=[36258 36259 36260 ... 48340 48341 48342]


### 6. Поиск параметров

In [35]:
X = result_df.drop(['created', 'price'], axis=1)
y = result_df['price']

In [36]:
X_train, X_valid, X_test, y_train, y_valid, y_test = split_into_three(X, y, test_size=0.2,
                                                                      validation_size=0.2)  # Info: shuffle=False

In [37]:
metrics = pd.DataFrame(columns=['name', 'metric', 'train', 'valid', 'test'])

In [38]:
def write_metrics(model, name, X_train, X_valid, X_test):
    y_pred_train = model.predict(X_train)
    y_pred_valid = model.predict(X_valid)
    y_pred_test = model.predict(X_test)

    if name not in metrics['name'].values:
        metrics.loc[len(metrics)] = [name, 'R2', r2_score(y_train, y_pred_train),
                                     mean_absolute_error(y_train, y_pred_train),
                                     np.sqrt(mean_squared_error(y_train, y_pred_train))]
        metrics.loc[len(metrics)] = [name, 'MAE', r2_score(y_valid, y_pred_valid),
                                     mean_absolute_error(y_valid, y_pred_valid),
                                     np.sqrt(mean_squared_error(y_valid, y_pred_valid))]
        metrics.loc[len(metrics)] = [name, 'RMSE', r2_score(y_test, y_pred_test),
                                     mean_absolute_error(y_test, y_pred_test),
                                     np.sqrt(mean_squared_error(y_test, y_pred_test))]


In [39]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train, y_train)
print(f'Best alpha: {lasso_cv.alpha_}')
print(f'Coefficients: {lasso_cv.coef_}')
write_metrics(lasso_cv, 'lasso_default', X_train, X_valid, X_test)

Best alpha: 0.9548653759868562
Coefficients: [1484.22165089  476.97931121 -405.39409014  219.40302491  -83.53871496
   -6.22907031   62.64397499  549.44293187  136.78643156  -82.39140043
 -181.05643333  189.17720428  -49.64214862  452.44198693  -97.45896539
  -46.68470313   97.16973705 -181.86160668  -60.08743073   -0.
 -173.82045417  -96.08519856  149.51257286]


In [40]:
coefficients = np.abs(lasso_cv.coef_)
indices = np.argsort(coefficients)[::-1]
top_10_indices = indices[:10]
top_10_indices

array([ 0,  7,  1, 13,  2,  3, 11, 17, 10, 20])

In [41]:
X_train_10_weight = X_train.iloc[:, top_10_indices]
X_valid_10_weight = X_valid.iloc[:, top_10_indices]
X_test_10_weight = X_test.iloc[:, top_10_indices]

In [42]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train_10_weight, y_train)
write_metrics(lasso_cv, 'X_train_10_weight', X_train_10_weight, X_valid_10_weight, X_test_10_weight)

In [43]:
corr_matrix = result_df.drop('created', axis=1).corr()
corr_matrix = corr_matrix.sort_values(by='price', ascending=False)
top_10_corr = corr_matrix.index[1:11].tolist()
print(top_10_corr)

['bathrooms', 'bedrooms', 'Doorman', 'Laundry in Unit', 'Fitness Center', 'Dishwasher', 'Dining Room', 'Elevator', 'Outdoor Space', 'Laundry in Building']


In [44]:
X_train_corr = X_train.loc[:, top_10_corr]
X_valid_corr = X_valid.loc[:, top_10_corr]
X_test_corr = X_test.loc[:, top_10_corr]

In [45]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train_corr, y_train)
write_metrics(lasso_cv, 'X_train_10_correlation', X_train_corr, X_valid_corr, X_test_corr)

In [46]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train, y_train)
importance = permutation_importance(lasso_cv, X_train, y_train, n_repeats=10)
importance_mean = importance.importances_mean
sort_importance = np.argsort(importance_mean)[::-1]
top_10_importance = sort_importance.tolist()[:10]
top_10_importance

[0, 1, 7, 2, 13, 3, 11, 10, 8, 17]

In [47]:
X_train_10_importance = X_train.iloc[:, top_10_importance]
X_valid_10_importance = X_valid.iloc[:, top_10_importance]
X_test_10_importance = X_test.iloc[:, top_10_importance]

In [48]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train_10_importance, y_train)
write_metrics(lasso_cv, 'X_train_10_importance', X_train_10_importance, X_valid_10_importance, X_test_10_importance)

In [49]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train, y_train)
explainer = shap.Explainer(lasso_cv, X_train)
shap_values = explainer(X_train)

In [50]:
importance_values = np.abs(shap_values.values).mean(axis=0)
top_10_snap = np.argsort(importance_values)[::-1][:10].tolist()
top_10_snap

[0, 1, 7, 2, 13, 3, 10, 11, 8, 4]

In [51]:
X_train_10_snap = X_train.iloc[:, top_10_snap]
X_valid_10_snap = X_valid.iloc[:, top_10_snap]
X_test_10_snap = X_test.iloc[:, top_10_snap]

In [52]:
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_train_10_snap, y_train)
write_metrics(lasso_cv, 'X_train_10_snap', X_train_10_snap, X_valid_10_snap, X_test_10_snap)

In [59]:
print(metrics)

                     name metric     train       valid         test
0           lasso_default     R2  0.603749  681.920140   993.645153
1           lasso_default    MAE  0.619118  683.922809   995.992775
2           lasso_default   RMSE  0.591943  694.276119  1008.083576
3       X_train_10_weight     R2  0.600161  684.848289   998.133997
4       X_train_10_weight    MAE  0.614983  687.842610  1001.385659
5       X_train_10_weight   RMSE  0.587363  697.708778  1013.724763
6         X_train_10_corr     R2  0.571026  710.923084  1033.859890
7         X_train_10_corr    MAE  0.582163  720.474016  1043.193529
8         X_train_10_corr   RMSE  0.552046  732.899287  1056.215368
9   X_train_10_importance     R2  0.600060  684.400541   998.259653
10  X_train_10_importance    MAE  0.614722  686.883612  1001.724218
11  X_train_10_importance   RMSE  0.587072  697.032927  1014.082464
12        X_train_10_snap     R2  0.599060  685.891285   999.506703
13        X_train_10_snap    MAE  0.614239  687.

In [58]:
print(metrics.pivot(columns='metric', index='name', values=['train', 'valid', 'test']))

                          train                           valid              \
metric                      MAE        R2      RMSE         MAE          R2   
name                                                                          
X_train_10_corr        0.582163  0.571026  0.552046  720.474016  710.923084   
X_train_10_importance  0.614722  0.600060  0.587072  686.883612  684.400541   
X_train_10_snap        0.614239  0.599060  0.587071  687.308995  685.891285   
X_train_10_weight      0.614983  0.600161  0.587363  687.842610  684.848289   
lasso_default          0.619118  0.603749  0.591943  683.922809  681.920140   

                                          test                            
metric                       RMSE          MAE           R2         RMSE  
name                                                                      
X_train_10_corr        732.899287  1043.193529  1033.859890  1056.215368  
X_train_10_importance  697.032927  1001.724218   998.259653  1014.0

### 7. Оптимизация гиперпараметров

In [78]:
params = {
    'alpha': np.logspace(-2, 0, 50),  # логарифмический масштаб для alpha
    'l1_ratio': np.linspace(0.1, 1.0, 10)  # более широкий диапазон для l1_ratio
}

model = ElasticNet(random_state=21)
grid = GridSearchCV(model, param_grid=params, cv=10, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)

{'alpha': 0.29470517025518095, 'l1_ratio': 1.0}


In [79]:
random_search = RandomizedSearchCV(model, param_distributions=params, cv=10)
random_search.fit(X_train, y_train)
print(random_search.best_params_)

{'l1_ratio': 1.0, 'alpha': 0.9102981779915218}


In [80]:
best_grid = grid.best_estimator_
write_metrics(best_grid, 'GridSearchCV', X_train, X_valid, X_test)

In [81]:
best_random = random_search.best_estimator_
write_metrics(best_random, 'RandomizedSearchCV', X_train, X_valid, X_test)

In [90]:
def objective(trial):
    alpha = trial.suggest_float('alpha', 1e-5, 1e1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0.0, 1.0)
    
    optuna_model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=21)
    score = cross_val_score(optuna_model, X_train, y_train, cv=10, scoring='r2').mean()
    
    return score

logging.getLogger('optuna').setLevel(logging.WARNING)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, show_progress_bar=True)

print(study.best_params)

  0%|          | 0/100 [00:00<?, ?it/s]

{'alpha': 0.2760316739507387, 'l1_ratio': 0.9989878323145379}


In [92]:
opt_model = ElasticNet(random_state=21, alpha=study.best_params['alpha'], l1_ratio=study.best_params['l1_ratio'])
write_metrics(best_random, 'Optuna', X_train, X_valid, X_test)

In [99]:
print(metrics.iloc[-9:, :])

                  name metric     train       valid         test
15        GridSearchCV     R2  0.603876  682.245074   993.485985
16        GridSearchCV    MAE  0.619238  684.286567   995.835770
17        GridSearchCV   RMSE  0.592073  694.596368  1007.922898
18  RandomizedSearchCV     R2  0.603883  682.303873   993.477097
19  RandomizedSearchCV    MAE  0.619245  684.354688   995.826879
20  RandomizedSearchCV   RMSE  0.592080  694.659923  1007.913875
21              Optuna     R2  0.603761  681.935209   993.629661
22              Optuna    MAE  0.619131  683.939568   995.976788
23              Optuna   RMSE  0.591956  694.289610  1008.066584


In [102]:
metrics[metrics['metric'] == 'R2'].sort_values(by='test', ascending=True)

Unnamed: 0,name,metric,train,valid,test
18,RandomizedSearchCV,R2,0.603883,682.303873,993.477097
15,GridSearchCV,R2,0.603876,682.245074,993.485985
21,Optuna,R2,0.603761,681.935209,993.629661
0,lasso_default,R2,0.603749,681.92014,993.645153
3,X_train_10_weight,R2,0.600161,684.848289,998.133997
9,X_train_10_importance,R2,0.60006,684.400541,998.259653
12,X_train_10_snap,R2,0.59906,685.891285,999.506703
6,X_train_10_corr,R2,0.571026,710.923084,1033.85989


In [103]:
metrics[metrics['metric'] == 'MAE'].sort_values(by='test', ascending=True)

Unnamed: 0,name,metric,train,valid,test
19,RandomizedSearchCV,MAE,0.619245,684.354688,995.826879
16,GridSearchCV,MAE,0.619238,684.286567,995.83577
22,Optuna,MAE,0.619131,683.939568,995.976788
1,lasso_default,MAE,0.619118,683.922809,995.992775
4,X_train_10_weight,MAE,0.614983,687.84261,1001.385659
10,X_train_10_importance,MAE,0.614722,686.883612,1001.724218
13,X_train_10_snap,MAE,0.614239,687.308995,1002.352189
7,X_train_10_corr,MAE,0.582163,720.474016,1043.193529


In [104]:
metrics[metrics['metric'] == 'RMSE'].sort_values(by='test', ascending=True)

Unnamed: 0,name,metric,train,valid,test
20,RandomizedSearchCV,RMSE,0.59208,694.659923,1007.913875
17,GridSearchCV,RMSE,0.592073,694.596368,1007.922898
23,Optuna,RMSE,0.591956,694.28961,1008.066584
2,lasso_default,RMSE,0.591943,694.276119,1008.083576
5,X_train_10_weight,RMSE,0.587363,697.708778,1013.724763
11,X_train_10_importance,RMSE,0.587072,697.032927,1014.082464
14,X_train_10_snap,RMSE,0.587071,696.615633,1014.082924
8,X_train_10_corr,RMSE,0.552046,732.899287,1056.215368
