## 1. Answer the questions from the introduction

- What is leave-one-out? Provide limitations and strengths.
- How do Grid Search, Randomized Grid Search, and Bayesian optimization work?
- Explain classification of feature selection methods. Explain how Pearson and Chi2 work. Explain how Lasso works. Explain what permutation significance is. Become familiar with SHAP.

1) LOO - метод валидации данных, частный случай cross-validation, где 1 фолд = 1 семпл. Модель итерационно обучается на n-1 семплах и валидационная метрика раситывается на оставшемся.  
2) Grid Search - подбор гиперпараметров по сетке. Рассматриваются всевозможные комбинации и выбирается та, где модель на валидации показала лучшие результаты.  
Random search - для непрерывных параметров задается распределение, затем из него рандомно выбираются значения. RS за то же время что и GS может рассмотреть более разнообразные комбинации.  
Баесовская оптимизация - метод оптимизации без дифференцирования искомой функции. Он состоит из двух пунктов: 
    - вероятностная модель, которая приближает распределение значений целевой функции на основе исторических данных (часто выбираются гауссовские процессы)
    - acquisition function - функция, которая по некоторым статистикам вероятностной модели позволяет указать в какой след точке вычислять функцию. a(x) = nu(x) + b*std(x). b - некоторый вес. 
На каждой итерации с помощью acquisition function вычисляется следующая точка в которой нужно будет вычислить исходную функцию.
Затем считается функция. Затем обновляется вероятностная модель.
3) Отбор признаков делится на группы:
    - для unsupervised learning  
    - для supervised learning:  
        - Wrappers (обертки) - проверяют комбинации фич
        - Filters (фильтры) - используют статистику
        - Embedded (встроенные) - отбор признаков встроен в обучение   

Корреляция по пирсону:  
$$
r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}
{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \; \sum_{i=1}^{n} (y_i - \bar{y})^2}}
$$

    если r близок к 1 то сильная корреляция, если к -1 - то обратная корреляция  

Хи-квадрат тест:  

$$
\chi^2 = \sum_{i=1}^{R}\sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$$

    строится таблица сопряженности. O - наблюдаемые частоты, E - ожидаемые. Если x2 большой, то признаки зависят друг от друга (смотрим по таблице предварительно вычислив степень совободы (rows-1)*(cols-1))

LASSO:
$$
\min_{\beta} \Big( \sum (y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j| \Big)
$$

    к функции потерь добавляются модули весов что позволяет ограничить их. Из-за того что в пространстве весов l1 принимает форму гиперкуба то признаки часто зануляются.

Permutation importance:  

    обучаем модель, измеряем качество, перемешиваем значения одного признака и смотрим как сильно упало качество.

## 2. Introduction — do all the preprocessing from the previous lesson

- Read all the data.
- Preprocess the "Interest Level" feature.
- Create features: 'Elevator', 'HardwoodFloors', 'CatsAllowed', 'DogsAllowed', 'Doorman', 'Dishwasher', 'NoFee', 'LaundryinBuilding', 'FitnessCenter', 'Pre-War', 'LaundryinUnit', 'RoofDeck', 'OutdoorSpace', 'DiningRoom', 'HighSpeedInternet', 'Balcony', 'SwimmingPool', 'LaundryInBuilding', 'NewConstruction', 'Terrace'.

In [525]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [544]:
train = pd.read_json("data/train.json")
train.head(3)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium


In [545]:
test = pd.read_json("data/test.json")

In [546]:
df_encoded = train.copy().reset_index(drop=True)
df_encoded['interest_level'] = LabelEncoder().fit_transform(train['interest_level'])

In [547]:
def clear_feature(l):
    symbs = set(["]", "[", "'", '"', " "])
    for i, word in enumerate(l):
        for symb in symbs:
            word = word.replace(symb, "")
        l[i] = word
    return l

In [548]:
res = []
for i, row in df_encoded.iterrows():
    cleared = clear_feature(row['features'])
    res.extend(cleared)

In [549]:
# кол-во уникальных значений:
len(set(res))

1546

In [551]:
from collections import Counter

res_counter = Counter(res).most_common(20)
res_counter

[('Elevator', 25915),
 ('CatsAllowed', 23540),
 ('HardwoodFloors', 23527),
 ('DogsAllowed', 22035),
 ('Doorman', 20898),
 ('Dishwasher', 20426),
 ('NoFee', 18062),
 ('LaundryinBuilding', 16344),
 ('FitnessCenter', 13252),
 ('Pre-War', 9148),
 ('LaundryinUnit', 8738),
 ('RoofDeck', 6542),
 ('OutdoorSpace', 5268),
 ('DiningRoom', 5136),
 ('HighSpeedInternet', 4299),
 ('Balcony', 2992),
 ('SwimmingPool', 2730),
 ('LaundryInBuilding', 2593),
 ('NewConstruction', 2559),
 ('Terrace', 2283)]

In [552]:
res_labels = set(x[0] for x in res_counter)

In [553]:
df_encoded

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
0,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[DiningRoom, Pre-War, LaundryinBuilding, Dishw...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,2
1,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, LaundryinBuilding, Dishwas...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,1
2,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, LaundryinBuilding, Laundry...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,2
3,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,2
4,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, FitnessCenter, LaundryinBu...",40.7439,7225292,-73.9743,2c3b41f588fbb5234d8a1e885a436cfa,[https://photos.renthop.com/2/7225292_901f1984...,2795,340 East 34th Street,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49347,1.0,3,92bbbf38baadfde0576fc496bd41749c,2016-04-05 03:58:33,There is 700 square feet of recently renovated...,W 171 Street,"[Elevator, Dishwasher, HardwoodFloors]",40.8433,6824800,-73.9396,a61e21da3ba18c7a3d54cfdcc247e1f8,[https://photos.renthop.com/2/6824800_0682be16...,2800,620 W 171 Street,1
49348,1.0,2,5565db9b7cba3603834c4aa6f2950960,2016-04-02 02:25:31,"2 bedroom apartment with updated kitchen, rece...",Broadway,"[CommonOutdoorSpace, CatsAllowed, DogsAllowed,...",40.8198,6813268,-73.9578,8f90e5e10e8a2d7cf997f016d89230eb,[https://photos.renthop.com/2/6813268_1e6fcc32...,2395,3333 Broadway,2
49349,1.0,1,67997a128056ee1ed7d046bbb856e3c7,2016-04-26 05:42:03,No Brokers Fee * Never Lived 1 Bedroom 1 Bathr...,210 Brighton 15th St,"[DiningRoom, Elevator, Pre-War, LaundryinBuild...",40.5765,6927093,-73.9554,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/6927093_93a52104...,1850,210 Brighton 15th St,2
49350,1.0,2,3c0574a740154806c18bdf1fddd3d966,2016-04-19 02:47:33,Wonderful Bright Chelsea 2 Bedroom apartment o...,West 21st Street,"[Pre-War, LaundryinUnit, Dishwasher, NoFee, Ou...",40.7448,6892816,-74.0017,c3cd45f4381ac371507090e9ffabea80,[https://photos.renthop.com/2/6892816_1a8d087a...,4195,350 West 21st Street,2


In [554]:
df = pd.DataFrame(data = [[0]*20 for i in range(len(df_encoded))], columns=list(res_labels))

for i, row in df.iterrows():
    for word in df_encoded['features'][i]:
        if word in res_labels:
            row[word] = 1

In [555]:
df = pd.concat([df_encoded[['bathrooms', 'bedrooms', 'interest_level']].reset_index(drop=True), df.reset_index(drop=True)], axis=1)
df.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,LaundryinBuilding,Elevator,OutdoorSpace,FitnessCenter,Terrace,CatsAllowed,Dishwasher,...,NewConstruction,LaundryInBuilding,RoofDeck,HardwoodFloors,Balcony,LaundryinUnit,DogsAllowed,Pre-War,NoFee,SwimmingPool
0,1.0,1,2,1,0,0,0,0,1,1,...,0,0,0,1,0,0,1,1,0,0
1,1.0,2,1,1,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,1,0
2,1.0,2,2,1,1,0,0,0,0,1,...,0,0,0,1,0,1,0,0,0,0
3,1.5,3,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1.0,0,1,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [556]:
feature_list = list(df.columns)

## 3. Implement the next methods:

- Split data into 2 parts randomly with parameter test_size (ratio from 0 to 1), return training and test samples.
- Randomly split data into 3 parts with parameters validation_size and test_size, return train, validation and test samples.
- Split data into 2 parts with parameter date_split, return train and test samples split by date_split param.
- Split data into 3 parts with parameters validation_date and test_date, return train, validation and test samples split by input params.
- Make split procedure determenistic. What does it mean?

In [557]:
X = df
y = df_encoded['price']

In [445]:
def my_train_test(X, y, test_size=0.2, random_state=21):
    if test_size <= 0 or test_size >= 1:
        raise ValueError("test_size must be between 0 and 1")
    if len(X) != len(y):
        raise ValueError("X and y must have one number of samples ")

    n = len(X)
    n_test = int(n*test_size)

    rng = np.random.default_rng(seed=random_state)
    indexes = np.arange(n)
    rng.shuffle(indexes)

    test_idx = indexes[:n_test]
    train_idx = indexes[n_test:]

    return X.iloc[train_idx], X.iloc[test_idx], y.iloc[train_idx], y.iloc[test_idx]

In [446]:
X_train, X_test, y_train, y_test = my_train_test(X, y)

len(X_train), len(X_test), len(y_train), len(y_test)

(39482, 9870, 39482, 9870)

In [447]:
def my_train_val_test(X, y, val_size=0.2, test_size=0.2, random_state=21):
    if test_size <= 0 or test_size >= 1:
        raise ValueError("test_size must be between 0 and 1")
    if val_size <= 0 or val_size >= 1:
        raise ValueError("val_size must be between 0 and 1")
    if len(X) != len(y):
        raise ValueError("X and y must have one number of samples ")
    
    n = len(X)
    n_test = int(n*test_size)
    n_val = int(n*val_size)

    rng = np.random.default_rng(seed=random_state)
    indexes = np.arange(n)
    rng.shuffle(indexes)

    test_idx = indexes[:n_test]
    val_idx = indexes[n_test:n_test+n_val]
    train_idx = indexes[n_test+n_val:]

    return X.iloc[train_idx], X.iloc[val_idx], X.iloc[test_idx], y.iloc[train_idx], y.iloc[val_idx], y.iloc[test_idx]

In [448]:
X_train, X_val, X_test, y_train, y_val, y_test = my_train_val_test(X, y)

len(X_train), len(X_val), len(X_test), len(y_train), len(y_val), len(y_test)

(29612, 9870, 9870, 29612, 9870, 9870)

In [449]:
def my_data_train_test(X, y, date_col, date_split):
    if len(X) != len(y):
        raise ValueError("X and y must have one number of samples ")
    
    sorted_idx = X[date_col].argsort()

    X_sorted = X.iloc[sorted_idx]
    y_sorted = y.iloc[sorted_idx]
    
    train_mask = X_sorted[date_col] <= date_split
    test_mask = ~train_mask

    return X_sorted[train_mask], X_sorted[test_mask], y_sorted[train_mask], y_sorted[test_mask]
    

In [450]:
# тк в данных которыми мы пользовались в предыдущем проекте нет даты то немного преобрпзуем оригинальный train
X_tmp = train.copy()
y_tmp = X_tmp['price']
X_tmp['created'] = pd.to_datetime(X_tmp['created'])
X_tmp = X_tmp.drop(columns=['price'])

X_train, X_test, y_train, y_test = my_data_train_test(X_tmp, y_tmp, 'created', '2016-06-12 08:17:43')

len(X_train), len(X_test), len(y_train), len(y_test)

(39481, 9871, 39481, 9871)

In [451]:
def my_data_train_test(X, y, date_col, validation_date, test_date):
    if len(X) != len(y):
        raise ValueError("X and y must have one number of samples ")
    if validation_date > test_date:
        raise ValueError("val_date must be earlier than test_date")
    
    sorted_idx = X[date_col].argsort()

    X_sorted = X.iloc[sorted_idx]
    y_sorted = y.iloc[sorted_idx]
    
    train_mask = X_sorted[date_col] < validation_date
    val_mask = (X_sorted[date_col] >= validation_date) & (X_sorted[date_col] < test_date)
    test_mask = X_sorted[date_col] >= test_date

    return X_sorted[train_mask], X_sorted[val_mask], X_sorted[test_mask], y_sorted[train_mask], y_sorted[val_mask], y_sorted[test_mask]
    

In [452]:
X_train, X_val, X_test, y_train, y_val, y_test = my_data_train_test(X_tmp, y_tmp, 'created', '2016-05-25 01:10:42', '2016-06-12 08:17:43')

len(X_train), len(X_val), len(X_test), len(y_train), len(y_val), len(y_test)

(29610, 9870, 9872, 29610, 9870, 9872)

Make split procedure determenistic. What does it mean?  
мы в каждой функции задаем seed у random поэтому знаяени при одном и том же random_state будут всегда одинковыми

## 4. Implement the next cross-validation methods:

- K-Fold, where k is the input parameter, returns a list of train and test indices.
- Grouped K-Fold, where k and group_field are input parameters, returns list of train and test indices.
- Stratified K-fold, where k and stratify_field are input parameters, returns list of train and test indices.
- Time series split, where k and date_field are input parameters, returns list of train and test indices.

In [None]:
def my_KFold(X, y, k=5, random_state=21):
    if len(X) != len(y):
        raise ValueError("X and y must have one number of samples")
    
    rng = np.random.RandomState(seed=random_state)
    indexes = np.arange(len(X))
    rng.shuffle(indexes)

    folds = np.array_split(indexes, k)
    splits = []

    for i in range(k):
        test_idx = np.sort(folds[i])          
        train_idx = np.sort(np.hstack(folds[:i] + folds[i+1:]))
        splits.append((train_idx, test_idx))
        
    return splits

In [454]:
for train_idx, test_idx in my_KFold(X, y):
    print(len(train_idx), len(test_idx))


39481 9871
39481 9871
39482 9870
39482 9870
39482 9870


In [455]:
def my_GroupKFold(X, y, groups, k=5, random_state=21):
    if len(X) != len(y) or len(X) != len(groups):
        raise ValueError("X, y and groups must have one number of samples")
    
    unique_groups = np.unique(groups)

    if len(unique_groups) < k:
        raise ValueError("Number of unique groups must be equal or bigger than k")
    
    rng = np.random.RandomState(seed=random_state)
    rng.shuffle(unique_groups)

    folds = np.array_split(unique_groups, k)
    splits = []

    for i in range(k):
        test_groups = folds[i]
        train_groups = np.hstack(folds[:i] + folds[i+1:])

        test_idx = np.sort(np.where(np.isin(groups, test_groups))[0])
        train_idx = np.sort(np.where(np.isin(groups, train_groups))[0])
        splits.append((train_idx, test_idx))
        
    return splits

In [456]:
for train_idx, test_idx in my_GroupKFold(X_tmp, y_tmp, X_tmp['manager_id']):
    print(len(train_idx), len(test_idx))

39059 10293
41742 7610
39880 9472
39085 10267
37642 11710


In [457]:
def my_StratifiedKFold(X, y, stratify_field, k=5, random_state=21):
    if len(X) != len(y) or len(X) != len(stratify_field):
        raise ValueError("X, y and groups must have one number of samples")
    
    rng = np.random.RandomState(seed=random_state)
    stratify_field = np.array(stratify_field)

    class_indexes = {}
    for cls in np.unique(stratify_field):
        cls_idx = np.where(stratify_field == cls)[0]
        rng.shuffle(cls_idx)
        class_indexes[cls] = np.array_split(cls_idx, k)

    splits = []

    for i in range(k):
        test_idx = np.sort(np.hstack([class_indexes[cls][i] for cls in class_indexes]))
        train_idx = np.sort(np.hstack([
            np.hstack(class_indexes[cls][:i] + class_indexes[cls][i+1:]) for cls in class_indexes
        ]))
        splits.append((train_idx, test_idx))
    return splits

In [458]:
# для теста придется сделать y категориальным

y_cat = pd.cut(y, [0, 1000, 10000, 100000, 10000000], labels=['1', '2', '3', '4'])
for train_idx, test_idx in my_StratifiedKFold(X, y_cat, stratify_field=y_cat):
    print(len(train_idx), len(test_idx))

39479 9873
39481 9871
39482 9870
39483 9869
39483 9869


In [459]:
def my_TimeSeriesSplit(X, y, date_field, k=5, random_state=21):
    if len(X) != len(y):
        raise ValueError("X and y must have one number of samples")
    
    sorted_indexes = np.array(X[date_field].reset_index(drop=True).argsort(kind='stable'))

    folds = np.array_split(sorted_indexes, k+1)
    splits = []

    for i in range(1, k+1):
        test_idx = folds[i]
        train_idx = np.hstack(folds[:i])
        splits.append((train_idx, test_idx))
        
    return splits

In [460]:
for train_idx, test_idx in my_TimeSeriesSplit(X_tmp, y_tmp, date_field='created'):
    print(len(train_idx), len(test_idx))

8226 8226
16452 8225
24677 8225
32902 8225
41127 8225


## 5. Cross-validation comparison

- Apply all the validation methods implemented above to our dataset. To apply Stratified algorithm you should preprocess target.
- Apply the appropriate methods from sklearn.
- Compare the resulting feature distributions for the training part of the dataset between sklearn and your implementation.
- Compare all validation schemes. Choose the best one. Explain your choice.

In [461]:
from sklearn.model_selection import KFold, GroupKFold, StratifiedKFold, TimeSeriesSplit

In [462]:
kf = KFold(n_splits=5, shuffle=True, random_state=21)
for (train1, test1), (train2, test2) in zip(my_KFold(X, y), kf.split(X, y)):
    print(np.array_equal(train1, train2), np.array_equal(test1, test2))

True True
True True
True True
True True
True True


In [463]:
kf = GroupKFold(n_splits=5, shuffle=True, random_state=21)
for (train1, test1), (train2, test2) in zip(my_GroupKFold(X_tmp, y_tmp, X_tmp['manager_id']), kf.split(X_tmp, y_tmp, X_tmp['manager_id'])):
    print(np.array_equal(train1, train2), np.array_equal(test1, test2))

True True
True True
True True
True True
True True


In [464]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=21)
for (train1, test1), (train2, test2) in zip(my_StratifiedKFold(X, y_cat, stratify_field=y_cat), kf.split(X, y_cat, y_cat)):
    print(np.array_equal(train1, train2), np.array_equal(test1, test2))

False False
False False
False False
False False
False False




к сожалению, полностью возпроизвести этот метод не получился, тк в sklearn он использует внутреннюю оптимизацию без исходного кода увы

In [465]:
idx = X_tmp['created'].argsort(kind='stable')
X_tmp_sorted = X_tmp.iloc[idx]
y_tmp_sorted = y_tmp.iloc[idx]

kf = TimeSeriesSplit(n_splits=5)
for (train1, test1), (train2, test2) in zip(my_TimeSeriesSplit(X_tmp_sorted, y_tmp_sorted, date_field='created'), kf.split(X_tmp_sorted, y_tmp_sorted)):
    print(np.array_equal(train1, train2), np.array_equal(test1, test2))

False False
True True
True True
True True
True True


отлично, только первые фолды не совпадают

ИТОГО:  
для нашей задачи лучше всего подходит обычный к-фолд. Но если у распределения таргет переменной будет длинный хвост, то лучше использовать стратифицированную выборку прежде поделив трагет на классы

## 6. Feature Selection

- Fit a Lasso regression model with normalized features. Use your method for splitting samples into 3 parts by field created with 60/20/20 ratio — train/validation/test.
- Sort features by weight coefficients from model, fit model to top 10 features and compare quality.
- Implement method for simple feature selection by nan-ratio in feature and correlation. Apply this method to feature set and take top 10 features, refit model and measure quality.
- Implement permutation importance method and take top 10 features, refit model and measure quality.
- Import Shap and also refit model on top 10 features.
- Compare the quality of these methods for different aspects — speed, metrics and stability.

In [558]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import r2_score

In [559]:
X_train, X_val, X_test, y_train, y_val, y_test = my_train_val_test(X, y, val_size=0.2, test_size=0.2)

In [564]:
scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_val_scaled = pd.DataFrame(scaler.transform(X_val), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_train.columns)


In [565]:
lasso_0 = Lasso(random_state=21)
lasso_0.fit(X_train_scaled, y_train)
y_pred = lasso_0.predict(X_val_scaled)

r2_score(y_val, y_pred)

0.4016848532965548

In [566]:
top_10 = pd.Series(np.abs(lasso_0.coef_)).sort_values(ascending=False).head(10).index
top_10

Index([0, 1, 10, 14, 6, 3, 4, 19, 18, 21], dtype='int64')

In [567]:
X.columns[top_10]

Index(['bathrooms', 'bedrooms', 'Doorman', 'LaundryInBuilding',
       'FitnessCenter', 'LaundryinBuilding', 'Elevator', 'DogsAllowed',
       'LaundryinUnit', 'NoFee'],
      dtype='object')

In [568]:
X_train_scaled_top = X_train_scaled.iloc[:, top_10]
X_val_scaled_top = X_val_scaled.iloc[:, top_10]
X_test_scaled_top = X_test_scaled.iloc[:, top_10]

In [569]:
lasso = Lasso(random_state=21)
lasso.fit(X_train_scaled_top, y_train)
y_pred = lasso.predict(X_val_scaled_top)

r2_score(y_val, y_pred)

0.4035558099985217

In [570]:
def simple_selection(X, y, nan_treshhold=0.2, top_k=10):
    clear_columns = X.columns[X.isna().mean() < nan_treshhold]
    top_columns = X[clear_columns].corrwith(y).abs().sort_values(ascending=False).head(top_k).index
    return top_columns



top_columns = simple_selection(X_train, y_train)
print(top_columns)

X_train_selected1 = X_train_scaled[top_columns]
X_val_selected1 = X_val_scaled[top_columns]
X_test_selected1 = X_test_scaled[top_columns]

lasso_1 = Lasso(random_state=21)
lasso_1.fit(X_train_selected1, y_train)
y_pred = lasso_1.predict(X_val_selected1)

r2_score(y_val, y_pred)

Index(['bathrooms', 'bedrooms', 'Doorman', 'Elevator', 'LaundryinUnit',
       'DiningRoom', 'DogsAllowed', 'CatsAllowed', 'Terrace', 'FitnessCenter'],
      dtype='object')


0.41296844733292615

In [571]:
def permutation_importance(model, X_train, X_val, y_train, y_val, top_k=10, random_state=21):
    np.random.seed(random_state)

    model.fit(X_train, y_train)
    y_pred_base = model.predict(X_val)
    base_score = r2_score(y_val, y_pred_base)

    importance = {}

    for col in X_val.columns:
        X_val_cp = X_val.copy()
        X_val_cp[col] = np.random.permutation(X_val_cp[col])

        y_pred_permuted = model.predict(X_val_cp)
        permuted_score = r2_score(y_val, y_pred_permuted)

        diff = base_score - permuted_score

        importance[col] = diff

    sorted_importance = sorted(importance.items(), reverse=True, key=lambda item: item[1])
    top_k_cols = list(map(lambda x: x[0], sorted_importance[:top_k]))
    return top_k_cols


lasso_2 = Lasso()
top_columns = permutation_importance(lasso_2, X_train_scaled,  X_val_scaled, y_train, y_val, top_k=10)
print(top_columns)

X_train_selected2 = X_train_scaled[top_columns]
X_val_selected2 = X_val_scaled[top_columns]
X_test_selected2 = X_test_scaled[top_columns]

lasso_2 = Lasso(random_state=21)
lasso_2.fit(X_train_selected2, y_train)
y_pred = lasso_2.predict(X_val_selected2)

r2_score(y_val, y_pred)


['bathrooms', 'Doorman', 'bedrooms', 'FitnessCenter', 'LaundryinUnit', 'LaundryinBuilding', 'Elevator', 'LaundryInBuilding', 'DogsAllowed', 'CatsAllowed']


0.4036657635053378

In [574]:
import shap

lasso_3 = Lasso()
lasso_3.fit(X_train_scaled, y_train)

explainer = shap.LinearExplainer(lasso_3, X_train_scaled)
shap_values = explainer(X_val_scaled)
top_cols = np.argsort(np.abs(shap_values.values).sum(axis=0))[-10:]
top_cols = X.columns[top_cols]
print(top_cols)


X_train_selected3 = X_train_scaled[top_cols]
X_val_selected3 = X_val_scaled[top_cols]
X_test_selected3 = X_test_scaled[top_cols]

lasso_3.fit(X_train_selected3, y_train)
y_pred = lasso_3.predict(X_val_selected3)
r2_score(y_val, y_pred)

Index(['LaundryinUnit', 'HardwoodFloors', 'NoFee', 'DogsAllowed', 'Elevator',
       'LaundryinBuilding', 'FitnessCenter', 'bedrooms', 'bathrooms',
       'Doorman'],
      dtype='object')


0.40324559228566703

In [573]:
%%timeit
# Отдельно измерим время выполнения всех способов

pd.Series(np.abs(lasso_0.coef_)).sort_values(ascending=False).head(10).index

127 μs ± 26.7 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [575]:
%%timeit
simple_selection(X_train, y_train)

22.7 ms ± 764 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [576]:
%%timeit
permutation_importance(lasso_2, X_train_scaled,  X_val_scaled, y_train, y_val, top_k=10)

120 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [577]:
lasso_3 = Lasso()
lasso_3.fit(X_train_scaled, y_train)

0,1,2
,alpha,1.0
,fit_intercept,True
,precompute,False
,copy_X,True
,max_iter,1000
,tol,0.0001
,warm_start,False
,positive,False
,random_state,
,selection,'cyclic'


In [578]:
%%timeit

explainer = shap.LinearExplainer(lasso_3, X_train_scaled)
shap_values = explainer(X_val_scaled)
top_cols = np.argsort(np.abs(shap_values.values).sum(axis=0))[-10:]
top_cols = X.columns[top_cols]


341 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


ИТОГО:  
самый быстрый способ - simple selection   
самый стабильный способ - shap  
самый лучший результат у simple selection  


## 7. Hyperparameter optimization

- Implement grid search and random search methods for alpha and l1_ratio for sklearn's ElasticNet model.
- Find the best combination of model hyperparameters.
- Fit the resulting model.
- Import optuna and configure the same experiment with ElasticNet.
- Estimate metrics and compare approaches.
- Run optuna on one of the cross-validation schemes.

In [579]:
from sklearn.linear_model import ElasticNet

In [584]:
#GS

alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]

best_score = float("-inf")
best_params = None

for alpha in alphas:
    for l1_ratio in l1_ratios:
        model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_val_scaled)

        score = r2_score(y_val, y_pred)

        if score > best_score:
            best_score = score
            best_params = {"apha" : alpha, "l1_ratio" : l1_ratio}

print(f"best score = {best_score}")
print(f"best params : {best_params}")


best score = 0.4096162045952626
best params : {'apha': 10.0, 'l1_ratio': 1.0}


In [586]:
# RS

best_score = float("-inf")
best_params = None

for _ in range(30):
        alpha = 10**np.random.uniform(-3, 1)
        l1_ratio = np.random.uniform(0, 1)
        model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_val_scaled)

        score = r2_score(y_val, y_pred)

        if score > best_score:
            best_score = score
            best_params = {"apha" : alpha, "l1_ratio" : l1_ratio}

print(f"best score = {best_score}")
print(f"best params : {best_params}")



best score = 0.39639695692883126
best params : {'apha': 0.0028023953199331217, 'l1_ratio': 0.8564793123158471}


In [588]:
model = ElasticNet(alpha=10, l1_ratio=1).fit(X_train_scaled, y_train)

In [602]:
# optuna
import optuna

def objective(trial):
    alpha = trial.suggest_float("alpha", 1e-3, 10, log=True)
    l1_ratio = trial.suggest_float("l1_ratio", 0, 1)

    model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)
    return r2_score(y_val, y_pred)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)


print(f"best score = {study.best_value}")
print(f"best params : {study.best_params}")


[I 2025-09-10 02:10:21,976] A new study created in memory with name: no-name-9e2b391f-5218-41ac-bc1f-cef6dc5ac517
[I 2025-09-10 02:10:22,031] Trial 0 finished with value: 0.15575893469386082 and parameters: {'alpha': 0.2750047549938825, 'l1_ratio': 0.5355991123586312}. Best is trial 0 with value: 0.15575893469386082.
[I 2025-09-10 02:10:22,142] Trial 1 finished with value: 0.3696279568362255 and parameters: {'alpha': 0.005308529566827734, 'l1_ratio': 0.441828743036598}. Best is trial 1 with value: 0.3696279568362255.
[I 2025-09-10 02:10:22,245] Trial 2 finished with value: 0.3935506105699489 and parameters: {'alpha': 0.0016690283619973404, 'l1_ratio': 0.6010147052603093}. Best is trial 2 with value: 0.3935506105699489.
[I 2025-09-10 02:10:22,259] Trial 3 finished with value: 0.16795224666594222 and parameters: {'alpha': 0.34151141203130997, 'l1_ratio': 0.6925234362168898}. Best is trial 2 with value: 0.3935506105699489.
[I 2025-09-10 02:10:22,270] Trial 4 finished with value: 0.1064176

best score = 0.39881446072460336
best params : {'alpha': 0.013484189312570602, 'l1_ratio': 0.9882409146019352}


In [603]:
# cv optuna
from sklearn.model_selection import cross_val_score, KFold


def objective(trial):
    alpha = trial.suggest_float("alpha", 1e-3, 10, log=True)
    l1_ratio = trial.suggest_float("l1_ratio", 0, 1)

    model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    kf = KFold(n_splits=5, random_state=21, shuffle=True)
    scores = cross_val_score(model, X_train_scaled, y_train, cv=kf, scoring="r2")
    return np.mean(scores)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)


print(f"best score = {study.best_value}")
print(f"best params : {study.best_params}")

[I 2025-09-10 02:10:37,266] A new study created in memory with name: no-name-6e812982-f7c1-4376-b7c3-606113556f62
[I 2025-09-10 02:10:37,568] Trial 0 finished with value: 0.05680215754720939 and parameters: {'alpha': 0.002367984038468648, 'l1_ratio': 0.39230723775232745}. Best is trial 0 with value: 0.05680215754720939.
[I 2025-09-10 02:10:37,746] Trial 1 finished with value: 0.0541042797536059 and parameters: {'alpha': 0.005609886841440991, 'l1_ratio': 0.2749811498456607}. Best is trial 0 with value: 0.05680215754720939.
[I 2025-09-10 02:10:38,283] Trial 2 finished with value: 0.057867347092354124 and parameters: {'alpha': 0.0016677217254864132, 'l1_ratio': 0.6746002421691982}. Best is trial 2 with value: 0.057867347092354124.
[I 2025-09-10 02:10:38,400] Trial 3 finished with value: 0.051624728461037425 and parameters: {'alpha': 0.04071381634522204, 'l1_ratio': 0.8196041358809014}. Best is trial 2 with value: 0.057867347092354124.
[I 2025-09-10 02:10:38,686] Trial 4 finished with valu

best score = 0.05837537204782135
best params : {'alpha': 0.06478510546728673, 'l1_ratio': 0.999728612327941}
