# Зачётное задание. **Тетерин Т.**

В рамках данной работы вам будет предложено самостоятельно выполнить те этапы машинного обучения, которые были изучены на парах.

Работа разбита на несколько заданий. Каждое задание наобходимо выполнять последовательно, комментируя все выполняемые действия в текстовых ячейках. При крайней необходимости, возможно вставить комментарий непосредственно в коде. Комментарии должны быть лаконичными, но при этом полными.


**Обратите внимание!**
1. Работы, в которых будет выявлен плагиат, оцениваются в **0 баллов**.
2. Перед проверкой работы по существу, запускается выполнение всех ячеек ноутбука (Среда выполнения -> Выполнить всё). Если хотя бы одна ячейка выполнится с ошибкой, то за работу выставляется **0 баллов**.
3. Если не указано иное, считать, что все необходимые файлы расположены в одной папке с ноутбуком.
4. Данный ноутбук является шаблоном для вашей работы. Пожалуйста, не удаляйте ячейки, которые были в исходном ноутбуке.



## Описание задания

Вам необходимо на основе данных о бронировании гостиниц обучить модель машинного обучения, которая предсказывает, будет ли отменена бронь. Описание колонок:

* `Booking_ID` – уникальный идентификатор бронирования
* `no_of_adults` - кол-во взрослых
* `no_of_children` – кол-во детей
* `no_of_weekend_nights` – кол-во ночей в выходные, которые входят в бронирование
* `no_of_week_nights` – кол-во ночей в будние дни, которые входят в бронирование
* `type_of_meal_plan` – тип питания
* `required_car_parking_space` – необходима ли парковка?
* `room_type_reserved` – тип зарезервированного номера
* `lead_time` – кол-во дней между датой бронирования и датой прибытия
* `arrival_year` – год прибытия
* `arrival_month` – месяц прибытия
* `arrival_date` – день прибытия
* `market_segment_type` – маркетинговый сегмент
* `repeated_guest` – является ли клиент постоянным гостем?
* `no_of_previous_cancellations` – кол-во отменённых бронирований перед текущим
* `no_of_previous_bookings_not_canceled` – кол-во предыдущих бронирований, которые не были отменены
* `avg_price_per_room` – средняя стоимость бронирования (в евро)
* `no_of_special_requests` – кол-во специальных условий в бронировании (например, кондиционер, номер на первом этаже, поздний заезд)
* `booking_status` – было ли бронирование отменено?


In [1]:
import pandas as pd
import numpy as np
import itertools
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDRegressor, Ridge, Lasso
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

import time
start_time = time.time()

## Задание 1

Прочитайте файл датасета `hotel_reservations.csv` в объект DataFrame, выведите первые 10 строк и последние 10 строк. Выведите информацию об объекте DataFrame.

In [2]:
df = pd.read_csv('hotel_reservations.csv', sep=',', low_memory=False)

In [3]:
df.head(10)

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2.0,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2.0,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1.0,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2.0,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2.0,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled
5,INN00006,2.0,0,0,2,Meal Plan 2,0,Room_Type 1,346,2018,9,13,Online,0,0,0,115.0,1,Canceled
6,INN00007,2.0,0,1,3,Meal Plan 1,0,Room_Type 1,34,2017,10,15,Online,0,0,0,107.55,1,Not_Canceled
7,INN00008,2.0,0,1,3,Meal Plan 1,0,Room_Type 4,83,2018,12,26,Online,0,0,0,105.61,1,Not_Canceled
8,INN00009,3.0,0,0,4,Meal Plan 1,0,Room_Type 1,121,2018,7,6,Offline,0,0,0,96.9,1,Not_Canceled
9,INN00010,2.0,0,0,5,Meal Plan 1,0,Room_Type 4,44,2018,10,18,Online,0,0,0,133.44,3,Not_Canceled


In [4]:
df.tail(10)

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
36271,INN36267,2.0,0,2,2,Meal Plan 1,0,Room_Type 2,8,2018,3,4,Online,0,0,0,85.96,1,Canceled
36272,INN36268,2.0,0,1,0,Not Selected,0,Room_Type 1,49,2018,7,11,Online,0,0,0,93.15,0,Canceled
36273,INN36269,1.0,0,0,3,Meal Plan 1,0,Room_Type 1,166,2018,11,1,Offline,0,0,0,110.0,0,Canceled
36274,INN36270,2.0,2,0,1,Meal Plan 1,0,Room_Type 6,0,2018,10,6,Online,0,0,0,216.0,0,Canceled
36275,INN36271,3.0,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.8,1,Not_Canceled
36276,INN36272,2.0,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36277,INN36273,2.0,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36278,INN36274,2.0,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.5,0,Canceled
36279,INN36275,2.0,0,1,2,Meal Plan 1,0,Room_Type 1,207,2018,12,30,Offline,0,0,0,161.67,0,Not_Canceled
36280,INN35327,2.0,0,1,0,Not Selected,0,Room_Type 1,69,2018,9,12,Online,0,0,0,125.1,1,Not_Canceled


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36281 entries, 0 to 36280
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36281 non-null  object 
 1   no_of_adults                          36278 non-null  float64
 2   no_of_children                        36281 non-null  int64  
 3   no_of_weekend_nights                  36281 non-null  int64  
 4   no_of_week_nights                     36281 non-null  int64  
 5   type_of_meal_plan                     36281 non-null  object 
 6   required_car_parking_space            36281 non-null  int64  
 7   room_type_reserved                    36277 non-null  object 
 8   lead_time                             36281 non-null  int64  
 9   arrival_year                          36281 non-null  object 
 10  arrival_month                         36281 non-null  int64  
 11  arrival_date   

## Задание 2

В датасете могут быть проблемы с данными. Проверьте набор данных:

* на некорректные значения,
* на пропуски,
* на дубликаты (частичные и полные).

При наличии проблем, исправьте их.

*при наличии пропусков их обязательно заполнять, удаление строк или столбцов с пропусками запрещено.

Проверка значений на пропуски

In [6]:
df['arrival_year'] = pd.to_numeric(df['arrival_year'], errors='coerce')

In [7]:
columns_with_skips = []
for df_column in df.columns:
    count_isna_cells = pd.isna(df[df_column]).sum()
    if count_isna_cells > 0:
        columns_with_skips.append(df_column)
        print(f'Колонка "{df_column}" имеет ошибки ({count_isna_cells})')

Колонка "no_of_adults" имеет ошибки (3)
Колонка "room_type_reserved" имеет ошибки (4)
Колонка "arrival_year" имеет ошибки (4)


In [8]:
# Проверим номера в которых никого не заселяется
df.query("no_of_children == 0 and no_of_adults == 0")

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status


In [9]:
# Дети без взрослых
display(df.query("no_of_children > 0 and no_of_adults == 0"))

# Некоторые дети могут ездить по доверенностям, однако здесь возраст детей неизвестен
# Будем считать что у таких детей есть хотя бы 1 взрослый или 1 сопровождающий

df.loc[df.eval("no_of_children > 0 and no_of_adults == 0"), 'no_of_adults'] = 1

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
32,INN00033,0.0,2,0,3,Meal Plan 1,0,Room_Type 2,56,2018.0,12,7,Online,0,0,0,82.44,1,Not_Canceled
287,INN00288,0.0,2,2,2,Meal Plan 1,0,Room_Type 1,68,2018.0,4,24,Online,0,0,0,108.38,1,Canceled
653,INN00654,0.0,2,1,2,Meal Plan 1,0,Room_Type 2,78,2018.0,8,19,Online,0,0,0,115.68,1,Not_Canceled
937,INN00938,0.0,2,0,3,Meal Plan 1,0,Room_Type 2,40,2018.0,1,14,Online,0,0,0,6.67,1,Not_Canceled
954,INN00955,0.0,2,1,1,Meal Plan 1,0,Room_Type 2,92,2018.0,10,29,Online,0,0,0,81.50,2,Not_Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34725,INN34721,0.0,2,0,3,Meal Plan 1,0,Room_Type 2,76,2018.0,9,21,Online,0,0,0,127.38,3,Not_Canceled
34735,INN34731,0.0,2,1,1,Meal Plan 1,0,Room_Type 2,178,2018.0,8,27,Online,0,0,0,88.77,0,Canceled
34895,INN34891,0.0,2,2,2,Meal Plan 1,0,Room_Type 2,31,2018.0,9,16,Online,0,0,0,124.25,2,Not_Canceled
35696,INN35692,0.0,2,2,1,Meal Plan 1,0,Room_Type 2,75,2018.0,3,19,Online,0,0,0,78.00,0,Canceled


In [10]:
# Поскольку в большинстве случаев предполагается заполнение ячейки да/нет
# есть строка где это не так
display(df.query("required_car_parking_space != 0 and required_car_parking_space != 1"))

# Будем считать что им было необходимо парковочное место

df.loc[df.eval("required_car_parking_space != 0 and required_car_parking_space != 1"), 'required_car_parking_space'] = 1

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
21993,INN00033,1.0,2,0,3,Meal Plan 1,4,Room_Type 2,56,2018.0,12,7,Online,0,0,0,82.44,2,Not_Canceled


In [11]:
# Корректность заполнения месяцев
display(df.query("arrival_month > 12 or arrival_month < 1"))

# Корректность заполнения чисел
display(df.query("arrival_date > 31 or arrival_date < 1"))

# Проверка значения "является ли гость постоянником"
display(df.query("repeated_guest != 0 and repeated_guest != 1"))

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status


Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status


Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status


In [12]:
df[df.isna().any(axis=1)]

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
4985,INN04985,2.0,0,0,2,Meal Plan 1,0,Room_Type 1,37,,11,17,Online,0,0,0,104.0,2,Not_Canceled
8082,INN08082,1.0,0,1,0,Meal Plan 1,0,Room_Type 1,0,,1,25,Corporate,0,0,0,79.0,0,Not_Canceled
20215,INN20214,2.0,1,1,2,Meal Plan 1,0,Room_Type 1,9,,6,17,Offline,0,0,0,95.0,1,Not_Canceled
20529,INN20528,1.0,0,0,3,Meal Plan 1,0,Room_Type 1,174,,9,22,Offline,0,0,0,95.67,0,Not_Canceled
20577,INN20576,,0,0,2,Meal Plan 1,0,Room_Type 4,20,2018.0,6,28,Online,0,0,0,156.0,1,Canceled
21027,INN21026,,0,0,3,Meal Plan 1,0,Room_Type 1,279,2018.0,10,12,Offline,0,0,0,110.0,0,Canceled
21506,INN21505,,0,1,4,Meal Plan 1,0,Room_Type 1,35,2018.0,8,29,Online,0,0,0,111.42,2,Canceled
21981,INN21980,2.0,0,0,2,Meal Plan 1,0,,151,2018.0,1,19,Offline,0,0,0,86.5,0,Not_Canceled
22417,INN22415,2.0,0,0,2,Meal Plan 2,0,,346,2018.0,9,13,Offline,0,0,0,115.0,1,Canceled
22807,INN22805,2.0,0,0,2,Meal Plan 1,0,,74,2017.0,10,28,Online,0,0,0,89.25,1,Not_Canceled


In [13]:
df['no_of_adults'] = df['no_of_adults'].fillna(df['no_of_adults'].mode()[0])
df['room_type_reserved'] = df['room_type_reserved'].fillna(df['room_type_reserved'].mode()[0])
df['arrival_year'] = df['arrival_year'].fillna(df['arrival_year'].mode()[0])

Проверка значений на полные дубликаты

In [14]:
df[df.duplicated(keep = False)].sort_values('Booking_ID')

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
4524,INN04524,2.0,2,1,3,Meal Plan 1,0,Room_Type 6,73,2018.0,8,8,Online,0,0,0,207.9,0,Canceled
30373,INN04524,2.0,2,1,3,Meal Plan 1,0,Room_Type 6,73,2018.0,8,8,Online,0,0,0,207.9,0,Canceled
4520,INN08771,2.0,1,0,1,Meal Plan 1,0,Room_Type 1,45,2018.0,9,15,Online,0,0,0,152.1,1,Not_Canceled
8772,INN08771,2.0,1,0,1,Meal Plan 1,0,Room_Type 1,45,2018.0,9,15,Online,0,0,0,152.1,1,Not_Canceled
35331,INN35327,2.0,0,1,0,Not Selected,0,Room_Type 1,69,2018.0,9,12,Online,0,0,0,125.1,1,Not_Canceled
36280,INN35327,2.0,0,1,0,Not Selected,0,Room_Type 1,69,2018.0,9,12,Online,0,0,0,125.1,1,Not_Canceled


In [15]:
df = df.drop_duplicates(keep='first')
df = df.reset_index(drop=True)

Проверка значений на частичные дубликаты

In [16]:
columns_without_id = list(set(df.columns) - set(['Booking_ID']))
df.groupby(columns_without_id, as_index=False).size()

Unnamed: 0,booking_status,required_car_parking_space,no_of_weekend_nights,arrival_year,room_type_reserved,no_of_week_nights,no_of_adults,repeated_guest,avg_price_per_room,arrival_month,no_of_children,arrival_date,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,lead_time,no_of_special_requests,type_of_meal_plan,market_segment_type,size
0,Canceled,0,0,2017.0,Room_Type 1,0,2.0,0,0.00,9,0,14,0,0,256,0,Meal Plan 1,Online,1
1,Canceled,0,0,2017.0,Room_Type 1,1,1.0,0,0.00,10,0,17,0,0,289,0,Meal Plan 1,Online,1
2,Canceled,0,0,2017.0,Room_Type 1,1,1.0,0,62.00,11,0,17,0,0,22,0,Meal Plan 1,Corporate,1
3,Canceled,0,0,2017.0,Room_Type 1,1,1.0,0,63.75,12,0,8,0,0,101,1,Meal Plan 1,Online,1
4,Canceled,0,0,2017.0,Room_Type 1,1,1.0,0,65.00,9,0,15,0,0,34,0,Meal Plan 1,Corporate,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25998,Not_Canceled,1,2,2018.0,Room_Type 6,3,2.0,0,156.90,12,2,3,0,0,50,1,Meal Plan 1,Online,1
25999,Not_Canceled,1,2,2018.0,Room_Type 6,5,2.0,0,207.64,9,2,7,0,0,61,1,Meal Plan 1,Online,1
26000,Not_Canceled,1,3,2018.0,Room_Type 1,5,2.0,0,81.26,8,0,21,0,0,122,2,Meal Plan 1,Offline,1
26001,Not_Canceled,1,3,2018.0,Room_Type 1,5,3.0,0,105.90,8,0,21,0,0,122,0,Meal Plan 1,Offline,1


При удалении частичных дубликатов (находим по множеству колонок с исключенным идентификатором) предлагается к удалению больше 10000 записей. Поэтому их не трогаем.

## Задание 3

Подготовьте данные к обучению: отделите целевую переменную, закодируйте категориальные переменные и сохраните. Посчитайте коэффициент линейной корреляции, удалите высококоррелирующие признаки.

In [17]:
df.loc[df['booking_status'] == 'Not_Canceled', 'booking_status'] = 1
df.loc[df['booking_status'] == 'Canceled', 'booking_status'] = 0

target = df['booking_status']

features = df.drop(columns=['booking_status'])
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36278 entries, 0 to 36277
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36278 non-null  object 
 1   no_of_adults                          36278 non-null  float64
 2   no_of_children                        36278 non-null  int64  
 3   no_of_weekend_nights                  36278 non-null  int64  
 4   no_of_week_nights                     36278 non-null  int64  
 5   type_of_meal_plan                     36278 non-null  object 
 6   required_car_parking_space            36278 non-null  int64  
 7   room_type_reserved                    36278 non-null  object 
 8   lead_time                             36278 non-null  int64  
 9   arrival_year                          36278 non-null  float64
 10  arrival_month                         36278 non-null  int64  
 11  arrival_date   

In [18]:
features.drop(columns=['Booking_ID'], inplace=True)

In [19]:
encoder = LabelEncoder()
features['type_of_meal_plan'] = encoder.fit_transform(features['type_of_meal_plan'])
features['market_segment_type'] = encoder.fit_transform(features['market_segment_type'])
features['room_type_reserved'] = encoder.fit_transform(features['room_type_reserved'])


In [20]:
corr_bound = 0.75
corr_of_features = features.corr()

display(corr_of_features)

corr_of_features[((corr_of_features >= corr_bound) | (corr_of_features <= -corr_bound)) & (corr_of_features != 1.000)]

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
no_of_adults,1.0,0.015597,0.106448,0.108678,0.022496,0.010364,0.277627,0.100402,0.080301,0.021643,0.028571,0.324248,-0.197268,-0.048831,-0.122161,0.298091,0.196168
no_of_children,0.015597,1.0,0.029352,0.02448,-0.086794,0.034946,0.363993,-0.04711,0.046022,-0.00288,0.025345,0.130664,-0.036356,-0.016392,-0.021193,0.337544,0.124671
no_of_weekend_nights,0.106448,0.029352,1.0,0.179562,-0.027304,-0.031236,0.057371,0.046622,0.055334,-0.009969,0.027372,0.129101,-0.067098,-0.020687,-0.026307,-0.004489,0.060566
no_of_week_nights,0.108678,0.02448,0.179562,1.0,-0.083401,-0.048662,0.094157,0.14968,0.032646,0.037429,-0.009346,0.112973,-0.099751,-0.030076,-0.049337,0.022741,0.046014
type_of_meal_plan,0.022496,-0.086794,-0.027304,-0.083401,1.0,-0.013056,-0.209161,-0.060245,0.071375,0.008547,0.004847,0.203355,-0.062988,-0.011619,-0.038179,-0.069243,0.022069
required_car_parking_space,0.010364,0.034946,-0.031236,-0.048662,-0.013056,1.0,0.038798,-0.06646,0.015745,-0.015266,-0.000188,-0.003612,0.110838,0.027085,0.06377,0.061186,0.088151
room_type_reserved,0.277627,0.363993,0.057371,0.094157,-0.209161,0.038798,1.0,-0.107772,0.103357,-0.005955,0.032911,0.156622,-0.025822,-0.007935,-0.008137,0.46989,0.145046
lead_time,0.100402,-0.04711,0.046622,0.14968,-0.060245,-0.06646,-0.107772,1.0,0.143412,0.136779,0.006503,-0.006891,-0.135974,-0.045719,-0.078131,-0.062578,-0.101642
arrival_year,0.080301,0.046022,0.055334,0.032646,0.071375,0.015745,0.103357,0.143412,1.0,-0.339644,0.018834,0.15,-0.018182,0.003917,0.026418,0.178589,0.05322
arrival_month,0.021643,-0.00288,-0.009969,0.037429,0.008547,-0.015266,-0.005955,0.136779,-0.339644,1.0,-0.042917,-0.006383,0.000336,-0.038614,-0.01072,0.054355,0.110565


Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
no_of_adults,,,,,,,,,,,,,,,,,
no_of_children,,,,,,,,,,,,,,,,,
no_of_weekend_nights,,,,,,,,,,,,,,,,,
no_of_week_nights,,,,,,,,,,,,,,,,,
type_of_meal_plan,,,,,,,,,,,,,,,,,
required_car_parking_space,,,,,,,,,,,,,,,,,
room_type_reserved,,,,,,,,,,,,,,,,,
lead_time,,,,,,,,,,,,,,,,,
arrival_year,,,,,,,,,,,,,,,,,
arrival_month,,,,,,,,,,,,,,,,,


Высококоррелирующих признаков не обнаружено

## Задание 4

Подготовьте данные для обучения линейных моделей: с помощью нормализации приведите все непрерывные признаки к промежутку `[0, 1]`, все категориальные признаки закодируйте с помощью One-Hot Encoding или Label Encoding. Обучите несколько линейных моделей с гиперпараметрами по умолчанию с помощью:
* разделения выборки на обучающую и тестовую;
* *кросс-валидации на 5 фолдов.

Подберите гиперпараметры для выбранных моделей. Выберите 3 наилучшие обученные модели по выбранным выше метрикам.

In [21]:
features_before_normalization = features.copy()

for features_column in features.columns:
    min_value_of_feature = min(features[features_column])
    max_value_of_feature = max(features[features_column])

    features[features_column] = (features[features_column] - min_value_of_feature) / (max_value_of_feature - min_value_of_feature)
    print(f'{features_column}: Min = {min_value_of_feature}, Max = {max_value_of_feature}')

display(features)

no_of_adults: Min = 1.0, Max = 4.0
no_of_children: Min = 0, Max = 10
no_of_weekend_nights: Min = 0, Max = 7
no_of_week_nights: Min = 0, Max = 17
type_of_meal_plan: Min = 0, Max = 3
required_car_parking_space: Min = 0, Max = 1
room_type_reserved: Min = 0, Max = 6
lead_time: Min = 0, Max = 443
arrival_year: Min = 2017.0, Max = 2018.0
arrival_month: Min = 1, Max = 12
arrival_date: Min = 1, Max = 31
market_segment_type: Min = 0, Max = 4
repeated_guest: Min = 0, Max = 1
no_of_previous_cancellations: Min = 0, Max = 13
no_of_previous_bookings_not_canceled: Min = 0, Max = 58
avg_price_per_room: Min = 0.0, Max = 540.0
no_of_special_requests: Min = 0, Max = 5


Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
0,0.333333,0.0,0.142857,0.117647,0.0,0.0,0.0,0.505643,0.0,0.818182,0.033333,0.75,0.0,0.0,0.0,0.120370,0.0
1,0.333333,0.0,0.285714,0.176471,1.0,0.0,0.0,0.011287,1.0,0.909091,0.166667,1.00,0.0,0.0,0.0,0.197556,0.2
2,0.000000,0.0,0.285714,0.058824,0.0,0.0,0.0,0.002257,1.0,0.090909,0.900000,1.00,0.0,0.0,0.0,0.111111,0.0
3,0.333333,0.0,0.000000,0.117647,0.0,0.0,0.0,0.476298,1.0,0.363636,0.633333,1.00,0.0,0.0,0.0,0.185185,0.0
4,0.333333,0.0,0.142857,0.058824,1.0,0.0,0.0,0.108352,1.0,0.272727,0.333333,1.00,0.0,0.0,0.0,0.175000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36273,0.666667,0.0,0.285714,0.352941,0.0,0.0,0.5,0.191874,1.0,0.636364,0.066667,1.00,0.0,0.0,0.0,0.310741,0.2
36274,0.333333,0.0,0.142857,0.176471,0.0,0.0,0.0,0.514673,1.0,0.818182,0.533333,1.00,0.0,0.0,0.0,0.168426,0.4
36275,0.333333,0.0,0.285714,0.352941,0.0,0.0,0.0,0.334086,1.0,0.545455,0.000000,1.00,0.0,0.0,0.0,0.182204,0.4
36276,0.333333,0.0,0.000000,0.176471,1.0,0.0,0.0,0.142212,1.0,0.272727,0.666667,1.00,0.0,0.0,0.0,0.175000,0.0


In [22]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((25394, 17), (25394,), (10884, 17), (10884,))

In [23]:
models_rating = []

#### Обучение моделей на основе разделения выборки на обучающую и тестовую

1. SGDRegressor

In [24]:
sgd_regressor = SGDRegressor(random_state=42)
sgd_regressor.fit(X_train, y_train)
y_pred =  [round(x) for x in sgd_regressor.predict(X_test)]

accuracy = accuracy_score(list(y_test.values), y_pred)
print(accuracy)

models_rating.append({
    'model_name': 'sgd_regressor',
    'split_method': 'train-test',
    'params': 'default',
    'accuracy': accuracy
})

0.7771040058801911


SGDRegressor может принимать следующие гиперпараметры:

    * alpha
    * penalty
    * l1_ratio
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [25]:
alpha_various = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
penalty_various = ['l2', 'l1', 'elasticnet', None]
l1_ratio_various = [0.1, 0.3, 0.5, 0.7, 0.9]

In [26]:
sgd_regressor_params_set = [alpha_various, penalty_various, l1_ratio_various]
sgd_regressor_params = list(itertools.product(*sgd_regressor_params_set))
print(f'{len(sgd_regressor_params)} кобинаций')

120 кобинаций


In [27]:
# Подбор гиперпараметров

current_max_accuracy = 0
for sgd_param in sgd_regressor_params:
    alpha = sgd_param[0]
    penalty = sgd_param[1]
    l1_ratio = sgd_param[2]
    sgd_regressor = SGDRegressor(random_state=42, alpha=alpha, penalty=penalty, l1_ratio=l1_ratio)
    sgd_regressor.fit(X_train, y_train)
    y_pred = [round(x) for x in sgd_regressor.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    if accuracy > current_max_accuracy:
        print(f'Alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}, accuracy_score is {accuracy}')
        models_rating.append({
            'model_name': 'sgd_regressor',
            'split_method': 'train-test',
            'params': f'alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

Alpha = 0.0001, penalty = l2, l1_ratio = 0.1, accuracy_score is 0.7771040058801911


2. Ridge

In [28]:
ridge = Ridge(random_state=42)
ridge.fit(X_train, y_train)
y_pred =  [round(x) for x in ridge.predict(X_test)]

accuracy = accuracy_score(list(y_test.values), y_pred)
print(accuracy)

models_rating.append({
    'model_name': 'ridge',
    'split_method': 'train-test',
    'params': 'default',
    'accuracy': accuracy
})

0.7860161705255421


In [29]:
current_max_accuracy = 0
for alpha_param in alpha_various:
    ridge = Ridge(random_state=42, alpha=alpha_param)
    ridge.fit(X_train, y_train)
    y_pred = [round(x) for x in ridge.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    if accuracy > current_max_accuracy:
        print(f'Alpha = {alpha}, accuracy_score is {accuracy}')
        models_rating.append({
            'model_name': 'ridge',
            'split_method': 'train-test',
            'params': f'alpha = {alpha}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

Alpha = 0.1, accuracy_score is 0.7861080485115767


3. Lasso

In [30]:
lasso = Lasso(random_state=42)
lasso.fit(X_train, y_train)
y_pred =  [round(x) for x in lasso.predict(X_test)]

accuracy = accuracy_score(list(y_test.values), y_pred)
print(accuracy)

models_rating.append({
    'model_name': 'lasso',
    'split_method': 'train-test',
    'params': 'default',
    'accuracy': accuracy
})

0.6752113193678795


In [31]:
current_max_accuracy = 0
for alpha_param in alpha_various:
    lasso = Lasso(random_state=42, alpha=alpha_param)
    lasso.fit(X_train, y_train)
    y_pred = [round(x) for x in lasso.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    if accuracy > current_max_accuracy:
        print(f'Alpha = {alpha}, accuracy_score is {accuracy}')
        models_rating.append({
            'model_name': 'lasso',
            'split_method': 'train-test',
            'params': f'alpha = {alpha}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

Alpha = 0.1, accuracy_score is 0.785832414553473
Alpha = 0.1, accuracy_score is 0.7884968761484749


#### Обучение моделей на основе кросс-валидации на 5 фолдов

1. SGDRegressor

In [32]:
# Параметры по умолчанию
sgd_regressor = SGDRegressor(random_state=42)
five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
fold_num = 1
accuracy_sum = 0
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    sgd_regressor.fit(X_train, y_train)
    y_pred = [round(x) for x in sgd_regressor.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    accuracy_sum += accuracy
    fold_num += 1
    
average_accuracy = accuracy_sum / 5
print(f'Average accuracy with default hyper parameters: {average_accuracy}')
models_rating.append({
    'model_name': 'sgd_regressor',
    'split_method': 'kfolds',
    'params': f'default',
    'accuracy': average_accuracy
})

Average accuracy with default hyper parameters: 0.7791774786350439


In [None]:
# Подбор гиперпараметров
current_max_accuracy = 0
for sgd_param in sgd_regressor_params:
    alpha = sgd_param[0]
    penalty = sgd_param[1]
    l1_ratio = sgd_param[2]

    sgd_regressor = SGDRegressor(random_state=42, alpha=alpha, penalty=penalty, l1_ratio=l1_ratio)

    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train = target.iloc[train]

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        sgd_regressor.fit(X_train, y_train)
        y_pred = [round(x) for x in sgd_regressor.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1

    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
    
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        # Выводим если находим значение average_accuracy больше
        print(f'Average accuracy with alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}: {average_accuracy}')
    models_rating.append({
        'model_name': 'sgd_regressor',
        'split_method': 'kfolds',
        'params': f'alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}',
        'accuracy': average_accuracy
    })

Average accuracy with alpha = 0.0001, penalty = l2, l1_ratio = 0.1: 0.7791774786350439


2. Ridge

In [None]:
# Параметры по умолчанию
fold_num = 1
accuracy_sum = 0
ridge = Ridge(random_state=42)
five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    ridge.fit(X_train, y_train)
    y_pred = [round(x) for x in ridge.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)
    
    accuracy_sum += accuracy
    fold_num += 1
    
average_accuracy = accuracy_sum / 5
print(f'Average accuracy with default hyper parameters: {average_accuracy}')
models_rating.append({
    'model_name': 'ridge',
    'split_method': 'kfolds',
    'params': f'default',
    'accuracy': average_accuracy
})

In [None]:
# Подбор гиперпараметров
current_max_accuracy = 0
for alpha_param in alpha_various:
    ridge = Ridge(random_state=42, alpha=alpha_param)

    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train = target.iloc[train]

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        ridge.fit(X_train, y_train)
        y_pred = [round(x) for x in ridge.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        # Выводим если находим значение average_accuracy больше
        print(f'Average accuracy with alpha = {alpha_param} is {average_accuracy}')
    models_rating.append({
        'model_name': 'ridge',
        'split_method': 'kfolds',
        'params': f'alpha = {alpha_param}',
        'accuracy': average_accuracy
    })

3. Lasso

In [None]:
# Параметры по умолчанию
fold_num = 1
accuracy_sum = 0
lasso = Lasso(random_state=42)
five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    lasso.fit(X_train, y_train)
    y_pred = [round(x) for x in lasso.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    accuracy_sum += accuracy
    fold_num += 1

average_accuracy = accuracy_sum / 5
print(f'Average accuracy with default hyper parameters is {average_accuracy}')
models_rating.append({
    'model_name': 'lasso',
    'split_method': 'kfolds',
    'params': f'default',
    'accuracy': average_accuracy
})

In [None]:
# Подбор гиперпараметров
current_max_accuracy = 0
for alpha_param in alpha_various:
    lasso = Lasso(random_state=42, alpha=alpha_param)

    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train = target.iloc[train]

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        lasso.fit(X_train, y_train)
        y_pred = [round(x) for x in lasso.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0

    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        # Выводим если находим значение average_accuracy больше
        print(f'Average accuracy with alpha = {alpha_param} is {average_accuracy}')
    models_rating.append({
        'model_name': 'lasso',
        'split_method': 'kfolds',
        'params': f'alpha = {alpha_param}',
        'accuracy': average_accuracy
    })

#### Лучшие результаты

In [None]:
# Выводим лучшие результаты среди линейных моделей обученных 
# на выборках разделенных на тестовую и обучающую и
# через применение кросс валидации на 5 фолдов
models_rating.sort(key=lambda x: x['accuracy'], reverse=True)
top_3 = models_rating[:3]

for result in top_3:
    print(f"Model name: {result['model_name']}, split_method: {result['split_method']}, params: {result['params']}, accuracy: {result['accuracy']}")

## Задание 5

С помощью
* разделения выборки на обучающую и тестовую,
* *кросс-валидации на 5 фолдов,

подберите гиперпараметры для моделей дерева решений, случайного леса, бэггинга и градиентного бустинга на деревьях решений, обучая модели на исходном датасете и на датасете, на котором обучались линейные модели. Учитывая результаты предыдущего задания, выберите 3 наилучшие обученные модели по метрикам.

#### Разделение выборки на случайную и обучающую

In [None]:
X_train_features, X_test_features, y_train_features, y_test_features = train_test_split(features, target, test_size=0.3, random_state=42)
y_train_features = np.asarray(y_train_features, dtype=np.float64)
X_train_features.shape, y_train_features.shape, X_test_features.shape, y_test_features.shape

In [None]:
X_train_before_norm, X_test_before_norm, y_train_before_norm, y_test_before_norm = train_test_split(features_before_normalization, target, test_size=0.3, random_state=42)
y_train_before_norm = np.asarray(y_train_before_norm, dtype=np.float64)
X_train_before_norm.shape, y_train_before_norm.shape, X_test_before_norm.shape, y_test_before_norm.shape

1. Дерево решений

DecisionTreeClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_depth = [2, 4, 6, 8, 10, 12]
criterion = ['gini', 'entropy', 'log_loss']

In [None]:
dtc_params_set = [max_depth, criterion]
dtc_params = list(itertools.product(*dtc_params_set))
print(f'{len(dtc_params)} кобинаций')

In [None]:
display(X_train_features)

In [None]:
# Нормализованные данные
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    dtc.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in dtc.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'dtc',
            'split_method': 'train-test',
            'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

In [None]:
display(X_train_before_norm)

In [None]:
# Данные до нормализации
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    dtc.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in dtc.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'dtc',
            'split_method': 'train-test',
            'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

2. Случайный лес

RandomForestClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_depth = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22]
criterion = ['gini', 'entropy', 'log_loss']

In [None]:
rfc_params_set = [max_depth, criterion]
rfc_params = list(itertools.product(*rfc_params_set))
print(f'{len(rfc_params)} кобинаций')

In [None]:
# Нормализованные данные
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    rfc.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in rfc.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'rfc',
            'split_method': 'train-test',
            'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

In [None]:
# Данные до нормализации
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    rfc.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in rfc.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'rfc',
            'split_method': 'train-test',
            'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

3. Бэггинг

BaggingClassifier может принимать следующие гиперпараметры:

    * max_samples
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_samples = [2, 4, 6, 8, 10]
n_estimators = [5, 10, 15, 20]

In [None]:
bagging_params_set = [max_samples, n_estimators]
bagging_params = list(itertools.product(*bagging_params_set))
print(f'{len(bagging_params)} кобинаций')

In [None]:
# Нормализованные данные
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_samples = bagging_param[0]
    n_estimators = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    bagging.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in bagging.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_samples = {max_samples}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'bagging',
            'split_method': 'train-test',
            'params': f'normalized data, max_samples = {max_samples}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

In [None]:
# Данные до нормализации
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_depth = bagging_param[0]
    criterion = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    bagging.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in bagging.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_samples = {max_samples}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'bagging',
            'split_method': 'train-test',
            'params': f'data before normalization, max_samples = {max_samples}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

4. Градиентный бустинг

GradientBoostingClassifier может принимать следующие гиперпараметры:

    * max_features
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_features = ['sqrt', 'log2']
n_estimators = [5, 10, 15, 20]

In [None]:
gbc_params_set = [max_features, n_estimators]
gbc_params = list(itertools.product(*gbc_params_set))
print(f'{len(gbc_params)} кобинаций')

In [None]:
# Нормализоваенные данные
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    n_estimators = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    gbc.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in gbc.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_features = {max_features}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'gbc',
            'split_method': 'train-test',
            'params': f'normalized data, max_features = {max_features}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

In [None]:
# Данные до нормализации
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    criterion = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    gbc.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in gbc.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_features = {max_features}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'gbc',
            'split_method': 'train-test',
            'params': f'data before normalization, max_features = {max_features}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

#### Кросс-валидация на 5 фолдов

1. Дерево решений

DecisionTreeClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_depth = [2, 4, 6, 8, 10, 12]
criterion = ['gini', 'entropy', 'log_loss']

In [None]:
dtc_params_set = [max_depth, criterion]
dtc_params = list(itertools.product(*dtc_params_set))
print(f'{len(dtc_params)} кобинаций')

In [None]:
# Кросс валидация
# Подбор гиперпараметров на нормализованном датасете
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        dtc.fit(X_train, y_train)
        y_pred = [round(x) for x in dtc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
    
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_depth = {max_depth}, criterion = {criterion} is {average_accuracy}')
    top_3.append({
        'model_name': 'dtc',
        'split_method': 'kfolds',
        'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
        'accuracy': average_accuracy
    })

In [None]:
# Кросс валидация
# Подбор гиперпараметров на датасете до нормализации
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        dtc.fit(X_train, y_train)
        y_pred = [round(x) for x in dtc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_depth = {max_depth}, criterion = {criterion} is {average_accuracy}')
    top_3.append({
        'model_name': 'dtc',
        'split_method': 'kfolds',
        'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
        'accuracy': average_accuracy
    })

2. Случайный лес

RandomForestClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_depth = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22]
criterion = ['gini', 'entropy', 'log_loss']

In [None]:
rfc_params_set = [max_depth, criterion]
rfc_params = list(itertools.product(*rfc_params_set))
print(f'{len(rfc_params)} кобинаций')

In [None]:
# Кросс валидация
# Подбор гиперпараметров на нормализованном датасете
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        rfc.fit(X_train, y_train)
        y_pred = [round(x) for x in rfc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_depth = {max_depth}, criterion = {criterion} is {average_accuracy}')
    top_3.append({
        'model_name': 'rfc',
        'split_method': 'kfolds',
        'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
        'accuracy': average_accuracy
    })

In [None]:
# Кросс валидация
# Подбор гиперпараметров на датасете до нормализации
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        rfc.fit(X_train, y_train)
        y_pred = [round(x) for x in rfc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_depth = {max_depth}, criterion = {criterion} is {average_accuracy}')
    top_3.append({
        'model_name': 'rfc',
        'split_method': 'kfolds',
        'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
        'accuracy': average_accuracy
    })

3. Бэггинг

BaggingClassifier может принимать следующие гиперпараметры:

    * max_samples
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_samples = [2, 6, 10, 14, 18]
n_estimators = [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

In [None]:
bagging_params_set = [max_samples, n_estimators]
bagging_params = list(itertools.product(*bagging_params_set))
print(f'{len(bagging_params)} кобинаций')

In [None]:
# Кросс валидация
# Подбор гиперпараметров на нормализованном датасете
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_samples = bagging_param[0]
    n_estimators = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        bagging.fit(X_train, y_train)
        y_pred = [round(x) for x in bagging.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
    
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_samples = {max_samples}, n_estimators = {n_estimators} is {average_accuracy}')
    top_3.append({
        'model_name': 'bagging',
        'split_method': 'kfolds',
        'params': f'normalized data, max_samples = {max_samples}, n_estimators = {n_estimators}',
        'accuracy': average_accuracy
    })

In [None]:
# Кросс валидация
# Подбор гиперпараметров на датасете до нормализации
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_samples = bagging_param[0]
    n_estimators = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        bagging.fit(X_train, y_train)
        y_pred = [round(x) for x in bagging.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_samples = {max_samples}, n_estimators = {n_estimators} is {average_accuracy}')
    top_3.append({
        'model_name': 'bagging',
        'split_method': 'kfolds',
        'params': f'data before normalization, max_samples = {max_samples}, n_estimators = {n_estimators}',
        'accuracy': average_accuracy
    })

4. Градиентный бустинг

GradientBoostingClassifier может принимать следующие гиперпараметры:

    * max_features
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [None]:
max_features = ['sqrt', 'log2']
n_estimators = [5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100]

In [None]:
gbc_params_set = [max_features, n_estimators]
gbc_params = list(itertools.product(*gbc_params_set))
print(f'{len(gbc_params)} кобинаций')

In [None]:
# Кросс валидация
# Подбор гиперпараметров на нормализованном датасете
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    n_estimators = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        gbc.fit(X_train, y_train)
        y_pred = [round(x) for x in gbc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0

    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy with max_features = {max_features}, n_estimators = {n_estimators} is {average_accuracy}')
    top_3.append({
        'model_name': 'gbc',
        'split_method': 'kfolds',
        'params': f'normalized data, max_features = {max_features}, n_estimators = {n_estimators}',
        'accuracy': average_accuracy
    })

In [None]:
# Кросс валидация
# Подбор гиперпараметров на датасете до нормализации
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    n_estimators = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    fold_num = 1
    accuracy_sum = 0
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
     # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        gbc.fit(X_train, y_train)
        y_pred = [round(x) for x in gbc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        accuracy_sum += accuracy 
        fold_num += 1
        
    average_accuracy = accuracy_sum / 5
    accuracy_sum = 0
        
    if average_accuracy > current_max_accuracy:
        current_max_accuracy = average_accuracy
        print(f'Average accuracy  with max_features = {max_features}, n_estimators = {n_estimators} is {average_accuracy}')
    top_3.append({
        'model_name': 'gbc',
        'split_method': 'kfolds',
        'params': f'data before normalization, max_features = {max_features}, n_estimators = {n_estimators}',
        'accuracy': average_accuracy
    })

#### Лучшие результаты

In [None]:
top_3.sort(key=lambda x: x['accuracy'], reverse=True)
top_3_new = top_3[:3]

for result in top_3_new:
    print(f"Model name: {result['model_name']}, split_method: {result['split_method']} params: {result['params']}, accuracy: {result['accuracy']}")

## Задание 6*

В данных имеется дисбалланс по целевой переменной. Сбалансируйте датасет с помощью одной из техник:
* Oversampling
* Undersampling

и обучите наилучшими алгоритмами 3 модели. Сравните результаты с предыдущими обученными моделями.


#### Oversampling

1. Random Over Sampler

In [None]:
random_over_sampler = RandomOverSampler(random_state = 42)

features_over, target_over = random_over_sampler.fit_resample(features, np.asarray(target, dtype=np.float64))
features_over.shape, features.shape

In [None]:
X_train_features_over, X_test_features_over, y_train_features_over, y_test_features_over = train_test_split(features_over, target_over, test_size=0.3, random_state=42)

X_train_features_over.shape, y_train_features_over.shape, X_test_features_over.shape, y_test_features_over.shape

In [None]:
# 1. RandomForestClassifier, max_depth = 20, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.9026131770888343

rfc = RandomForestClassifier(random_state=42, max_depth=20, criterion='gini')
rfc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in rfc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

In [None]:
# 2. DecisionTreeClassifier, max_depth = 12, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.8774464479881952

dtc = DecisionTreeClassifier(random_state=42, max_depth=12, criterion='gini')
dtc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in dtc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

In [None]:
# 3. Lasso, params: alpha = 0.1
# Результат обучения на данных ранее accuracy_score: 0.7883131201764058

lasso = Lasso(random_state=42, alpha = 0.1)
lasso.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in lasso.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

Модели выбрал не только опиарясь на полученные "топы". Взял из разных групп для интереса.

Получилось что RandomForestClassifier сыграл в лучшую сторону. \
Результаты DecisionTreeClassifier чуть лучше чем были. \
Но у Lasso результаты сильно просели. 

Если выбрать alpha уже 0.01 результат получится ближе к тому какой в был в экспериментах с Lasso.

In [None]:
# 3*. Lasso, params: alpha = 0.01
# Результат обучения на данных ранее accuracy_score: 0.7826326671261199

lasso = Lasso(random_state=42, alpha = 0.01)
lasso.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in lasso.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

2. SMOTE

In [None]:
smote = SMOTE(random_state = 42)

features_over, target_over = smote.fit_resample(features, np.asarray(target, dtype=np.float64))
features_over.shape, features.shape

In [None]:
X_train_features_over, X_test_features_over, y_train_features_over, y_test_features_over = train_test_split(features_over, target_over, test_size=0.3, random_state=42)

X_train_features_over.shape, y_train_features_over.shape, X_test_features_over.shape, y_test_features_over.shape

In [None]:
# 1. RandomForestClassifier, max_depth = 20, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.9026131770888343

rfc = RandomForestClassifier(random_state=42, max_depth=22, criterion='entropy')
rfc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in rfc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

In [None]:
# 2. DecisionTreeClassifier, max_depth = 12, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.8774464479881952

dtc = DecisionTreeClassifier(random_state=42, max_depth=12, criterion='gini')
dtc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in dtc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

In [None]:
# 3. Lasso, params: alpha = 0.01
# Результат обучения на данных ранее accuracy_score: 0.7826326671261199

lasso = Lasso(random_state=42, alpha = 0.01)
lasso.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in lasso.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

Применение Random Over Sampler показало лучшие результаты

#### Undersampling

1. RandomUnderSampler

In [None]:
rus = RandomUnderSampler(random_state=42)

features_under, target_under = rus.fit_resample(features, np.asarray(target, dtype=np.float64))
features_under.shape, features.shape

In [None]:
X_train_features_under, X_test_features_under, y_train_features_under, y_test_features_under = train_test_split(features_under, target_under, test_size=0.3, random_state=42)

X_train_features_under.shape, y_train_features_under.shape, X_test_features_under.shape, y_test_features_under.shape

In [None]:
# 1. RandomForestClassifier, max_depth = 20, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.9026131770888343

rfc = RandomForestClassifier(random_state=42, max_depth=22, criterion='entropy')
rfc.fit(X_train_features_under, y_train_features_under)

pred_features = [round(x) for x in rfc.predict(X_test_features_under)]
accuracy = accuracy_score(y_test_features_under, pred_features)
print(accuracy)

In [None]:
# 2. DecisionTreeClassifier, max_depth = 12, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.8774464479881952

dtc = DecisionTreeClassifier(random_state=42, max_depth=12, criterion='gini')
dtc.fit(X_train_features_under, y_train_features_under)

pred_features = [round(x) for x in dtc.predict(X_test_features_under)]
accuracy = accuracy_score(y_test_features_under, pred_features)
print(accuracy)

In [None]:
# 3. Lasso, params: alpha = 0.01
# Результат обучения на данных ранее accuracy_score: 0.7826326671261199

lasso = Lasso(random_state=42, alpha = 0.01)
lasso.fit(X_train_features_under, y_train_features_under)

pred_features = [round(x) for x in lasso.predict(X_test_features_under)]
accuracy = accuracy_score(y_test_features_under, pred_features)
print(accuracy)

При Undersampling результаты обучения и работы моделей снизились

## Задание 7*

С помощью наилучшей линейной модели и наилучшей модели случайного леса, определите важность признаков.

In [None]:
# Ridge
# 0.001, accuracy: 0.789045755616968
ridge = Ridge(random_state=42, alpha=0.001)

five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    ridge.fit(X_train, y_train)
    y_pred = ridge.predict(X_test)

ridge_coefs = ridge.coef_
display(ridge_coefs)

In [None]:
# RandomForestClassifier
# max_depth = 20, criterion = gini, accuarcy: 0.9026131770888343
rfc = RandomForestClassifier(random_state=42, criterion='gini', max_depth=20)

five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = np.asarray(target.iloc[train], dtype=np.float64)

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    rfc.fit(X_train, y_train)
    y_pred = rfc.predict(X_test)

rfc_coefs = rfc.feature_importances_
display(rfc_coefs)

In [None]:
features.columns

In [None]:
# Ridge
zipped = zip(features.columns, ridge_coefs, rfc_coefs)
ridge_sorted_features = sorted(zipped, key = lambda t: -t[1])
for feature in ridge_sorted_features:
    print(feature)

In [None]:
# RandomForestClassifier
zipped = zip(features.columns, ridge_coefs, rfc_coefs)
rfc_sorted_features = sorted(zipped, key = lambda t: -t[2])
for feature in rfc_sorted_features:
    print(feature)

Опираясь на результаты важности признаков полученные через Ridge и RandomForestClassifier, в топе совпадают:
 - **no_of_special_requests** - кол-во специальных условий в бронировании (например, кондиционер, номер на первом этаже, поздний заезд)
 - **arrival_month** - месяц прибытия
 
Опираясь на результаты важности признаков полученные только на основании RandomForestClassifier:
 - **lead_time** - кол-во дней между датой бронирования и датой прибытия

In [None]:
print("--- %s seconds ---" % (time.time() - start_time))