# Зачётное задание. **Тетерин Т.**

В рамках данной работы вам будет предложено самостоятельно выполнить те этапы машинного обучения, которые были изучены на парах.

Работа разбита на несколько заданий. Каждое задание наобходимо выполнять последовательно, комментируя все выполняемые действия в текстовых ячейках. При крайней необходимости, возможно вставить комментарий непосредственно в коде. Комментарии должны быть лаконичными, но при этом полными.


**Обратите внимание!**
1. Работы, в которых будет выявлен плагиат, оцениваются в **0 баллов**.
2. Перед проверкой работы по существу, запускается выполнение всех ячеек ноутбука (Среда выполнения -> Выполнить всё). Если хотя бы одна ячейка выполнится с ошибкой, то за работу выставляется **0 баллов**.
3. Если не указано иное, считать, что все необходимые файлы расположены в одной папке с ноутбуком.
4. Данный ноутбук является шаблоном для вашей работы. Пожалуйста, не удаляйте ячейки, которые были в исходном ноутбуке.



## Описание задания

Вам необходимо на основе данных о бронировании гостиниц обучить модель машинного обучения, которая предсказывает, будет ли отменена бронь. Описание колонок:

* `Booking_ID` – уникальный идентификатор бронирования
* `no_of_adults` - кол-во взрослых
* `no_of_children` – кол-во детей
* `no_of_weekend_nights` – кол-во ночей в выходные, которые входят в бронирование
* `no_of_week_nights` – кол-во ночей в будние дни, которые входят в бронирование
* `type_of_meal_plan` – тип питания
* `required_car_parking_space` – необходима ли парковка?
* `room_type_reserved` – тип зарезервированного номера
* `lead_time` – кол-во дней между датой бронирования и датой прибытия
* `arrival_year` – год прибытия
* `arrival_month` – месяц прибытия
* `arrival_date` – день прибытия
* `market_segment_type` – маркетинговый сегмент
* `repeated_guest` – является ли клиент постоянным гостем?
* `no_of_previous_cancellations` – кол-во отменённых бронирований перед текущим
* `no_of_previous_bookings_not_canceled` – кол-во предыдущих бронирований, которые не были отменены
* `avg_price_per_room` – средняя стоимость бронирования (в евро)
* `no_of_special_requests` – кол-во специальных условий в бронировании (например, кондиционер, номер на первом этаже, поздний заезд)
* `booking_status` – было ли бронирование отменено?


In [5]:
import pandas as pd
import numpy as np
import itertools
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDRegressor, Ridge, Lasso
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

import time
start_time = time.time()

## Задание 1

Прочитайте файл датасета `hotel_reservations.csv` в объект DataFrame, выведите первые 10 строк и последние 10 строк. Выведите информацию об объекте DataFrame.

In [6]:
df = pd.read_csv('hotel_reservations.csv', sep=',', low_memory=False)

In [7]:
df.head(10)

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2.0,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2.0,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1.0,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2.0,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2.0,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled
5,INN00006,2.0,0,0,2,Meal Plan 2,0,Room_Type 1,346,2018,9,13,Online,0,0,0,115.0,1,Canceled
6,INN00007,2.0,0,1,3,Meal Plan 1,0,Room_Type 1,34,2017,10,15,Online,0,0,0,107.55,1,Not_Canceled
7,INN00008,2.0,0,1,3,Meal Plan 1,0,Room_Type 4,83,2018,12,26,Online,0,0,0,105.61,1,Not_Canceled
8,INN00009,3.0,0,0,4,Meal Plan 1,0,Room_Type 1,121,2018,7,6,Offline,0,0,0,96.9,1,Not_Canceled
9,INN00010,2.0,0,0,5,Meal Plan 1,0,Room_Type 4,44,2018,10,18,Online,0,0,0,133.44,3,Not_Canceled


In [8]:
df.tail(10)

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
36271,INN36267,2.0,0,2,2,Meal Plan 1,0,Room_Type 2,8,2018,3,4,Online,0,0,0,85.96,1,Canceled
36272,INN36268,2.0,0,1,0,Not Selected,0,Room_Type 1,49,2018,7,11,Online,0,0,0,93.15,0,Canceled
36273,INN36269,1.0,0,0,3,Meal Plan 1,0,Room_Type 1,166,2018,11,1,Offline,0,0,0,110.0,0,Canceled
36274,INN36270,2.0,2,0,1,Meal Plan 1,0,Room_Type 6,0,2018,10,6,Online,0,0,0,216.0,0,Canceled
36275,INN36271,3.0,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.8,1,Not_Canceled
36276,INN36272,2.0,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36277,INN36273,2.0,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36278,INN36274,2.0,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.5,0,Canceled
36279,INN36275,2.0,0,1,2,Meal Plan 1,0,Room_Type 1,207,2018,12,30,Offline,0,0,0,161.67,0,Not_Canceled
36280,INN35327,2.0,0,1,0,Not Selected,0,Room_Type 1,69,2018,9,12,Online,0,0,0,125.1,1,Not_Canceled


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36281 entries, 0 to 36280
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36281 non-null  object 
 1   no_of_adults                          36278 non-null  float64
 2   no_of_children                        36281 non-null  int64  
 3   no_of_weekend_nights                  36281 non-null  int64  
 4   no_of_week_nights                     36281 non-null  int64  
 5   type_of_meal_plan                     36281 non-null  object 
 6   required_car_parking_space            36281 non-null  int64  
 7   room_type_reserved                    36277 non-null  object 
 8   lead_time                             36281 non-null  int64  
 9   arrival_year                          36281 non-null  object 
 10  arrival_month                         36281 non-null  int64  
 11  arrival_date   

## Задание 2

В датасете могут быть проблемы с данными. Проверьте набор данных:

* на некорректные значения,
* на пропуски,
* на дубликаты (частичные и полные).

При наличии проблем, исправьте их.

*при наличии пропусков их обязательно заполнять, удаление строк или столбцов с пропусками запрещено.

Проверка значений на пропуски

In [10]:
df['arrival_year'] = pd.to_numeric(df['arrival_year'], errors='coerce')

In [11]:
columns_with_skips = []
for df_column in df.columns:
    count_isna_cells = pd.isna(df[df_column]).sum()
    if count_isna_cells > 0:
        columns_with_skips.append(df_column)
        print(f'Колонка "{df_column}" имеет пропуски ({count_isna_cells})')

Колонка "no_of_adults" имеет пропуски (3)
Колонка "room_type_reserved" имеет пропуски (4)
Колонка "arrival_year" имеет пропуски (4)


In [12]:
df[df.isna().any(axis=1)]

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
4985,INN04985,2.0,0,0,2,Meal Plan 1,0,Room_Type 1,37,,11,17,Online,0,0,0,104.0,2,Not_Canceled
8082,INN08082,1.0,0,1,0,Meal Plan 1,0,Room_Type 1,0,,1,25,Corporate,0,0,0,79.0,0,Not_Canceled
20215,INN20214,2.0,1,1,2,Meal Plan 1,0,Room_Type 1,9,,6,17,Offline,0,0,0,95.0,1,Not_Canceled
20529,INN20528,1.0,0,0,3,Meal Plan 1,0,Room_Type 1,174,,9,22,Offline,0,0,0,95.67,0,Not_Canceled
20577,INN20576,,0,0,2,Meal Plan 1,0,Room_Type 4,20,2018.0,6,28,Online,0,0,0,156.0,1,Canceled
21027,INN21026,,0,0,3,Meal Plan 1,0,Room_Type 1,279,2018.0,10,12,Offline,0,0,0,110.0,0,Canceled
21506,INN21505,,0,1,4,Meal Plan 1,0,Room_Type 1,35,2018.0,8,29,Online,0,0,0,111.42,2,Canceled
21981,INN21980,2.0,0,0,2,Meal Plan 1,0,,151,2018.0,1,19,Offline,0,0,0,86.5,0,Not_Canceled
22417,INN22415,2.0,0,0,2,Meal Plan 2,0,,346,2018.0,9,13,Offline,0,0,0,115.0,1,Canceled
22807,INN22805,2.0,0,0,2,Meal Plan 1,0,,74,2017.0,10,28,Online,0,0,0,89.25,1,Not_Canceled


In [13]:
df['no_of_adults'] = df['no_of_adults'].fillna(df['no_of_adults'].mode()[0])
df['room_type_reserved'] = df['room_type_reserved'].fillna(df['room_type_reserved'].mode()[0])
df['arrival_year'] = df['arrival_year'].fillna(df['arrival_year'].mode()[0])

Проверка значений на полные дубликаты

In [14]:
df[df.duplicated(keep = False)].sort_values('Booking_ID')

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
4524,INN04524,2.0,2,1,3,Meal Plan 1,0,Room_Type 6,73,2018.0,8,8,Online,0,0,0,207.9,0,Canceled
30373,INN04524,2.0,2,1,3,Meal Plan 1,0,Room_Type 6,73,2018.0,8,8,Online,0,0,0,207.9,0,Canceled
4520,INN08771,2.0,1,0,1,Meal Plan 1,0,Room_Type 1,45,2018.0,9,15,Online,0,0,0,152.1,1,Not_Canceled
8772,INN08771,2.0,1,0,1,Meal Plan 1,0,Room_Type 1,45,2018.0,9,15,Online,0,0,0,152.1,1,Not_Canceled
35331,INN35327,2.0,0,1,0,Not Selected,0,Room_Type 1,69,2018.0,9,12,Online,0,0,0,125.1,1,Not_Canceled
36280,INN35327,2.0,0,1,0,Not Selected,0,Room_Type 1,69,2018.0,9,12,Online,0,0,0,125.1,1,Not_Canceled


In [15]:
df = df.drop_duplicates(keep='first')

Проверка значений на частичные дубликаты

In [16]:
columns_without_id = list(set(df.columns) - set(['Booking_ID']))
df.groupby(columns_without_id, as_index=False).size()

Unnamed: 0,no_of_children,no_of_weekend_nights,arrival_year,type_of_meal_plan,no_of_special_requests,repeated_guest,no_of_previous_cancellations,lead_time,arrival_month,room_type_reserved,market_segment_type,avg_price_per_room,no_of_week_nights,no_of_adults,booking_status,no_of_previous_bookings_not_canceled,arrival_date,required_car_parking_space,size
0,0,0,2017.0,Meal Plan 1,0,0,0,0,8,Room_Type 1,Complementary,0.00,1,1.0,Not_Canceled,0,5,0,1
1,0,0,2017.0,Meal Plan 1,0,0,0,0,8,Room_Type 1,Complementary,0.00,1,2.0,Not_Canceled,0,6,0,1
2,0,0,2017.0,Meal Plan 1,0,0,0,0,8,Room_Type 1,Complementary,0.00,1,2.0,Not_Canceled,0,11,0,1
3,0,0,2017.0,Meal Plan 1,0,0,0,0,8,Room_Type 1,Complementary,0.00,2,2.0,Not_Canceled,0,4,0,1
4,0,0,2017.0,Meal Plan 1,0,0,0,0,8,Room_Type 1,Complementary,0.00,2,2.0,Not_Canceled,0,11,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25998,3,2,2017.0,Meal Plan 1,0,0,0,0,8,Room_Type 6,Online,153.00,0,2.0,Not_Canceled,0,9,0,1
25999,3,2,2018.0,Meal Plan 1,1,0,0,103,8,Room_Type 7,Online,198.98,5,2.0,Not_Canceled,0,2,0,1
26000,9,2,2017.0,Meal Plan 1,0,0,0,11,10,Room_Type 1,Corporate,95.00,1,1.0,Not_Canceled,0,11,0,1
26001,9,2,2017.0,Meal Plan 1,1,0,0,8,8,Room_Type 2,Online,76.50,5,2.0,Canceled,0,13,0,1


При удалении частичных дубликатов (находим по множеству колонок с исключенным идентификатором) предлагается к удалению больше 10000 записей. Поэтому их не трогаем.

## Задание 3

Подготовьте данные к обучению: отделите целевую переменную, закодируйте категориальные переменные и сохраните. Посчитайте коэффициент линейной корреляции, удалите высококоррелирующие признаки.

In [17]:
df.loc[df['booking_status'] == 'Not_Canceled', 'booking_status'] = 1
df.loc[df['booking_status'] == 'Canceled', 'booking_status'] = 0

target = df['booking_status']

features = df.drop(columns=['booking_status'])
features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36278 entries, 0 to 36279
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36278 non-null  object 
 1   no_of_adults                          36278 non-null  float64
 2   no_of_children                        36278 non-null  int64  
 3   no_of_weekend_nights                  36278 non-null  int64  
 4   no_of_week_nights                     36278 non-null  int64  
 5   type_of_meal_plan                     36278 non-null  object 
 6   required_car_parking_space            36278 non-null  int64  
 7   room_type_reserved                    36278 non-null  object 
 8   lead_time                             36278 non-null  int64  
 9   arrival_year                          36278 non-null  float64
 10  arrival_month                         36278 non-null  int64  
 11  arrival_date        

In [18]:
features.drop(columns=['Booking_ID'], inplace=True)

In [19]:
encoder = LabelEncoder()
features['type_of_meal_plan'] = encoder.fit_transform(features['type_of_meal_plan'])
features['market_segment_type'] = encoder.fit_transform(features['market_segment_type'])
features['room_type_reserved'] = encoder.fit_transform(features['room_type_reserved'])


In [20]:
corr_bound = 0.75
corr_of_features = features.corr()

corr_of_features[((corr_of_features >= corr_bound) | (corr_of_features <= -corr_bound)) & (corr_of_features != 1.000)]

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
no_of_adults,,,,,,,,,,,,,,,,,
no_of_children,,,,,,,,,,,,,,,,,
no_of_weekend_nights,,,,,,,,,,,,,,,,,
no_of_week_nights,,,,,,,,,,,,,,,,,
type_of_meal_plan,,,,,,,,,,,,,,,,,
required_car_parking_space,,,,,,,,,,,,,,,,,
room_type_reserved,,,,,,,,,,,,,,,,,
lead_time,,,,,,,,,,,,,,,,,
arrival_year,,,,,,,,,,,,,,,,,
arrival_month,,,,,,,,,,,,,,,,,


Высококоррелирующих признаков не обнаружено

## Задание 4

Подготовьте данные для обучения линейных моделей: с помощью нормализации приведите все непрерывные признаки к промежутку `[0, 1]`, все категориальные признаки закодируйте с помощью One-Hot Encoding или Label Encoding. Обучите несколько линейных моделей с гиперпараметрами по умолчанию с помощью:
* разделения выборки на обучающую и тестовую;
* *кросс-валидации на 5 фолдов.

Подберите гиперпараметры для выбранных моделей. Выберите 3 наилучшие обученные модели по выбранным выше метрикам.

In [21]:
features_before_normalization = features.copy()

for features_column in features.columns:
    min_value_of_feature = min(features[features_column])
    max_value_of_feature = max(features[features_column])

    features[features_column] = (features[features_column] - min_value_of_feature) / (max_value_of_feature - min_value_of_feature)
    print(f'{features_column}: Min = {min_value_of_feature}, Max = {max_value_of_feature}')

display(features)

no_of_adults: Min = 0.0, Max = 4.0
no_of_children: Min = 0, Max = 10
no_of_weekend_nights: Min = 0, Max = 7
no_of_week_nights: Min = 0, Max = 17
type_of_meal_plan: Min = 0, Max = 3
required_car_parking_space: Min = 0, Max = 4
room_type_reserved: Min = 0, Max = 6
lead_time: Min = 0, Max = 443
arrival_year: Min = 2017.0, Max = 2018.0
arrival_month: Min = 1, Max = 12
arrival_date: Min = 1, Max = 31
market_segment_type: Min = 0, Max = 4
repeated_guest: Min = 0, Max = 1
no_of_previous_cancellations: Min = 0, Max = 13
no_of_previous_bookings_not_canceled: Min = 0, Max = 58
avg_price_per_room: Min = 0.0, Max = 540.0
no_of_special_requests: Min = 0, Max = 5


Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
0,0.50,0.0,0.142857,0.117647,0.0,0.0,0.0,0.505643,0.0,0.818182,0.033333,0.75,0.0,0.0,0.0,0.120370,0.0
1,0.50,0.0,0.285714,0.176471,1.0,0.0,0.0,0.011287,1.0,0.909091,0.166667,1.00,0.0,0.0,0.0,0.197556,0.2
2,0.25,0.0,0.285714,0.058824,0.0,0.0,0.0,0.002257,1.0,0.090909,0.900000,1.00,0.0,0.0,0.0,0.111111,0.0
3,0.50,0.0,0.000000,0.117647,0.0,0.0,0.0,0.476298,1.0,0.363636,0.633333,1.00,0.0,0.0,0.0,0.185185,0.0
4,0.50,0.0,0.142857,0.058824,1.0,0.0,0.0,0.108352,1.0,0.272727,0.333333,1.00,0.0,0.0,0.0,0.175000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36275,0.75,0.0,0.285714,0.352941,0.0,0.0,0.5,0.191874,1.0,0.636364,0.066667,1.00,0.0,0.0,0.0,0.310741,0.2
36276,0.50,0.0,0.142857,0.176471,0.0,0.0,0.0,0.514673,1.0,0.818182,0.533333,1.00,0.0,0.0,0.0,0.168426,0.4
36277,0.50,0.0,0.285714,0.352941,0.0,0.0,0.0,0.334086,1.0,0.545455,0.000000,1.00,0.0,0.0,0.0,0.182204,0.4
36278,0.50,0.0,0.000000,0.176471,1.0,0.0,0.0,0.142212,1.0,0.272727,0.666667,1.00,0.0,0.0,0.0,0.175000,0.0


In [22]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((25394, 17), (25394,), (10884, 17), (10884,))

In [23]:
models_rating = []

#### Обучение моделей на основе разделения выборки на обучающую и тестовую

1. SGDRegressor

In [24]:
sgd_regressor = SGDRegressor(random_state=42)
sgd_regressor.fit(X_train, y_train)
y_pred =  [round(x) for x in sgd_regressor.predict(X_test)]

accuracy = accuracy_score(list(y_test.values), y_pred)
print(accuracy)

models_rating.append({
    'model_name': 'sgd_regressor',
    'split_method': 'train-test',
    'params': 'default',
    'accuracy': accuracy
})

0.7771958838662256


SGDRegressor может принимать следующие гиперпараметры:

    * alpha
    * penalty
    * l1_ratio
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [25]:
alpha_various = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
penalty_various = ['l2', 'l1', 'elasticnet', None]
l1_ratio_various = [0.1, 0.3, 0.5, 0.7, 0.9]

In [26]:
sgd_regressor_params_set = [alpha_various, penalty_various, l1_ratio_various]
sgd_regressor_params = list(itertools.product(*sgd_regressor_params_set))
print(f'{len(sgd_regressor_params)} кобинаций')

120 кобинаций


In [27]:
# Подбор гиперпараметров

current_max_accuracy = 0
for sgd_param in sgd_regressor_params:
    alpha = sgd_param[0]
    penalty = sgd_param[1]
    l1_ratio = sgd_param[2]
    sgd_regressor = SGDRegressor(random_state=42, alpha=alpha, penalty=penalty, l1_ratio=l1_ratio)
    sgd_regressor.fit(X_train, y_train)
    y_pred = [round(x) for x in sgd_regressor.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    if accuracy > current_max_accuracy:
        print(f'Alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}, accuracy_score is {accuracy}')
        models_rating.append({
            'model_name': 'sgd_regressor',
            'split_method': 'train-test',
            'params': f'alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

Alpha = 0.0001, penalty = l2, l1_ratio = 0.1, accuracy_score is 0.7771958838662256
Alpha = 0.0001, penalty = l1, l1_ratio = 0.1, accuracy_score is 0.7774715178243293


2. Ridge

In [28]:
ridge = Ridge(random_state=42)
ridge.fit(X_train, y_train)
y_pred =  [round(x) for x in ridge.predict(X_test)]

accuracy = accuracy_score(list(y_test.values), y_pred)
print(accuracy)

models_rating.append({
    'model_name': 'ridge',
    'split_method': 'train-test',
    'params': 'default',
    'accuracy': accuracy
})

0.7860161705255421


In [29]:
current_max_accuracy = 0
for alpha_param in alpha_various:
    ridge = Ridge(random_state=42, alpha=alpha_param)
    ridge.fit(X_train, y_train)
    y_pred = [round(x) for x in ridge.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    if accuracy > current_max_accuracy:
        print(f'Alpha = {alpha}, accuracy_score is {accuracy}')
        models_rating.append({
            'model_name': 'ridge',
            'split_method': 'train-test',
            'params': f'alpha = {alpha}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

Alpha = 0.1, accuracy_score is 0.7861080485115767


3. Lasso

In [30]:
lasso = Lasso(random_state=42)
lasso.fit(X_train, y_train)
y_pred =  [round(x) for x in lasso.predict(X_test)]

accuracy = accuracy_score(list(y_test.values), y_pred)
print(accuracy)

models_rating.append({
    'model_name': 'lasso',
    'split_method': 'train-test',
    'params': 'default',
    'accuracy': accuracy
})

0.6752113193678795


In [31]:
current_max_accuracy = 0
for alpha_param in alpha_various:
    lasso = Lasso(random_state=42, alpha=alpha_param)
    lasso.fit(X_train, y_train)
    y_pred = [round(x) for x in lasso.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    if accuracy > current_max_accuracy:
        print(f'Alpha = {alpha}, accuracy_score is {accuracy}')
        models_rating.append({
            'model_name': 'lasso',
            'split_method': 'train-test',
            'params': f'alpha = {alpha}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

Alpha = 0.1, accuracy_score is 0.785832414553473
Alpha = 0.1, accuracy_score is 0.7883131201764058


#### Обучение моделей на основе кросс-валидации на 5 фолдов

1. SGDRegressor

In [32]:
sgd_regressor = SGDRegressor(random_state=42)
five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
fold_num = 1
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    sgd_regressor.fit(X_train, y_train)
    y_pred = [round(x) for x in sgd_regressor.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    print(f'Accuracy for fold {fold_num} with default hyper parameters is {accuracy}')

    models_rating.append({
        'model_name': 'sgd_regressor',
        'split_method': 'kfolds',
        'params': f'default, fold = {fold_num}',
        'accuracy': accuracy
    })

    fold_num += 1

Accuracy for fold 1 with default hyper parameters is 0.7767364939360529
Accuracy for fold 2 with default hyper parameters is 0.7779768467475193
Accuracy for fold 3 with default hyper parameters is 0.778941565600882
Accuracy for fold 4 with default hyper parameters is 0.7798759476223295
Accuracy for fold 5 with default hyper parameters is 0.7786354238456237


In [33]:
current_max_accuracy = 0
for sgd_param in sgd_regressor_params:
    alpha = sgd_param[0]
    penalty = sgd_param[1]
    l1_ratio = sgd_param[2]

    sgd_regressor = SGDRegressor(random_state=42, alpha=alpha, penalty=penalty, l1_ratio=l1_ratio)

    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train = target.iloc[train]

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        sgd_regressor.fit(X_train, y_train)
        y_pred = [round(x) for x in sgd_regressor.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio} is {accuracy}')
            models_rating.append({
                'model_name': 'sgd_regressor',
                'split_method': 'kfolds',
                'params': f'alpha = {alpha}, penalty = {penalty}, l1_ratio = {l1_ratio}, fold = {fold_num}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with alpha = 0.0001, penalty = l2, l1_ratio = 0.1 is 0.7767364939360529
Accuracy for fold 2 with alpha = 0.0001, penalty = l2, l1_ratio = 0.1 is 0.7779768467475193
Accuracy for fold 3 with alpha = 0.0001, penalty = l2, l1_ratio = 0.1 is 0.778941565600882
Accuracy for fold 4 with alpha = 0.0001, penalty = l2, l1_ratio = 0.1 is 0.7798759476223295
Accuracy for fold 4 with alpha = 0.0001, penalty = elasticnet, l1_ratio = 0.5 is 0.7800137835975189
Accuracy for fold 4 with alpha = 0.0005, penalty = l2, l1_ratio = 0.1 is 0.7801516195727085


2. Ridge

In [34]:
fold_num = 1
ridge = Ridge(random_state=42)
five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    ridge.fit(X_train, y_train)
    y_pred = [round(x) for x in ridge.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    print(f'Accuracy for fold {fold_num} with default hyper parameters is {accuracy}')

    models_rating.append({
        'model_name': 'ridge',
        'split_method': 'kfolds',
        'params': f'default, fold = {fold_num}',
        'accuracy': accuracy
    })

    fold_num += 1

Accuracy for fold 1 with default hyper parameters is 0.7872105843439912
Accuracy for fold 2 with default hyper parameters is 0.7902425578831312
Accuracy for fold 3 with default hyper parameters is 0.7867971334068358
Accuracy for fold 4 with default hyper parameters is 0.7910406616126809
Accuracy for fold 5 with default hyper parameters is 0.7891109579600276


In [35]:
current_max_accuracy = 0
for alpha_param in alpha_various:
    ridge = Ridge(random_state=42, alpha=alpha_param)

    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train = target.iloc[train]

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        ridge.fit(X_train, y_train)
        y_pred = [round(x) for x in ridge.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with alpha = {alpha} is {accuracy}')
            models_rating.append({
                'model_name': 'ridge',
                'split_method': 'kfolds',
                'params': f'alpha = {alpha}, fold = {fold_num}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with alpha = 0.1 is 0.787348401323043
Accuracy for fold 2 with alpha = 0.1 is 0.7902425578831312
Accuracy for fold 4 with alpha = 0.1 is 0.7911784975878704


3. Lasso

In [36]:
fold_num = 1
lasso = Lasso(random_state=42)
five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    lasso.fit(X_train, y_train)
    y_pred = [round(x) for x in lasso.predict(X_test)]
    accuracy = accuracy_score(list(y_test.values), y_pred)

    print(f'Accuracy for fold {fold_num} with default hyper parameters is {accuracy}')

    models_rating.append({
        'model_name': 'lasso',
        'split_method': 'kfolds',
        'params': f'default, fold = {fold_num}',
        'accuracy': accuracy
    })

    fold_num += 1

Accuracy for fold 1 with default hyper parameters is 0.6733737596471885
Accuracy for fold 2 with default hyper parameters is 0.668412348401323
Accuracy for fold 3 with default hyper parameters is 0.6702039691289967
Accuracy for fold 4 with default hyper parameters is 0.6767746381805652
Accuracy for fold 5 with default hyper parameters is 0.6729152308752584


In [37]:
current_max_accuracy = 0
for alpha_param in alpha_various:
    lasso = Lasso(random_state=42, alpha=alpha_param)

    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train = target.iloc[train]

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        lasso.fit(X_train, y_train)
        y_pred = [round(x) for x in lasso.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with alpha = {alpha} is {accuracy}')
            models_rating.append({
                'model_name': 'lasso',
                'split_method': 'kfolds',
                'params': f'alpha = {alpha}, fold = {fold_num}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with alpha = 0.1 is 0.7865214994487321
Accuracy for fold 2 with alpha = 0.1 is 0.790380374862183
Accuracy for fold 4 with alpha = 0.1 is 0.7911784975878704


#### Лучшие результаты

In [38]:
models_rating.sort(key=lambda x: x['accuracy'], reverse=True)
top_3 = models_rating[:3]

for result in top_3:
    print(f"Model name: {result['model_name']}, params: {result['params']}, accuracy: {result['accuracy']}")

Model name: ridge, params: alpha = 0.1, fold = 4, accuracy: 0.7911784975878704
Model name: lasso, params: alpha = 0.1, fold = 4, accuracy: 0.7911784975878704
Model name: ridge, params: default, fold = 4, accuracy: 0.7910406616126809


## Задание 5

С помощью
* разделения выборки на обучающую и тестовую,
* *кросс-валидации на 5 фолдов,

подберите гиперпараметры для моделей дерева решений, случайного леса, бэггинга и градиентного бустинга на деревьях решений, обучая модели на исходном датасете и на датасете, на котором обучались линейные модели. Учитывая результаты предыдущего задания, выберите 3 наилучшие обученные модели по метрикам.

#### Разделение выборки на случайную и обучающую

In [39]:
X_train_features, X_test_features, y_train_features, y_test_features = train_test_split(features, target, test_size=0.3, random_state=42)
y_train_features = np.asarray(y_train_features, dtype=np.float64)
X_train_features.shape, y_train_features.shape, X_test_features.shape, y_test_features.shape

((25394, 17), (25394,), (10884, 17), (10884,))

In [40]:
X_train_before_norm, X_test_before_norm, y_train_before_norm, y_test_before_norm = train_test_split(features_before_normalization, target, test_size=0.3, random_state=42)
y_train_before_norm = np.asarray(y_train_before_norm, dtype=np.float64)
X_train_before_norm.shape, y_train_before_norm.shape, X_test_before_norm.shape, y_test_before_norm.shape

((25394, 17), (25394,), (10884, 17), (10884,))

1. Дерево решений

DecisionTreeClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [41]:
max_depth = [2, 4, 6, 8, 10, 12]
criterion = ['gini', 'entropy', 'log_loss']

In [42]:
dtc_params_set = [max_depth, criterion]
dtc_params = list(itertools.product(*dtc_params_set))
print(f'{len(dtc_params)} кобинаций')

18 кобинаций


In [43]:
display(X_train_features)

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
6810,0.50,0.0,0.000000,0.058824,1.000000,0.0,0.0,0.002257,1.0,0.727273,0.766667,1.00,0.0,0.0,0.0,0.333333,0.0
27783,0.25,0.1,0.142857,0.058824,0.333333,0.0,0.0,0.295711,1.0,0.545455,0.800000,1.00,0.0,0.0,0.0,0.304167,0.0
4026,0.50,0.0,0.285714,0.294118,1.000000,0.0,0.0,0.582393,1.0,0.818182,0.033333,1.00,0.0,0.0,0.0,0.166852,0.0
18651,0.25,0.0,0.285714,0.058824,0.333333,0.0,0.0,0.006772,0.0,0.545455,0.333333,0.75,0.0,0.0,0.0,0.134259,0.0
6325,0.50,0.0,0.142857,0.000000,0.000000,0.0,0.0,0.198646,1.0,0.545455,0.766667,1.00,0.0,0.0,0.0,0.195000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16851,0.25,0.0,0.000000,0.176471,0.000000,0.0,0.0,0.374718,1.0,0.909091,0.000000,0.75,0.0,0.0,0.0,0.203704,0.0
6265,0.50,0.0,0.142857,0.176471,0.000000,0.0,0.0,0.013544,1.0,1.000000,0.133333,1.00,0.0,0.0,0.0,0.163704,0.2
11285,0.50,0.0,0.285714,0.117647,0.000000,0.0,0.0,0.002257,0.0,0.818182,0.733333,1.00,0.0,0.0,0.0,0.178333,0.2
860,0.50,0.0,0.000000,0.176471,0.000000,0.0,0.0,0.480813,1.0,0.454545,0.200000,1.00,0.0,0.0,0.0,0.240741,0.0


In [44]:
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    dtc.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in dtc.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'dtc',
            'split_method': 'train-test',
            'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_depth = 2, criterion = gini, accuracy_score is 0.7612091142962146
max_depth = 4, criterion = gini, accuracy_score is 0.8263506063947078
max_depth = 6, criterion = gini, accuracy_score is 0.8403160602719588
max_depth = 8, criterion = gini, accuracy_score is 0.8582322675486953
max_depth = 10, criterion = gini, accuracy_score is 0.8656743844174936
max_depth = 12, criterion = gini, accuracy_score is 0.8742190371187064


In [45]:
display(X_train_before_norm)

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests
6810,2.0,0,0,1,3,0,0,1,2018.0,9,24,4,0,0,0,180.00,0
27783,1.0,1,1,1,1,0,0,131,2018.0,7,25,4,0,0,0,164.25,0
4026,2.0,0,2,5,3,0,0,258,2018.0,10,2,4,0,0,0,90.10,0
18651,1.0,0,2,1,1,0,0,3,2017.0,7,11,3,0,0,0,72.50,0
6325,2.0,0,1,0,0,0,0,88,2018.0,7,24,4,0,0,0,105.30,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16851,1.0,0,0,3,0,0,0,166,2018.0,11,1,3,0,0,0,110.00,0
6265,2.0,0,1,3,0,0,0,6,2018.0,12,5,4,0,0,0,88.40,1
11285,2.0,0,2,2,0,0,0,1,2017.0,10,23,4,0,0,0,96.30,1
860,2.0,0,0,3,0,0,0,213,2018.0,6,7,4,0,0,0,130.00,0


In [46]:
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    dtc.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in dtc.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'dtc',
            'split_method': 'train-test',
            'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_depth = 2, criterion = gini, accuracy_score is 0.7612091142962146
max_depth = 4, criterion = gini, accuracy_score is 0.8263506063947078
max_depth = 6, criterion = gini, accuracy_score is 0.8403160602719588
max_depth = 8, criterion = gini, accuracy_score is 0.8582322675486953
max_depth = 10, criterion = gini, accuracy_score is 0.8658581403895627
max_depth = 12, criterion = gini, accuracy_score is 0.8744027930907754


2. Случайный лес

RandomForestClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [47]:
max_depth = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22]
criterion = ['gini', 'entropy', 'log_loss']

In [48]:
rfc_params_set = [max_depth, criterion]
rfc_params = list(itertools.product(*rfc_params_set))
print(f'{len(rfc_params)} кобинаций')

33 кобинаций


In [49]:
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    rfc.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in rfc.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'rfc',
            'split_method': 'train-test',
            'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_depth = 2, criterion = gini, accuracy_score is 0.7215178243292907
max_depth = 4, criterion = gini, accuracy_score is 0.8104557148107313
max_depth = 6, criterion = gini, accuracy_score is 0.8401323042998897
max_depth = 8, criterion = gini, accuracy_score is 0.8589672914369717
max_depth = 10, criterion = gini, accuracy_score is 0.8743109151047409
max_depth = 12, criterion = gini, accuracy_score is 0.8824880558618156
max_depth = 14, criterion = gini, accuracy_score is 0.8868981991914737
max_depth = 16, criterion = gini, accuracy_score is 0.8939728041161338
max_depth = 18, criterion = gini, accuracy_score is 0.8956266078647556
max_depth = 20, criterion = gini, accuracy_score is 0.8962697537669975
max_depth = 20, criterion = entropy, accuracy_score is 0.8967291436971702
max_depth = 22, criterion = gini, accuracy_score is 0.9006798970966556


In [50]:
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    rfc.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in rfc.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_depth = {max_depth}, criterion = {criterion}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'rfc',
            'split_method': 'train-test',
            'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_depth = 2, criterion = gini, accuracy_score is 0.7215178243292907
max_depth = 4, criterion = gini, accuracy_score is 0.8104557148107313
max_depth = 6, criterion = gini, accuracy_score is 0.8401323042998897
max_depth = 8, criterion = gini, accuracy_score is 0.8590591694230062
max_depth = 10, criterion = gini, accuracy_score is 0.8744027930907754
max_depth = 12, criterion = gini, accuracy_score is 0.8823042998897465
max_depth = 14, criterion = gini, accuracy_score is 0.8868063212054392
max_depth = 16, criterion = gini, accuracy_score is 0.8936971701580302
max_depth = 18, criterion = gini, accuracy_score is 0.8950753399485484
max_depth = 20, criterion = gini, accuracy_score is 0.8962697537669975
max_depth = 20, criterion = entropy, accuracy_score is 0.8966372657111357
max_depth = 22, criterion = gini, accuracy_score is 0.9002205071664829


3. Бэггинг

BaggingClassifier может принимать следующие гиперпараметры:

    * max_samples
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [51]:
max_samples = [2, 4, 6, 8, 10]
n_estimators = [5, 10, 15, 20]

In [52]:
bagging_params_set = [max_samples, n_estimators]
bagging_params = list(itertools.product(*bagging_params_set))
print(f'{len(bagging_params)} кобинаций')

20 кобинаций


In [53]:
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_samples = bagging_param[0]
    n_estimators = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    bagging.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in bagging.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_samples = {max_samples}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'bagging',
            'split_method': 'train-test',
            'params': f'normalized data, max_samples = {max_samples}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_samples = 2, n_estimators = 5, accuracy_score is 0.6752113193678795
max_samples = 4, n_estimators = 5, accuracy_score is 0.6866960676221977
max_samples = 8, n_estimators = 10, accuracy_score is 0.6988239617787578
max_samples = 8, n_estimators = 15, accuracy_score is 0.7375045938993017
max_samples = 8, n_estimators = 20, accuracy_score is 0.7450385887541345


In [54]:
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_depth = bagging_param[0]
    criterion = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    bagging.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in bagging.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_samples = {max_samples}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'bagging',
            'split_method': 'train-test',
            'params': f'data before normalization, max_samples = {max_samples}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_samples = 10, n_estimators = 20, accuracy_score is 0.7352076442484381


4. Градиентный бустинг

GradientBoostingClassifier может принимать следующие гиперпараметры:

    * max_features
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [55]:
max_features = ['sqrt', 'log2']
n_estimators = [5, 10, 15, 20]

In [56]:
gbc_params_set = [max_features, n_estimators]
gbc_params = list(itertools.product(*gbc_params_set))
print(f'{len(gbc_params)} кобинаций')

8 кобинаций


In [57]:
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    n_estimators = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    gbc.fit(X_train_features, y_train_features)
    y_pred_features = [round(x) for x in gbc.predict(X_test_features)]
    accuracy = accuracy_score(list(y_test_features.values), y_pred_features)

    if accuracy > current_max_accuracy:
        print(f'max_features = {max_features}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'gbc',
            'split_method': 'train-test',
            'params': f'normalized data, max_features = {max_features}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_features = sqrt, n_estimators = 5, accuracy_score is 0.6754869533259831
max_features = sqrt, n_estimators = 10, accuracy_score is 0.7711319367879456
max_features = sqrt, n_estimators = 15, accuracy_score is 0.7861080485115767
max_features = sqrt, n_estimators = 20, accuracy_score is 0.8075156192576258


In [58]:
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    criterion = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    gbc.fit(X_train_before_norm, y_train_before_norm)
    y_pred_before_norm = [round(x) for x in gbc.predict(X_test_before_norm)]
    accuracy = accuracy_score(list(y_test_before_norm.values), y_pred_before_norm)

    if accuracy > current_max_accuracy:
        print(f'max_features = {max_features}, n_estimators = {n_estimators}, accuracy_score is {accuracy}')
        top_3.append({
            'model_name': 'gbc',
            'split_method': 'train-test',
            'params': f'data before normalization, max_features = {max_features}, n_estimators = {n_estimators}',
            'accuracy': accuracy
        })
        current_max_accuracy = accuracy

max_features = sqrt, n_estimators = 20, accuracy_score is 0.8075156192576258


#### Кросс-валидация на 5 фолдов

1. Дерево решений

DecisionTreeClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [59]:
max_depth = [2, 4, 6, 8, 10, 12]
criterion = ['gini', 'entropy', 'log_loss']

In [60]:
dtc_params_set = [max_depth, criterion]
dtc_params = list(itertools.product(*dtc_params_set))
print(f'{len(dtc_params)} кобинаций')

18 кобинаций


In [61]:
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        dtc.fit(X_train, y_train)
        y_pred = [round(x) for x in dtc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_depth = {max_depth}, criterion = {criterion} is {accuracy}')
            top_3.append({
                'model_name': 'dtc',
                'split_method': 'kfolds',
                'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_depth = 2, criterion = gini is 0.7619900771775082
Accuracy for fold 4 with max_depth = 2, criterion = gini is 0.763749138525155
Accuracy for fold 1 with max_depth = 4, criterion = gini is 0.8205622932745315
Accuracy for fold 3 with max_depth = 4, criterion = gini is 0.8238699007717751
Accuracy for fold 1 with max_depth = 6, criterion = gini is 0.8430264608599779
Accuracy for fold 3 with max_depth = 6, criterion = gini is 0.8519845644983461
Accuracy for fold 1 with max_depth = 8, criterion = gini is 0.8561190738699008
Accuracy for fold 2 with max_depth = 8, criterion = gini is 0.8595644983461963
Accuracy for fold 3 with max_depth = 8, criterion = gini is 0.8668687982359427
Accuracy for fold 1 with max_depth = 10, criterion = gini is 0.86742006615215
Accuracy for fold 2 with max_depth = 10, criterion = gini is 0.8707276736493936
Accuracy for fold 3 with max_depth = 10, criterion = gini is 0.8806504961411246


In [62]:
current_max_accuracy = 0
for dtc_param in dtc_params:
    max_depth = dtc_param[0]
    criterion = dtc_param[1]

    dtc = DecisionTreeClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        dtc.fit(X_train, y_train)
        y_pred = [round(x) for x in dtc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_depth = {max_depth}, criterion = {criterion} is {accuracy}')
            top_3.append({
                'model_name': 'dtc',
                'split_method': 'kfolds',
                'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_depth = 2, criterion = gini is 0.7619900771775082
Accuracy for fold 4 with max_depth = 2, criterion = gini is 0.763749138525155
Accuracy for fold 1 with max_depth = 4, criterion = gini is 0.8205622932745315
Accuracy for fold 3 with max_depth = 4, criterion = gini is 0.8238699007717751
Accuracy for fold 1 with max_depth = 6, criterion = gini is 0.8430264608599779
Accuracy for fold 3 with max_depth = 6, criterion = gini is 0.8519845644983461
Accuracy for fold 1 with max_depth = 8, criterion = gini is 0.8561190738699008
Accuracy for fold 2 with max_depth = 8, criterion = gini is 0.8595644983461963
Accuracy for fold 3 with max_depth = 8, criterion = gini is 0.8668687982359427
Accuracy for fold 1 with max_depth = 10, criterion = gini is 0.86742006615215
Accuracy for fold 2 with max_depth = 10, criterion = gini is 0.8707276736493936
Accuracy for fold 3 with max_depth = 10, criterion = gini is 0.8806504961411246


2. Случайный лес

RandomForestClassifier может принимать следующие гиперпараметры:

    * max_depth
    * criterion
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [63]:
max_depth = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22]
criterion = ['gini', 'entropy', 'log_loss']

In [64]:
rfc_params_set = [max_depth, criterion]
rfc_params = list(itertools.product(*rfc_params_set))
print(f'{len(rfc_params)} кобинаций')

33 кобинаций


In [65]:
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        rfc.fit(X_train, y_train)
        y_pred = [round(x) for x in rfc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_depth = {max_depth}, criterion = {criterion} is {accuracy}')
            top_3.append({
                'model_name': 'rfc',
                'split_method': 'kfolds',
                'params': f'normalized data, max_depth = {max_depth}, criterion = {criterion}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_depth = 2, criterion = gini is 0.717199558985667
Accuracy for fold 4 with max_depth = 2, criterion = gini is 0.7206064782908339
Accuracy for fold 1 with max_depth = 4, criterion = gini is 0.8103638368246968
Accuracy for fold 4 with max_depth = 4, criterion = gini is 0.8117160578911096
Accuracy for fold 1 with max_depth = 6, criterion = gini is 0.8397188533627343
Accuracy for fold 2 with max_depth = 6, criterion = gini is 0.8409592061742006
Accuracy for fold 3 with max_depth = 6, criterion = gini is 0.8455071664829107
Accuracy for fold 1 with max_depth = 8, criterion = gini is 0.8588754134509372
Accuracy for fold 2 with max_depth = 8, criterion = gini is 0.8623208379272327
Accuracy for fold 3 with max_depth = 8, criterion = gini is 0.8708654906284454
Accuracy for fold 1 with max_depth = 10, criterion = gini is 0.8770672546857773
Accuracy for fold 3 with max_depth = 10, criterion = gini is 0.88409592061742
Accuracy for fold 3 with max_depth = 10, criterion = 

In [66]:
current_max_accuracy = 0
for rfc_param in rfc_params:
    max_depth = rfc_param[0]
    criterion = rfc_param[1]

    rfc = RandomForestClassifier(random_state=42, max_depth=max_depth, criterion=criterion)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        rfc.fit(X_train, y_train)
        y_pred = [round(x) for x in rfc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_depth = {max_depth}, criterion = {criterion} is {accuracy}')
            top_3.append({
                'model_name': 'rfc',
                'split_method': 'kfolds',
                'params': f'data before normalization, max_depth = {max_depth}, criterion = {criterion}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_depth = 2, criterion = gini is 0.717199558985667
Accuracy for fold 4 with max_depth = 2, criterion = gini is 0.7206064782908339
Accuracy for fold 1 with max_depth = 4, criterion = gini is 0.8103638368246968
Accuracy for fold 4 with max_depth = 4, criterion = gini is 0.8117160578911096
Accuracy for fold 1 with max_depth = 6, criterion = gini is 0.8397188533627343
Accuracy for fold 2 with max_depth = 6, criterion = gini is 0.8409592061742006
Accuracy for fold 3 with max_depth = 6, criterion = gini is 0.8455071664829107
Accuracy for fold 1 with max_depth = 8, criterion = gini is 0.859013230429989
Accuracy for fold 2 with max_depth = 8, criterion = gini is 0.8623208379272327
Accuracy for fold 3 with max_depth = 8, criterion = gini is 0.8708654906284454
Accuracy for fold 1 with max_depth = 10, criterion = gini is 0.8769294377067255
Accuracy for fold 3 with max_depth = 10, criterion = gini is 0.8839581036383682
Accuracy for fold 3 with max_depth = 10, criterion =

3. Бэггинг

BaggingClassifier может принимать следующие гиперпараметры:

    * max_samples
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [67]:
max_samples = [2, 4, 6, 8, 10]
n_estimators = [5, 10, 15, 20]

In [68]:
bagging_params_set = [max_samples, n_estimators]
bagging_params = list(itertools.product(*bagging_params_set))
print(f'{len(bagging_params)} кобинаций')

20 кобинаций


In [69]:
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_samples = bagging_param[0]
    n_estimators = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        bagging.fit(X_train, y_train)
        y_pred = [round(x) for x in bagging.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_samples = {max_samples}, n_estimators = {n_estimators} is {accuracy}')
            top_3.append({
                'model_name': 'bagging',
                'split_method': 'kfolds',
                'params': f'normalized data, max_samples = {max_samples}, n_estimators = {n_estimators}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_samples = 2, n_estimators = 5 is 0.5569184123484013
Accuracy for fold 3 with max_samples = 2, n_estimators = 5 is 0.6702039691289967
Accuracy for fold 4 with max_samples = 2, n_estimators = 5 is 0.6767746381805652
Accuracy for fold 2 with max_samples = 4, n_estimators = 5 is 0.7013506063947078
Accuracy for fold 2 with max_samples = 6, n_estimators = 5 is 0.7038313120176406
Accuracy for fold 3 with max_samples = 6, n_estimators = 15 is 0.705071664829107
Accuracy for fold 5 with max_samples = 6, n_estimators = 20 is 0.7076498966230186
Accuracy for fold 2 with max_samples = 8, n_estimators = 10 is 0.7198180815876516
Accuracy for fold 2 with max_samples = 8, n_estimators = 15 is 0.7265711135611908
Accuracy for fold 2 with max_samples = 8, n_estimators = 20 is 0.7493109151047409
Accuracy for fold 5 with max_samples = 10, n_estimators = 15 is 0.7494141971054445
Accuracy for fold 5 with max_samples = 10, n_estimators = 20 is 0.7523087525844245


In [70]:
current_max_accuracy = 0
for bagging_param in bagging_params:
    max_samples = bagging_param[0]
    n_estimators = bagging_param[1]

    bagging = BaggingClassifier(random_state=42, max_samples=max_samples, n_estimators=n_estimators)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        bagging.fit(X_train, y_train)
        y_pred = [round(x) for x in bagging.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_samples = {max_samples}, n_estimators = {n_estimators} is {accuracy}')
            top_3.append({
                'model_name': 'bagging',
                'split_method': 'kfolds',
                'params': f'data before normalization, max_samples = {max_samples}, n_estimators = {n_estimators}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_samples = 2, n_estimators = 5 is 0.5569184123484013
Accuracy for fold 3 with max_samples = 2, n_estimators = 5 is 0.6702039691289967
Accuracy for fold 4 with max_samples = 2, n_estimators = 5 is 0.6767746381805652
Accuracy for fold 2 with max_samples = 4, n_estimators = 5 is 0.7013506063947078
Accuracy for fold 2 with max_samples = 6, n_estimators = 5 is 0.7038313120176406
Accuracy for fold 3 with max_samples = 6, n_estimators = 15 is 0.7101708930540243
Accuracy for fold 2 with max_samples = 8, n_estimators = 10 is 0.7198180815876516
Accuracy for fold 2 with max_samples = 8, n_estimators = 15 is 0.7265711135611908
Accuracy for fold 2 with max_samples = 8, n_estimators = 20 is 0.7493109151047409
Accuracy for fold 5 with max_samples = 10, n_estimators = 20 is 0.7520330806340455


4. Градиентный бустинг

GradientBoostingClassifier может принимать следующие гиперпараметры:

    * max_features
    * n_estimators
    
Сгенерируем возможные варианты для проверки по этим параметрам

Будем использовать следующие следущие наборы значений для каждого из них:

In [71]:
max_features = ['sqrt', 'log2']
n_estimators = [5, 10, 15, 20]

In [72]:
gbc_params_set = [max_features, n_estimators]
gbc_params = list(itertools.product(*gbc_params_set))
print(f'{len(gbc_params)} кобинаций')

8 кобинаций


In [73]:
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    n_estimators = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
    # Используем нормализованный датасет
    for train, test in five_folds_model_selection.split(features, target):
        X_train = features.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features.iloc[test]
        y_test = target.iloc[test]

        gbc.fit(X_train, y_train)
        y_pred = [round(x) for x in gbc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_features = {max_features}, n_estimators = {n_estimators} is {accuracy}')
            top_3.append({
                'model_name': 'gbc',
                'split_method': 'kfolds',
                'params': f'normalized data, max_features = {max_features}, n_estimators = {n_estimators}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_features = sqrt, n_estimators = 5 is 0.6735115766262404
Accuracy for fold 4 with max_features = sqrt, n_estimators = 5 is 0.6771881461061336
Accuracy for fold 1 with max_features = sqrt, n_estimators = 10 is 0.7721885336273429
Accuracy for fold 1 with max_features = sqrt, n_estimators = 15 is 0.7924476295479603
Accuracy for fold 4 with max_features = sqrt, n_estimators = 15 is 0.8012405237767057
Accuracy for fold 1 with max_features = sqrt, n_estimators = 20 is 0.8071940463065049
Accuracy for fold 3 with max_features = sqrt, n_estimators = 20 is 0.8161521499448732
Accuracy for fold 4 with max_features = sqrt, n_estimators = 20 is 0.825499655410062


In [74]:
current_max_accuracy = 0
for gbc_param in gbc_params:
    max_features = gbc_param[0]
    n_estimators = gbc_param[1]

    gbc = GradientBoostingClassifier(random_state=42, max_features=max_features, n_estimators=n_estimators)
    fold_num = 1
    five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
     # Используем датасет до нормализации
    for train, test in five_folds_model_selection.split(features_before_normalization, target):
        X_train = features_before_normalization.iloc[train]
        y_train =  np.asarray(target.iloc[train], dtype=np.float64)

        X_test = features_before_normalization.iloc[test]
        y_test = target.iloc[test]

        gbc.fit(X_train, y_train)
        y_pred = [round(x) for x in gbc.predict(X_test)]
        accuracy = accuracy_score(list(y_test.values), y_pred)

        if accuracy > current_max_accuracy:
            print(f'Accuracy for fold {fold_num} with max_features = {max_features}, n_estimators = {n_estimators} is {accuracy}')
            top_3.append({
                'model_name': 'gbc',
                'split_method': 'kfolds',
                'params': f'data before normalization, max_features = {max_features}, n_estimators = {n_estimators}',
                'accuracy': accuracy
            })
            current_max_accuracy = accuracy
        fold_num += 1

Accuracy for fold 1 with max_features = sqrt, n_estimators = 5 is 0.6735115766262404
Accuracy for fold 4 with max_features = sqrt, n_estimators = 5 is 0.6771881461061336
Accuracy for fold 1 with max_features = sqrt, n_estimators = 10 is 0.7721885336273429
Accuracy for fold 1 with max_features = sqrt, n_estimators = 15 is 0.7924476295479603
Accuracy for fold 4 with max_features = sqrt, n_estimators = 15 is 0.8012405237767057
Accuracy for fold 1 with max_features = sqrt, n_estimators = 20 is 0.8071940463065049
Accuracy for fold 3 with max_features = sqrt, n_estimators = 20 is 0.8161521499448732
Accuracy for fold 4 with max_features = sqrt, n_estimators = 20 is 0.825499655410062


#### Лучшие результаты

In [75]:
top_3.sort(key=lambda x: x['accuracy'], reverse=True)
top_3_new = top_3[:3]

for result in top_3_new:
    print(f"Model name: {result['model_name']}, params: {result['params']}, accuracy: {result['accuracy']}")

Model name: rfc, params: data before normalization, max_depth = 22, criterion = entropy, accuracy: 0.9079382579933848
Model name: rfc, params: normalized data, max_depth = 22, criterion = entropy, accuracy: 0.907800441014333
Model name: rfc, params: normalized data, max_depth = 20, criterion = gini, accuracy: 0.9073869900771775


## Задание 6*

В данных имеется дисбалланс по целевой переменной. Сбалансируйте датасет с помощью одной из техник:
* Oversampling
* Undersampling

и обучите наилучшими алгоритмами 3 модели. Сравните результаты с предыдущими обученными моделями.


#### Oversampling

1. Random Over Sampler

In [76]:
random_over_sampler = RandomOverSampler(random_state = 42)

features_over, target_over = random_over_sampler.fit_resample(features, np.asarray(target, dtype=np.float64))
features_over.shape, features.shape

((48782, 17), (36278, 17))

In [77]:
X_train_features_over, X_test_features_over, y_train_features_over, y_test_features_over = train_test_split(features_over, target_over, test_size=0.3, random_state=42)

X_train_features_over.shape, y_train_features_over.shape, X_test_features_over.shape, y_test_features_over.shape

((34147, 17), (34147,), (14635, 17), (14635,))

In [78]:
# 1. RandomForestClassifier, max_depth = 22, criterion = entropy
# Результат обучения на данных ранее accuracy_score: 0.907800441014333

rfc = RandomForestClassifier(random_state=42, max_depth=22, criterion='entropy')
rfc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in rfc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.9298257601639904


In [79]:
# 2. DecisionTreeClassifier, max_depth = 12, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.8742190371187064

dtc = DecisionTreeClassifier(random_state=42, max_depth=12, criterion='gini')
dtc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in dtc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.8694226170140075


In [80]:
# 3. Lasso, params: alpha = 0.1
# Результат обучения на данных ранее accuracy_score: 0.7883131201764058

lasso = Lasso(random_state=42, alpha = 0.1)
lasso.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in lasso.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.4968910146908097


Модели выбрал не только опиарясь на полученные "топы". Взял из разных групп для интереса.

Получилось что RandomForestClassifier сыграл в лучшую сторону. \
Результаты DecisionTreeClassifier чуть лучше чем были. \
Но у Lasso результаты сильно просели. 

Если выбрать alpha уже 0.01 результат получится ближе к тому какой в был в экспериментах с Lasso.

In [81]:
# 3*. Lasso, params: alpha = 0.01
# Результат обучения на данных ранее accuracy_score: 0.7826326671261199

lasso = Lasso(random_state=42, alpha = 0.01)
lasso.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in lasso.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.737068670994192


2. SMOTE

In [82]:
smote = SMOTE(random_state = 42)

features_over, target_over = smote.fit_resample(features, np.asarray(target, dtype=np.float64))
features_over.shape, features.shape

((48782, 17), (36278, 17))

In [83]:
X_train_features_over, X_test_features_over, y_train_features_over, y_test_features_over = train_test_split(features_over, target_over, test_size=0.3, random_state=42)

X_train_features_over.shape, y_train_features_over.shape, X_test_features_over.shape, y_test_features_over.shape

((34147, 17), (34147,), (14635, 17), (14635,))

In [84]:
# 1. RandomForestClassifier, max_depth = 22, criterion = entropy
# Результат обучения на данных ранее accuracy_score: 0.907800441014333

rfc = RandomForestClassifier(random_state=42, max_depth=22, criterion='entropy')
rfc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in rfc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.916159890673044


In [85]:
# 2. DecisionTreeClassifier, max_depth = 12, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.8742190371187064

dtc = DecisionTreeClassifier(random_state=42, max_depth=12, criterion='gini')
dtc.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in dtc.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.8707892039631021


In [86]:
# 3. Lasso, params: alpha = 0.01
# Результат обучения на данных ранее accuracy_score: 0.7826326671261199

lasso = Lasso(random_state=42, alpha = 0.01)
lasso.fit(X_train_features_over, y_train_features_over)

pred_features = [round(x) for x in lasso.predict(X_test_features_over)]
accuracy = accuracy_score(y_test_features_over, pred_features)
print(accuracy)

0.7337888623163649


Применение Random Over Sampler показало лучшие результаты

#### Undersampling

1. RandomUnderSampler

In [87]:
rus = RandomUnderSampler(random_state=42)

features_under, target_under = rus.fit_resample(features, np.asarray(target, dtype=np.float64))
features_under.shape, features.shape

((23774, 17), (36278, 17))

In [88]:
X_train_features_under, X_test_features_under, y_train_features_under, y_test_features_under = train_test_split(features_under, target_under, test_size=0.3, random_state=42)

X_train_features_under.shape, y_train_features_under.shape, X_test_features_under.shape, y_test_features_under.shape

((16641, 17), (16641,), (7133, 17), (7133,))

In [89]:
# 1. RandomForestClassifier, max_depth = 22, criterion = entropy
# Результат обучения на данных ранее accuracy_score: 0.907800441014333

rfc = RandomForestClassifier(random_state=42, max_depth=22, criterion='entropy')
rfc.fit(X_train_features_under, y_train_features_under)

pred_features = [round(x) for x in rfc.predict(X_test_features_under)]
accuracy = accuracy_score(y_test_features_under, pred_features)
print(accuracy)

0.8776111033225852


In [90]:
# 2. DecisionTreeClassifier, max_depth = 12, criterion = gini
# Результат обучения на данных ранее accuracy_score: 0.8742190371187064

dtc = DecisionTreeClassifier(random_state=42, max_depth=12, criterion='gini')
dtc.fit(X_train_features_under, y_train_features_under)

pred_features = [round(x) for x in dtc.predict(X_test_features_under)]
accuracy = accuracy_score(y_test_features_under, pred_features)
print(accuracy)

0.8483106687228376


In [91]:
# 3. Lasso, params: alpha = 0.01
# Результат обучения на данных ранее accuracy_score: 0.7826326671261199

lasso = Lasso(random_state=42, alpha = 0.01)
lasso.fit(X_train_features_under, y_train_features_under)

pred_features = [round(x) for x in lasso.predict(X_test_features_under)]
accuracy = accuracy_score(y_test_features_under, pred_features)
print(accuracy)

0.7381186036730688


При Undersampling результаты обучения и работы моделей снизились

## Задание 7*

С помощью наилучшей линейной модели и наилучшей модели случайного леса, определите важность признаков.

In [92]:
ridge = Ridge(random_state=42, alpha=0.1)

five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = target.iloc[train]

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    ridge.fit(X_train, y_train)
    y_pred = ridge.predict(X_test)

ridge_coefs = ridge.coef_
display(ridge_coefs)

array([-1.32153106e-04, -1.74690098e-01, -1.29115825e-01, -1.18468982e-01,
       -6.12131189e-02,  4.72448277e-01, -1.52786730e-02, -1.00422697e+00,
       -8.84015394e-02,  3.22234836e-02, -1.03149348e-02, -4.36032592e-01,
       -1.03944244e-01,  2.30571117e-01, -3.38790271e-01, -1.12056777e+00,
        8.81181688e-01])

In [93]:
rfc = RandomForestClassifier(random_state=42, criterion='entropy', max_depth=22)

five_folds_model_selection = KFold(n_splits=5, shuffle=True, random_state=42)
for train, test in five_folds_model_selection.split(features, target):
    X_train = features.iloc[train]
    y_train = np.asarray(target.iloc[train], dtype=np.float64)

    X_test = features.iloc[test]
    y_test = target.iloc[test]

    rfc.fit(X_train, y_train)
    y_pred = rfc.predict(X_test)

rfc_coefs = rfc.feature_importances_
display(rfc_coefs)

array([0.02427866, 0.00716888, 0.03730255, 0.05110294, 0.01769102,
       0.00810976, 0.01505756, 0.30889879, 0.02642136, 0.08916499,
       0.08636062, 0.06034089, 0.00503806, 0.000597  , 0.00242452,
       0.16264468, 0.09739771])

In [94]:
features.columns

Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
       'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space',
       'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month',
       'arrival_date', 'market_segment_type', 'repeated_guest',
       'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
       'avg_price_per_room', 'no_of_special_requests'],
      dtype='object')

In [95]:
# Ridge
zipped = zip(features.columns, ridge_coefs, rfc_coefs)
ridge_sorted_features = sorted(zipped, key = lambda t: -t[1])
for feature in ridge_sorted_features:
    print(feature)

('no_of_special_requests', 0.8811816878176904, 0.09739771206957497)
('required_car_parking_space', 0.47244827680555096, 0.008109755680544022)
('no_of_previous_cancellations', 0.2305711169816426, 0.0005970007472866272)
('arrival_month', 0.03222348355790558, 0.08916498681853123)
('no_of_adults', -0.00013215310551898193, 0.02427865962165891)
('arrival_date', -0.010314934761942538, 0.0863606248065927)
('room_type_reserved', -0.015278672989139932, 0.015057560786443916)
('type_of_meal_plan', -0.061213118933541665, 0.01769101852320537)
('arrival_year', -0.08840153935630789, 0.026421364101151788)
('repeated_guest', -0.10394424433105211, 0.005038063208482925)
('no_of_week_nights', -0.11846898173117547, 0.051102940917281243)
('no_of_weekend_nights', -0.12911582452944137, 0.037302551993007194)
('no_of_children', -0.1746900982730445, 0.007168882180617358)
('no_of_previous_bookings_not_canceled', -0.33879027050893873, 0.0024245248037136494)
('market_segment_type', -0.43603259197231364, 0.0603408921

In [96]:
# RandomForestClassifier
zipped = zip(features.columns, ridge_coefs, rfc_coefs)
rfc_sorted_features = sorted(zipped, key = lambda t: -t[2])
for feature in rfc_sorted_features:
    print(feature)

('lead_time', -1.0042269720217236, 0.3088987860658319)
('avg_price_per_room', -1.1205677739772104, 0.16264467549999664)
('no_of_special_requests', 0.8811816878176904, 0.09739771206957497)
('arrival_month', 0.03222348355790558, 0.08916498681853123)
('arrival_date', -0.010314934761942538, 0.0863606248065927)
('market_segment_type', -0.43603259197231364, 0.06034089217607965)
('no_of_week_nights', -0.11846898173117547, 0.051102940917281243)
('no_of_weekend_nights', -0.12911582452944137, 0.037302551993007194)
('arrival_year', -0.08840153935630789, 0.026421364101151788)
('no_of_adults', -0.00013215310551898193, 0.02427865962165891)
('type_of_meal_plan', -0.061213118933541665, 0.01769101852320537)
('room_type_reserved', -0.015278672989139932, 0.015057560786443916)
('required_car_parking_space', 0.47244827680555096, 0.008109755680544022)
('no_of_children', -0.1746900982730445, 0.007168882180617358)
('repeated_guest', -0.10394424433105211, 0.005038063208482925)
('no_of_previous_bookings_not_can

Опираясь на результаты важности признаков полученные через Ridge и RandomForestClassifier, в топе совпадают:
 - **no_of_special_requests** - кол-во специальных условий в бронировании (например, кондиционер, номер на первом этаже, поздний заезд)
 - **arrival_month** - месяц прибытия
 
Опираясь на результаты важности признаков полученные только на основании RandomForestClassifier:
 - **lead_time** - кол-во дней между датой бронирования и датой прибытия

In [97]:
print("--- %s seconds ---" % (time.time() - start_time))

--- 968.3093876838684 seconds ---
