# Лабораторная работа №6

## Задание

Провести классификацию найденного датасета, методами CatBoost

Импорт библиотек

In [19]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from catboost import CatBoostClassifier

Загрузка датасета

In [20]:
df = pd.read_csv('weather.csv', encoding='utf-8')
df['RainToday']= df['RainToday'].map({'Yes': True, 'No': False}).astype(bool) 
df['RainTomorrow']= df['RainTomorrow'].map({'Yes': True, 'No': False}).astype(bool)
print(df.head(5))
print(df.dtypes)

         Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
0  2008-12-01   Albury     13.4     22.9       0.6          NaN       NaN   
1  2008-12-02   Albury      7.4     25.1       0.0          NaN       NaN   
2  2008-12-03   Albury     12.9     25.7       0.0          NaN       NaN   
3  2008-12-04   Albury      9.2     28.0       0.0          NaN       NaN   
4  2008-12-05   Albury     17.5     32.3       1.0          NaN       NaN   

  WindGustDir  WindGustSpeed WindDir9am  ... Humidity9am  Humidity3pm  \
0           W           44.0          W  ...        71.0         22.0   
1         WNW           44.0        NNW  ...        44.0         25.0   
2         WSW           46.0          W  ...        38.0         30.0   
3          NE           24.0         SE  ...        45.0         16.0   
4           W           41.0        ENE  ...        82.0         33.0   

   Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
0       1007.7    

Удаление ненужных столбцов из датасета

In [21]:
columns_to_drop = ['Humidity3pm', 'Pressure3pm', 'Cloud3pm', 'Temp3pm', 'WindDir9am']
df = df.drop(columns=columns_to_drop, axis=1)
print(df.head(5))

         Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
0  2008-12-01   Albury     13.4     22.9       0.6          NaN       NaN   
1  2008-12-02   Albury      7.4     25.1       0.0          NaN       NaN   
2  2008-12-03   Albury     12.9     25.7       0.0          NaN       NaN   
3  2008-12-04   Albury      9.2     28.0       0.0          NaN       NaN   
4  2008-12-05   Albury     17.5     32.3       1.0          NaN       NaN   

  WindGustDir  WindGustSpeed WindDir3pm  WindSpeed9am  WindSpeed3pm  \
0           W           44.0        WNW          20.0          24.0   
1         WNW           44.0        WSW           4.0          22.0   
2         WSW           46.0        WSW          19.0          26.0   
3          NE           24.0          E          11.0           9.0   
4           W           41.0         NW           7.0          20.0   

   Humidity9am  Pressure9am  Cloud9am  Temp9am  RainToday  RainTomorrow  
0         71.0       1007.7       8.

Удаление строк с пропущенными значениями

In [22]:
df.dropna(inplace=True)

Предобработка данных

In [23]:
# Кодирование категориальных признаков
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])
    
print(df.head(5))

      Date  Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
6049   418         4     17.9     35.2       0.0         12.0      12.3   
6050   419         4     18.4     28.9       0.0         14.8      13.0   
6052   421         4     19.4     37.6       0.0         10.8      10.6   
6053   422         4     21.9     38.4       0.0         11.4      12.2   
6054   423         4     24.2     41.0       0.0         11.2       8.4   

      WindGustDir  WindGustSpeed  WindDir3pm  WindSpeed9am  WindSpeed3pm  \
6049           11           48.0          12           6.0          20.0   
6050            8           37.0          10          19.0          19.0   
6052            5           46.0           6          30.0          15.0   
6053           14           31.0          15           6.0           6.0   
6054           14           35.0          14          17.0          13.0   

      Humidity9am  Pressure9am  Cloud9am  Temp9am  RainToday  RainTomorrow  
6049         20

In [24]:
# Масштабирование числовых признаков
scaler = StandardScaler()
numeric_features = df.select_dtypes(include=['int32', 'int64', 'float32', 'float64']).columns
df[numeric_features] = scaler.fit_transform(df[numeric_features])

print(df.head(5))

          Date  Location   MinTemp   MaxTemp  Rainfall  Evaporation  Sunshine  \
6049 -1.545272 -1.182653  0.716750  1.589217 -0.302062     1.771778  1.217662   
6050 -1.544105 -1.182653  0.793979  0.686505 -0.302062     2.529392  1.404005   
6052 -1.541771 -1.182653  0.948437  1.933107 -0.302062     1.447086  0.765115   
6053 -1.540604 -1.182653  1.334582  2.047737 -0.302062     1.609432  1.191042   
6054 -1.539437 -1.182653  1.689834  2.420285 -0.302062     1.555317  0.179465   

      WindGustDir  WindGustSpeed  WindDir3pm  WindSpeed9am  WindSpeed3pm  \
6049     0.728337       0.553696    0.933445     -1.077916      0.043157   
6050     0.102162      -0.269995    0.509257      0.435573     -0.074484   
6052    -0.524013       0.403934   -0.339118      1.716217     -0.545047   
6053     1.354511      -0.719280    1.569726     -1.077916     -1.603814   
6054     1.354511      -0.419757    1.357632      0.202728     -0.780329   

      Humidity9am  Pressure9am  Cloud9am   Temp9am  Rain

Разделение данных на обучающий и тестовый наборы

In [25]:
X = df.drop('RainTomorrow', axis=1)
Y = df['RainTomorrow']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

##### Обучение модели методом решающего дерева

In [26]:
# создаем модель
catboost = CatBoostClassifier()

# подаем на вход модели обучающие данные
catboost.fit(X_train, Y_train)

Learning rate set to 0.053641
0:	learn: 0.6525600	total: 159ms	remaining: 2m 38s
1:	learn: 0.6167360	total: 165ms	remaining: 1m 22s
2:	learn: 0.5876869	total: 172ms	remaining: 57s
3:	learn: 0.5620279	total: 179ms	remaining: 44.7s
4:	learn: 0.5407152	total: 187ms	remaining: 37.3s
5:	learn: 0.5204479	total: 196ms	remaining: 32.5s
6:	learn: 0.5036455	total: 205ms	remaining: 29.1s
7:	learn: 0.4873853	total: 212ms	remaining: 26.3s
8:	learn: 0.4744263	total: 219ms	remaining: 24.2s
9:	learn: 0.4618982	total: 228ms	remaining: 22.6s
10:	learn: 0.4531031	total: 235ms	remaining: 21.2s
11:	learn: 0.4438144	total: 242ms	remaining: 20s
12:	learn: 0.4369847	total: 250ms	remaining: 19s
13:	learn: 0.4308341	total: 258ms	remaining: 18.1s
14:	learn: 0.4248911	total: 265ms	remaining: 17.4s
15:	learn: 0.4188755	total: 272ms	remaining: 16.7s
16:	learn: 0.4143544	total: 279ms	remaining: 16.1s
17:	learn: 0.4098210	total: 287ms	remaining: 15.7s
18:	learn: 0.4054271	total: 306ms	remaining: 15.8s
19:	learn: 0.40

<catboost.core.CatBoostClassifier at 0x2769c4b09e0>

Определение сетки параметров для поиска

In [27]:
# предсказываем результат на тестовой выборке
catboost_pred = catboost.predict(X_test)

# оцениваем модель
print(f'Сравнение метрик: \n{classification_report(Y_test, catboost_pred)}\n')
print(f'Матрица ошибок: \n{confusion_matrix(Y_test, catboost_pred)}\n')

Сравнение метрик: 
              precision    recall  f1-score   support

       False       0.88      0.95      0.91      9326
        True       0.74      0.55      0.63      2589

    accuracy                           0.86     11915
   macro avg       0.81      0.75      0.77     11915
weighted avg       0.85      0.86      0.85     11915


Матрица ошибок: 
[[8817  509]
 [1176 1413]]



Поиск оптимальных параметров с использованием кросс-валидации

In [28]:
%%time

# задаем диапазон параметров
# задаем параметры
parameters = {
    'depth': [5, 10], # максимальная глубина дерева
    'learning_rate': [0.01, 0.1], # скорость обучения
    'iterations': [10, 100], # количество итераций
    'l2_leaf_reg': [1, 10], # коэффициент L2 регуляризации листьев
    'border_count': [1, 255], # количество границ для числовых признаков
    'loss_function': ['Logloss', 'CrossEntropy', 'MultiClass', 'MultiClassOneVsAll'], # функция потерь
    'random_strength': [0, 1], # сила случайности в выборе признаков на каждом уровне
    'bagging_temperature': [0, 1], # температура для баггинга
    'od_type': ['IncToDec', 'Iter', 'None'], # тип оптимизации
    'od_wait': [10, 100] # количество итераций между проверками оптимизации
}

# подбираем лучшие параметры
grid_catboost = RandomizedSearchCV(
    CatBoostClassifier(),
    parameters,
    scoring='f1',
    n_jobs = -1
)
# обучаем модель
grid_catboost.fit(X_train, Y_train)

0:	learn: 0.6457115	total: 16.9ms	remaining: 1.68s
1:	learn: 0.6053309	total: 65.3ms	remaining: 3.2s
2:	learn: 0.5712594	total: 112ms	remaining: 3.6s
3:	learn: 0.5431520	total: 154ms	remaining: 3.69s
4:	learn: 0.5188510	total: 194ms	remaining: 3.68s
5:	learn: 0.4988626	total: 238ms	remaining: 3.73s
6:	learn: 0.4809000	total: 294ms	remaining: 3.9s
7:	learn: 0.4650267	total: 340ms	remaining: 3.92s
8:	learn: 0.4514019	total: 383ms	remaining: 3.87s
9:	learn: 0.4396561	total: 425ms	remaining: 3.82s
10:	learn: 0.4288889	total: 471ms	remaining: 3.81s
11:	learn: 0.4200462	total: 514ms	remaining: 3.77s
12:	learn: 0.4115636	total: 555ms	remaining: 3.72s
13:	learn: 0.4045165	total: 598ms	remaining: 3.68s
14:	learn: 0.3981815	total: 643ms	remaining: 3.64s
15:	learn: 0.3922016	total: 694ms	remaining: 3.65s
16:	learn: 0.3872981	total: 741ms	remaining: 3.62s
17:	learn: 0.3824423	total: 786ms	remaining: 3.58s
18:	learn: 0.3781952	total: 830ms	remaining: 3.54s
19:	learn: 0.3743420	total: 882ms	remainin

Получение лучших параметров

In [30]:
# результаты кросс-валидации
print(f'Лучшие параметры:\n{grid_catboost.best_params_}\n')
print("Показатель f1 для лучшей модели составил: {:.2f}%".
      format(grid_catboost.best_score_ * 100) + '\n')

# предсказываем результат на тестовой выборке
grid_catboost_pred = grid_catboost.predict(X_test)

# оцениваем модель
print(f'Сравнение метрик: \n{classification_report(Y_test, grid_catboost_pred)}\n')
print(f'Матрица ошибок: \n{confusion_matrix(Y_test, grid_catboost_pred)}\n')

Лучшие параметры:
{'random_strength': 1, 'od_wait': 100, 'od_type': 'None', 'loss_function': 'MultiClass', 'learning_rate': 0.1, 'l2_leaf_reg': 1, 'iterations': 100, 'depth': 10, 'border_count': 255, 'bagging_temperature': 0}

Показатель f1 для лучшей модели составил: 59.93%

Сравнение метрик: 
              precision    recall  f1-score   support

       False       0.88      0.95      0.91      9326
        True       0.73      0.52      0.61      2589

    accuracy                           0.85     11915
   macro avg       0.80      0.73      0.76     11915
weighted avg       0.84      0.85      0.84     11915


Матрица ошибок: 
[[8820  506]
 [1241 1348]]

