# Практическая работа №12

## Задание 1

Найти данные для задачи классификации или для задачи регрессии.


В астрономии, спектральная классификация звёзд - это классификация звёзд по особенностям их спектров.
Схема классификации галактик, квазаров и звезд является одной из самых фундаментальных вещей в астрономии.

Приведенные ниже данные содержат 100 тысяч записей, полученных SDSS (Sloan Digital Sky Survey, или Слоановский цифровой небесный обзор).
Этот проект назван в честь фонда Альфреда Слоана и предназначен для широкомасштабного исследования многоспектральных изображений и спектров красного смещения звёзд и галактик при помощи 2,5-метрового широкоугольного телескопа в обсерватории Апачи-Пойнт в штате Нью-Мексико.
Каждая запись состоит из 17 различных признаков и 1 признака с данными о принадлежности записи к определенному классу: звезда, галактика или квазар.

Описание каждого параметра:
- `obj_ID` = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS
- `alpha` = Right Ascension angle (at J2000 epoch)
- `delta` = Declination angle (at J2000 epoch)
- `u` = Ultraviolet filter in the photometric system
- `g` = Green filter in the photometric system
- `r` = Red filter in the photometric system
- `i` = Near Infrared filter in the photometric system
- `z` = Infrared filter in the photometric system
- `run_ID` = Run Number used to identify the specific scan
- `rereun_ID` = Rerun Number to specify how the image was processed
- `cam_col` = Camera column to identify the scanline within the run
- `field_ID` = Field number to identify each field
- `spec_obj_ID` = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same `spec_obj_ID` must share the output class)
- `class` = object class (galaxy, star or quasar object)
- `redshift` = redshift value based on the increase in wavelength
- `plate` = plate ID, identifies each plate in SDSS
- `MJD` = Modified Julian Date, used to indicate when a given piece of SDSS data was taken
- `fiber_ID` = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

Загрузка данных:

In [1]:
import pandas as pd

data = pd.read_csv('star_classification.csv')
data

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
0,1.237661e+18,135.689107,32.494632,23.87882,22.27530,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,GALAXY,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,GALAXY,0.779136,10445,58158,427
2,1.237661e+18,142.188790,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.152200e+18,GALAXY,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.25010,4192,301,3,214,1.030107e+19,GALAXY,0.932346,9149,58039,775
4,1.237680e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,GALAXY,0.116123,6121,56187,842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1.237679e+18,39.620709,-2.594074,22.16759,22.97586,21.90404,21.30548,20.73569,7778,301,2,581,1.055431e+19,GALAXY,0.000000,9374,57749,438
99996,1.237679e+18,29.493819,19.798874,22.69118,22.38628,20.45003,19.75759,19.41526,7917,301,1,289,8.586351e+18,GALAXY,0.404895,7626,56934,866
99997,1.237668e+18,224.587407,15.700707,21.16916,19.26997,18.20428,17.69034,17.35221,5314,301,4,308,3.112008e+18,GALAXY,0.143366,2764,54535,74
99998,1.237661e+18,212.268621,46.660365,25.35039,21.63757,19.91386,19.07254,18.62482,3650,301,4,131,7.601080e+18,GALAXY,0.455040,6751,56368,470


Получим краткую сводку о данных:

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 18 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   obj_ID       100000 non-null  float64
 1   alpha        100000 non-null  float64
 2   delta        100000 non-null  float64
 3   u            100000 non-null  float64
 4   g            100000 non-null  float64
 5   r            100000 non-null  float64
 6   i            100000 non-null  float64
 7   z            100000 non-null  float64
 8   run_ID       100000 non-null  int64  
 9   rerun_ID     100000 non-null  int64  
 10  cam_col      100000 non-null  int64  
 11  field_ID     100000 non-null  int64  
 12  spec_obj_ID  100000 non-null  float64
 13  class        100000 non-null  object 
 14  redshift     100000 non-null  float64
 15  plate        100000 non-null  int64  
 16  MJD          100000 non-null  int64  
 17  fiber_ID     100000 non-null  int64  
dtypes: float64(10), int64(7),

Получим описание данных:

In [3]:
data.describe()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,redshift,plate,MJD,fiber_ID
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,1.237665e+18,177.629117,24.135305,21.980468,20.531387,19.645762,19.084854,18.66881,4481.36606,301.0,3.51161,186.13052,5.783882e+18,0.576661,5137.00966,55588.6475,449.31274
std,8438560000000.0,96.502241,19.644665,31.769291,31.750292,1.85476,1.757895,31.728152,1964.764593,0.0,1.586912,149.011073,3.324016e+18,0.730707,2952.303351,1808.484233,272.498404
min,1.237646e+18,0.005528,-18.785328,-9999.0,-9999.0,9.82207,9.469903,-9999.0,109.0,301.0,1.0,11.0,2.995191e+17,-0.009971,266.0,51608.0,1.0
25%,1.237659e+18,127.518222,5.146771,20.352353,18.96523,18.135828,17.732285,17.460677,3187.0,301.0,2.0,82.0,2.844138e+18,0.054517,2526.0,54234.0,221.0
50%,1.237663e+18,180.9007,23.645922,22.179135,21.099835,20.12529,19.405145,19.004595,4188.0,301.0,4.0,146.0,5.614883e+18,0.424173,4987.0,55868.5,433.0
75%,1.237668e+18,233.895005,39.90155,23.68744,22.123767,21.044785,20.396495,19.92112,5326.0,301.0,5.0,241.0,8.332144e+18,0.704154,7400.25,56777.0,645.0
max,1.237681e+18,359.99981,83.000519,32.78139,31.60224,29.57186,32.14147,29.38374,8162.0,301.0,6.0,989.0,1.412694e+19,7.011245,12547.0,58932.0,1000.0


Видно, что некоторые признаки, такие как `u`, `g` и `z`, содержат очень большие отрицательные числа (`-9999.0`), которые сильно отличаются от средних.
Минимальное значение признака `alpha` меньше среднего на порядки.
Также максимальное значение признака `redshift` примерно в 10 раз больше среднего значения по всему датасету.

Это может говорить о том, что данные содержат некоторые выбросы.

Для начала обработаем признаки с минимальными значениями, равными `-9999.0`: заменим эти значения на медианные в соответствующем признаке:

In [4]:
data.loc[data['u'] == -9999, 'u'] = data.u.median()
data.loc[data['g'] == -9999, 'g'] = data.g.median()
data.loc[data['z'] == -9999, 'z'] = data.z.median()

Далее построим коробку с усами по признаку `alpha`:

In [6]:
import matplotlib.pyplot as plt

%matplotlib notebook

plt.boxplot(data.alpha)
plt.show()

<IPython.core.display.Javascript object>

Видно, что выбросов по данному признаку нет.

Было выяснено, что науке известно множество объектов с космологическим красным смещением больше единицы. Так, галактика с наибольшим известным красным смещением на апрель 2022 года - HD1 (самая старая из известных науке галактик), у которой этот показатель составляет 13,27.

Это говорит о том, что значение параметра `redshift` (красное смещение), большее 1, с большой долей вероятности является не выбросом, а реальным значением.

Теперь снова получим описание данных:

In [7]:
data.describe()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,redshift,plate,MJD,fiber_ID
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,1.237665e+18,177.629117,24.135305,22.08068,20.631588,19.645762,19.084854,18.76899,4481.36606,301.0,3.51161,186.13052,5.783882e+18,0.576661,5137.00966,55588.6475,449.31274
std,8438560000000.0,96.502241,19.644665,2.251057,2.037374,1.85476,1.757895,1.765973,1964.764593,0.0,1.586912,149.011073,3.324016e+18,0.730707,2952.303351,1808.484233,272.498404
min,1.237646e+18,0.005528,-18.785328,10.99623,10.4982,9.82207,9.469903,9.612333,109.0,301.0,1.0,11.0,2.995191e+17,-0.009971,266.0,51608.0,1.0
25%,1.237659e+18,127.518222,5.146771,20.35243,18.965245,18.135828,17.732285,17.4609,3187.0,301.0,2.0,82.0,2.844138e+18,0.054517,2526.0,54234.0,221.0
50%,1.237663e+18,180.9007,23.645922,22.179138,21.099882,20.12529,19.405145,19.004598,4188.0,301.0,4.0,146.0,5.614883e+18,0.424173,4987.0,55868.5,433.0
75%,1.237668e+18,233.895005,39.90155,23.68744,22.123767,21.044785,20.396495,19.92112,5326.0,301.0,5.0,241.0,8.332144e+18,0.704154,7400.25,56777.0,645.0
max,1.237681e+18,359.99981,83.000519,32.78139,31.60224,29.57186,32.14147,29.38374,8162.0,301.0,6.0,989.0,1.412694e+19,7.011245,12547.0,58932.0,1000.0


Выведем уникальные значения классов. Их должно быть три:

In [8]:
data['class'].unique()

array(['GALAXY', 'QSO', 'STAR'], dtype=object)

Удалим те данные, которые напрямую не относятся к спектральным характеристикам звезд, такие как идентификаторы и дата снимка:

In [9]:
data.drop(
    columns=['obj_ID', 'run_ID', 'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'plate', 'MJD', 'fiber_ID'],
    inplace=True,
)
data

Unnamed: 0,alpha,delta,u,g,r,i,z,class,redshift
0,135.689107,32.494632,23.87882,22.27530,20.39501,19.16573,18.79371,GALAXY,0.634794
1,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,GALAXY,0.779136
2,142.188790,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,GALAXY,0.644195
3,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.25010,GALAXY,0.932346
4,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,GALAXY,0.116123
...,...,...,...,...,...,...,...,...,...
99995,39.620709,-2.594074,22.16759,22.97586,21.90404,21.30548,20.73569,GALAXY,0.000000
99996,29.493819,19.798874,22.69118,22.38628,20.45003,19.75759,19.41526,GALAXY,0.404895
99997,224.587407,15.700707,21.16916,19.26997,18.20428,17.69034,17.35221,GALAXY,0.143366
99998,212.268621,46.660365,25.35039,21.63757,19.91386,19.07254,18.62482,GALAXY,0.455040


Удалим предположительно появившиеся дубликаты:

In [10]:
data.drop_duplicates(keep='first', inplace=True)
data

Unnamed: 0,alpha,delta,u,g,r,i,z,class,redshift
0,135.689107,32.494632,23.87882,22.27530,20.39501,19.16573,18.79371,GALAXY,0.634794
1,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,GALAXY,0.779136
2,142.188790,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,GALAXY,0.644195
3,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.25010,GALAXY,0.932346
4,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,GALAXY,0.116123
...,...,...,...,...,...,...,...,...,...
99995,39.620709,-2.594074,22.16759,22.97586,21.90404,21.30548,20.73569,GALAXY,0.000000
99996,29.493819,19.798874,22.69118,22.38628,20.45003,19.75759,19.41526,GALAXY,0.404895
99997,224.587407,15.700707,21.16916,19.26997,18.20428,17.69034,17.35221,GALAXY,0.143366
99998,212.268621,46.660365,25.35039,21.63757,19.91386,19.07254,18.62482,GALAXY,0.455040


## Подготовка к проведению расчетов

In [11]:
params = {
    'feature': 'Метод',
    'time': 'Время',
    'score': 'Счет',
}
algos = {
    'DecisionTree': ['predict', 'train', 'test'],
    'Bagging': ['predict', 'train', 'test'],
    'Random Forest': ['predict', 'train', 'test'],
    'CatBoosting CPU': ['predict', 'train', 'test'],
    'CatBoosting GPU': ['predict', 'train', 'test'],
    'Gradient Boosting': ['predict', 'train', 'test'],
    'Hist Gradient Boosting': ['predict', 'train', 'test'],
    'XGBoosting CPU': ['predict', 'train', 'test'],
    'XGBoosting GPU': ['predict', 'train', 'test'],
}
random_state = 42

In [12]:
numbers = [len(lst) for alg, lst in algos.items()]
mult_algos = [
    x for x, number in zip(algos, numbers)
    for _ in range(number)
]
content = ["—"] * len(mult_algos)

results = pd.DataFrame(data={
    params['feature']: content,
    params['time']: content,
    params['score']: content,
}, index=pd.Index(
    data=mult_algos, name="Тип анс. обучения"
))

for alg, lst in algos.items():
    results.at[alg, params['feature']] = lst
    # for i in range(len(lst)):
    #     results.at[alg, 'Характеристика'][i] = lst[i]

results

Unnamed: 0_level_0,Метод,Время,Счет
Тип анс. обучения,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DecisionTree,predict,—,—
DecisionTree,train,—,—
DecisionTree,test,—,—
Bagging,predict,—,—
Bagging,train,—,—
Bagging,test,—,—
Random Forest,predict,—,—
Random Forest,train,—,—
Random Forest,test,—,—
CatBoosting CPU,predict,—,—


In [13]:
from timeit import timeit

def time_for(alg, feature):
    def decorator(func):
        def wrapper(*args, **kwargs):
            boolean = \
                (results[params['feature']] == feature) \
                & (results.index == alg)

            # result = func(*args, **kwargs)
            results.loc[boolean, params['time']] = \
                timeit(lambda: func(*args, **kwargs), number=1)

            return func(*args, **kwargs)
        return wrapper
    return decorator


In [14]:
def score_for(alg, feature):
    def decorator(func):
        def wrapper(*args, **kwargs):
            boolean = \
                (results[params['feature']] == feature) \
                & (results.index == alg)

            # result = func(*args, **kwargs)
            results.loc[boolean, params['score']] = func(*args, **kwargs)

            return func(*args, **kwargs)
        return wrapper
    return decorator


In [15]:
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import GridSearchCV
import plotly.express as px

def draw_conf_matrix(test, predict):
    figure = px.imshow(confusion_matrix(test, predict), text_auto=True)
    figure.update_layout(xaxis_title='Target', yaxis_title='Prediction')
    figure.show()


## Разделение выборки на обучающую и тестовую

In [16]:
from sklearn import preprocessing as preprc

predictors = data.drop(['class'], axis=1)

predictors = pd.DataFrame(
    preprc.MinMaxScaler().fit_transform(predictors),
    columns=predictors.columns,
)
predictors

Unnamed: 0,alpha,delta,u,g,r,i,z,redshift
0,0.376905,0.503802,0.591347,0.558050,0.535344,0.427665,0.464377,0.091831
1,0.402286,0.491812,0.632603,0.584423,0.646203,0.515986,0.607035,0.112389
2,0.394960,0.534139,0.654888,0.576463,0.546218,0.435729,0.472194,0.093170
3,0.940947,0.180600,0.511384,0.629186,0.596946,0.486717,0.487460,0.134210
4,0.959118,0.392679,0.387463,0.335579,0.337999,0.287021,0.300043,0.017959
...,...,...,...,...,...,...,...,...
99995,0.110044,0.159072,0.512797,0.591245,0.611752,0.522045,0.562598,0.001420
99996,0.081913,0.379072,0.536831,0.563308,0.538130,0.453770,0.495813,0.059087
99997,0.623848,0.338810,0.466966,0.415644,0.424420,0.362588,0.391468,0.021839
99998,0.589629,0.642974,0.658896,0.527831,0.510982,0.423554,0.455834,0.066229


In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    predictors, pd.factorize(data['class'])[0], train_size=0.01,
    shuffle=True, random_state=random_state,
)

print(
    f'Размер для признаков обучающей выборки: {X_train.shape}',
    f'Размер для признаков тестовой выборки: {X_test.shape}',
    f'Размер для целевого показателя обучающей выборки: {y_train.shape}',
    f'Размер для показателя тестовой выборки: {y_test.shape}',
sep='\n')

Размер для признаков обучающей выборки: (1000, 8)
Размер для признаков тестовой выборки: (99000, 8)
Размер для целевого показателя обучающей выборки: (1000,)
Размер для показателя тестовой выборки: (99000,)


## Задание 2

Реализовать баггинг, стандартным примером которого является случайный лес.


### Decision Tree

In [18]:
from sklearn.tree import DecisionTreeClassifier

decision_params = {
    'max_depth': [12, 15, 18],
    'max_features': [.5, .75, 1],
    'min_samples_leaf': [3, 5, 10],
    'min_samples_split': [6, 9, 12],
}

@time_for('DecisionTree', 'predict')
def fit_decision():
    return GridSearchCV(
        estimator=DecisionTreeClassifier(random_state=random_state),
        param_grid=decision_params, scoring='f1_macro', cv=4,
    ).fit(X_train, y_train)

decision_tree = fit_decision()

In [19]:
decision_model = decision_tree.best_estimator_
decision_model

In [20]:
decision_y_train = decision_model.predict(X_train)

In [72]:
@score_for('DecisionTree', 'train')
@time_for('DecisionTree', 'train')
def train_decision():
    return f1_score(decision_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "DecisionTreeClassifier:", train_decision(),
)
draw_conf_matrix(y_train, decision_y_train)

F1-мера для тренировочных данных с помощью DecisionTreeClassifier: 0.9676284485078227


In [22]:
decision_y_test = decision_model.predict(X_test)

In [73]:
@score_for('DecisionTree', 'test')
@time_for('DecisionTree', 'test')
def test_decision():
    return f1_score(decision_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "DecisionTreeClassifier:", test_decision(),
)
draw_conf_matrix(y_test, decision_y_test)

F1-мера для тестовых данных с помощью DecisionTreeClassifier: 0.9332125078871915


### Bagging

In [24]:
from sklearn.ensemble import BaggingClassifier

bagging_params = {
    'max_samples': [.5, .75, 1],
    'max_features': [.5, .75, 1],
    'n_estimators': [10, 50, 100],
}

@time_for('Bagging', 'predict')
def fit_bagging():
    return GridSearchCV(
        estimator=BaggingClassifier(random_state=random_state),
        param_grid=bagging_params, scoring='f1_macro', cv=4,
    ).fit(X_train, y_train)

bagging = fit_bagging()

In [25]:
bagging_model = bagging.best_estimator_
bagging_model

In [27]:
bagging_y_train = bagging_model.predict(X_train)

In [74]:
@score_for('Bagging', 'train')
@time_for('Bagging', 'train')
def train_bagging():
    return f1_score(bagging_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "BaggingClassifier:", train_bagging(),
)
draw_conf_matrix(y_train, bagging_y_train)

F1-мера для тренировочных данных с помощью BaggingClassifier: 0.9976728763489481


In [29]:
bagging_y_test = bagging_model.predict(X_test)

In [75]:
@score_for('Bagging', 'test')
@time_for('Bagging', 'test')
def test_bagging():
    return f1_score(bagging_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "BaggingClassifier:", test_bagging(),
)
draw_conf_matrix(y_test, bagging_y_test)

F1-мера для тестовых данных с помощью BaggingClassifier: 0.9616708744091462


### Random Forest

In [31]:
from sklearn.ensemble import RandomForestClassifier

forest_params = {
    'n_estimators': [10, 100, 500],
    'max_depth': [None, 12, 15],
    'min_samples_leaf': [1, 3, 5],
    'min_samples_split': [2, 6, 9],
}

@time_for('Random Forest', 'predict')
def fit_forest():
    return GridSearchCV(
        estimator=RandomForestClassifier(random_state=random_state),
        param_grid=forest_params, scoring='f1_macro', cv=4,
    ).fit(X_train, y_train)

random_forest = fit_forest()

In [32]:
forest_model = random_forest.best_estimator_
forest_model

In [33]:
forest_y_train = forest_model.predict(X_train)

In [76]:
@score_for('Random Forest', 'train')
@time_for('Random Forest', 'train')
def train_forest():
    return f1_score(forest_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "RandomForestClassifier:", train_forest(),
)
draw_conf_matrix(y_train, forest_y_train)

F1-мера для тренировочных данных с помощью RandomForestClassifier: 1.0


In [35]:
forest_y_test = forest_model.predict(X_test)

In [77]:
@score_for('Random Forest', 'test')
@time_for('Random Forest', 'test')
def test_forest():
    return f1_score(forest_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "RandomForestClassifier:", test_forest(),
)
draw_conf_matrix(y_test, forest_y_test)

F1-мера для тестовых данных с помощью RandomForestClassifier: 0.9575933526756595


## Задание 3

Реализовать бустинг на тех же данных, что использовались для баггинга.


### Cat Boosting

С использованием CPU


In [37]:
from catboost import CatBoostClassifier

@time_for('CatBoosting CPU', 'predict')
def fit_cat_cpu():
    return CatBoostClassifier(
        random_state=random_state,
    ).fit(X_train, y_train)

cat_cpu_model = fit_cat_cpu()

Learning rate set to 0.079127
0:	learn: 0.9971944	total: 152ms	remaining: 2m 31s
1:	learn: 0.8988531	total: 158ms	remaining: 1m 18s
2:	learn: 0.8163142	total: 164ms	remaining: 54.5s
3:	learn: 0.7496475	total: 171ms	remaining: 42.5s
4:	learn: 0.6889320	total: 176ms	remaining: 35.1s
5:	learn: 0.6377499	total: 183ms	remaining: 30.4s
6:	learn: 0.5944032	total: 190ms	remaining: 26.9s
7:	learn: 0.5544519	total: 196ms	remaining: 24.3s
8:	learn: 0.5185654	total: 204ms	remaining: 22.5s
9:	learn: 0.4876519	total: 211ms	remaining: 20.9s
10:	learn: 0.4608737	total: 217ms	remaining: 19.5s
11:	learn: 0.4367177	total: 223ms	remaining: 18.4s
12:	learn: 0.4144205	total: 230ms	remaining: 17.4s
13:	learn: 0.3929261	total: 237ms	remaining: 16.7s
14:	learn: 0.3737588	total: 243ms	remaining: 15.9s
15:	learn: 0.3571414	total: 249ms	remaining: 15.3s
16:	learn: 0.3417432	total: 256ms	remaining: 14.8s
17:	learn: 0.3283899	total: 267ms	remaining: 14.6s
18:	learn: 0.3143855	total: 274ms	remaining: 14.1s
19:	learn

In [38]:
cat_cpu_y_train = cat_cpu_model.predict(X_train)

In [78]:
@score_for('CatBoosting CPU', 'train')
@time_for('CatBoosting CPU', 'train')
def train_cat_cpu():
    return f1_score(cat_cpu_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "CatBoostClassifier на CPU:", train_cat_cpu(),
)
draw_conf_matrix(y_train, cat_cpu_y_train)

F1-мера для тренировочных данных с помощью CatBoostClassifier на CPU: 1.0


In [40]:
cat_cpu_y_test = cat_cpu_model.predict(X_test)

In [79]:
@score_for('CatBoosting CPU', 'test')
@time_for('CatBoosting CPU', 'test')
def test_cat_cpu():
    return f1_score(cat_cpu_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "CatBoostClassifier на CPU:", test_cat_cpu(),
)
draw_conf_matrix(y_test, cat_cpu_y_test)

F1-мера для тестовых данных с помощью CatBoostClassifier на CPU: 0.9610094547682362


### Cat Boosting

С использованием GPU


In [43]:
from catboost import CatBoostClassifier

@time_for('CatBoosting GPU', 'predict')
def fit_cat_gpu():
    return CatBoostClassifier(
        task_type='GPU', devices='0',
        random_state=random_state,
    ).fit(X_train, y_train)

cat_gpu_model = fit_cat_gpu()

Learning rate set to 0.064906
0:	learn: 1.0015128	total: 25.6ms	remaining: 25.6s
1:	learn: 0.9169174	total: 84.7ms	remaining: 42.3s
2:	learn: 0.8447158	total: 280ms	remaining: 1m 33s
3:	learn: 0.7815021	total: 288ms	remaining: 1m 11s
4:	learn: 0.7260631	total: 297ms	remaining: 59.1s
5:	learn: 0.6780486	total: 305ms	remaining: 50.6s
6:	learn: 0.6347123	total: 313ms	remaining: 44.4s
7:	learn: 0.5953649	total: 320ms	remaining: 39.7s
8:	learn: 0.5607136	total: 328ms	remaining: 36.1s
9:	learn: 0.5279078	total: 337ms	remaining: 33.4s
10:	learn: 0.4980567	total: 348ms	remaining: 31.3s
11:	learn: 0.4716894	total: 357ms	remaining: 29.4s
12:	learn: 0.4467183	total: 365ms	remaining: 27.7s
13:	learn: 0.4231300	total: 375ms	remaining: 26.4s
14:	learn: 0.4020272	total: 424ms	remaining: 27.8s
15:	learn: 0.3832922	total: 440ms	remaining: 27.1s
16:	learn: 0.3660352	total: 449ms	remaining: 25.9s
17:	learn: 0.3497797	total: 459ms	remaining: 25s
18:	learn: 0.3336754	total: 467ms	remaining: 24.1s
19:	learn

In [44]:
cat_gpu_y_train = cat_gpu_model.predict(X_train)

In [80]:
@score_for('CatBoosting GPU', 'train')
@time_for('CatBoosting GPU', 'train')
def train_cat_gpu():
    return f1_score(cat_gpu_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "CatBoostClassifier на GPU:", train_cat_gpu(),
)
draw_conf_matrix(y_train, cat_gpu_y_train)

F1-мера для тренировочных данных с помощью CatBoostClassifier на GPU: 1.0


In [47]:
cat_gpu_y_test = cat_gpu_model.predict(X_test)

In [81]:
@score_for('CatBoosting GPU', 'test')
@time_for('CatBoosting GPU', 'test')
def test_cat_gpu():
    return f1_score(cat_gpu_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "CatBoostClassifier на GPU:", test_cat_gpu(),
)
draw_conf_matrix(y_test, cat_gpu_y_test)

F1-мера для тестовых данных с помощью CatBoostClassifier на GPU: 0.9604606522486456


### Gradient Boosting

In [51]:
from sklearn.ensemble import GradientBoostingClassifier

@time_for('Gradient Boosting', 'predict')
def fit_gradient():
    return GradientBoostingClassifier(
        random_state=random_state,
    ).fit(X_train, y_train)

gradient_model = fit_gradient()

In [52]:
gradient_y_train = gradient_model.predict(X_train)

In [82]:
@score_for('Gradient Boosting', 'train')
@time_for('Gradient Boosting', 'train')
def train_gradient():
    return f1_score(gradient_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "GradientBoostingClassifier:", train_gradient(),
)
draw_conf_matrix(y_train, gradient_y_train)

F1-мера для тренировочных данных с помощью GradientBoostingClassifier: 0.9988384942853769


In [54]:
gradient_y_test = gradient_model.predict(X_test)

In [83]:
@score_for('Gradient Boosting', 'test')
@time_for('Gradient Boosting', 'test')
def test_gradient():
    return f1_score(gradient_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "GradientBoostingClassifier:", test_gradient(),
)
draw_conf_matrix(y_test, gradient_y_test)

F1-мера для тестовых данных с помощью GradientBoostingClassifier: 0.960838685258659


### Hist Gradient Boosting

In [56]:
from sklearn.ensemble import HistGradientBoostingClassifier

@time_for('Hist Gradient Boosting', 'predict')
def fit_hist_grad():
    return HistGradientBoostingClassifier(
        random_state=random_state,
    ).fit(X_train, y_train)

hist_grad_model = fit_hist_grad()

In [57]:
hist_grad_y_train = hist_grad_model.predict(X_train)

In [84]:
@score_for('Hist Gradient Boosting', 'train')
@time_for('Hist Gradient Boosting', 'train')
def train_hist_grad():
    return f1_score(hist_grad_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "HistGradientBoostingClassifier:", train_hist_grad(),
)
draw_conf_matrix(y_train, hist_grad_y_train)

F1-мера для тренировочных данных с помощью HistGradientBoostingClassifier: 1.0


In [59]:
hist_grad_y_test = hist_grad_model.predict(X_test)

In [85]:
@score_for('Hist Gradient Boosting', 'test')
@time_for('Hist Gradient Boosting', 'test')
def test_hist_grad():
    return f1_score(hist_grad_y_test, y_test, average='macro')

print(
    "F1-мера для тестовых данных с помощью "
    "HistGradientBoostingClassifier:", test_hist_grad(),
)
draw_conf_matrix(y_test, hist_grad_y_test)

F1-мера для тестовых данных с помощью HistGradientBoostingClassifier: 0.956678790248597


### XG Boosting

С использованием CPU


In [61]:
from xgboost import XGBClassifier

@time_for('XGBoosting CPU', 'predict')
def fit_xgb_cpu():
    return XGBClassifier(
        random_state=random_state,
    ).fit(X_train, y_train)

xgb_cpu_model = fit_xgb_cpu()

In [62]:
xgb_cpu_y_train = xgb_cpu_model.predict(X_train)

In [86]:
@score_for('XGBoosting CPU', 'train')
@time_for('XGBoosting CPU', 'train')
def train_xgb_cpu():
    return f1_score(xgb_cpu_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "XGBClassifier на CPU:", train_xgb_cpu(),
)
draw_conf_matrix(y_train, xgb_cpu_y_train)

F1-мера для тренировочных данных с помощью XGBClassifier на CPU: 1.0


In [64]:
xgb_cpu_y_test = xgb_cpu_model.predict(X_test)

In [87]:
@score_for('XGBoosting CPU', 'test')
@time_for('XGBoosting CPU', 'test')
def test_xgb_cpu():
    return f1_score(xgb_cpu_y_test, y_test, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "XGBClassifier на CPU:", test_xgb_cpu(),
)
draw_conf_matrix(y_test, xgb_cpu_y_test)

F1-мера для тренировочных данных с помощью XGBClassifier на CPU: 0.9594437472100448


### XG Boosting

С использованием GPU


In [66]:
from xgboost import XGBClassifier

@time_for('XGBoosting GPU', 'predict')
def fit_xgb_gpu():
    return XGBClassifier(
        tree_method='gpu_hist', gpu_id=0,
        random_state=random_state,
    ).fit(X_train, y_train)

xgb_gpu_model = fit_xgb_gpu()

In [67]:
xgb_gpu_y_train = xgb_gpu_model.predict(X_train)

In [88]:
@score_for('XGBoosting GPU', 'train')
@time_for('XGBoosting GPU', 'train')
def train_xgb_gpu():
    return f1_score(xgb_gpu_y_train, y_train, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "XGBClassifier на GPU:", train_xgb_gpu(),
)
draw_conf_matrix(y_train, xgb_gpu_y_train)

F1-мера для тренировочных данных с помощью XGBClassifier на GPU: 1.0


In [69]:
xgb_gpu_y_test = xgb_gpu_model.predict(X_test)

In [89]:
@score_for('XGBoosting GPU', 'test')
@time_for('XGBoosting GPU', 'test')
def test_xgb_gpu():
    return f1_score(xgb_gpu_y_test, y_test, average='macro')

print(
    "F1-мера для тренировочных данных с помощью "
    "XGBClassifier на GPU:", test_xgb_gpu(),
)
draw_conf_matrix(y_test, xgb_gpu_y_test)

F1-мера для тренировочных данных с помощью XGBClassifier на GPU: 0.9581598414426468


## Задание 4

Сравнить результаты работы алгоритмов (время работы и качество моделей).
Сделать выводы.


In [90]:
results

Unnamed: 0_level_0,Метод,Время,Счет
Тип анс. обучения,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DecisionTree,predict,7.366282,—
DecisionTree,train,0.001177,0.967628
DecisionTree,test,0.037815,0.933213
Bagging,predict,29.935607,—
Bagging,train,0.001767,0.997673
Bagging,test,0.048112,0.961671
Random Forest,predict,179.286113,—
Random Forest,train,0.001453,1.0
Random Forest,test,0.0384,0.957593
CatBoosting CPU,predict,8.388958,—
