# Лабораторная работа №1 (Проведение исследований с алгоритмом KNN)

### Выбор начальных условий
#### Задача классификации
- Датасет: Heart Failure Prediction Dataset
- Ссылка: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/
- Описание: датасет содержит данные о различных показателях здоровья людей.
- Возможная задача: прогнозирование риска сердечной недостаточности.
- Обоснование: Эта практическая задача может быть полезна в медицине для ранней оценки риска заболевания.
#### Задача регрессии
- Датасет: Melbourne Housing Snapshot
- Ссылка: https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot/
- Описание: датасет содержит различные данные о недвижимости в Мельбурне.
- Возможная задача: предсказание цены покупки/продажи недвижимости по характеристикам объекта.
- Обоснование: прогнозирование рынка невижимости важно для агентов, покупателей, банков и аналитики рынка недвижимости.

### Выбор метрик
Для задачи классификации:
- Accuracy: основная метрика, можно удобно сравнивать качества моделей.
- Recall: дополнительная метрика. В медицинской задаче важно минимизировать пропуск больных, чувствительность показывает, какую долю реальных случаев модель корректно выявляет.

Для задачи регрессии:
- R-squared: основная метрика, удобна для сравнения моделей.
- Mean Absolute Error (MAE): дополнительная метрика, дает оценку среднего абсолютного отклонения предсказаний.

### Создание бейзлайна и оценка качества

Импорт необходимых библиотек.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import accuracy_score, recall_score, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.utils.class_weight import compute_class_weight

Подготовка данных.

In [2]:
path = 'drive/MyDrive/ai_data/'
heart = pd.read_csv(path + 'heart.csv')
heart.shape, heart.columns

((918, 12),
 Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
        'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
        'HeartDisease'],
       dtype='object'))

Посмотрим состав таблицы и наличие пропусков, а также баланс HeartDisease.

In [3]:
display(heart.head())
display(heart.info())
display(heart.describe())
print("Missing per column:\n", heart.isna().sum())
print(heart['HeartDisease'].value_counts(normalize=True))

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


None

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


Missing per column:
 Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64
HeartDisease
1    0.553377
0    0.446623
Name: proportion, dtype: float64


Выполним те же действия для второго датасета.

In [4]:
melbourne = pd.read_csv(path + 'melb_data.csv')
melbourne.shape, melbourne.columns

((13580, 21),
 Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
        'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
        'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
        'Longtitude', 'Regionname', 'Propertycount'],
       dtype='object'))

In [5]:
display(melbourne.head())
display(melbourne.info())
display(melbourne.describe())
print(melbourne.isna().sum())
print(melbourne['Price'].describe())

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

None

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64
count    1.358000e+04
mean     1.075684e+06
std      6.393107e+05
min      8.500000e+04
25%      6.500000e+05
50%      9.030000e+05
75%      1.330000e+06
max      9.000000e+06
Name: Price, dtype: float64


Удалим столбцы с большим количеством пропусков, заполним столбцы с маленьким количеством пропусков медианой и модой

In [6]:
columns_to_drop = ['BuildingArea', 'YearBuilt']
melbourne_clean = melbourne.drop(columns=columns_to_drop)

car_median = melbourne_clean['Car'].median()
melbourne_clean['Car'] = melbourne_clean['Car'].fillna(car_median)

council_mode = melbourne_clean['CouncilArea'].mode()[0]
melbourne_clean['CouncilArea'] = melbourne_clean['CouncilArea'].fillna(council_mode)

print(melbourne_clean.isna().sum())

Suburb           0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
CouncilArea      0
Lattitude        0
Longtitude       0
Regionname       0
Propertycount    0
dtype: int64


Разделим данные на тренировочную и тестовую выборки:

In [7]:
X_heart = heart.drop('HeartDisease', axis=1)
y_heart = heart['HeartDisease']
X_heart_train, X_heart_test, y_heart_train, y_heart_test = train_test_split(
    X_heart, y_heart, test_size=0.2, random_state=42, stratify=y_heart
)


X_melbourne = melbourne_clean.drop('Price', axis=1)
y_melbourne = melbourne_clean['Price']
X_melbourne_train, X_melbourne_test, y_melbourne_train, y_melbourne_test = train_test_split(
    X_melbourne, y_melbourne, test_size=0.2, random_state=42
)

KNN работает только с числовыми данными, проведем обработку категориальных переменных:

In [8]:
X_heart_train_processed = X_heart_train.copy()
X_heart_test_processed = X_heart_test.copy()
X_melbourne_train_processed = X_melbourne_train.copy()
X_melbourne_test_processed = X_melbourne_test.copy()

Обработка категориальных переменных для Heart dataset:

In [9]:
categorical_heart = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
for col in categorical_heart:
    print(f"{col}: {X_heart_train[col].unique()}")

label_encoders_heart = {}
for col in categorical_heart:
    le = LabelEncoder()
    X_heart_train_processed[col] = le.fit_transform(X_heart_train[col])
    X_heart_test_processed[col] = le.transform(X_heart_test[col])
    label_encoders_heart[col] = le

print("\nРезультат:")
print(X_heart_train_processed[categorical_heart].head())
print(f"{X_heart_train_processed[categorical_heart].dtypes}")

Sex: ['M' 'F']
ChestPainType: ['ATA' 'ASY' 'NAP' 'TA']
RestingECG: ['ST' 'Normal' 'LVH']
ExerciseAngina: ['Y' 'N']
ST_Slope: ['Flat' 'Up' 'Down']

Результат:
     Sex  ChestPainType  RestingECG  ExerciseAngina  ST_Slope
485    1              1           2               1         1
486    1              1           2               0         2
117    0              0           2               1         1
361    1              0           1               1         1
296    1              0           1               1         1
Sex               int64
ChestPainType     int64
RestingECG        int64
ExerciseAngina    int64
ST_Slope          int64
dtype: object


Обработка категориальных переменных для Melbourne dataset. Применяем One-Hot Encoding для сложных категориальных признаков. Удаляем нечисловые столбцы, которые не будем использовать.

In [10]:
categorical_melbourne = ['Type', 'Method', 'CouncilArea', 'Regionname']
for col in categorical_melbourne:
    print(f"{col}: {X_melbourne_train[col].nunique()}")

simple_categorical = ['Type', 'Method']
complex_categorical = ['CouncilArea', 'Regionname']

label_encoders_melbourne = {}
for col in simple_categorical:
    le = LabelEncoder()
    X_melbourne_train_processed[col] = le.fit_transform(X_melbourne_train[col])
    X_melbourne_test_processed[col] = le.transform(X_melbourne_test[col])
    label_encoders_melbourne[col] = le


council_dummies_train = pd.get_dummies(X_melbourne_train['CouncilArea'], prefix='Council')
region_dummies_train = pd.get_dummies(X_melbourne_train['Regionname'], prefix='Region')
council_dummies_test = pd.get_dummies(X_melbourne_test['CouncilArea'], prefix='Council')
region_dummies_test = pd.get_dummies(X_melbourne_test['Regionname'], prefix='Region')

council_dummies_test = council_dummies_test.reindex(columns=council_dummies_train.columns, fill_value=0)
region_dummies_test = region_dummies_test.reindex(columns=region_dummies_train.columns, fill_value=0)

X_melbourne_train_processed = X_melbourne_train_processed.drop(['CouncilArea', 'Regionname'], axis=1)
X_melbourne_test_processed = X_melbourne_test_processed.drop(['CouncilArea', 'Regionname'], axis=1)

X_melbourne_train_processed = pd.concat([X_melbourne_train_processed, council_dummies_train, region_dummies_train], axis=1)
X_melbourne_test_processed = pd.concat([X_melbourne_test_processed, council_dummies_test, region_dummies_test], axis=1)

columns_to_drop_melbourne = ['Suburb', 'Address', 'SellerG', 'Date']
X_melbourne_train_processed = X_melbourne_train_processed.drop(columns=columns_to_drop_melbourne, errors='ignore')
X_melbourne_test_processed = X_melbourne_test_processed.drop(columns=columns_to_drop_melbourne, errors='ignore')

print(f"Итог: {X_melbourne_train_processed.shape[1]} признаков")

Type: 3
Method: 5
CouncilArea: 33
Regionname: 8
Итог: 53 признаков


Итоговые размер данных и типы данных:

In [11]:
print(f"Heart train: {X_heart_train_processed.shape}")
print(f"Heart test: {X_heart_test_processed.shape}")
print(f"Melbourne train: {X_melbourne_train_processed.shape}")
print(f"Melbourne test: {X_melbourne_test_processed.shape}")

print(f"Heart: {X_heart_train_processed.dtypes.unique()}")
print(f"Melbourne: {X_melbourne_train_processed.dtypes.unique()}")

Heart train: (734, 11)
Heart test: (184, 11)
Melbourne train: (10864, 53)
Melbourne test: (2716, 53)
Heart: [dtype('int64') dtype('float64')]
Melbourne: [dtype('int64') dtype('float64') dtype('bool')]


Проведем масштабирование признаков:

In [12]:
scaler_heart = StandardScaler()
X_heart_train_scaled = scaler_heart.fit_transform(X_heart_train_processed)
X_heart_test_scaled = scaler_heart.transform(X_heart_test_processed)

scaler_melbourne = StandardScaler()
X_melbourne_train_scaled = scaler_melbourne.fit_transform(X_melbourne_train_processed)
X_melbourne_test_scaled = scaler_melbourne.transform(X_melbourne_test_processed)

#### Обучение моделей

In [13]:
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_heart_train_scaled, y_heart_train)

y_heart_pred = knn_classifier.predict(X_heart_test_scaled)
y_heart_pred_proba = knn_classifier.predict_proba(X_heart_test_scaled)[:, 1]

knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_melbourne_train_scaled, y_melbourne_train)

y_melbourne_pred = knn_regressor.predict(X_melbourne_test_scaled)

#### Оценка качества моделей

In [14]:
accuracy = accuracy_score(y_heart_test, y_heart_pred)
recall = recall_score(y_heart_test, y_heart_pred)

print("Классификация")
print(f"Accuracy: {accuracy:.4f}")
print(f"Recall:   {recall:.4f}")

r2 = r2_score(y_melbourne_test, y_melbourne_pred)
mae = mean_absolute_error(y_melbourne_test, y_melbourne_pred)

print("Регрессия")
print(f"R-squared: {r2:.4f}")
print(f"MAE:       {mae:,.2f}")

Классификация
Accuracy: 0.8913
Recall:   0.9118
Регрессия
R-squared: 0.7046
MAE:       215,821.60


### Улучшение бейзлайна

Гипотезы
1) Подбор оптимального количества соседей (n_neighbors)  
В бейзлайне использовано k=5, но это значение может быть неоптимальным.

2) Использование взвешенного голосования по расстоянию повысит точность предсказаний за счет придания большего веса более близким объектам.

3) Применение кросс-валидации при подборе гиперпараметров обеспечит улучшение результата.

Проверка гипотез:

In [15]:
param_grid_class = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid_class = GridSearchCV(
    KNeighborsClassifier(),
    param_grid_class,
    cv=5,
    scoring=['accuracy', 'recall'],
    refit='accuracy',
    n_jobs=-1
)
grid_class.fit(X_heart_train_scaled, y_heart_train)

print("Лучшие параметры для классификации:")
print(grid_class.best_params_)
print(f"Лучшая accuracy на кросс-валидации: {grid_class.best_score_:.4f}")


param_grid_reg = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

grid_reg = GridSearchCV(
    KNeighborsRegressor(),
    param_grid_reg,
    cv=5,
    scoring=['r2', 'neg_mean_absolute_error'],
    refit='r2',
    n_jobs=-1
)
grid_reg.fit(X_melbourne_train_scaled, y_melbourne_train)

print("Лучшие параметры для регрессии:")
print(grid_reg.best_params_)
print(f"Лучший R2 на кросс-валидации: {grid_reg.best_score_:.4f}")

Лучшие параметры для классификации:
{'metric': 'manhattan', 'n_neighbors': 15, 'weights': 'uniform'}
Лучшая accuracy на кросс-валидации: 0.8637
Лучшие параметры для регрессии:
{'metric': 'manhattan', 'n_neighbors': 16, 'weights': 'distance'}
Лучший R2 на кросс-валидации: 0.7139


Гипотезы подвердились. Сформируем улучшенный бейзлайн по результатам проверки гипотез

In [16]:
best_knn_classifier = grid_class.best_estimator_
best_knn_regressor = grid_reg.best_estimator_

y_heart_pred_improved = best_knn_classifier.predict(X_heart_test_scaled)
y_melbourne_pred_improved = best_knn_regressor.predict(X_melbourne_test_scaled)

Оценим качество улучшенного бейзлайна и сравним результат с изначальным бейзлайном:

In [17]:
accuracy_improved = accuracy_score(y_heart_test, y_heart_pred_improved)
recall_improved = recall_score(y_heart_test, y_heart_pred_improved)

r2_improved = r2_score(y_melbourne_test, y_melbourne_pred_improved)
mae_improved = mean_absolute_error(y_melbourne_test, y_melbourne_pred_improved)

print("\Классификация:")
print(f"Бейзлайн - Accuracy: {accuracy:.4f}, Recall: {recall:.4f}")
print(f"Улучшенная - Accuracy: {accuracy_improved:.4f}, Recall: {recall_improved:.4f}")

print("\nРегрессия:")
print(f"Бейзлайн - R2: {r2:.4f}, MAE: {mae:,.2f}")
print(f"Улучшенная - R2: {r2_improved:.4f}, MAE: {mae_improved:,.2f}")

accuracy_diff = accuracy_improved - accuracy
recall_diff = recall_improved - recall
r2_diff = r2_improved - r2
mae_diff = mae_improved - mae


print(f"Accuracy: {accuracy_diff:+.4f}")
print(f"Recall: {recall_diff:+.4f}")
print(f"R2: {r2_diff:+.4f}")
print(f"MAE: {mae_diff:+,.2f}")

\Классификация:
Бейзлайн - Accuracy: 0.8913, Recall: 0.9118
Улучшенная - Accuracy: 0.9130, Recall: 0.9314

Регрессия:
Бейзлайн - R2: 0.7046, MAE: 215,821.60
Улучшенная - R2: 0.7452, MAE: 200,354.22
Accuracy: +0.0217
Recall: +0.0196
R2: +0.0406
MAE: -15,467.39


С улучшенным бейзлайном качество модели возрасло, поэтому можно сделать следующие выводы:
- Оптимальное количество соседей важно: значения k=15-16 оказались значительно лучше используемого по умолчанию k=5.
- Выбор метрики расстояния влияет на результат: манхэттенское расстояние показало лучший результат.
- Стратегия взвешивания зависит от задачи: для регрессии взвешивание по расстоянию оказалось эффективнее, в то время как для классификации равномерное взвешивание дало лучший результат.

### Имплементация алгоритма машинного обучения

Релизация KNN для регресии и классификации в виде классов:

In [18]:
class CustomKNNClassifier:
    def __init__(self, n_neighbors=5, weights='uniform', metric='euclidean'):
        self.n_neighbors = n_neighbors
        self.weights = weights
        self.metric = metric
        self.X_train = None
        self.y_train = None

    def _calculate_distance(self, x1, x2):
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))

    def fit(self, X, y):
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self

    def predict(self, X):
        X = np.array(X)
        predictions = []

        for x in X:
            distances = []
            for i in range(len(self.X_train)):
                dist = self._calculate_distance(x, self.X_train[i])
                distances.append((dist, self.y_train[i]))

            distances.sort(key=lambda x: x[0])
            neighbors = distances[:self.n_neighbors]

            if self.weights == 'uniform':
                neighbor_labels = [label for _, label in neighbors]
                prediction = max(set(neighbor_labels), key=neighbor_labels.count)
            else:
                label_weights = {}
                for dist, label in neighbors:
                    weight = 1 / (dist + 1e-8)  # чтобы избежать деления на 0
                    if label in label_weights:
                        label_weights[label] += weight
                    else:
                        label_weights[label] = weight
                prediction = max(label_weights, key=label_weights.get)

            predictions.append(prediction)

        return np.array(predictions)


class CustomKNNRegressor:
    def __init__(self, n_neighbors=5, weights='uniform', metric='euclidean'):
        self.n_neighbors = n_neighbors
        self.weights = weights
        self.metric = metric
        self.X_train = None
        self.y_train = None

    def _calculate_distance(self, x1, x2):
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))

    def fit(self, X, y):
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self

    def predict(self, X):
        X = np.array(X)
        predictions = []

        for x in X:
            distances = []
            for i in range(len(self.X_train)):
                dist = self._calculate_distance(x, self.X_train[i])
                distances.append((dist, self.y_train[i]))

            distances.sort(key=lambda x: x[0])
            neighbors = distances[:self.n_neighbors]

            if self.weights == 'uniform':
                neighbor_values = [value for _, value in neighbors]
                prediction = np.mean(neighbor_values)
            else:
                total_weight = 0
                weighted_sum = 0
                for dist, value in neighbors:
                    weight = 1 / (dist + 1e-8)  # чтобы избежать деления на 0
                    weighted_sum += weight * value
                    total_weight += weight
                prediction = weighted_sum / total_weight

            predictions.append(prediction)

        return np.array(predictions)

Посмотрим на результаты и сравним их с базовым бейзайном:

In [19]:
custom_knn_classifier = CustomKNNClassifier(n_neighbors=5, weights='uniform', metric='euclidean')
custom_knn_classifier.fit(X_heart_train_scaled, y_heart_train)
y_heart_pred_custom = custom_knn_classifier.predict(X_heart_test_scaled)

custom_knn_regressor = CustomKNNRegressor(n_neighbors=5, weights='uniform', metric='euclidean')
custom_knn_regressor.fit(X_melbourne_train_scaled, y_melbourne_train)
y_melbourne_pred_custom = custom_knn_regressor.predict(X_melbourne_test_scaled)

accuracy_custom = accuracy_score(y_heart_test, y_heart_pred_custom)
recall_custom = recall_score(y_heart_test, y_heart_pred_custom)

r2_custom = r2_score(y_melbourne_test, y_melbourne_pred_custom)
mae_custom = mean_absolute_error(y_melbourne_test, y_melbourne_pred_custom)

print("Самостоятельная имплементация:")
print(f"Классификация - Accuracy: {accuracy_custom:.4f}, Recall: {recall_custom:.4f}")
print(f"Регрессия - R2: {r2_custom:.4f}, MAE: {mae_custom:,.2f}")

print("\nСравнение с базовым бейзлайном:")
print(f"Классификация - Sklearn Accuracy: {accuracy:.4f}, Custom Accuracy: {accuracy_custom:.4f}")
print(f"Регрессия - Sklearn R2: {r2:.4f}, Custom R2: {r2_custom:.4f}")

Самостоятельная имплементация:
Классификация - Accuracy: 0.8913, Recall: 0.9118
Регрессия - R2: 0.7046, MAE: 215,821.97

Сравнение с базовым бейзлайном:
Классификация - Sklearn Accuracy: 0.8913, Custom Accuracy: 0.8913
Регрессия - Sklearn R2: 0.7046, Custom R2: 0.7046


Получили полное совпадение с результатами первого бейзлайна.  
Теперь добавим в модель техники из улучшенного бейзлайна:

In [20]:
custom_knn_classifier_improved = CustomKNNClassifier(
    n_neighbors=15,
    weights='uniform',
    metric='manhattan'
)
custom_knn_classifier_improved.fit(X_heart_train_scaled, y_heart_train)
y_heart_pred_custom_improved = custom_knn_classifier_improved.predict(X_heart_test_scaled)

custom_knn_regressor_improved = CustomKNNRegressor(
    n_neighbors=16,
    weights='distance',
    metric='manhattan'
)
custom_knn_regressor_improved.fit(X_melbourne_train_scaled, y_melbourne_train)
y_melbourne_pred_custom_improved = custom_knn_regressor_improved.predict(X_melbourne_test_scaled)

accuracy_custom_improved = accuracy_score(y_heart_test, y_heart_pred_custom_improved)
recall_custom_improved = recall_score(y_heart_test, y_heart_pred_custom_improved)

r2_custom_improved = r2_score(y_melbourne_test, y_melbourne_pred_custom_improved)
mae_custom_improved = mean_absolute_error(y_melbourne_test, y_melbourne_pred_custom_improved)

print("Самостоятельная имплементация с лучшенными параметрами:")
print(f"Классификация - Accuracy: {accuracy_custom_improved:.4f}, Recall: {recall_custom_improved:.4f}")
print(f"Регрессия - R2: {r2_custom_improved:.4f}, MAE: {mae_custom_improved:,.2f}")

print("\nСравнение с улучшенным бейзлайном:")
print(f"Классификация - Sklearn Improved Accuracy: {accuracy_improved:.4f}, Custom Improved Accuracy: {accuracy_custom_improved:.4f}")
print(f"Регрессия - Sklearn Improved R2: {r2_improved:.4f}, Custom Improved R2: {r2_custom_improved:.4f}")

Самостоятельная имплементация с лучшенными параметрами:
Классификация - Accuracy: 0.9130, Recall: 0.9314
Регрессия - R2: 0.7452, MAE: 200,354.22

Сравнение с улучшенным бейзлайном:
Классификация - Sklearn Improved Accuracy: 0.9130, Custom Improved Accuracy: 0.9130
Регрессия - Sklearn Improved R2: 0.7452, Custom Improved R2: 0.7452


### Выводы

Сравнение результатов показывает полную идентичность производительности собственной реализации и библиотеки sklearn. Обе улучшенные версии продемонстрировали одинаковое качество по всем метрикам. Улучшенные версии бейзлайнов показали немного более высокую точность по сравнению с базовым бейзлайном.