# Бинарный классификатор

## Описание датасета

Согласно варианту в ходе работы используем датасет [«Оценка вероятности диагностики диабета у человека»](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset). В исходном датасете 3 класса: 0 (нет диабета), 1 (преддиабет) и 2 (диабет). Поскольку классификатор бинарный, то классы 1 и 2 будут объеденены в один. Для загрузки датасета с kaggle установим их API.

In [None]:
pip install kagglehub kagglehub[hf-datasets]

Collecting datasets (from kagglehub[hf-datasets])
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting xxhash (from datasets->kagglehub[hf-datasets])
  Downloading xxhash-3.5.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets->kagglehub[hf-datasets])
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
Downloading multiprocess-0.70.16-py312-none-any.whl (146 kB)
Downloading xxhash-3.5.0-cp312-cp312-win_amd64.whl (30 kB)
Installing collected packages: xxhash, multiprocess, datasets
Successfully installed datasets-3.3.2 multiprocess-0.70.16 xxhash-3.5.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
# import os
# os.environ['KAGGLE_USERNAME'] = "USERNAME" # username from the json file
# os.environ['KAGGLE_KEY'] = "KEY" # key from the json file

In [None]:
import kaggle

# Имя датасета
dataset_name = "alexteboul/diabetes-health-indicators-dataset"

# Скачиваем и разархивируем в текущую папку
kaggle.api.dataset_download_files(dataset_name, path="./", unzip=True)


Dataset URL: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset


In [None]:
import pandas as pd

df = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')
df.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


Далее выполним замену трёх классов на два

In [None]:
print(df['Diabetes_012'].value_counts())
df['Diabetes_012'] = df['Diabetes_012'].apply(lambda x: 1 if x > 0 else 0)
print(f'New column:\n{df['Diabetes_012'].value_counts()}')

Diabetes_012
0.0    213703
2.0     35346
1.0      4631
Name: count, dtype: int64
New column:
Diabetes_012
0    213703
1     39977
Name: count, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split

y = df['Diabetes_012']
X = df.drop(['Diabetes_012'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
print(f'X_train shape: {X_train.shape}\n y_train shape: {y_train.shape}\n X_test shape: {X_test.shape}\n y_test shape: {y_test.shape}')

X_train shape: (202944, 21)
 y_train shape: (202944,)
 X_test shape: (50736, 21)
 y_test shape: (50736,)


Здесь мы задаём fit у MinMaxScaler только на обучающей выборке, чтобы информация от тестовой выборки не просачилась в обучение.

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()

X_train = min_max_scaler.fit_transform(X_train)
X_test = min_max_scaler.transform(X_test)

# Многоклассовый классификатор

## Описание датасета

Согласно варианту в ходе работы используем датасет [«Оценка уровня физического развития людей разного возраста»](https://www.kaggle.com/datasets/kukuroo3/body-performance-data). В данном датасете людям выделяется 4 оценки физической активности: A,B,C,D. Именно эту оценку мы и будем предсказывать.

In [2]:
import kaggle

# Имя датасета
dataset_name = "kukuroo3/body-performance-data"

# Скачиваем и разархивируем в текущую папку
kaggle.api.dataset_download_files(dataset_name, path="./", unzip=True)


Dataset URL: https://www.kaggle.com/datasets/kukuroo3/body-performance-data


In [11]:
import pandas as pd

df = pd.read_csv('bodyPerformance.csv')
df.head()

Unnamed: 0,age,gender,height_cm,weight_kg,body fat_%,diastolic,systolic,gripForce,sit and bend forward_cm,sit-ups counts,broad jump_cm,class
0,27.0,M,172.3,75.24,21.3,80.0,130.0,54.9,18.4,60.0,217.0,C
1,25.0,M,165.0,55.8,15.7,77.0,126.0,36.4,16.3,53.0,229.0,A
2,31.0,M,179.6,78.0,20.1,92.0,152.0,44.8,12.0,49.0,181.0,C
3,32.0,M,174.5,71.1,18.4,76.0,147.0,41.4,15.2,53.0,219.0,B
4,28.0,M,173.8,67.7,17.1,70.0,127.0,43.5,27.1,45.0,217.0,B


Перекодируем столбец "gender" на 1/0 вместо M/F. Для этого используем label encoder.

In [12]:
from sklearn.preprocessing import LabelEncoder

print(df['gender'].unique())
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
print(df['gender'].unique())

['M' 'F']
[1 0]


Таким образом получилось, что мужской пол кодируется единичкой, а женский - нулём. 

In [13]:
from sklearn.model_selection import train_test_split

y = df['class']
X = df.drop(['class'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
print(f'X_train shape: {X_train.shape}\n y_train shape: {y_train.shape}\n X_test shape: {X_test.shape}\n y_test shape: {y_test.shape}')

X_train shape: (10714, 11)
 y_train shape: (10714,)
 X_test shape: (2679, 11)
 y_test shape: (2679,)


Здесь мы задаём fit у MinMaxScaler только на обучающей выборке, чтобы информация от тестовой выборки не просачилась в обучение.

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()

X_train = min_max_scaler.fit_transform(X_train)
X_test = min_max_scaler.transform(X_test)


# Регрессор

## Описание датасета

Согласно варианту в ходе работы используем датасет [«Аренда велосипедов»](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset). Этот набор данных содержит почасовое и ежедневное количество арендованных велосипедов в период с 2011 по 2012 год в системе Capital bikeshare с соответствующей информацией о погоде и сезонах. В данной работе ставится задача предсказать количество арендованных велосипедов по часам.

In [1]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


In [5]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# Загружаем датасет
bike_sharing = fetch_ucirepo(id=275)

# Преобразуем в DataFrame
df = pd.DataFrame(bike_sharing.data.features, columns=bike_sharing.metadata.features)
df['target'] = bike_sharing.data.targets  # Если есть целевая переменная

# Проверяем результат
print(df.head())


       dteday  season  yr  mnth  hr  holiday  weekday  workingday  weathersit  \
0  2011-01-01       1   0     1   0        0        6           0           1   
1  2011-01-01       1   0     1   1        0        6           0           1   
2  2011-01-01       1   0     1   2        0        6           0           1   
3  2011-01-01       1   0     1   3        0        6           0           1   
4  2011-01-01       1   0     1   4        0        6           0           1   

   temp   atemp   hum  windspeed  target  
0  0.24  0.2879  0.81        0.0      16  
1  0.22  0.2727  0.80        0.0      40  
2  0.22  0.2727  0.80        0.0      32  
3  0.24  0.2879  0.75        0.0      13  
4  0.24  0.2879  0.75        0.0       1  


In [9]:
df.columns

Index(['dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'target'],
      dtype='object')

In [7]:
df['weekday'].unique()

array([6, 0, 1, 2, 3, 4, 5], dtype=int64)

Поскольку мы не собираемся использовать модели с памятью, то удалим дату из этого набора данных

In [11]:
df.drop('dteday', axis = 1)

Unnamed: 0,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,target
0,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0000,16
1,1,0,1,1,0,6,0,1,0.22,0.2727,0.80,0.0000,40
2,1,0,1,2,0,6,0,1,0.22,0.2727,0.80,0.0000,32
3,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0000,13
4,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,1,1,12,19,0,1,1,2,0.26,0.2576,0.60,0.1642,119
17375,1,1,12,20,0,1,1,2,0.26,0.2576,0.60,0.1642,89
17376,1,1,12,21,0,1,1,1,0.26,0.2576,0.60,0.1642,90
17377,1,1,12,22,0,1,1,1,0.26,0.2727,0.56,0.1343,61


Таким образом получилось, что мужской пол кодируется единичкой, а женский - нулём. 

In [14]:
from sklearn.model_selection import train_test_split

y = df['target']
X = df.drop(['target'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'X_train shape: {X_train.shape}\n y_train shape: {y_train.shape}\n X_test shape: {X_test.shape}\n y_test shape: {y_test.shape}')

X_train shape: (13903, 13)
 y_train shape: (13903,)
 X_test shape: (3476, 13)
 y_test shape: (3476,)
