### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Preprocessing danych:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Zmiana typu danych i wstępna eksploracja](#3)
5. [LabelEncoder](#4)
6. [OneHotEncoder](#5)
7. [Pandas *get_dummies()*](#6)
8. [Standaryzacja - StandardScaler](#7)
9. [Przygotowanie danych do modelu](#8)



### <a name='0'></a> Import bibliotek

In [None]:
import numpy as np
import pandas as pd
import sklearn

sklearn.__version__

### <a name='1'></a> Wygenerowanie danych

In [None]:
data = {
    'size': ['XL', 'L', 'M', 'L', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'gender': ['female', 'male', 'male', 'female', 'female'],
    'price': [199.0, 89.0, 99.0, 129.0, 79.0],
    'weight': [500, 450, 300, 380, 410],
    'bought': ['yes', 'no', 'yes', 'no', 'yes']
}

df_raw = pd.DataFrame(data=data)
df_raw

### <a name='2'></a> Utworzenie kopii danych



In [None]:
df = df_raw.copy()
df.info()

### <a name='3'></a> Zmiana typu danych i wstępna eksploracja



In [None]:
for col in ['size', 'color', 'gender', 'bought']:
    df[col] = df[col].astype('category')

df['weight'] = df['weight'].astype('float')

df.info()

In [None]:
df.describe()

In [None]:
df.describe().T

In [None]:
df.describe(include=['category']).T

In [None]:
df

### <a name='4'></a> LabelEncoder



In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(df['bought'])
le.transform(df['bought'])

In [None]:
le.fit_transform(df['bought'])

In [None]:
le.classes_

In [None]:
df['bought'] = le.fit_transform(df['bought'])
df

In [None]:
le.inverse_transform(df['bought'])

In [None]:
df['bought'] = le.inverse_transform(df['bought'])
df

### <a name='5'></a> OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['size']])

In [None]:
encoder.transform(df[['size']])

In [None]:
encoder.categories_

In [None]:
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoder.fit(df[['size']])
encoder.transform(df[['size']])

In [None]:
encoder.categories_

In [None]:
df

In [None]:
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoder.fit(df[['size']])
ohe_array = encoder.transform(df[['size']])  # numpy ndarray

cols = encoder.get_feature_names_out(['size'])
df_ohe = pd.DataFrame(ohe_array, columns=cols, index=df.index)

# zamień kolumnę 'size' na kolumny one‑hot
df = df.drop(columns=['size']).join(df_ohe)
df

### <a name='6'></a> Pandas *get_dummies()*

In [None]:
df = df_raw.copy()
df

In [None]:
pd.get_dummies(data=df)

In [None]:
pd.get_dummies(data=df, drop_first=True)

In [None]:
pd.get_dummies(data=df, drop_first=True, prefix='new')

In [None]:
pd.get_dummies(data=df, drop_first=True, prefix_sep='-')

In [None]:
pd.get_dummies(data=df, drop_first=True, columns=['size'])

In [None]:
pd.get_dummies(data=df, drop_first=True, columns=['size'], dtype=int) # 0 i 1 

### <a name='7'></a> Standaryzacja - StandardScaler

##### Dygresja nt. odchylenia standardowego

std() - pandas nieobciążony  
std() - numpy obciążony

In [None]:
print(f"{df['price']}\n")
print(f"Średnia: {df['price'].mean()}")
print(f"Odchylenie standardowe (Pandas): {df['price'].std():.2f}")

In [None]:
print(f"{df['price']}\n")
print(f"Średnia: {np.mean(df['price'])}")
print(f"Odchylenie standardowe (Numpy): {np.std(df['price']):.2f}")

In [None]:
df['price']

In [None]:
(df['price'] - df['price'].mean()) / df['price'].std()

In [None]:
def standardize(x):
    return (x - x.mean()) / x.std()

standardize(df['price'])

In [None]:
from sklearn.preprocessing import scale

scale(df['price'])

In [None]:
(df['price'] - df['price'].mean()) / np.std(df['price'])

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df[['price']])
scaler.transform(df[['price']])

In [None]:
scaler = StandardScaler()
df[['price', 'weight']] = scaler.fit_transform(df[['price', 'weight']])
df

### <a name='8'></a> Przygotowanie danych do modelu

In [None]:
df = df_raw.copy()
df

In [None]:
le = LabelEncoder()

df['bought'] = le.fit_transform(df['bought'])

scaler = StandardScaler()
df[['price', 'weight']] = scaler.fit_transform(df[['price', 'weight']])

df = pd.get_dummies(data=df, drop_first=True, dtype=int)
df

In [None]:

pd.set_option('display.float_format', lambda x: f'{x:.1f}')

# 1. Odwróć StandardScaler
df[['price', 'weight']] = scaler.inverse_transform(df[['price', 'weight']])

# 2. Odwróć LabelEncoder
if df['bought'].dtype in ['int64', 'float64']:
    df['bought'] = le.inverse_transform(df['bought'].astype(int))

# 3. Odwróć get_dummies — przywróć kolumny kategoryczne
color_cols = [col for col in df.columns if col.startswith('color_')]
gender_cols = [col for col in df.columns if col.startswith('gender_')]
size_cols = [col for col in df.columns if col.startswith('size_')]

# Mapowanie dla color
if color_cols:
    df['color'] = df[color_cols].idxmax(axis=1).str.replace('color_', '')
    df = df.drop(columns=color_cols)

# Mapowanie dla gender
if gender_cols:
    df['gender'] = df[gender_cols].idxmax(axis=1).str.replace('gender_', '')
    df = df.drop(columns=gender_cols)

# Mapowanie dla size
if size_cols:
    df['size'] = df[size_cols].idxmax(axis=1).str.replace('size_', '')
    df.loc[df[size_cols].sum(axis=1) == 0, 'size'] = 'M'
    df = df.drop(columns=size_cols)

# Zmień typy z powrotem na category
for col in ['size', 'color', 'gender', 'bought']:
    if col in df.columns:
        df[col] = df[col].astype('category')

# Zmień kolejność kolumn na oryginalną
df = df[['size', 'color', 'gender', 'price', 'weight', 'bought']]

# Zaokrąglij wartości numeryczne do oryginalnych liczb całkowitych/jednego miejsca po przecinku
df['price'] = df['price'].round(1)
df['weight'] = df['weight'].round(0).astype(int)

df

In [None]:
df_raw