### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```

### Preprocessing danych:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Zmiana typu danych i wstępna eksploracja](#3)
5. [LabelEncoder](#4)
6. [OneHotEncoder](#5)
7. [Pandas *get_dummies()*](#6)
8. [Standaryzacja - StandardScaler](#7)
9. [Przygotowanie danych do modelu](#8)



### <a name='0'></a> Import bibliotek

In [40]:
import numpy as np
import pandas as pd
import sklearn

sklearn.__version__

'1.8.0'

### <a name='1'></a> Wygenerowanie danych

In [41]:
data = {
    'size': ['XL', 'L', 'M', 'L', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'gender': ['female', 'male', 'male', 'female', 'female'],
    'price': [199.0, 89.0, 99.0, 129.0, 79.0],
    'weight': [500, 450, 300, 380, 410],
    'bought': ['yes', 'no', 'yes', 'no', 'yes']
}

df_raw = pd.DataFrame(data=data)
df_raw

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500,yes
1,L,green,male,89.0,450,no
2,M,blue,male,99.0,300,yes
3,L,green,female,129.0,380,no
4,M,red,female,79.0,410,yes


### <a name='2'></a> Utworzenie kopii danych



In [88]:
df = df_raw.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   size    5 non-null      object 
 1   color   5 non-null      object 
 2   gender  5 non-null      object 
 3   price   5 non-null      float64
 4   weight  5 non-null      int64  
 5   bought  5 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 372.0+ bytes


### <a name='3'></a> Zmiana typu danych i wstępna eksploracja



In [57]:
for col in ['size', 'color', 'gender', 'bought']:
    df[col] = df[col].astype('category')

df['weight'] = df['weight'].astype('float')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   size    5 non-null      category
 1   color   5 non-null      category
 2   gender  5 non-null      category
 3   price   5 non-null      float64 
 4   weight  5 non-null      float64 
 5   bought  5 non-null      category
dtypes: category(4), float64(2)
memory usage: 744.0 bytes


In [54]:
type(df['size'])

pandas.core.series.Series

In [55]:

type(df[['size']])

pandas.core.frame.DataFrame

In [8]:
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


In [6]:
df.describe()

Unnamed: 0,price,weight
count,5.0,5.0
mean,119.0,408.0
std,48.476799,75.299402
min,79.0,300.0
25%,89.0,380.0
50%,99.0,410.0
75%,129.0,450.0
max,199.0,500.0


In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,5.0,119.0,48.476799,79.0,89.0,99.0,129.0,199.0
weight,5.0,408.0,75.299402,300.0,380.0,410.0,450.0,500.0


In [10]:
df.describe(include=['category']).T

Unnamed: 0,count,unique,top,freq
size,5,3,L,2
color,5,3,green,2
gender,5,2,female,3
bought,5,2,yes,3


In [11]:
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


### <a name='4'></a> LabelEncoder



In [12]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(df['bought'])
le.transform(df['bought'])

array([1, 0, 1, 0, 1])

In [13]:
le.fit_transform(df['bought'])

array([1, 0, 1, 0, 1])

In [14]:
le.classes_

array(['no', 'yes'], dtype=object)

In [15]:
df['bought'] = le.fit_transform(df['bought'])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,1
1,L,green,male,89.0,450.0,0
2,M,blue,male,99.0,300.0,1
3,L,green,female,129.0,380.0,0
4,M,red,female,79.0,410.0,1


In [16]:
le.inverse_transform(df['bought'])

array(['yes', 'no', 'yes', 'no', 'yes'], dtype=object)

In [17]:
df['bought'] = le.inverse_transform(df['bought'])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


SIZE

In [60]:
le = LabelEncoder()
le.fit(df['size'])
le.transform(df['size'])

array([2, 0, 1, 0, 1])

### <a name='5'></a> OneHotEncoder

In [18]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['size']])

0,1,2
,"categories  categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith  column. The passed categories should not mix strings and numeric  values within a single feature, and should be sorted in case of  numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20",'auto'
,"drop  drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one  category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two  categories. Features with 1 or more than 2 categories are  left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that  should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21  The parameter `drop` was added in 0.21. .. versionchanged:: 0.23  The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1  Support for dropping infrequent categories.",
,"sparse_output  sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in ""Compressed Sparse Row"" (CSR) format. .. versionadded:: 1.2  `sparse` was renamed to `sparse_output`",False
,"dtype  dtype: number type, default=np.float64 Desired dtype of output.",<class 'numpy.float64'>
,"handle_unknown  handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during  transform, the resulting one-hot encoded columns for this feature  will be all zeros. In the inverse transform, an unknown category  will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered  during transform, the resulting one-hot encoded columns for this  feature will map to the infrequent category if it exists. The  infrequent category will be mapped to the last position in the  encoding. During inverse transform, an unknown category will be  mapped to the category denoted `'infrequent'` if it exists. If the  `'infrequent'` category does not exist, then :meth:`transform` and  :meth:`inverse_transform` will handle an unknown category as with  `handle_unknown='ignore'`. Infrequent categories exist based on  `min_frequency` and `max_categories`. Read more in the  :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform  a warning is issued, and the encoding then proceeds as described for  `handle_unknown=""infrequent_if_exist""`. .. versionchanged:: 1.1  `'infrequent_if_exist'` was added to automatically handle unknown  categories and infrequent categories. .. versionadded:: 1.6  The option `""warn""` was added in 1.6.",'error'
,"min_frequency  min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered  infrequent. - If `float`, categories with a smaller cardinality than  `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"max_categories  max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"feature_name_combiner  feature_name_combiner: ""concat"" or callable, default=""concat"" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `""concat""` concatenates encoded feature name and category with `feature + ""_"" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3",'concat'


In [19]:
encoder.transform(df[['size']])

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [20]:
encoder.categories_

[array(['L', 'M', 'XL'], dtype=object)]

In [21]:
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoder.fit(df[['size']])
encoder.transform(df[['size']])

array([[0., 1.],
       [0., 0.],
       [1., 0.],
       [0., 0.],
       [1., 0.]])

In [22]:
encoder.categories_

[array(['L', 'M', 'XL'], dtype=object)]

In [23]:
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


0,1,2
,"categories  categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith  column. The passed categories should not mix strings and numeric  values within a single feature, and should be sorted in case of  numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20",'auto'
,"drop  drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one  category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two  categories. Features with 1 or more than 2 categories are  left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that  should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21  The parameter `drop` was added in 0.21. .. versionchanged:: 0.23  The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1  Support for dropping infrequent categories.",
,"sparse_output  sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in ""Compressed Sparse Row"" (CSR) format. .. versionadded:: 1.2  `sparse` was renamed to `sparse_output`",False
,"dtype  dtype: number type, default=np.float64 Desired dtype of output.",<class 'numpy.float64'>
,"handle_unknown  handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during  transform, the resulting one-hot encoded columns for this feature  will be all zeros. In the inverse transform, an unknown category  will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered  during transform, the resulting one-hot encoded columns for this  feature will map to the infrequent category if it exists. The  infrequent category will be mapped to the last position in the  encoding. During inverse transform, an unknown category will be  mapped to the category denoted `'infrequent'` if it exists. If the  `'infrequent'` category does not exist, then :meth:`transform` and  :meth:`inverse_transform` will handle an unknown category as with  `handle_unknown='ignore'`. Infrequent categories exist based on  `min_frequency` and `max_categories`. Read more in the  :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform  a warning is issued, and the encoding then proceeds as described for  `handle_unknown=""infrequent_if_exist""`. .. versionchanged:: 1.1  `'infrequent_if_exist'` was added to automatically handle unknown  categories and infrequent categories. .. versionadded:: 1.6  The option `""warn""` was added in 1.6.",'error'
,"min_frequency  min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered  infrequent. - If `float`, categories with a smaller cardinality than  `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"max_categories  max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"feature_name_combiner  feature_name_combiner: ""concat"" or callable, default=""concat"" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `""concat""` concatenates encoded feature name and category with `feature + ""_"" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3",'concat'


In [38]:
encoder = OneHotEncoder(sparse_output=False)
encoder.fit(df[['size']])
ohe_array = encoder.transform(df[['size']])  # numpy ndarray

cols = encoder.get_feature_names_out(['size'])
df_ohe = pd.DataFrame(ohe_array, columns=cols, index=df.index)

# zamień kolumnę 'size' na kolumny one‑hot
df = df.drop(columns=['size']).join(df_ohe)
df

Unnamed: 0,color,gender,price,weight,bought,size_L,size_M,size_XL
0,red,female,199.0,500.0,yes,0.0,0.0,1.0
1,green,male,89.0,450.0,no,1.0,0.0,0.0
2,blue,male,99.0,300.0,yes,0.0,1.0,0.0
3,green,female,129.0,380.0,no,1.0,0.0,0.0
4,red,female,79.0,410.0,yes,0.0,1.0,0.0


In [35]:
# encoder = OneHotEncoder(sparse_output=False)
# encoder.fit(df[['size']])
# ohe_array = encoder.transform(df[['size']])  # numpy ndarray

# cols = encoder.get_feature_names_out(['size'])
# df_ohe = pd.DataFrame(ohe_array, columns=cols, index=df.index)

# # zamień kolumnę 'size' na kolumny one‑hot
# df = df.drop(columns=['size']).join(df_ohe)
# df

Unnamed: 0,color,gender,price,weight,bought,size_M,size_XL
0,red,female,199.0,500.0,yes,0.0,1.0
1,green,male,89.0,450.0,no,0.0,0.0
2,blue,male,99.0,300.0,yes,1.0,0.0
3,green,female,129.0,380.0,no,0.0,0.0
4,red,female,79.0,410.0,yes,1.0,0.0


### <a name='6'></a> Pandas *get_dummies()*

In [61]:
df = df_raw.copy()
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500,yes
1,L,green,male,89.0,450,no
2,M,blue,male,99.0,300,yes
3,L,green,female,129.0,380,no
4,M,red,female,79.0,410,yes


In [62]:
pd.get_dummies(data=df)

Unnamed: 0,price,weight,size_L,size_M,size_XL,color_blue,color_green,color_red,gender_female,gender_male,bought_no,bought_yes
0,199.0,500,False,False,True,False,False,True,True,False,False,True
1,89.0,450,True,False,False,False,True,False,False,True,True,False
2,99.0,300,False,True,False,True,False,False,False,True,False,True
3,129.0,380,True,False,False,False,True,False,True,False,True,False
4,79.0,410,False,True,False,False,False,True,True,False,False,True


In [63]:
pd.get_dummies(data=df, drop_first=True)

Unnamed: 0,price,weight,size_M,size_XL,color_green,color_red,gender_male,bought_yes
0,199.0,500,False,True,False,True,False,True
1,89.0,450,False,False,True,False,True,False
2,99.0,300,True,False,False,False,True,True
3,129.0,380,False,False,True,False,False,False
4,79.0,410,True,False,False,True,False,True


In [64]:
pd.get_dummies(data=df, drop_first=True, prefix='new')

Unnamed: 0,price,weight,new_M,new_XL,new_green,new_red,new_male,new_yes
0,199.0,500,False,True,False,True,False,True
1,89.0,450,False,False,True,False,True,False
2,99.0,300,True,False,False,False,True,True
3,129.0,380,False,False,True,False,False,False
4,79.0,410,True,False,False,True,False,True


In [65]:
pd.get_dummies(data=df, drop_first=True, prefix_sep='-')

Unnamed: 0,price,weight,size-M,size-XL,color-green,color-red,gender-male,bought-yes
0,199.0,500,False,True,False,True,False,True
1,89.0,450,False,False,True,False,True,False
2,99.0,300,True,False,False,False,True,True
3,129.0,380,False,False,True,False,False,False
4,79.0,410,True,False,False,True,False,True


In [66]:
pd.get_dummies(data=df, drop_first=True, columns=['size'])

Unnamed: 0,color,gender,price,weight,bought,size_M,size_XL
0,red,female,199.0,500,yes,False,True
1,green,male,89.0,450,no,False,False
2,blue,male,99.0,300,yes,True,False
3,green,female,129.0,380,no,False,False
4,red,female,79.0,410,yes,True,False


In [67]:
pd.get_dummies(data=df, drop_first=True, columns=['size'], dtype=int) # 0 i 1 

Unnamed: 0,color,gender,price,weight,bought,size_M,size_XL
0,red,female,199.0,500,yes,0,1
1,green,male,89.0,450,no,0,0
2,blue,male,99.0,300,yes,1,0
3,green,female,129.0,380,no,0,0
4,red,female,79.0,410,yes,1,0


In [68]:
pd.get_dummies(data=df, dtype=int) # 0 i 1 

Unnamed: 0,price,weight,size_L,size_M,size_XL,color_blue,color_green,color_red,gender_female,gender_male,bought_no,bought_yes
0,199.0,500,0,0,1,0,0,1,1,0,0,1
1,89.0,450,1,0,0,0,1,0,0,1,1,0
2,99.0,300,0,1,0,1,0,0,0,1,0,1
3,129.0,380,1,0,0,0,1,0,1,0,1,0
4,79.0,410,0,1,0,0,0,1,1,0,0,1


### <a name='7'></a> Standaryzacja - StandardScaler

##### Dygresja nt. odchylenia standardowego

std() - pandas nieobciążony  
std() - numpy obciążony

In [69]:
print(f"{df['price']}\n")
print(f"Średnia: {df['price'].mean()}")
print(f"Odchylenie standardowe (Pandas): {df['price'].std():.2f}")

0    199.0
1     89.0
2     99.0
3    129.0
4     79.0
Name: price, dtype: float64

Średnia: 119.0
Odchylenie standardowe (Pandas): 48.48


In [70]:
print(f"{df['price']}\n")
print(f"Średnia: {np.mean(df['price'])}")
print(f"Odchylenie standardowe (Numpy): {np.std(df['price']):.2f}")

0    199.0
1     89.0
2     99.0
3    129.0
4     79.0
Name: price, dtype: float64

Średnia: 119.0
Odchylenie standardowe (Numpy): 43.36


In [71]:
df['price']

0    199.0
1     89.0
2     99.0
3    129.0
4     79.0
Name: price, dtype: float64

In [74]:
(df['price'] - df['price'].mean()) / df['price'].std()

0    1.650274
1   -0.618853
2   -0.412568
3    0.206284
4   -0.825137
Name: price, dtype: float64

In [75]:
def standardize(x):
    return (x - x.mean()) / x.std()

standardize(df['price'])

0    1.650274
1   -0.618853
2   -0.412568
3    0.206284
4   -0.825137
Name: price, dtype: float64

In [76]:
from sklearn.preprocessing import scale

scale(df['price'])

array([ 1.84506242, -0.69189841, -0.4612656 ,  0.2306328 , -0.92253121])

In [77]:
(df['price'] - df['price'].mean()) / np.std(df['price'])

0    1.845062
1   -0.691898
2   -0.461266
3    0.230633
4   -0.922531
Name: price, dtype: float64

In [79]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df[['price']])
scaler.transform(df[['price']])

array([[ 1.84506242],
       [-0.69189841],
       [-0.4612656 ],
       [ 0.2306328 ],
       [-0.92253121]])

In [80]:
scaler = StandardScaler()
df[['price', 'weight']] = scaler.fit_transform(df[['price', 'weight']])
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,1.845062,1.366002,yes
1,L,green,male,-0.691898,0.62361,no
2,M,blue,male,-0.461266,-1.603567,yes
3,L,green,female,0.230633,-0.41574,no
4,M,red,female,-0.922531,0.029696,yes


# Normalizacja (0-1)

In [89]:
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500,yes
1,L,green,male,89.0,450,no
2,M,blue,male,99.0,300,yes
3,L,green,female,129.0,380,no
4,M,red,female,79.0,410,yes


In [90]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(df[['price']])
scaler.transform(df[['price']])

array([[1.        ],
       [0.08333333],
       [0.16666667],
       [0.41666667],
       [0.        ]])

### <a name='8'></a> Przygotowanie danych do modelu

In [85]:
df = df_raw.copy()
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500,yes
1,L,green,male,89.0,450,no
2,M,blue,male,99.0,300,yes
3,L,green,female,129.0,380,no
4,M,red,female,79.0,410,yes


In [86]:
le = LabelEncoder()

df['bought'] = le.fit_transform(df['bought'])

scaler = StandardScaler()
df[['price', 'weight']] = scaler.fit_transform(df[['price', 'weight']])

df = pd.get_dummies(data=df, dtype=int)
df

Unnamed: 0,price,weight,bought,size_L,size_M,size_XL,color_blue,color_green,color_red,gender_female,gender_male
0,1.845062,1.366002,1,0,0,1,0,0,1,1,0
1,-0.691898,0.62361,0,1,0,0,0,1,0,0,1
2,-0.461266,-1.603567,1,0,1,0,1,0,0,0,1
3,0.230633,-0.41574,0,1,0,0,0,1,0,1,0
4,-0.922531,0.029696,1,0,1,0,0,0,1,1,0


In [None]:

pd.set_option('display.float_format', lambda x: f'{x:.1f}')

# 1. Odwróć StandardScaler
df[['price', 'weight']] = scaler.inverse_transform(df[['price', 'weight']])

# 2. Odwróć LabelEncoder
if df['bought'].dtype in ['int64', 'float64']:
    df['bought'] = le.inverse_transform(df['bought'].astype(int))

# 3. Odwróć get_dummies — przywróć kolumny kategoryczne
color_cols = [col for col in df.columns if col.startswith('color_')]
gender_cols = [col for col in df.columns if col.startswith('gender_')]
size_cols = [col for col in df.columns if col.startswith('size_')]

# Mapowanie dla color
if color_cols:
    df['color'] = df[color_cols].idxmax(axis=1).str.replace('color_', '')
    df = df.drop(columns=color_cols)

# Mapowanie dla gender
if gender_cols:
    df['gender'] = df[gender_cols].idxmax(axis=1).str.replace('gender_', '')
    df = df.drop(columns=gender_cols)

# Mapowanie dla size
if size_cols:
    df['size'] = df[size_cols].idxmax(axis=1).str.replace('size_', '')
    df.loc[df[size_cols].sum(axis=1) == 0, 'size'] = 'M'
    df = df.drop(columns=size_cols)

# Zmień typy z powrotem na category
for col in ['size', 'color', 'gender', 'bought']:
    if col in df.columns:
        df[col] = df[col].astype('category')

# Zmień kolejność kolumn na oryginalną
df = df[['size', 'color', 'gender', 'price', 'weight', 'bought']]

# Zaokrąglij wartości numeryczne do oryginalnych liczb całkowitych/jednego miejsca po przecinku
df['price'] = df['price'].round(1)
df['weight'] = df['weight'].round(0).astype(int)

df

In [None]:
df_raw