Wczytaj zbiór danych `adult.csv`. Zbadaj dane, a następnie przygotuj je do modelowania.  

Zastosuj różne modele klasyfikacyjne do przewidywania grupy dochodów (`<=50K`/`>50K`) i wybierz odpowiednie hiperparametry. Upewnij się, że używasz odpowiednich metryk do oceny modeli. Możesz użyć zbioru walidacyjnego lub walidacji krzyżowej (zobacz: [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html), [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html), [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html))

In [58]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Data Preparation

### Dropping NA values

In [26]:
df = pd.read_csv("adult.csv", na_values=["?"])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [27]:
df_nona = df.dropna()
df_nona.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K


### Dropping columns
Drop `education` since we have `education_num`

Drop `fnlwgt` since census weight is often not predictive of individual income and can add noise

In [28]:
df_drop = df_nona.drop(columns=["education", "fnlwgt"])
df_drop

Unnamed: 0,age,workclass,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
5,34,Private,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Aggregating less common values

In [29]:
unique_values = df_drop['workclass'].unique()
print(df_drop['workclass'].value_counts())
print()

government_jobs = ["Local-gov", "State-gov", "Federal-gov"]

df_drop["workclass"] = df_drop["workclass"].replace(government_jobs, "Government")
print(df_drop['workclass'].value_counts())

workclass
Private             33307
Self-emp-not-inc     3796
Local-gov            3100
State-gov            1946
Self-emp-inc         1646
Federal-gov          1406
Without-pay            21
Name: count, dtype: int64

workclass
Private             33307
Government           6452
Self-emp-not-inc     3796
Self-emp-inc         1646
Without-pay            21
Name: count, dtype: int64


In [30]:
unique_values = df_drop['marital-status'].unique()
print(df_drop['marital-status'].value_counts())
print()

married = ["Married-civ-spouse", "Married-AF-spouse", "Married-spouse-absent"]
formerly_married = ["Divorced", "Separated", "Widowed"]
df_drop["marital-status"] = df_drop["marital-status"].replace(married, "Married")
df_drop["marital-status"] = df_drop["marital-status"].replace(formerly_married, "Formerly-Married")

print(df_drop['marital-status'].value_counts())
df_drop

marital-status
Married-civ-spouse       21055
Never-married            14598
Divorced                  6297
Separated                 1411
Widowed                   1277
Married-spouse-absent      552
Married-AF-spouse           32
Name: count, dtype: int64

marital-status
Married             21639
Never-married       14598
Formerly-Married     8985
Name: count, dtype: int64


Unnamed: 0,age,workclass,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,9,Married,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Government,12,Married,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,10,Married,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
5,34,Private,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,12,Married,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,9,Married,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,9,Formerly-Married,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [31]:
unique_values = df_drop['occupation'].unique()
print(df_drop['occupation'].value_counts())
print()

other = ["Priv-house-serv", "Armed-Forces"]
df_drop["occupation"] = df_drop["occupation"].replace(other, "Other")
print(df_drop['occupation'].value_counts())

occupation
Craft-repair         6020
Prof-specialty       6008
Exec-managerial      5984
Adm-clerical         5540
Sales                5408
Other-service        4808
Machine-op-inspct    2970
Transport-moving     2316
Handlers-cleaners    2046
Farming-fishing      1480
Tech-support         1420
Protective-serv       976
Priv-house-serv       232
Armed-Forces           14
Name: count, dtype: int64

occupation
Craft-repair         6020
Prof-specialty       6008
Exec-managerial      5984
Adm-clerical         5540
Sales                5408
Other-service        4808
Machine-op-inspct    2970
Transport-moving     2316
Handlers-cleaners    2046
Farming-fishing      1480
Tech-support         1420
Protective-serv       976
Other                 246
Name: count, dtype: int64


In [32]:
unique_values = df_drop['relationship'].unique()
print(df_drop['relationship'].value_counts())
print()

relationship
Husband           18666
Not-in-family     11702
Own-child          6626
Unmarried          4788
Wife               2091
Other-relative     1349
Name: count, dtype: int64



In [33]:
unique_values = df_drop['race'].unique()
print(df_drop['race'].value_counts())
print()

other_races = ["Amer-Indian-Eskimo", "Other"]
df_drop["race"] = df_drop["race"].replace(other_races, "Other-races")
print(df_drop['race'].value_counts())

race
White                 38903
Black                  4228
Asian-Pac-Islander     1303
Amer-Indian-Eskimo      435
Other                   353
Name: count, dtype: int64

race
White                 38903
Black                  4228
Asian-Pac-Islander     1303
Other-races             788
Name: count, dtype: int64


In [34]:
unique_values = df_drop['gender'].unique()
print(df_drop['gender'].value_counts())
print()

gender
Male      30527
Female    14695
Name: count, dtype: int64



In [35]:
unique_values = df_drop['native-country'].unique()
print(df_drop['native-country'].value_counts())
print()

countries_to_keep = ['United-States','Mexico','Philippines','Germany','Puerto-Rico','Canada']

df_drop.loc[~df_drop['native-country'].isin(countries_to_keep), 'native-country'] = 'Other-Native-Country'
print(df_drop['native-country'].value_counts())

native-country
United-States                 41292
Mexico                          903
Philippines                     283
Germany                         193
Puerto-Rico                     175
Canada                          163
El-Salvador                     147
India                           147
Cuba                            133
England                         119
China                           113
Jamaica                         103
South                           101
Italy                           100
Dominican-Republic               97
Japan                            89
Guatemala                        86
Vietnam                          83
Columbia                         82
Poland                           81
Haiti                            69
Portugal                         62
Iran                             56
Taiwan                           55
Greece                           49
Nicaragua                        48
Peru                             45
Ecuador      

### Encoding income

In [36]:
df_drop['income'] = df_drop['income'].map({'<=50K': 0, '>50K': 1})

### Dropping duplicates
They will not benefit the training process

In [37]:
num_duplicates = df_drop.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")
df_drop.drop_duplicates(keep='first', inplace=True)
print(f"Shape of DataFrame after dropping duplicates: {df_drop.shape}")
df_drop.head()

Number of duplicate rows: 6387
Shape of DataFrame after dropping duplicates: (38835, 13)


Unnamed: 0,age,workclass,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,9,Married,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Government,12,Married,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,10,Married,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
5,34,Private,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,0


### Encoding categorical columns

In [38]:
df_drop.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38835 entries, 0 to 48841
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              38835 non-null  int64 
 1   workclass        38835 non-null  object
 2   educational-num  38835 non-null  int64 
 3   marital-status   38835 non-null  object
 4   occupation       38835 non-null  object
 5   relationship     38835 non-null  object
 6   race             38835 non-null  object
 7   gender           38835 non-null  object
 8   capital-gain     38835 non-null  int64 
 9   capital-loss     38835 non-null  int64 
 10  hours-per-week   38835 non-null  int64 
 11  native-country   38835 non-null  object
 12  income           38835 non-null  int64 
dtypes: int64(6), object(7)
memory usage: 4.1+ MB


In [39]:
categorical_columns = df_drop.select_dtypes(include=['object']).columns

df_processed = pd.get_dummies(df_drop, columns=categorical_columns, drop_first=True)
df_processed.head()

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week,income,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_Without-pay,...,race_Black,race_Other-races,race_White,gender_Male,native-country_Germany,native-country_Mexico,native-country_Other-Native-Country,native-country_Philippines,native-country_Puerto-Rico,native-country_United-States
0,25,7,0,0,40,0,True,False,False,False,...,True,False,False,True,False,False,False,False,False,True
1,38,9,0,0,50,0,True,False,False,False,...,False,False,True,True,False,False,False,False,False,True
2,28,12,0,0,40,1,False,False,False,False,...,False,False,True,True,False,False,False,False,False,True
3,44,10,7688,0,40,1,True,False,False,False,...,True,False,False,True,False,False,False,False,False,True
5,34,6,0,0,30,0,True,False,False,False,...,False,False,True,True,False,False,False,False,False,True


In [40]:
df_processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38835 entries, 0 to 48841
Data columns (total 39 columns):
 #   Column                               Non-Null Count  Dtype
---  ------                               --------------  -----
 0   age                                  38835 non-null  int64
 1   educational-num                      38835 non-null  int64
 2   capital-gain                         38835 non-null  int64
 3   capital-loss                         38835 non-null  int64
 4   hours-per-week                       38835 non-null  int64
 5   income                               38835 non-null  int64
 6   workclass_Private                    38835 non-null  bool 
 7   workclass_Self-emp-inc               38835 non-null  bool 
 8   workclass_Self-emp-not-inc           38835 non-null  bool 
 9   workclass_Without-pay                38835 non-null  bool 
 10  marital-status_Married               38835 non-null  bool 
 11  marital-status_Never-married         38835 non-null  bool 


### Splitting into test and train sets

In [41]:
X = df_processed.drop(columns=['income'])
y = df_processed.income

# Stratify - means that the split will maintain the same proportion of classes in the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17, stratify=y)

In [42]:
X_train

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_Without-pay,marital-status_Married,...,race_Black,race_Other-races,race_White,gender_Male,native-country_Germany,native-country_Mexico,native-country_Other-Native-Country,native-country_Philippines,native-country_Puerto-Rico,native-country_United-States
1902,30,13,0,2559,40,True,False,False,False,False,...,False,False,True,True,False,False,False,False,False,True
17751,61,15,0,0,5,True,False,False,False,True,...,False,False,True,True,False,False,False,False,False,True
2803,24,7,0,0,36,True,False,False,False,False,...,False,False,True,False,False,False,False,False,False,True
37052,23,9,0,0,40,True,False,False,False,False,...,False,False,True,True,False,False,False,False,False,True
25869,53,14,0,0,40,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42195,51,14,0,1977,50,True,False,False,False,True,...,False,False,True,True,False,False,False,False,False,True
25161,36,13,0,0,40,True,False,False,False,True,...,False,False,True,True,False,False,False,False,False,True
14996,45,7,0,0,40,True,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False
18223,24,9,0,0,40,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,True


### Scaling

In [43]:
# We have unbalanced classes
print("Class distribution in training set:")
print(y_train.value_counts(normalize=True))

Class distribution in training set:
income
0    0.745333
1    0.254667
Name: proportion, dtype: float64


# Modeling

### Preprocessor

In [None]:
# Kolumny numeryczne (te, które istniały PRZED one-hot encodingiem i są w X_train)
numerical_cols = ['age', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week'] 

# Preprocessor: Skaluje kolumny numeryczne, resztę (już zakodowane OHE) przepuszcza.
# remainder='passthrough' jest kluczowe, aby kolumny binarne z pd.get_dummies zostały zachowane.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols) # Stosuj StandardScaler tylko do kolumn numerycznych
    ],
    remainder='passthrough' # Pozostałe kolumny (one-hot encoded) zostaną nietknięte
)

### Logistic Regression

In [None]:
# Pipeline: Preprocessor + Model (z ustawieniami bazowymi/domyślnymi)
baseline_pipeline_logreg = Pipeline([
    ('preprocessor', preprocessor),
    # Nie ma tutaj zdefiniowanych ale dalej sprawdzamy wynik w zależności od hiperparametru: C i penalty 
    ('classifier', LogisticRegression(max_iter=1000, random_state=17, class_weight='balanced', solver='liblinear')) 
])

print("Ocena modelu bazowego (Logistic Regression) za pomocą 5-krotnej walidacji krzyżowej na X_train:")

# Przekazujemy X_train (które zawiera zarówno kolumny numeryczne do przeskalowania, jak i kolumny już zakodowane one-hot). Pipeline zajmie się resztą.
cv_scores_roc_auc = cross_val_score(baseline_pipeline_logreg, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)

print(f"Wyniki ROC AUC z walidacji krzyżowej: {cv_scores_roc_auc}")
print(f"Średnia ROC AUC: {cv_scores_roc_auc.mean():.4f} (+/- {cv_scores_roc_auc.std() * 2:.4f})")

Ocena modelu bazowego (Logistic Regression) za pomocą 5-krotnej walidacji krzyżowej na X_train:
Wyniki ROC AUC z walidacji krzyżowej: [0.89749775 0.9082584  0.89643972 0.90186394 0.89461678]
Średnia ROC AUC: 0.8997 (+/- 0.0098)


  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b


In [None]:
# C - odwrotność siły regularyzacji. Im mniejsza wartość, tym większa siła regularyzacji (większa kara za skomplikowanie modelu - kwestia przeuczenia).
# penalty - rodzaj regularyzacji. 'l1' - Lasso, 'l2' - Ridge.
# l1 - Lasso - może prowadzić do zerowania niektórych współczynników, co może być przydatne w przypadku dużej liczby cech (feature selection).
# l2 - Ridge - zmniejsza ale nie prowadzi do zerowania współczynników - najczęściej stosowane w praktyce.
param_grid_logreg = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],  
    'classifier__penalty': ['l1', 'l2']       
}

grid_search_logreg = GridSearchCV(
    estimator=baseline_pipeline_logreg, 
    param_grid=param_grid_logreg, 
    cv=5,                    
    scoring='roc_auc',       
    n_jobs=-1,               
    verbose=1                
)

# Dopasowujemy do X_train i y_train. Pipeline wewnątrz grid_search_logreg zajmie się poprawnym zastosowaniem preprocessora (w tym skalowania) na każdym foldzie CV.
grid_search_logreg.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret 

In [52]:
print("Najlepsze znalezione hiperparametry przez GridSearchCV:")
print(grid_search_logreg.best_params_)

print("\nNajlepszy wynik (ROC AUC) z walidacji krzyżowej wewnątrz GridSearchCV (na danych treningowych):")
print(f"{grid_search_logreg.best_score_:.4f}") 

best_model_logreg = grid_search_logreg.best_estimator_

Najlepsze znalezione hiperparametry przez GridSearchCV:
{'classifier__C': 1, 'classifier__penalty': 'l2'}

Najlepszy wynik (ROC AUC) z walidacji krzyżowej wewnątrz GridSearchCV (na danych treningowych):
0.8997


In [None]:
print("Finalna ocena NAJLEPSZEGO modelu (po GridSearchCV) na zbiorze testowym:")

# Używamy X_test. Pipeline w 'best_model_logreg' sam zajmie się skalowaniem
# (używając parametrów nauczonych na X_train).
y_pred_final = best_model_logreg.predict(X_test)
y_pred_proba_final = best_model_logreg.predict_proba(X_test)[:, 1]

print(f"Accuracy: {accuracy_score(y_test, y_pred_final):.4f}")
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba_final):.4f}")

print("\nConfusion Matrix (Final Model):")
cm_final = confusion_matrix(y_test, y_pred_final)
print(cm_final)
tn_final, fp_final, fn_final, tp_final = cm_final.ravel()
print(f"True Negatives: {tn_final}, False Positives: {fp_final}")
print(f"False Negatives: {fn_final}, True Positives: {tp_final}")

print("\nClassification Report (Final Model):\n")
print(classification_report(y_test, y_pred_final, target_names=['<=50K (0)', '>50K (1)']))

Finalna ocena NAJLEPSZEGO modelu (po GridSearchCV) na zbiorze testowym:
Accuracy: 0.8048
ROC AUC Score: 0.8982

Confusion Matrix (Final Model):
[[4597 1192]
 [ 324 1654]]
True Negatives: 4597, False Positives: 1192
False Negatives: 324, True Positives: 1654

Classification Report (Final Model):

              precision    recall  f1-score   support

   <=50K (0)       0.93      0.79      0.86      5789
    >50K (1)       0.58      0.84      0.69      1978

    accuracy                           0.80      7767
   macro avg       0.76      0.82      0.77      7767
weighted avg       0.84      0.80      0.81      7767



  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b


### SVC (Support Vector Machines)

In [None]:
# Pipeline dla SVC
pipeline_svc = Pipeline([
    # Ten sam preprocessor co dla Logistic Regression
    ('preprocessor', preprocessor),
    
    # probability=True jest potrzebne dla roc_auc_score, ale spowalnia trening
    # Sprawdzamy wartosc 
    ('classifier', SVC(probability=True, random_state=17, class_weight='balanced')) 
])

print("\nOcena modelu bazowego (SVC) za pomocą 5-krotnej walidacji krzyżowej na X_train:")
cv_scores_svc = cross_val_score(pipeline_svc, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)

print(f"Wyniki ROC AUC dla SVC z walidacji krzyżowej: {cv_scores_svc}")
print(f"Średnia ROC AUC dla SVC: {cv_scores_svc.mean():.4f} (+/- {cv_scores_svc.std() * 2:.4f})")


Ocena modelu bazowego (SVC) za pomocą 5-krotnej walidacji krzyżowej na X_train:
Wyniki ROC AUC dla SVC z walidacji krzyżowej: [0.89937992 0.90658439 0.89833169 0.89999559 0.89307717]
Średnia ROC AUC dla SVC: 0.8995 (+/- 0.0086)


In [None]:
# Siatka hiperparametrów dla SVC
param_grid_svc = {
    # Podobne do Logistic Regression
    'classifier__C': [0.1, 1, 10],
    # Parametr jądra (kernel coefficient)
    'classifier__gamma': ['scale', 'auto', 0.01, 0.1],
    # Typy jądra
    'classifier__kernel': ['rbf', 'linear']
}

# GridSearchCV dla SVC
grid_search_svc = GridSearchCV(
    estimator=pipeline_svc,
    param_grid=param_grid_svc,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search_svc.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [62]:
print("Najlepsze znalezione hiperparametry przez GridSearchCV:")
print(grid_search_svc.best_params_)

print("\nNajlepszy wynik (ROC AUC) z walidacji krzyżowej wewnątrz GridSearchCV (na danych treningowych):")
print(f"{grid_search_svc.best_score_:.4f}") 

best_model_svc = grid_search_svc.best_estimator_

Najlepsze znalezione hiperparametry przez GridSearchCV:
{'classifier__C': 10, 'classifier__gamma': 0.01, 'classifier__kernel': 'rbf'}

Najlepszy wynik (ROC AUC) z walidacji krzyżowej wewnątrz GridSearchCV (na danych treningowych):
0.9023


In [63]:
print("Finalna ocena NAJLEPSZEGO modelu (po GridSearchCV) na zbiorze testowym:")

# Używamy X_test. Pipeline w 'best_model_svc' sam zajmie się skalowaniem
# (używając parametrów nauczonych na X_train).
y_pred_final = best_model_svc.predict(X_test)
y_pred_proba_final = best_model_svc.predict_proba(X_test)[:, 1]

print(f"Accuracy: {accuracy_score(y_test, y_pred_final):.4f}")
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba_final):.4f}")

print("\nConfusion Matrix (Final Model):")
cm_final = confusion_matrix(y_test, y_pred_final)
print(cm_final)
tn_final, fp_final, fn_final, tp_final = cm_final.ravel()
print(f"True Negatives: {tn_final}, False Positives: {fp_final}")
print(f"False Negatives: {fn_final}, True Positives: {tp_final}")

print("\nClassification Report (Final Model):\n")
print(classification_report(y_test, y_pred_final, target_names=['<=50K (0)', '>50K (1)']))

Finalna ocena NAJLEPSZEGO modelu (po GridSearchCV) na zbiorze testowym:
Accuracy: 0.7970
ROC AUC Score: 0.9005

Confusion Matrix (Final Model):
[[4498 1291]
 [ 286 1692]]
True Negatives: 4498, False Positives: 1291
False Negatives: 286, True Positives: 1692

Classification Report (Final Model):

              precision    recall  f1-score   support

   <=50K (0)       0.94      0.78      0.85      5789
    >50K (1)       0.57      0.86      0.68      1978

    accuracy                           0.80      7767
   macro avg       0.75      0.82      0.77      7767
weighted avg       0.85      0.80      0.81      7767



### Random Forest

In [64]:
# Pipeline dla Random Forest
pipeline_rf = Pipeline([
    # Ten sam preprocessor co dla Logistic Regression
    ('preprocessor', preprocessor),
    
    ('classifier', RandomForestClassifier(random_state=17, class_weight='balanced'))
])

print("\nOcena modelu bazowego (Random Forest) za pomocą 5-krotnej walidacji krzyżowej na X_train:")
cv_scores_rf = cross_val_score(pipeline_rf, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)

print(f"Wyniki ROC AUC dla Random Forest z walidacji krzyżowej: {cv_scores_rf}")
print(f"Średnia ROC AUC dla Random Forest: {cv_scores_rf.mean():.4f} (+/- {cv_scores_rf.std() * 2:.4f})")


Ocena modelu bazowego (Random Forest) za pomocą 5-krotnej walidacji krzyżowej na X_train:
Wyniki ROC AUC dla Random Forest z walidacji krzyżowej: [0.88300984 0.88776807 0.8792194  0.88193415 0.87946406]
Średnia ROC AUC dla Random Forest: 0.8823 (+/- 0.0062)


In [65]:
# Siatka hiperparametrów dla RandomForest
param_grid_rf = {
    # Liczba drzew w lesie
    'classifier__n_estimators': [100, 200, 300],
    # Maksymalna głębokość drzewa
    'classifier__max_depth': [None, 10, 20, 30],
    # Minimalna liczba próbek wymagana do podziału węzła         
    'classifier__min_samples_split': [2, 5, 10],
    # Minimalna liczba próbek wymagana w liściu
    'classifier__min_samples_leaf': [1, 2, 4]
}


# GridSearchCV dla RandomForest
grid_search_rf = GridSearchCV(
    estimator=pipeline_rf,
    param_grid=param_grid_rf,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search_rf.fit(X_train, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [66]:
print("Najlepsze znalezione hiperparametry przez GridSearchCV:")
print(grid_search_rf.best_params_)

print("\nNajlepszy wynik (ROC AUC) z walidacji krzyżowej wewnątrz GridSearchCV (na danych treningowych):")
print(f"{grid_search_rf.best_score_:.4f}") 

best_model_rf = grid_search_rf.best_estimator_

Najlepsze znalezione hiperparametry przez GridSearchCV:
{'classifier__max_depth': 20, 'classifier__min_samples_leaf': 4, 'classifier__min_samples_split': 10, 'classifier__n_estimators': 300}

Najlepszy wynik (ROC AUC) z walidacji krzyżowej wewnątrz GridSearchCV (na danych treningowych):
0.9149


In [67]:
print("Finalna ocena NAJLEPSZEGO modelu (po GridSearchCV) na zbiorze testowym:")

# Używamy X_test. Pipeline w 'best_model_rf' sam zajmie się skalowaniem
# (używając parametrów nauczonych na X_train).
y_pred_final = best_model_rf.predict(X_test)
y_pred_proba_final = best_model_rf.predict_proba(X_test)[:, 1]

print(f"Accuracy: {accuracy_score(y_test, y_pred_final):.4f}")
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred_proba_final):.4f}")

print("\nConfusion Matrix (Final Model):")
cm_final = confusion_matrix(y_test, y_pred_final)
print(cm_final)
tn_final, fp_final, fn_final, tp_final = cm_final.ravel()
print(f"True Negatives: {tn_final}, False Positives: {fp_final}")
print(f"False Negatives: {fn_final}, True Positives: {tp_final}")

print("\nClassification Report (Final Model):\n")
print(classification_report(y_test, y_pred_final, target_names=['<=50K (0)', '>50K (1)']))

Finalna ocena NAJLEPSZEGO modelu (po GridSearchCV) na zbiorze testowym:
Accuracy: 0.8250
ROC AUC Score: 0.9147

Confusion Matrix (Final Model):
[[4732 1057]
 [ 302 1676]]
True Negatives: 4732, False Positives: 1057
False Negatives: 302, True Positives: 1676

Classification Report (Final Model):

              precision    recall  f1-score   support

   <=50K (0)       0.94      0.82      0.87      5789
    >50K (1)       0.61      0.85      0.71      1978

    accuracy                           0.83      7767
   macro avg       0.78      0.83      0.79      7767
weighted avg       0.86      0.83      0.83      7767

