# Starting point

We would like to compare 2 different approaches to classify objects based on predictors. 

Chosen approaches:
- kNN classifier
- Naive Bayes classifier

The comparison will be based on Accuracy, Recall, Precision and F1-Value of the models.
The goal is to find the best model to predict "income" based on meaningful categorical features in the *census* data.

In [113]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Part 1

## Data preparation

For both models the census.csv data will be used.

In [47]:
dat = pd.read_csv("census.csv")
dat.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The target is **Income**

In [48]:
dat[["income"]].value_counts()


income
<=50K     24720
>50K       7841
Name: count, dtype: int64

Our target variable doesnt show the balanced distribution within the groups. So we should care about creating stratified data samples during spliting on train-test samples.

We need to choose a few meaningful categorical features as predictors.


In [49]:
def analyze_categorical_columns(df):
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns

    if len(categorical_cols) == 0:
        print("No categorical columns in DataFrame.")
        return

    for col in categorical_cols:
        print(f"Number of unique values: {df[col].nunique()}")
        print(f"{df[col].value_counts()}")
        print("-" * 30)

analyze_categorical_columns(dat)

Number of unique values: 9
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64
------------------------------
Number of unique values: 16
education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: count, dtype: int64
------------------------------
Number of unique values: 7
marital.status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      4

Based on information above the set of these categorical feachures were selected:

- workclass (9 classes)
- education (16 classes)
- marital.status (7 classes)
- occupation (15 classes)
- sex (2 classes)

In [50]:
whole_dataset = dat[["workclass","education","marital.status", "occupation", "sex", "income"]]
whole_dataset.head(3)

Unnamed: 0,workclass,education,marital.status,occupation,sex,income
0,State-gov,Bachelors,Never-married,Adm-clerical,Male,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Male,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Male,<=50K


In [51]:
whole_dataset.isna().sum()

workclass         0
education         0
marital.status    0
occupation        0
sex               0
income            0
dtype: int64

There are no any missing values, good.

## Onehot encoding

In [101]:

# Use pd.get_dummies() to one-hot encode the categorical columns
ds_encoded = pd.get_dummies(whole_dataset, columns=["workclass","education","marital.status", "occupation", "sex"], drop_first=True)
ds_encoded.iloc[:,:5].head(1)

Unnamed: 0,income,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private
0,<=50K,False,False,False,False


Here we can see just first few columns.


In [108]:
print(f"Total number of columns: {len(ds_encoded.columns)}")
print(ds_encoded.columns)

Total number of columns: 45
Index(['income', 'workclass_Federal-gov', 'workclass_Local-gov',
       'workclass_Never-worked', 'workclass_Private', 'workclass_Self-emp-inc',
       'workclass_Self-emp-not-inc', 'workclass_State-gov',
       'workclass_Without-pay', 'education_11th', 'education_12th',
       'education_1st-4th', 'education_5th-6th', 'education_7th-8th',
       'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc',
       'education_Bachelors', 'education_Doctorate', 'education_HS-grad',
       'education_Masters', 'education_Preschool', 'education_Prof-school',
       'education_Some-college', 'marital.status_Married-AF-spouse',
       'marital.status_Married-civ-spouse',
       'marital.status_Married-spouse-absent', 'marital.status_Never-married',
       'marital.status_Separated', 'marital.status_Widowed',
       'occupation_Adm-clerical', 'occupation_Armed-Forces',
       'occupation_Craft-repair', 'occupation_Exec-managerial',
       'occupation_Farming-fis

The total number of columns is 45.

Now the data is ready to be splitted on test-train samples. 

## Train-test split

In [109]:
X = ds_encoded.iloc[:,1:]
y = ds_encoded.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=1/3, random_state=471
)

In [110]:
pd.concat({"train": y_train.value_counts(), "test": y_test.value_counts()}, axis=1)
# y_test.value_counts().merge(y_train.value_counts(), how = "inner")

Unnamed: 0_level_0,train,test
income,Unnamed: 1_level_1,Unnamed: 2_level_1
<=50K,16518,8202
>50K,5189,2652


Looks quite stratified.

## k-NN-model

As in previous homework we can use grid search to find good k and determine the best kNN model. We already know, that this method perform pretty well because it uses cross validation (in our case with 10 folds).

In [112]:
gs = GridSearchCV(estimator = knn,
        param_grid = {'n_neighbors' : list(range(1,10))},
        scoring = 'accuracy',
        cv = 10,
        refit = True)
best_knn_model = gs.fit(X_train, y_train).best_estimator_
print('Best k : %d' % best_model.get_params()['n_neighbors'])

Best k : 9


Grid search tells us that the best k is 9.

We can calculate some performance metrics to check how good the model actually.

In [114]:

pred = best_knn_model.predict(X_test)
print(confusion_matrix(y_test, pred))


[[7442  760]
 [1284 1368]]


In [116]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

       <=50K       0.85      0.91      0.88      8202
        >50K       0.64      0.52      0.57      2652

    accuracy                           0.81     10854
   macro avg       0.75      0.71      0.73     10854
weighted avg       0.80      0.81      0.80     10854

