## Note on Execution Counts and Kernel Restart

During the course of this project, I encountered an issue with my Jupyter Notebook. After numerous iterations and extensive experimentation (reaching execution counts in the range of 300), my laptop slowed down significantly. Some cells, indicated by a `[*]` symbol, were not completing their execution. This issue persisted even for basic operations like importing libraries. The last completed cell having execution count of 358, the cell being just the one before going for hypertuning of models i.e. the one before importing GridSearchCV and after that cell's execution, my laptop slowed down and the cells weren't being executed. I tried saving the file and reopening it but it was still happening so I restarted the whole thing and started again from command prompt. 

To resolve this, I had to restart the Jupyter kernel. Consequently, the execution counts of all cells were reset. Therefore, the current execution counts might appear lower than expected, typically starting from 1 or 2 after the restart. 

Please note that the extensive work done to preprocess, feature engineer, and fine-tune models has been preserved, but the cell execution counts do not reflect the initial high numbers due to the kernel restart.

This project involved thorough and iterative attempts to improve the model, which is why the initial execution counts reached such high numbers before the restart was necessary.

This restart changed the accuracy levels a little bit but there was no significant change.


Re-Edit :- Turns out that if the laptop powers off (not sleep, just power off or battery 0), it will disconnect the kernel and I have to restart jupyter notebook to re-connect it which is probably what happened previously too (written above although might not be due to the reason wrote above). So as of now, hyperparameter tuning has been done for Random Forest and Gradient Boosting. I have decided to not do hypertuning for Support Vector Machines, the time consumption is a lot and the accuracy difference between Gradient Boosting and Support Vector Machines is already very much and even after hypertuning, it probalby won't be able to overcome that. I was interested to see the accuracy increase in SVM but sadly it takes too long and it's probably not even worth it. (If you are doing this on some online platform which doesn't depend on your laptop's specifications, then it might take less time and thus, be worth it)

# Libraries

In [28]:
import pandas as pd
import numpy as np
import warnings
from sklearn.impute import SimpleImputer
import seaborn as sms
import matplotlib.pyplot as plt
import joblib
from joblib import dump
warnings.filterwarnings('ignore')

In [24]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Dataset

In [5]:
temp = pd.read_csv("adult.data", skipinitialspace = True)

In [6]:
temp1 = pd.read_csv("adult.test", skipinitialspace = True)

In [33]:
dataset = pd.concat([temp, temp1], axis = 0, ignore_index = True)

#### Dataset shown for reference

In [36]:
dataset

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


#### -> Occupation, native country and workclass had missing values
#### -> native country was fixed by replacing missing value ("?") by United-States since that was the most frequent country in each race
#### -> Occupation where workclass was "Never-worked" was set to "Never-worked"
#### -> Rest of the occupation and workclass were added with the help of Random Forest Model
#### -> Some features which might be related with occupation (despite being controversial) were given to the model to predict occupation
#### -> Since workclass can not be tied to any of the given columns completely, full dataset was used to predict the missing values
#### -> 'education' column has been left out since 'education-num' is just the same thing
#### -> 'income' column was converted to a binary column with -1 and 1 values
#### -> The label encoder and standard scaler models were saved with joblib to use them to scale data if any new data was to be added

In [9]:
dataset.loc[(dataset['income'] == "<=50K") | (dataset['income'] == "<=50K."), 'income'] = -1
dataset.loc[(dataset['income'] == ">50K") | (dataset['income'] == ">50K."), 'income'] = 1
dataset.loc[(dataset['workclass'] == "Never-worked"), 'occupation'] = "Never-worked"
dataset.loc[(dataset['native-country'] == '?'), 'native-country'] = "United-States"
categorical = ['marital-status', 'relationship', 'race', 'sex', 'native-country']
numerical = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
label = {}
ss = StandardScaler()
for col in categorical:
    le = LabelEncoder()
    dataset[col] = le.fit_transform(dataset[col])
    label[col] = le

dataset[numerical] = ss.fit_transform(dataset[numerical])

for col, le in label.items():
    joblib.dump(le, f'{col}_label_encoder.pkl')
joblib.dump(ss, 'standard_scaler.pkl')

['standard_scaler.pkl']

In [10]:
dataset_no_missing = dataset[(dataset['occupation'] != '?') & (dataset['workclass'] != '?')]
dataset_missing = dataset[(dataset['occupation'] == '?') | (dataset['workclass'] == '?')]

In [11]:
categorical1 = ['workclass', 'occupation']
for col in categorical1:
    le = LabelEncoder()
    dataset_no_missing[col] = le.fit_transform(dataset_no_missing[col])
    joblib.dump(le, f'{col}_label_encoder.pkl')

In [12]:
x_train1 = dataset_no_missing[['education-num', 'race', 'sex', 'fnlwgt', 'hours-per-week', 'income']]
y_train1 = dataset_no_missing[['occupation']]
x_train1 = x_train1.iloc[:, :].values
y_train1 = y_train1.iloc[:, :].values
rf0 = RandomForestClassifier(random_state = 42)
rf0.fit(x_train1, y_train1)

In [13]:
x_test1 = dataset_missing[['education-num', 'race', 'sex', 'fnlwgt', 'hours-per-week', 'income']]
x_test1 = x_test1.iloc[:, :].values
dataset_missing['occupation'] = rf0.predict(x_test1)

In [14]:
x_train2 = dataset_no_missing.drop(['workclass', 'education'], axis = 1)
y_train2 = dataset_no_missing[['workclass']]
x_train2 = x_train2.iloc[:, :].values
y_train2 = y_train2.iloc[:, :].values
rf2 = RandomForestClassifier(random_state = 42)
rf2.fit(x_train2, y_train2)
x_test2 = dataset_missing.drop(['workclass', 'education'], axis = 1)
x_test2 = x_test2.iloc[:, :].values
dataset_missing['workclass'] = rf2.predict(x_test2)

In [15]:
dataset_pp = pd.concat([dataset_missing, dataset_no_missing], axis = 0, ignore_index = True)

In [16]:
dataset_pp.values

array([[1.1200583962298325, 3, -0.0895158241693816, ...,
        1.5799464539233603, 34, 1],
       [-0.48456647364404626, 3, 0.9873954372655107, ...,
        -0.03408696347500956, 38, -1],
       [-0.9951289322402804, 3, 0.10432346817242755, ...,
        -0.03408696347500956, 38, -1],
       ...,
       [-0.04694150913298843, 3, 1.7548645689912852, ...,
        0.7729297452241753, 38, -1],
       [0.3906834553780694, 3, -1.0016116052325499, ...,
        -0.03408696347500956, 38, -1],
       [-0.26575399138851735, 4, -0.07117353255892316, ...,
        1.5799464539233603, 38, 1]], dtype=object)

In [17]:
missing_values = dataset_pp.isnull()

if missing_values.any().any():
    print("There are missing values in the dataset.")
else:
    print("No missing values found in the dataset.")

if missing_values.any(axis=1).any():
    print("There are rows with missing values.")
else:
    print("No rows contain missing values.")

No missing values found in the dataset.
No rows contain missing values.


In [32]:
missing_counts = {}
for col in dataset_pp.columns:
    missing_count = (dataset_pp[col] == "?").sum()
    if missing_count > 0:
        missing_counts[col] = missing_count

if missing_counts:
    print("Columns with missing values and their counts:")
    for col, count in missing_counts.items():
        print(f"{col}: {count}")
else:
    print("No missing values found in the dataset.")

joblib.dump(dataset_pp, 'dataset_pp.pkl')

No missing values found in the dataset.


['dataset_pp.pkl']

### Dataset without any missing values has been prepared

## Training, Testing Split

In [63]:
x_train, x_test, y_train, y_test = train_test_split(dataset_pp.drop(['education', 'income'], axis = 1).values, dataset_pp[['income']].values, test_size = 0.2, random_state = 42)
y_train = y_train.astype(int)
y_test = y_test.astype(int)

# Models

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

## Basic Models

### Logistic Regression

In [21]:
lr = LogisticRegression()
scores_lr = cross_val_score(lr, x_train, y_train, cv = 10, scoring = 'accuracy')
lr.fit(x_train, y_train)
y_test_pred_lr = lr.predict(x_test)
accuracy_score(y_test, y_test_pred_lr), scores_lr.mean()

(0.8210666393694339, 0.8255829889873019)

### Decision Tree

In [22]:
dt = DecisionTreeClassifier(random_state = 42)
scores_dt = cross_val_score(dt, x_train, y_train, cv = 10, scoring = 'accuracy')
dt.fit(x_train, y_train)
y_test_pred_dt = dt.predict(x_test)
accuracy_score(y_test, y_test_pred_dt), scores_dt.mean()

(0.808475790766711, 0.8142958574471614)

## Ensemble Methods

### Random Forest

In [23]:
rf = RandomForestClassifier(random_state = 42)
scores_rf = cross_val_score(rf, x_train, y_train, cv = 10, scoring = 'accuracy')
rf.fit(x_train, y_train)
y_test_pred_rf = rf.predict(x_test)
accuracy_score(y_test, y_test_pred_rf), scores_rf.mean()

(0.8548469648889344, 0.8616179159312773)

### Gradient Boosting

In [24]:
gsb = GradientBoostingClassifier()
scores_gsb = cross_val_score(gsb, x_train, y_train, cv = 10, scoring = 'accuracy')
gsb.fit(x_train, y_train)
y_test_pred_gsb = gsb.predict(x_test)
accuracy_score(y_test, y_test_pred_gsb), scores_gsb.mean()

(0.8623195823523391, 0.8661479382857161)

## Advanced Models

### Support Vector Machines

In [25]:
sv = SVC(random_state = 42)
scores_sv = cross_val_score(sv, x_train, y_train, cv = 10, scoring = 'accuracy')
sv.fit(x_train, y_train)
y_test_pred_sv = sv.predict(x_test)
accuracy_score(y_test, y_test_pred_sv), scores_sv.mean()

(0.8082710615211383, 0.8128630042028859)

### I'll try out Neural Networks in the next project with an image dataset or some basic neural network project

## Specialized Models

### K-Nearest Neighbors (KNN)

In [26]:
knn = KNeighborsClassifier()
scores_knn = cross_val_score(knn, x_train, y_train, cv = 10, scoring = 'accuracy')
knn.fit(x_train, y_train)
y_test_pred_knn = knn.predict(x_test)
accuracy_score(y_test, y_test_pred_knn), scores_knn.mean()

(0.8274132459821886, 0.8339518943376177)

### Naive Bayes

In [27]:
nb = GaussianNB()
scores_nb = cross_val_score(nb, x_train, y_train, cv = 10, scoring = 'accuracy')
nb.fit(x_train, y_train)
y_test_pred_nb = nb.predict(x_test)
accuracy_score(y_test, y_test_pred_nb), scores_nb.mean()

(0.7951683898044836, 0.8023185165643693)

## Testing if removing 'marital-status' and 'relationship' affects accuracy score

In [28]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(dataset_pp.drop(['education', 'income', 'marital-status', 'relationship'], axis = 1).values, dataset_pp[['income']].values, test_size = 0.2, random_state = 42)
y_train1 = y_train1.astype(int)
y_test1 = y_test1.astype(int)

## Basic Models

### Logistic Regression

In [29]:
lr1 = LogisticRegression()
scores_lr1 = cross_val_score(lr1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
lr1.fit(x_train1, y_train1)
y_test_pred_lr1 = lr1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_lr1), scores_lr1.mean()

(0.8200429931415703, 0.8252757759148933)

### Decision Tree

In [30]:
dt1 = DecisionTreeClassifier(random_state = 42)
scores_dt1 = cross_val_score(dt1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
dt1.fit(x_train1, y_train1)
y_test_pred_dt1 = dt1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_dt1), scores_dt1.mean()

(0.7794042378953834, 0.7887799016488527)

## Ensemble Methods

### Random Forest Classifier

In [31]:
rf1 = RandomForestClassifier(random_state = 42)
scores_rf1 = cross_val_score(rf1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
rf1.fit(x_train1, y_train1)
y_test_pred_rf1 = rf1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_rf1), scores_rf1.mean()

(0.8292558091923431, 0.8427558310032722)

### Gradient Boosting Classifier

In [32]:
gsb1 = GradientBoostingClassifier()
scores_gsb1 = cross_val_score(gsb1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
gsb1.fit(x_train1, y_train1)
y_test_pred_gsb1 = gsb1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_gsb1), scores_gsb1.mean()

(0.8450199611014434, 0.8532746449631518)

## Advanced Models

### Support Vector Machines

In [33]:
sv1 = SVC(random_state = 42)
scores_sv1 = cross_val_score(sv1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
sv1.fit(x_train1, y_train1)
y_test_pred_sv1 = sv1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_sv1), scores_sv1.mean()

(0.8037670181185382, 0.8084354604325386)

## Specialized Models

### KNN

In [34]:
knn1 = KNeighborsClassifier()
scores_knn1 = cross_val_score(knn1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
knn1.fit(x_train1, y_train1)
y_test_pred_knn1= knn1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_knn1), scores_knn1.mean()

(0.8015149964172382, 0.8138613107880012)

### Naive Bayes

In [35]:
nb1 = GaussianNB()
scores_nb1 = cross_val_score(nb1, x_train1, y_train1, cv = 10, scoring = 'accuracy')
nb1.fit(x_train1, y_train1)
y_test_pred_nb1 = nb1.predict(x_test1)
accuracy_score(y_test1, y_test_pred_nb1), scores_nb1.mean()

(0.789128877060088, 0.7973278743582562)

### Checking combined accuracy for clarification

In [36]:
scores_all_features = scores_lr.mean() + scores_dt.mean() + scores_rf.mean() + scores_gsb.mean() + scores_knn.mean() + scores_nb.mean() + scores_sv.mean()
scores_less_features = scores_lr1.mean() + scores_dt1.mean() + scores_rf1.mean() + scores_gsb1.mean() + scores_knn1.mean() + scores_nb1.mean() + scores_sv1.mean()

In [37]:
scores_all_features/7, scores_less_features/7

(0.8309683022509043, 0.8185301141584237)

## This means that 'workclass' and 'marital-status' did have some sort of relation to the 'income'

## Trying out Hypertuning on specific models to see if it makes a difference

This process takes a lot of time. I would recommend if you want to check the parameters, you can manually check the best parameters and their accuracy. However, if you want to run the process, make sure to add verbose = 2 in the GridSearch. I have done it for SVM but not for Random Forest or Gradient Boosting.

In [38]:
from sklearn.model_selection import GridSearchCV

### Random Forest Classifier

In [45]:
parameter_grid_rf = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}
rf2 = RandomForestClassifier(random_state = 42)
grid_search_rf = GridSearchCV(estimator = rf2, param_grid = parameter_grid_rf, cv = 5, scoring = 'accuracy', n_jobs = -1)
grid_search_rf.fit(x_train, y_train)

print("Best parameters: ", grid_search_rf.best_params_)
best_rf = grid_search_rf.best_estimator_
y_test_pred_rf2 = best_rf.predict(x_test)
print("Test Accuracy: ", accuracy_score(y_test, y_test_pred_rf2))
print("Best Cross-Validation Score: ", grid_search_rf.best_score_)

Best parameters:  {'bootstrap': True, 'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
Test Accuracy:  0.8623195823523391
Best Cross-Validation Score:  0.8688352958688746


### Gradient Boosting Classifier

In [46]:
parameter_grid_gsb = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2, 0.3], 
    'max_depth': [3, 5, 7, 9],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.6, 0.8, 1.0]
}
gsb2 = GradientBoostingClassifier()
grid_search_gsb = GridSearchCV(estimator = gsb2, param_grid = parameter_grid_gsb, cv = 5, scoring = 'accuracy', n_jobs = -1)
grid_search_gsb.fit(x_train, y_train)

print("Best parameters: ", grid_search_gsb.best_params_)
best_gsb = grid_search_gsb.best_estimator_
y_test_pred_gsb2 = best_gsb.predict(x_test)
print("Test Accuracy: ", accuracy_score(y_test, y_test_pred_gsb2))
print("Best Cross-Validation Score: ", grid_search_gsb.best_score_)

Best parameters:  {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200, 'subsample': 1.0}
Test Accuracy:  0.8697921998157436
Best Cross-Validation Score:  0.876257284487495


### Support Vector Machines

Following is the code for trying out hypertuning for SVM. I haven't tried it for the reasons listed in the first markdown of this file.

In [None]:
parameter_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': [0.001, 0.01, 0.1, 1]
}
svm2 = SVC(random_state = 42)
grid_search_svm = GridSearchCV(estimator = svm2, param_grid = parameter_grid_svm, cv = 5, scoring = 'accuracy', n_jobs = -1, verbose = 2)
grid_search_svm.fit(x_train, y_train)

print("Best parameters: ", grid_search_svm.best_params_)
best_svm = grid_search_svm.best_estimator_
y_test_pred_svm2 = best_svm.predict(x_test)
print("Test Accuracy: ", accuracy_score(y_test, y_test_pred_svm2))
print("Best Cross-Validation Score: ", grid_search_svm.best_score_)

## You can try out the model on your data if you want, the steps are below

In [51]:
dataset_sample = pd.concat([temp, temp1], axis = 0, ignore_index = True)

In [52]:
for _ in dataset_sample.columns:
    print(dataset_sample[_].value_counts())
    print()

age
36    1348
35    1337
33    1335
23    1329
31    1325
      ... 
88       6
85       5
87       3
89       2
86       1
Name: count, Length: 74, dtype: int64

workclass
Private             33906
Self-emp-not-inc     3862
Local-gov            3136
?                    2799
State-gov            1981
Self-emp-inc         1695
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: count, dtype: int64

fnlwgt
203488    21
120277    19
190290    19
125892    18
126569    18
          ..
286983     1
185942     1
234220     1
214706     1
350977     1
Name: count, Length: 28523, dtype: int64

education
HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: count

In [67]:
label_encoders = {}
categorical = ['marital-status', 'relationship', 'race', 'sex', 'native-country']
categorical1 = ['workclass', 'occupation']
numerical = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

for col in categorical + categorical1:
    label_encoders[col] = joblib.load(f'{col}_label_encoder.pkl')

ss = joblib.load('standard_scaler.pkl')

sample_data = {
    'age': [30] * 5,
    'workclass': ['Private'] * 5,
    'fnlwgt': [750000] * 5,  # just taking average cause no idea what this means
    'education-num': [14] * 5,  # Corresponding to 'Masters'
    'marital-status': ['Never-married'] * 5,
    'occupation': ['Prof-specialty'] * 5,
    'relationship': ['Not-in-family'] * 5,
    'race': ['Asian-Pac-Islander'] * 5,  # 'Asian-Pac-Islander' for Indian
    'sex': ['Male'] * 5,
    'capital-gain': [0] * 5,
    'capital-loss': [0] * 5,
    'hours-per-week': [40, 46, 38, 48, 40], 
    'native-country': ['United-States', 'Japan', 'England', 'Scotland', 'India']
}

sample_df = pd.DataFrame(sample_data)

# Transform categorical features using the saved encoders
for col in categorical + categorical1:
    sample_df[col] = label_encoders[col].transform(sample_df[col])

# Transform numerical features using the saved scaler
sample_df[numerical] = ss.transform(sample_df[numerical])

### This sample_df can be used to calculate income if you want. The census is very old though so it won't be tha accurate.
### Now, finally testing the best model found and checking the accuracy

In [66]:
# {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200, 'subsample': 1.0}
model = GradientBoostingClassifier(
    learning_rate=0.1,
    max_depth=5,
    min_samples_leaf=4,
    min_samples_split=2,
    n_estimators=200,
    subsample=1.0
)

temp = cross_val_score(model, x_train, y_train, cv = 10, scoring = 'accuracy')
model.fit(x_train, y_train)
y_test_pred_final = model.predict(x_test)
accuracy_score(y_test_pred_final, y_test)

0.8697921998157436

## 86.98% accuracy achieved by Gradient Boosting Classifier with the following parameters
#### {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200, 'subsample': 1.0}