# Assignment 1

You are given a dataset about individuals with multiple features. Source of data:
[https://archive.ics.uci.edu/ml/datasets/Census+Income]

Your target is to predict if the income of a given individual is <=50K or >50K (USD).

Recommendations:

- Since this is a classification problem, you could use the Random Forest model.
- You need to remove the remove the “target” feature from the training set.
- Draw graphs where relevant to examine the data.
- Split the data and target into training and test datasets using “train_test_split”.
- Use “OrdinalEncoder” to convert non-numerical features such as “workclass”, “occupation” etc. into numerical features.
- Use “RandomizedSearchCV”

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('data/census_data.csv')

In [3]:
data.head()
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age               32561 non-null  int64 
 1   workclass         32561 non-null  object
 2   final_weight      32561 non-null  int64 
 3   education         32561 non-null  object
 4   education_length  32561 non-null  int64 
 5   marital_status    32561 non-null  object
 6   occupation        32561 non-null  object
 7   relationship      32561 non-null  object
 8   race              32561 non-null  object
 9   gender            32561 non-null  object
 10  capital_gain      32561 non-null  int64 
 11  capital_loss      32561 non-null  int64 
 12  hours/week        32561 non-null  int64 
 13  native_country    32561 non-null  object
 14  target            32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Unnamed: 0,age,final_weight,education_length,capital_gain,capital_loss,hours/week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


Split the dataset into features (X) and target (y)

In [4]:
X = data.drop('target', axis=1)
y = data['target']

Convert non-numerical features into numerical features using OrdinalEncoder

In [5]:
encoder = OrdinalEncoder()
X = pd.DataFrame(encoder.fit_transform(X), columns=X.columns)

Split the dataset into training and test sets using train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train the model using a random forest classifier

In [7]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

Use the trained model to predict the income of individuals in the test set

In [8]:
y_pred = rfc.predict(X_test)

Evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1-score

In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, pos_label=' >50K'))
print('Recall:', recall_score(y_test, y_pred, pos_label=' >50K'))
print('F1-score:', f1_score(y_test, y_pred, pos_label=' >50K'))

Accuracy: 0.8613542146476278
Precision: 0.7477744807121661
Recall: 0.6416295353278166
F1-score: 0.6906474820143884


Use RandomizedSearchCV to find the best hyperparameters for the model

In [10]:
param_distributions = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

rfc = RandomForestClassifier()
random_search = RandomizedSearchCV(rfc, param_distributions=param_distributions, n_iter=100, cv=5, n_jobs=-1)
random_search.fit(X_train, y_train)

Print the best hyperparameters found by RandomizedSearchCV

In [11]:
print('Best hyperparameters:', random_search.best_params_)

Best hyperparameters: {'n_estimators': 150, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 15}


Train a new model using the best hyperparameters found by RandomizedSearchCV

In [12]:
best_rfc = RandomForestClassifier(**random_search.best_params_)
best_rfc.fit(X_train, y_train)

Use the new model to predict the income of individuals in the test set

In [13]:
y_pred = best_rfc.predict(X_test)

Evaluate the performance of the new model

In [14]:
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, pos_label=' >50K'))
print('Recall:', recall_score(y_test, y_pred, pos_label=' >50K'))
print('F1-score:', f1_score(y_test, y_pred, pos_label=' >50K'))

Accuracy: 0.8651926915399969
Precision: 0.7842493847415914
Recall: 0.6085295989815405
F1-score: 0.6853046594982078
