# Adult Cencus Income Classification

The Goal is to predict whether a person has an income of more than 50K a year or not.

This is basically a binary classification problem where a person is classified into the more than 50K group or less than or equal to 50K group

The dataset contains about 48842 rows and 15 features which after all the implementation of all standard techniques like Data Cleaning, Feature Engineering was feeded to our Classifier for training and testing.

# Importing required libraries

In [1]:
import os
import csv
import json
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# Ignoring warnings
import warnings 
warnings.filterwarnings(action='ignore')

# Extracting Data From UCI

In [2]:
pip install ucimlrepo

Note: you may need to restart the kernel to use updated packages.


In [3]:
from ucimlrepo import fetch_ucirepo 

In [4]:
# Fetch dataset 
uci_data = fetch_ucirepo(id=20) 

In [5]:
uci_data

{'data': {'ids': None,
  'features':        age         workclass  fnlwgt  education  education-num  \
  0       39         State-gov   77516  Bachelors             13   
  1       50  Self-emp-not-inc   83311  Bachelors             13   
  2       38           Private  215646    HS-grad              9   
  3       53           Private  234721       11th              7   
  4       28           Private  338409  Bachelors             13   
  ...    ...               ...     ...        ...            ...   
  48837   39           Private  215419  Bachelors             13   
  48838   64               NaN  321403    HS-grad              9   
  48839   38           Private  374983  Bachelors             13   
  48840   44           Private   83891  Bachelors             13   
  48841   35      Self-emp-inc  182148  Bachelors             13   
  
             marital-status         occupation    relationship  \
  0           Never-married       Adm-clerical   Not-in-family   
  1      Marri

# Check if the file already exists

In [6]:
if os.path.exists('census_income.csv'):
    data = uci_data['data']['original']
    data.to_csv('census_income.csv', index=False, mode='w')
else:
    data = uci_data['data']['original']
    data.to_csv('census_income.csv', index=False)

In [7]:
df = pd.read_csv('census_income.csv')

In [8]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Step 1: Preliminary Analysis

## Number of independent and dependent variables:

In [9]:
independent_vars = df.columns[:-1]
dependent_var = df.columns[-1]
print("Number of independent variables:", len(independent_vars))
print("Number of dependent variables:", 1)

Number of independent variables: 14
Number of dependent variables: 1


## Number of records:

In [10]:
print("Number of records:", len(df))

Number of records: 48842


In [11]:
df.shape

(48842, 15)

## Data types of variables:

In [12]:
print("Data types of variables:")
print(df.dtypes)

Data types of variables:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object


## Summary Statistics

In [13]:
# Display summary statistics for the whole dataframe
df.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,47879,48842.0,48842,48842.0,48842,47876,48842,48842,48842,48842.0,48842.0,48842.0,48568,48842
unique,,9,,16,,7,15,6,5,2,,,,42,4
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,24720
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


In [14]:
# Display summary statistics for only numeric values in dataframe
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


# Step 2: Data Cleaning

In [15]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [16]:
df.isnull().sum()

age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64

In [17]:
df.nunique()

age                  74
workclass             9
fnlwgt            28523
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        123
capital-loss         99
hours-per-week       96
native-country       42
income                4
dtype: int64

In [18]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

## Checking for low variance:

In [19]:
# Checking for columns with a single unique value
low_variance_cols = [col for col in df.columns if df[col].nunique() <= 1]
print("Columns with low variance (possible candidates for removal):", low_variance_cols)

Columns with low variance (possible candidates for removal): []


## Remove Duplicates

In [20]:
# Checking for duplicates
if df.duplicated().any():
    print("Duplicates found:", df.duplicated().sum())
    df = df.drop_duplicates()
    print("Duplicates have been removed.")
else:
    print("No duplicates found.")

Duplicates found: 29
Duplicates have been removed.


## Handling Missing Values

In [21]:
# Checking for missing values in all columns
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0]

if not missing_data.empty:
    print("Missing values found in the following columns:")
    print(missing_data)
    # Handling missing values (example using median imputation for numerical columns)
    for column in df.columns:
        if df[column].dtype == np.number:
            df[column].fillna(df[column].median(), inplace=True)
        else:
            df[column].fillna(df[column].mode()[0], inplace=True)  # For categorical data, using mode
    print("Missing values have been handled.")
else:
    print("No missing values found.")

Missing values found in the following columns:
workclass         963
occupation        966
native-country    274
dtype: int64
Missing values have been handled.


## Unique counts of categories in columns

In [22]:
df['workclass'].value_counts()

workclass
Private             34842
Self-emp-not-inc     3861
Local-gov            3136
State-gov            1981
?                    1836
Self-emp-inc         1694
Federal-gov          1432
Without-pay            21
Never-worked           10
Name: count, dtype: int64

In [23]:
df['occupation'].value_counts()

occupation
Prof-specialty       7133
Craft-repair         6107
Exec-managerial      6084
Adm-clerical         5608
Sales                5504
Other-service        4919
Machine-op-inspct    3019
Transport-moving     2355
Handlers-cleaners    2071
?                    1843
Farming-fishing      1487
Tech-support         1445
Protective-serv       983
Priv-house-serv       240
Armed-Forces           15
Name: count, dtype: int64

In [24]:
df['native-country'].value_counts()

native-country
United-States                 44084
Mexico                          947
?                               582
Philippines                     295
Germany                         206
Puerto-Rico                     184
Canada                          182
El-Salvador                     155
India                           151
Cuba                            138
England                         127
China                           122
South                           115
Jamaica                         106
Italy                           105
Dominican-Republic              103
Japan                            92
Poland                           87
Guatemala                        86
Vietnam                          86
Columbia                         85
Haiti                            75
Portugal                         67
Taiwan                           65
Iran                             59
Greece                           49
Nicaragua                        49
Peru         

In [25]:
df['marital-status'].value_counts()

marital-status
Married-civ-spouse       22372
Never-married            16098
Divorced                  6630
Separated                 1530
Widowed                   1518
Married-spouse-absent      628
Married-AF-spouse           37
Name: count, dtype: int64

In [26]:
df['sex'].value_counts()

sex
Male      32631
Female    16182
Name: count, dtype: int64

In [27]:
df['race'].value_counts()

race
White                 41736
Black                  4683
Asian-Pac-Islander     1518
Amer-Indian-Eskimo      470
Other                   406
Name: count, dtype: int64

In [28]:
df['education'].value_counts()

education
HS-grad         15777
Some-college    10869
Bachelors        8020
Masters          2656
Assoc-voc        2060
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           954
Prof-school       834
9th               756
12th              656
Doctorate         594
5th-6th           508
1st-4th           245
Preschool          82
Name: count, dtype: int64

In [29]:
df['income'].value_counts()

income
<=50K     24698
<=50K.    12430
>50K       7839
>50K.      3846
Name: count, dtype: int64

## Count ? as missing:

In [30]:
# Replacing '?' with NaN for accurate missing data handling
columns_with_question = ['workclass', 'occupation', 'native-country']
for column in columns_with_question:
    df[column] = df[column].replace('?', np.nan)

## Recalculate missing values after replacement

In [31]:
missing_data_updated = df.isnull().sum()
print("Updated missing values per column after '?' replacement:")
print(missing_data_updated)

Updated missing values per column after '?' replacement:
age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     582
income               0
dtype: int64


In [32]:
for column in columns_with_question:
    df[column].fillna(df[column].mode()[0], inplace=True)
print("Missing values have been filled with the most frequent value.")

Missing values have been filled with the most frequent value.


## Correcting Income Feature

In [33]:
# income
df.income = df.income.replace('<=50K.', '<=50K')
df.income = df.income.replace('>50K.', '>50K')

In [34]:
df['income'].value_counts()

income
<=50K    37128
>50K     11685
Name: count, dtype: int64

## Copy of Original Data

In [35]:
df_copy=df

In [36]:
print(df_copy.dtypes)

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object


In [37]:
df_copy.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [38]:
from sklearn.preprocessing import MinMaxScaler

In [39]:
# Create an instance of MinMaxScaler
scaler = MinMaxScaler()

In [40]:
# Select only the numeric columns
numeric_features = df_copy.select_dtypes(include=['int64', 'float64']).columns

In [41]:
# Apply the MinMaxScaler only to the numeric feature columns
df_copy[numeric_features] = scaler.fit_transform(df_copy[numeric_features])

In [42]:
df_copy.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,0.30137,State-gov,0.044131,Bachelors,0.8,Never-married,Adm-clerical,Not-in-family,White,Male,0.02174,0.0,0.397959,United-States,<=50K
1,0.452055,Self-emp-not-inc,0.048052,Bachelors,0.8,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,0.122449,United-States,<=50K
2,0.287671,Private,0.137581,HS-grad,0.533333,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,0.397959,United-States,<=50K
3,0.493151,Private,0.150486,11th,0.4,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,0.397959,United-States,<=50K
4,0.150685,Private,0.220635,Bachelors,0.8,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,0.397959,Cuba,<=50K


## Labeling the target variable

In [43]:
# income
df_copy.income = df_copy.income.replace('<=50K', 0)
df_copy.income = df_copy.income.replace('>50K', 1)

## Identify independent variables (all columns except 'income') and the dependent variable ('income')

In [44]:
X = df_copy.columns[:-1]  # Assuming 'income' is the last column
y = df_copy.columns[-1]

## Feature Scaling

In [45]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

In [46]:
df1= df_copy.copy()
df1= df1.apply(LabelEncoder().fit_transform)
df1.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,22,6,3461,9,12,4,0,1,4,1,27,0,39,38,0
1,33,5,3788,9,12,2,3,0,4,1,0,0,12,38,0
2,21,3,18342,11,8,0,5,1,4,1,0,0,39,38,0
3,36,3,19995,1,6,2,5,0,2,1,0,0,39,38,0
4,11,3,25405,9,12,2,9,5,2,0,0,0,39,4,0


In [47]:
ss= StandardScaler().fit(df1.drop('income', axis=1))

In [48]:
X= ss.transform(df1.drop('income', axis=1))
y= df_copy['income']

# Splitting data into training and testing sets

In [49]:
from sklearn.model_selection import train_test_split

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the number of records in the training and testing data

In [53]:
print(f"Training records: {len(X_train)}, Testing records: {len(X_test)}")

Training records: 34169, Testing records: 14644


# Save training data to CSV

In [None]:
# # Convert X_train and y_train into pandas DataFrame if they are not already
# X_train_df = pd.DataFrame(X_train, columns=[list of your feature names])  # Replace [list of your feature names] with actual column names
# y_train_df = pd.DataFrame(y_train, columns=['income'])  # Assuming 'income' is the target column name

In [54]:
train_data = pd.concat([ pd.DataFrame(X_train),  pd.DataFrame(y_train)], axis=1)
train_data.to_csv('cencus_training_data.csv', index=False)
print("Training data has been saved to 'cencus_training_data.csv'.")

Training data has been saved to 'cencus_training_data.csv'.


# Save testing data to CSV

In [55]:
test_data = pd.concat([pd.DataFrame(X_test), pd.DataFrame(y_test)], axis=1)
test_data.to_csv('cencus_testing_data.csv', index=False)
print("Testing data has been saved to 'cencus_testing_data.csv'.")

Testing data has been saved to 'cencus_testing_data.csv'.


# Logistic Regression

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [57]:
# Initializing the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

In [58]:
# Fitting the model
model=log_reg.fit(X_train, y_train)

In [59]:
# Predict on the test data
model_predictions = model.predict(X_test)

In [62]:
print("Accuracy on training data: {:,.3f}".format(log_reg.score(X_train, y_train)))
print("Accuracy on test data: {:,.3f}".format(log_reg.score(X_test, y_test)))

Accuracy on training data: 0.823
Accuracy on test data: 0.824


In [63]:
from sklearn.metrics import accuracy_score, classification_report

In [65]:
# Evaluate the model
model_accuracy = accuracy_score(y_test, model_predictions)
print("Logistic Regression Accuracy:", model_accuracy)
print("Classification Report:\n", classification_report(y_test, model_predictions))

Logistic Regression Accuracy: 0.8242283529090413
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.95      0.89     11078
           1       0.73      0.45      0.55      3566

    accuracy                           0.82     14644
   macro avg       0.78      0.70      0.72     14644
weighted avg       0.81      0.82      0.81     14644



##### The model accuracy is approximately 82.42%, which is a decent starting point 

### Class '0' : income <=50K

Precision: 0.84 - This indicates that 84% of the instances predicted as class 0 were actually class 0.

Recall: 0.95 - This indicates that the model captured 95% of actual class 0 instances.

f1-score: 0.89 - This is a high F1 score, which is a harmonic mean of precision and recall, indicating good performance for class 0.

Support: 11078 instances

### Class '1' : income >50K

Precision: 0.73 - This indicates that 73% of the instances predicted as class 1 were actually class 1.

Recall: 0.45 - This shows a relatively low value, meaning the model only captured 45% of actual class 1 instances.

f1-score: 0.55 - This score is moderately low, reflecting the low recall rate for class 1.

Support: 3566 instances

##### The support values indicate a significant class imbalance as there are more instances of class 0 than class 1

##### either oversampling the class '1' or undersampling the class '0' could be considered

## Random Forest

In [66]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

model1 = rfc.fit(X_train, y_train)
prediction1 = model1.predict(X_test)

print("Acc on training data: {:,.3f}".format(rfc.score(X_train, y_train)))
print("Acc on test data: {:,.3f}".format(rfc.score(X_test, y_test)))

Acc on training data: 1.000
Acc on test data: 0.855


In [67]:
# Evaluate Random Forest
print("Random Forest Accuracy:", accuracy_score(y_test, prediction1))
print("Random Forest Classification Report:\n", classification_report(y_test, prediction1))

Random Forest Accuracy: 0.8548210871346626
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.93      0.91     11078
           1       0.75      0.61      0.67      3566

    accuracy                           0.85     14644
   macro avg       0.81      0.77      0.79     14644
weighted avg       0.85      0.85      0.85     14644



##### The overall accuracy of the Random Forest model is about 85.48%, which is an improvement over the Logistic Regression model

### Class '0' : income <=50K

Precision: 0.88 - A high value indicating that 88% of the predictions for class 0 are correct.

Recall: 0.93 - A high value showing that the model identifies 93% of all actual class 0 instances.

F1-Score: 0.91 - A strong score showing the balance between precision and recall.

Support: 11078 instances

### Class '1' : income >50K

Precision: 0.75 - Suggests that 75% of the model's class 1 predictions are correct.

Recall: 0.61 - Indicates the model identifies 61% of all actual class 1 instances, which is moderately good but could be improved.

F1-Score: 0.67 - Reflects a moderate balance between precision and recall for class 1.

Support: 3566  instances , indicating class imbalance similar to what was observed in the logistic regression model.