# Census Income Project 

# Problem statement:
In this project, initially you need to preprocess the data and then develop an understanding of different features of the data by performing exploratory analysis and creating visualizations.Further, after having sufficient knowledge about the attributes you will perform a predictive task of classification to predict whether an individual makes over 50K a year or less,by using different Machine Learning Algorithms. 

In [1]:
#import packages
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
from time import time

In [3]:
#to find inference time of all the models
def timer(function, input, num_times=10):
    times_list = list()
    for i in range(num_times):
        start_time = time()
        y_pred = function(input)
        end_time = time()-start_time
        times_list.append(end_time)
        del start_time, end_time
    return np.mean(times_list)

In [4]:
#load dataset 
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per- week', 'native-country', 'income']
CI = pd.read_csv('census-income.csv', names = columns, header=0, delimiter=', ')
CI_original = CI.copy()
CI.head()

  CI = pd.read_csv('census-income.csv', names = columns, header=0, delimiter=', ')


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Tasks to be done:
1. Data Preprocessing:
a) Replace all the missing values with NA.
b) Remove all the rows that contain NA values. 

In [5]:
#check missing values
CI.isna().sum()
CI

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [6]:
#1a.
#CI.replace(np.nan, 'NA')
#or
CI.fillna('NA')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [7]:
#1b.
CI.dropna()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


# 2. Data Manipulation:  
a) Extract the “education” column and store it in “census_ed” .
b) Extract all the columns from “age” to “relationship” and store it in “census_seq”.
c) Extract the column number “5”, “8”, “11” and store it in “census_col”.
d) Extract all the male employees who work in state-gov and store it in “male_gov”.
e) Extract all the 39 year olds who either have a bachelor's degree or who are native of the United States and store the result in “census_us”.
f) Extract 200 random rows from the “census” data frame and store it in “census_200”.
g) Get the count of different levels of the “workclass” column.
h) Calculate the mean of the “capital.gain” column grouped according to “workclass”.
i) Create a separate dataframe with the details of males and females from the census data that has income more than 50,000. 
j) Calculate the percentage of people from the United States who are private employees and earn less than 50,000 annually. 
k) Calculate the percentage of married people in the census data.
l) Calculate the percentage of high school graduates earning more than 50,000 annually. 

In [8]:
#2a.
census_ed = CI['education']
census_ed

0         Bachelors
1         Bachelors
2           HS-grad
3              11th
4         Bachelors
            ...    
32556    Assoc-acdm
32557       HS-grad
32558       HS-grad
32559       HS-grad
32560       HS-grad
Name: education, Length: 32561, dtype: object

In [9]:
#2b.
census_seq = CI.loc[:,'age':'relationship']
census_seq

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife
...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child


In [10]:
#2c.
census_col = CI.iloc[:,[4,7,10]]
census_col

Unnamed: 0,education-num,relationship,capital-gain
0,13,Not-in-family,2174
1,13,Husband,0
2,9,Not-in-family,0
3,7,Husband,0
4,13,Wife,0
...,...,...,...
32556,12,Wife,0
32557,9,Husband,0
32558,9,Unmarried,0
32559,9,Own-child,0


In [11]:
#2d.
male_gov = CI[(CI['sex'] == 'Male') & (CI['workclass'] == 'State-gov')]
male_gov

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
34,22,State-gov,311512,Some-college,10,Married-civ-spouse,Other-service,Husband,Black,Male,0,0,15,United-States,<=50K
48,41,State-gov,101603,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
123,29,State-gov,267989,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32163,36,State-gov,135874,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,<=50K
32241,45,State-gov,231013,Bachelors,13,Divorced,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32321,54,State-gov,138852,HS-grad,9,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,<=50K
32324,42,State-gov,138162,Some-college,10,Divorced,Adm-clerical,Own-child,White,Male,0,0,40,United-States,<=50K


In [12]:
#2e.
census_us = CI[(CI['age']==39) & ((CI['education']=='Bachelors') | (CI['native-country']=='United-States'))]
census_us

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
28,39,Private,367260,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,80,United-States,<=50K
129,39,Private,365739,Some-college,10,Divorced,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,<=50K
166,39,Federal-gov,235485,Assoc-acdm,12,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,42,United-States,<=50K
320,39,Self-emp-not-inc,174308,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32146,39,Private,117381,Some-college,10,Divorced,Transport-moving,Not-in-family,White,Male,0,0,65,United-States,<=50K
32260,39,Federal-gov,232036,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,>50K
32428,39,Federal-gov,110622,Bachelors,13,Married-civ-spouse,Adm-clerical,Wife,Asian-Pac-Islander,Female,0,0,40,Philippines,<=50K
32468,39,Self-emp-not-inc,193689,HS-grad,9,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,65,United-States,<=50K


In [13]:
#2f.
census_200 = CI.sample(200)
census_200

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
17137,43,Self-emp-inc,151089,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,<=50K
7741,19,Private,323605,7th-8th,4,Never-married,Other-service,Not-in-family,White,Male,0,0,60,United-States,>50K
7465,54,Private,174102,Some-college,10,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,40,United-States,>50K
7946,29,Private,110134,Some-college,10,Divorced,Machine-op-inspct,Not-in-family,White,Male,0,0,40,United-States,<=50K
11419,39,State-gov,122353,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,45,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14957,23,Private,141323,Some-college,10,Never-married,Other-service,Own-child,White,Male,0,0,40,United-States,<=50K
10035,61,?,135285,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,2603,32,United-States,<=50K
12888,33,Self-emp-not-inc,203784,Assoc-voc,11,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,62,United-States,<=50K
4736,25,Private,163620,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,7298,0,84,United-States,>50K


In [14]:
#2g.
CI['workclass'].value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

In [15]:
#2h.
CI.groupby(['workclass','capital-gain']).mean()

  CI.groupby(['workclass','capital-gain']).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,age,fnlwgt,education-num,capital-loss,hours-per- week
workclass,capital-gain,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
?,0,39.991827,188209.934034,9.226503,65.123176,31.828371
?,401,90.000000,39824.000000,9.000000,0.000000,4.000000
?,594,25.400000,196454.800000,9.600000,0.000000,22.000000
?,991,81.000000,273222.000000,13.000000,0.000000,8.500000
?,1055,19.250000,181701.750000,8.500000,0.000000,33.000000
...,...,...,...,...,...,...
State-gov,25236,49.000000,283653.666667,15.666667,0.000000,48.333333
State-gov,99999,49.000000,423222.000000,14.000000,0.000000,80.000000
Without-pay,0,48.500000,163704.083333,9.083333,0.000000,33.166667
Without-pay,2414,65.000000,172949.000000,9.000000,0.000000,20.000000


In [16]:
#2i.
CI2 = CI[CI['income']=='>50K']
CI2

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per- week,native-country,income
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32539,71,?,287372,Doctorate,16,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,>50K
32545,39,Local-gov,111499,Assoc-acdm,12,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,20,United-States,>50K
32554,53,Private,321865,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K


In [17]:
#2j.
Prvtemp_earnless = len(CI[(CI['workclass'] == 'Private') & (CI['native-country'] == 'United-States') & (CI['income'] == '<=50K')])/ len(CI) * 100 
Prvtemp_earnless

47.891649519363654

In [18]:
#2k.
marr_census = len(CI[CI['marital-status'] != 'Never-married']) / len(CI['marital-status']) * 100
marr_census

67.19081109302539

In [19]:
#2l.
hs_grad_income = len(CI[(CI['education'] == 'HS-grad') & (CI['income'] == '>50K') ]) / len(CI[(CI['education'] == 'HS-grad')]) * 100
hs_grad_income

15.950861822683555

# 3. Linear Regression:
a) Build a simple linear regression model as follows:

i) Divide the dataset into training and test sets in 70:30 ratio.
ii) Build a linear model on the test set where the dependent variable is “hours.per.week” and the independent variable is “education.num”.
iii) Predict the values on the train set and find the error in prediction. 
iv) Find the root-mean-square error (RMSE).

In [20]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error

In [21]:
#create a dataframe to store the dependent and independent variables 
data = CI[['education-num', 'hours-per- week']]
data

Unnamed: 0,education-num,hours-per- week
0,13,40
1,13,13
2,9,40
3,7,40
4,13,40
...,...,...
32556,12,38
32557,9,40
32558,9,40
32559,9,20


In [22]:
#check data types for variables "education.num" and "hours.per.week"
data.dtypes

education-num      int64
hours-per- week    int64
dtype: object

In [23]:
#x = independent variable and y = dependent variable
x = CI['education-num']
y = CI['hours-per- week']

In [24]:
#Divide the dataset into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

In [25]:
#Initialize the Linear Regression Model
model = LinearRegression()

In [26]:
#Fit the model on the training data
model.fit(
    np.expand_dims(np.array(x_train) , axis=1),
    y_train
)

In [27]:
#Predict the values on the test set
y_pred = model.predict(np.expand_dims(np.array(x_test) , axis=1))

In [28]:
time_input = np.expand_dims(np.array(x_test) , axis=1)

In [29]:
avg_time = timer(model.predict, time_input[:1], num_times=10)
print(f"Average time for LinearRegression model for 1 input sample is {avg_time}")

Average time for LinearRegression model for 1 input sample is 0.00017092227935791015


In [30]:
#Find the root-mean-square error (RMSE) on the test set
RMSE = np.sqrt(mean_squared_error(y_test, y_pred)) 
print('Root Mean Squared Error:', RMSE)

Root Mean Squared Error: 12.13064789640857


# 4. Logistic Regression:
 a) Build a simple logistic regression model as follows:
i) Divide the dataset into training and test sets in 65:35 ratio.
ii) Build a logistic regression model where the dependent variable is “X”(yearly income) and the independent variable is “occupation”.
iii) Predict the values on the test set.
iv) Build a confusion matrix and find the accuracy.

b)Build a multiple logistic regression model as follows:
i) Divide the dataset into training and test sets in 80:20 ratio.
ii) Build a logistic regression model where the dependent variable is “X”(yearly income) and independent variables are “age”, “workclass”, and “education”.
iii) Predict the values on the test set.
iv) Build a confusion matrix and find the accuracy.


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

In [32]:
#4a.
# Define dependent and independent variables
x_simple = CI[['occupation']] 
y_simple = CI['income']

In [33]:
x_simple.value_counts()

occupation       
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
dtype: int64

In [34]:
y_simple.value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

In [35]:
X_simple = pd.get_dummies(x_simple)
X_simple

Unnamed: 0,occupation_?,occupation_Adm-clerical,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
32557,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
32558,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
32559,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [36]:
#Divide the dataset into training and test sets 
X_simple_train, X_simple_test, y_simple_train, y_simple_test = train_test_split(X_simple, y_simple, test_size=0.35, random_state=1)

In [37]:
#Initialize the Logistic Regression Model
simple_model = LogisticRegression()

In [38]:
#Fit the model on the training data
simple_model.fit(X_simple_train, y_simple_train)

In [39]:
#Predict the values on the test set
y_simple_pred = simple_model.predict(X_simple_test)

In [40]:
avg_time = timer(simple_model.predict, X_simple_test[:1], num_times=10)
print(f"Average time for LogisticRegression model for 1 input sample is {avg_time}")

Average time for LogisticRegression model for 1 input sample is 0.0014028310775756835


In [41]:
X_simple_test.shape

(11397, 15)

In [42]:
#Build a confusion matrix and find the accuracy
conf_matrix_simple = confusion_matrix(y_simple_test, y_simple_pred) 
accuracy_simple = accuracy_score(y_simple_test, y_simple_pred)
print('Confusion Matrix for Simple Logistic Regression:\n', conf_matrix_simple) 
print('Accuracy for Simple Logistic Regression:', accuracy_simple)

Confusion Matrix for Simple Logistic Regression:
 [[8800    0]
 [2597    0]]
Accuracy for Simple Logistic Regression: 0.7721330174607353


In [43]:
#Creating a dataframe with actual and predicted values
predicted_values = pd.DataFrame ({'Actual' :y_simple_test, 'Predicted' :y_simple_pred})
predicted_values

Unnamed: 0,Actual,Predicted
9646,<=50K,<=50K
709,<=50K,<=50K
7385,>50K,<=50K
16671,<=50K,<=50K
21932,<=50K,<=50K
...,...,...
25155,<=50K,<=50K
13055,<=50K,<=50K
2515,<=50K,<=50K
30833,>50K,<=50K


In [43]:
#4b.
# Define dependent and independent variables
x_multiple = CI[['age', 'workclass', 'education']] 
y_multiple = CI['income']

In [44]:
X_multiple = pd.get_dummies(x_multiple[['workclass', 'education']])
X_multiple

Unnamed: 0,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,...,education_9th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
32557,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
32558,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
32559,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [45]:
X_multiple['age'] = CI.age
X_multiple.columns

Index(['workclass_?', 'workclass_Federal-gov', 'workclass_Local-gov',
       'workclass_Never-worked', 'workclass_Private', 'workclass_Self-emp-inc',
       'workclass_Self-emp-not-inc', 'workclass_State-gov',
       'workclass_Without-pay', 'education_10th', 'education_11th',
       'education_12th', 'education_1st-4th', 'education_5th-6th',
       'education_7th-8th', 'education_9th', 'education_Assoc-acdm',
       'education_Assoc-voc', 'education_Bachelors', 'education_Doctorate',
       'education_HS-grad', 'education_Masters', 'education_Preschool',
       'education_Prof-school', 'education_Some-college', 'age'],
      dtype='object')

In [46]:
#Divide the dataset into training and test sets 
X_multiple_train, X_multiple_test, y_multiple_train, y_multiple_test = train_test_split(X_multiple, y_multiple, test_size=0.20, random_state=1)

In [47]:
#Initialize the Multiple Logistic Regression Model
multiple_model = LogisticRegression()

In [48]:
#Fit the model on the training data
multiple_model.fit(X_multiple_train, y_multiple_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [49]:
#Predict the values on the test set
y_multiple_pred = multiple_model.predict(X_multiple_test)

In [50]:
avg_time = timer(multiple_model.predict, X_multiple_test[:1], num_times=10)
print(f"Average time for LogisticRegression model for 1 input sample is {avg_time}")

Average time for LogisticRegression model for 1 input sample is 0.0012395143508911132


In [51]:
#Build a confusion matrix and find the accuracy
conf_matrix_multiple = confusion_matrix(y_multiple_test, y_multiple_pred) 
accuracy_multiple = accuracy_score(y_multiple_test, y_multiple_pred)
print('Confusion Matrix for Multiple Logistic Regression:\n', conf_matrix_multiple) 
print('Accuracy for Multiple Logistic Regression:', accuracy_multiple)

Confusion Matrix for Multiple Logistic Regression:
 [[4739  287]
 [1052  435]]
Accuracy for Multiple Logistic Regression: 0.7944111776447106


In [52]:
#Creating a dataframe with actual and predicted values
predicted_values = pd.DataFrame ({'Actual' :y_multiple_test, 'Predicted' :y_multiple_pred})
predicted_values

Unnamed: 0,Actual,Predicted
9646,<=50K,<=50K
709,<=50K,<=50K
7385,>50K,<=50K
16671,<=50K,<=50K
21932,<=50K,<=50K
...,...,...
5889,>50K,<=50K
25723,<=50K,<=50K
29514,<=50K,<=50K
1600,<=50K,<=50K


# 5. Decision Tree:
a) Build a decision tree model as follows:

Divide the dataset into training and test sets in 70:30 ratio.
Build a decision tree model where the dependent variable is “X”(Yearly Income) and the rest of the variables as independent variables.
Predict the values on the test set.
Build a confusion matrix and calculate the accuracy.

In [53]:
from sklearn.tree import DecisionTreeClassifier

In [54]:
CI3 = pd.read_csv("census-income - census-income2.csv")
CI3.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [55]:
#Define dependent and independent variables
y_tree = CI3['income']
X_tree = CI3.drop('income', axis=1)

In [56]:
#One-hot encode using sparse DataFrame
CI3_encoded = pd.get_dummies(X_tree, drop_first=True, sparse=True)
#CI3_encoded contains independent or feature variables and y_tree contains dependent or target variables 

In [57]:
CI3_encoded

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
32557,40,154374,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
32558,58,151910,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
32559,22,201490,9,0,0,20,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [58]:
#Divide the dataset into training and test sets 
X_tree_train, X_tree_test, y_tree_train, y_tree_test = train_test_split(CI3_encoded, y_tree, test_size=0.3, random_state=1)

In [59]:
#Initialize the model
model_tree = DecisionTreeClassifier()

In [60]:
#Train the model
model_tree.fit(X_tree_train, y_tree_train)



In [61]:
#Predict the values on the test set
y_pred_tree = model_tree.predict(X_tree_test)



In [62]:
avg_time = timer(model_tree.predict,X_tree_test[:1], num_times=10)
print(f"Average time for DecisionTreeClassifier model for 1 input sample is {avg_time}")

Average time for DecisionTreeClassifier model for 1 input sample is 0.0033193588256835937




In [63]:
#Build a confusion matrix and find the accuracy
confusion_matrix_tree = confusion_matrix(y_tree_test, y_pred_tree) 
accuracy_tree = accuracy_score(y_tree_test, y_pred_tree)
print('Confusion Matrix for Decision Tree Model:\n', confusion_matrix_tree) 
print('Accuracy for Decision Tree Model:', accuracy_tree)

Confusion Matrix for Decision Tree Model:
 [[6574  976]
 [ 822 1397]]
Accuracy for Decision Tree Model: 0.8159484082301157


In [64]:
#Creating a dataframe with actual and predicted values
predicted_values = pd.DataFrame ({'Actual' :y_tree_test, 'Predicted' :y_pred_tree})
predicted_values

Unnamed: 0,Actual,Predicted
9646,<=50K,<=50K
709,<=50K,<=50K
7385,>50K,>50K
16671,<=50K,>50K
21932,<=50K,<=50K
...,...,...
29663,>50K,>50K
29310,<=50K,<=50K
29661,<=50K,>50K
19491,<=50K,<=50K


# 6. Random Forest:
 a) Build a random forest model as follows:
Divide the dataset into training and test sets in 80:20 ratio.
Build a random forest model where the dependent variable is “X”(Yearly Income) and the rest of the variables as independent variables and number of trees as 300.
Predict values on the test set
Build a confusion matrix and calculate the accuracy

In [65]:
from sklearn.ensemble import RandomForestClassifier

In [66]:
#Divide the dataset into training and test sets 
X_randomforest_train, X_randomforest_test, y_randomforest_train, y_randomforest_test = train_test_split(CI3_encoded, y_tree, test_size=0.20, random_state=1)

In [67]:
#Initialize the Random Forest Classifier Model with 300 trees
randomforest_model = RandomForestClassifier(n_estimators=300, random_state=1)

In [68]:
#Train the model
randomforest_model.fit(X_randomforest_train, y_randomforest_train)



In [69]:
#Predict the values on the test set
y_pred_randomforest = randomforest_model.predict(X_randomforest_test)



In [70]:
test_input_random_forest = np.array(X_randomforest_test[:1])

In [71]:
avg_time = timer(randomforest_model.predict, test_input_random_forest, num_times=10)
print(f"Average time for RandomForestClassifier model for 1 input sample is {avg_time}")

Average time for RandomForestClassifier model for 1 input sample is 0.008323407173156739




In [72]:
#Build a confusion matrix and find the accuracy
confusion_matrix_randomforest = confusion_matrix(y_randomforest_test, y_pred_randomforest) 
accuracy_randomforest = accuracy_score(y_randomforest_test, y_pred_randomforest)
print('Confusion Matrix for Random Forest:\n', confusion_matrix_randomforest) 
print('Accuracy for Random Forest:', accuracy_randomforest)

Confusion Matrix for Random Forest:
 [[4652  374]
 [ 531  956]]
Accuracy for Random Forest: 0.8610471364962383


In [73]:
#Creating a dataframe with actual and predicted values
predicted_values = pd.DataFrame ({'Actual' :y_randomforest_test, 'Predicted' :y_pred_randomforest})
predicted_values

Unnamed: 0,Actual,Predicted
9646,<=50K,<=50K
709,<=50K,<=50K
7385,>50K,>50K
16671,<=50K,<=50K
21932,<=50K,<=50K
...,...,...
5889,>50K,>50K
25723,<=50K,<=50K
29514,<=50K,<=50K
1600,<=50K,<=50K


# Conclusion:


This marks the end of our process, we have successfully trained our model to predict the income of a person, with an accuracy of ~86%.
We moved step by step, analyzing, cleaning and modeling the data, and applied various machine learning models to achieve the desired predictions. We also tuned the model to improve the accuracy, and were able to achieve a model with quite a good accuracy.