<a href="https://colab.research.google.com/github/rupaidutta66/MACHINE-LEARNING-PROJECTS-/blob/main/Census_Income_Project(Capstone).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Income Classification Model**

###**Introduction**

The income dataset was extracted from 1994 U.S. Census database.

###**Objective of the porject**

The goal of this machine learning project is to predict whether a person makes over 50K a year or not given their demographic variation. To achieve this, several classification techniques are explored and the random forest model yields to the best prediction result.

*Source:*

[adult data set](https://https://archive.ics.uci.edu/ml/datasets/adult/)

[Income dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data)






In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn 
%matplotlib inline

In [None]:
cens = pd.read_csv('/content/census-income.csv')

In [None]:
cens.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Unnamed: 15
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
 cens.rename(columns={' ':'Income'}, inplace = True)

In [None]:
cens.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
cens.shape

(32561, 15)

#**1. Data Preprocessing:**

a) Replace all the missing values with NA.

b) Remove all the rows that contain NA values. 


In [None]:
cens.isnull().sum()

age                0
 workclass         0
 fnlwgt            0
 education         0
 education-num     0
 marital-status    0
 occupation        0
 relationship      0
 race              0
 sex               0
 capital-gain      0
 capital-loss      0
 hours-per-week    0
 native-country    0
Income             0
dtype: int64

#**2. Data Manipulation:**


In [None]:
# a)Extract the “education” column and store it in “census_ed” .

census_ed = cens[[' education']]
census_ed

Unnamed: 0,education
0,Bachelors
1,Bachelors
2,HS-grad
3,11th
4,Bachelors
...,...
32556,Assoc-acdm
32557,HS-grad
32558,HS-grad
32559,HS-grad


In [None]:
## b)Extract all the columns from “age” to “relationship” and store it in “census_seq”.

cesnus_seq = cens.loc[:,'age':' relationship']
cesnus_seq

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife
...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child


In [None]:
### c)Extract the column number “5”, “8”, “11” and store it in “census_col”.

census_col = cens.iloc[:,[4,7,10]]  #notebly, here due to indexing concept we use one less the the actual column number.
census_col

Unnamed: 0,education-num,relationship,capital-gain
0,13,Not-in-family,2174
1,13,Husband,0
2,9,Not-in-family,0
3,7,Husband,0
4,13,Wife,0
...,...,...,...
32556,12,Wife,0
32557,9,Husband,0
32558,9,Unmarried,0
32559,9,Own-child,0


In [None]:
#### d) Extract all the male employees who work in state-gov and store it in “male_gov”.

male_gov = cens[(cens[' sex']==' Male') & (cens[' workclass']==' State-gov')]
male_gov

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
34,22,State-gov,311512,Some-college,10,Married-civ-spouse,Other-service,Husband,Black,Male,0,0,15,United-States,<=50K
48,41,State-gov,101603,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
123,29,State-gov,267989,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32163,36,State-gov,135874,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,<=50K
32241,45,State-gov,231013,Bachelors,13,Divorced,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32321,54,State-gov,138852,HS-grad,9,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,<=50K
32324,42,State-gov,138162,Some-college,10,Divorced,Adm-clerical,Own-child,White,Male,0,0,40,United-States,<=50K


In [None]:
## e)Extract all the 39 year olds who either have a bachelor's degree or who are native of the United States and store the result in “census_us”.

census_us = cens[(cens.age==39) & ((cens[' education']==' Bachelors') | (cens[' native-country']==' United-States'))]
census_us

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
28,39,Private,367260,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,80,United-States,<=50K
129,39,Private,365739,Some-college,10,Divorced,Craft-repair,Not-in-family,White,Male,0,0,40,United-States,<=50K
166,39,Federal-gov,235485,Assoc-acdm,12,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,42,United-States,<=50K
320,39,Self-emp-not-inc,174308,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32146,39,Private,117381,Some-college,10,Divorced,Transport-moving,Not-in-family,White,Male,0,0,65,United-States,<=50K
32260,39,Federal-gov,232036,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,>50K
32428,39,Federal-gov,110622,Bachelors,13,Married-civ-spouse,Adm-clerical,Wife,Asian-Pac-Islander,Female,0,0,40,Philippines,<=50K
32468,39,Self-emp-not-inc,193689,HS-grad,9,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,65,United-States,<=50K


In [None]:
## f) Extract 200 random rows from the “census” data frame and store it in “census_200”.

census_200 = cens.sample(200)
census_200

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
12793,33,Private,236396,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,1902,55,United-States,>50K
8076,50,Self-emp-inc,136913,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
489,38,Private,91039,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,15024,0,60,United-States,>50K
27326,21,Private,222993,HS-grad,9,Never-married,Machine-op-inspct,Own-child,White,Male,0,0,40,United-States,<=50K
30849,42,State-gov,212027,Bachelors,13,Divorced,Prof-specialty,Not-in-family,Black,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8111,21,Private,34918,HS-grad,9,Never-married,Sales,Own-child,White,Female,0,0,30,United-States,<=50K
14057,24,Private,52242,HS-grad,9,Married-civ-spouse,Sales,Wife,White,Female,0,0,40,United-States,>50K
15937,34,State-gov,287908,HS-grad,9,Never-married,Other-service,Own-child,Black,Male,0,0,42,United-States,<=50K
21237,37,Private,305259,Assoc-acdm,12,Divorced,Exec-managerial,Not-in-family,White,Female,0,0,48,United-States,<=50K


In [None]:
## g) Get the count of different levels of the “workclass” column

cens[' workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name:  workclass, dtype: int64

In [None]:
# h) Calculate the mean of the “capital.gain” column grouped according to “workclass”.

cens.groupby(' workclass')[' capital-gain'].mean()

 workclass
 ?                    606.795752
 Federal-gov          833.232292
 Local-gov            880.202580
 Never-worked           0.000000
 Private              889.217792
 Self-emp-inc        4875.693548
 Self-emp-not-inc    1886.061787
 State-gov            701.699538
 Without-pay          487.857143
Name:  capital-gain, dtype: float64

In [None]:
## i)Create a separate dataframe with the details of males and females from the census data that has income more than 50,000.

cens['Income'] = cens.Income.replace(' <=50K', 0)
cens['Income'] = cens.Income.replace(' >50K', 1)

In [None]:
male_50k=cens[(cens[' sex']==' Male') & (cens['Income']==1)]
male_50k

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,1
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,1
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,1
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,1
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,?,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32532,34,Private,204461,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,60,United-States,1
32533,54,Private,337992,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Asian-Pac-Islander,Male,0,0,50,Japan,1
32539,71,?,287372,Doctorate,16,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,1
32554,53,Private,321865,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,1


In [None]:
female_50k=cens[(cens[' sex']==' Female') & (cens['Income']==1)]
female_50k

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,1
19,43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unmarried,White,Female,0,0,45,United-States,1
52,47,Private,51835,Prof-school,15,Married-civ-spouse,Prof-specialty,Wife,White,Female,0,1902,60,Honduras,1
67,53,Private,169846,HS-grad,9,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,40,United-States,1
84,44,Private,343591,HS-grad,9,Divorced,Craft-repair,Not-in-family,White,Female,14344,0,40,United-States,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32530,35,?,320084,Bachelors,13,Married-civ-spouse,?,Wife,White,Female,0,0,55,United-States,1
32536,34,Private,160216,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,55,United-States,1
32538,38,Private,139180,Bachelors,13,Divorced,Prof-specialty,Unmarried,Black,Female,15020,0,45,United-States,1
32545,39,Local-gov,111499,Assoc-acdm,12,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,20,United-States,1


In [None]:
### j) Calculate the percentage of people from the United States who are private employees and earn less than 50,000 annually.

total = cens[' workclass'].value_counts().sum()
p_le50k=cens[(cens[' native-country']==' United-States') & (cens[' workclass']==' Private') &(cens['Income']==0)]

percentage = (len(p_le50k)/total)*100
print(' The percentage of people from the United States who are private employees and earn less than 50,000 annually is {}'.format(percentage))

 The percentage of people from the United States who are private employees and earn less than 50,000 annually is 47.891649519363654


In [None]:
## k) Calculate the percentage of married people in the census data.

cens[' marital-status']= cens[' marital-status'].replace([' Married-civ-spouse', ' Married-AF-spouse',' Married-spouse-absent'], 'married')
cens[' marital-status'].value_counts()

married           15417
 Never-married    10683
 Divorced          4443
 Separated         1025
 Widowed            993
Name:  marital-status, dtype: int64

In [None]:
total=len(cens[' marital-status'])
n_married = len(cens[cens[' marital-status']=='married'])
perc_married = (n_married/total)*100
perc_married

47.34805442093302

In [None]:
# l) Calculate the percentage of high school graduates earning more than 50,000 annually.

hs_m50k=len(cens[(cens[' education']==' HS-grad') &(cens['Income']==1)])
total=len(cens[' education'])
perc_edu = (hs_m50k/total)*100
perc_edu

5.144190903227788

#**Feature Engineering**

In [None]:
# education Category
cens[' education'] = cens[' education'].replace([' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th',' 10th', ' 11th', ' 12th'], 'school')
cens[' education'] = cens[' education'].replace([' HS-grad'], 'high school')
cens[' education'] = cens[' education'].replace([' Bachelors'], 'undergrad')
cens[' education'] = cens[' education'].replace([' Masters'], 'grad')
cens[' education'] = cens[' education'].replace([' Doctorate'], 'doc')

In [None]:
#martial status

cens[' marital-status']= cens[' marital-status'].replace([' Never-married'], 'not-married')
cens[' marital-status']= cens[' marital-status'].replace([' Divorced', ' Separated','Widowed'], 'other')

In [None]:
cens.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,39,State-gov,77516,undergrad,13,not-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,undergrad,13,married,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,high school,9,other,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,school,7,married,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,undergrad,13,married,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [None]:
cens[' workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name:  workclass, dtype: int64

In [None]:
cens[' occupation'].value_counts()

 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 ?                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name:  occupation, dtype: int64

In [None]:
cens[' native-country'].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

In [None]:
# changing "?" to Unknown

change_columns = [' workclass', ' occupation', ' native-country']
for column in change_columns:
        cens[column] = cens[column].replace({'?': 'Unknown'})

#**Data Preprocessing**

In [None]:
cens_prep = cens.copy()# We have taken a copy of the dataset to maintain the cleaned one for later uses, and to use the copied one for preparing the data for the model.

In [None]:
from sklearn.preprocessing import MinMaxScaler
numerical = ['age', ' capital-gain', ' capital-loss', ' hours-per-week', ' fnlwgt']

scaler = MinMaxScaler()
cens_prep[numerical] = scaler.fit_transform(cens_prep[numerical])

In [None]:
cens_prep[' sex'] = cens_prep[' sex'].replace({" Female": 0, " Male": 1})

In [None]:
cens_prep.sample(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
15694,0.219178,Private,0.115942,Some-college,10,not-married,Other-service,Not-in-family,White,1,0.0,0.0,0.234694,United-States,0
18370,0.178082,Private,0.015864,school,7,married,Handlers-cleaners,Husband,White,1,0.0,0.0,0.5,United-States,0
15957,0.082192,Private,0.344974,Some-college,10,not-married,Prof-specialty,Not-in-family,White,0,0.0,0.0,0.44898,United-States,0


#**3. Linear Regression:**

3. Linear Regression:

a) Build a simple linear regression model as follows:

●	Divide the dataset into training and test sets in 70:30 ratio.

●	Build a linear model on the test set where the dependent variable is “hours.per.week” and the independent variable is “education.num”.

●	Predict the values on the train set and find the error in prediction. 

●	Find the root-mean-square error (RMSE).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
#independent variable is “education.num”.
x=cens_prep[[' education-num']]
#dependent variable is “hours.per.week”
y=cens_prep[' hours-per-week']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.30, random_state=1)

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(x_train,y_train)

LinearRegression()

In [None]:
y_pred = lr.predict(x_test)

In [None]:
error = y_test - y_pred
error

9646     0.306580
709     -0.134278
7385     0.075842
16671    0.003789
21932    0.018783
           ...   
29663   -0.018702
29310    0.003789
29661   -0.003708
19491   -0.011205
2861     0.056268
Name:  hours-per-week, Length: 9769, dtype: float64

In [None]:
print('mean_squared_error :',mean_squared_error(y_test,y_pred))

print('root-mean-square error :',np.sqrt(mean_squared_error(y_test,y_pred)))

mean_squared_error : 0.015322013576285046
root-mean-square error : 0.12378212139192415


#**4. Logistic Regression:**

a) Build a simple logistic regression model as follows:

●	Divide the dataset into training and test sets in 65:35 ratio.

●	Build a logistic regression model where the dependent variable is “X”(yearly income) and the independent variable is “occupation”.

●	Predict the values on the test set.

●	Build a confusion matrix and find the accuracy.**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [None]:
x=cens[' occupation']
x=pd.DataFrame(x)

In [None]:
x

Unnamed: 0,occupation
0,Adm-clerical
1,Exec-managerial
2,Handlers-cleaners
3,Handlers-cleaners
4,Prof-specialty
...,...
32556,Tech-support
32557,Machine-op-inspct
32558,Adm-clerical
32559,Adm-clerical


In [None]:
le=LabelEncoder()
X=le.fit_transform(x)

  y = column_or_1d(y, warn=True)


In [None]:
X=pd.DataFrame(X)

In [None]:
X.head()

Unnamed: 0,0
0,1
1,4
2,6
3,6
4,10


In [None]:

y=cens['Income']
y.values.reshape(-1,1)


array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [1]])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.35, random_state=1)

lo=LogisticRegression()

In [None]:
lo.fit(X_train,y_train)


LogisticRegression()

In [None]:
y_pred=lo.predict(X_test)

In [None]:
print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[8800 2597]
 [   0    0]]
accuracy_score : 0.7721330174607353


b)Build a multiple logistic regression model as follows:

●-	Divide the dataset into training and test sets in 80:20 ratio.

●-	Build a logistic regression model where the dependent variable is “X”
(yearly income) and independent variables are “age”, “workclass”, and “education”.

●-	Predict the values on the test set.

●-	Build a confusion matrix and find the accuracy

In [None]:
census=cens[['age',' workclass',' education']]
census

Unnamed: 0,age,workclass,education
0,39,State-gov,undergrad
1,50,Self-emp-not-inc,undergrad
2,38,Private,high school
3,53,Private,school
4,28,Private,undergrad
...,...,...,...
32556,27,Private,Assoc-acdm
32557,40,Private,high school
32558,58,Private,high school
32559,22,Private,high school


In [None]:
x=census.apply(le.fit_transform)
y=cens['Income']
y.values.reshape(-1,1)

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [1]])

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=1)
lo=LogisticRegression()
lo.fit(x_train,y_train)
y_pred=lo.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[4919 1457]
 [ 107   30]]
accuracy_score : 0.7598648856133886


#**5. Decision Tree:**
a) Build a decision tree model as follows:

●	Divide the dataset into training and test sets in 70:30 ratio.

●	Build a decision tree model where the dependent variable is “X”(Yearly Income) and the rest of the variables as independent variables.

●	Predict the values on the test set.

●	Build a confusion matrix and calculate the accuracy.


In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
cens_prep.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,0.30137,State-gov,0.044302,undergrad,13,not-married,Adm-clerical,Not-in-family,White,1,0.02174,0.0,0.397959,United-States,0
1,0.452055,Self-emp-not-inc,0.048238,undergrad,13,married,Exec-managerial,Husband,White,1,0.0,0.0,0.122449,United-States,0
2,0.287671,Private,0.138113,high school,9,other,Handlers-cleaners,Not-in-family,White,1,0.0,0.0,0.397959,United-States,0
3,0.493151,Private,0.151068,school,7,married,Handlers-cleaners,Husband,Black,1,0.0,0.0,0.397959,United-States,0
4,0.150685,Private,0.221488,undergrad,13,married,Prof-specialty,Wife,Black,0,0.0,0.0,0.397959,Cuba,0


In [None]:
data=cens_prep.apply(le.fit_transform)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Income
0,22,7,2671,8,12,2,1,1,4,1,25,0,39,39,0
1,33,6,2926,8,12,1,4,0,4,1,0,0,12,39,0
2,21,4,14086,6,8,3,6,1,4,1,0,0,39,39,0
3,36,4,15336,7,6,1,6,0,2,1,0,0,39,39,0
4,11,4,19355,8,12,1,10,5,2,0,0,0,39,5,0


In [None]:
#independent
x=data.iloc[:,:-1]
#dependent
y=data.iloc[:,-1]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.30, random_state=1)
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[6548  855]
 [1002 1364]]
accuracy_score : 0.8099088954857201


#**6. Random Forest:**
a) Build a random forest model as follows:

●	Divide the dataset into training and test sets in 80:20 ratio.

●	Build a random forest model where the dependent variable is “X”(Yearly Income) and the rest of the variables as independent variables and number of trees as 300.

●	Predict values on the test set

●	Build a confusion matrix and calculate the accuracy**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=1)
rf=RandomForestClassifier(n_estimators=300)
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)

print('confusion_matrix :')
print(confusion_matrix(y_pred,y_test))
print('accuracy_score :',accuracy_score(y_test,y_pred))

confusion_matrix :
[[4654  517]
 [ 372  970]]
accuracy_score : 0.8635037617073545
