# Adult Census Income Analysis - Decision TREE, Random Forest, CV, Tuning the model with Ensemble Techniques(Baaging , ADAboost)


### A stable and optimized model to predict the income of a given population, which is labelled as <= 50K and >50K. The attributes (predictors) are age, working class type, marital status, gender, race etc.
#### Following are the steps, 
#### 1.clean and prepare the data,
#### 2.Analyze Data,
#### 3.Label Encoding,
#### 4.Build a decision tree and Random forest with default hyperparameters,
#### 5.Build several classifier models to compare, cross validate and for voting classifier model
#### 6.choose the optimal hyperparameters using grid search cross-validation.
#### 7.Build optimized Random forest model with tuned hyperparameters from grid search model
#### 8.Increase Accuracy by Applying Ensemble technique BAGGING to our tuned random forest model
#### 9.Increase Accuracy by Applying Ensemble technique ADABOOST to our tuned random forest model
####  I hope you enjoy this notebook and find it useful!

## Clean & Analyze Data,

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
# from sklearn.cross_validation import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [3]:
from sklearn.model_selection import cross_val_score

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
data =  pd.read_csv("./adult.csv")

In [6]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education.num     32561 non-null int64
marital.status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital.gain      32561 non-null int64
capital.loss      32561 non-null int64
hours.per.week    32561 non-null int64
native.country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 2.6+ MB


In [8]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

In [9]:
# select all categorical variables
df_categorical = data.select_dtypes(include=['object'])

# checking whether any other columns contain a "?"
df_categorical.apply(lambda x: x=="?", axis=0).sum()

workclass         1836
education            0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
native.country     583
income               0
dtype: int64

In [10]:
data[data['workclass'] == '?' ].count()

age               1836
workclass         1836
fnlwgt            1836
education         1836
education.num     1836
marital.status    1836
occupation        1836
relationship      1836
race              1836
sex               1836
capital.gain      1836
capital.loss      1836
hours.per.week    1836
native.country    1836
income            1836
dtype: int64

In [11]:
data[data['occupation'] == '?' ].count()

age               1843
workclass         1843
fnlwgt            1843
education         1843
education.num     1843
marital.status    1843
occupation        1843
relationship      1843
race              1843
sex               1843
capital.gain      1843
capital.loss      1843
hours.per.week    1843
native.country    1843
income            1843
dtype: int64

In [12]:
data[data['native.country'] == '?' ].count()

age               583
workclass         583
fnlwgt            583
education         583
education.num     583
marital.status    583
occupation        583
relationship      583
race              583
sex               583
capital.gain      583
capital.loss      583
hours.per.week    583
native.country    583
income            583
dtype: int64

In [13]:
(1836/32561)/100

0.0005638647461687295

 ### Missing Value % is very insignificant  so we will drop those values

In [14]:
data.count()

age               32561
workclass         32561
fnlwgt            32561
education         32561
education.num     32561
marital.status    32561
occupation        32561
relationship      32561
race              32561
sex               32561
capital.gain      32561
capital.loss      32561
hours.per.week    32561
native.country    32561
income            32561
dtype: int64

In [15]:
data = data[data["workclass"] != "?" ]

In [16]:
data = data[data["occupation"] != "?" ]

In [17]:
data = data[data["native.country"] != "?" ]

In [18]:
data.count()

age               30162
workclass         30162
fnlwgt            30162
education         30162
education.num     30162
marital.status    30162
occupation        30162
relationship      30162
race              30162
sex               30162
capital.gain      30162
capital.loss      30162
hours.per.week    30162
native.country    30162
income            30162
dtype: int64

In [19]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
5,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
6,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K


In [20]:
data["income"].unique()

array(['<=50K', '>50K'], dtype=object)

In [21]:
data["income"] = data["income"].map({'<=50K' : 0, '>50K': 1})
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,0
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,0
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,0
5,34,Private,216864,HS-grad,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,0
6,38,Private,150601,10th,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,0


In [22]:
data["income"].unique()

array([0, 1], dtype=int64)

## Label Encoding

In [23]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

catogorical_data = data.select_dtypes(include =['object'])

In [24]:
catogorical_data.head()

Unnamed: 0,workclass,education,marital.status,occupation,relationship,race,sex,native.country
1,Private,HS-grad,Widowed,Exec-managerial,Not-in-family,White,Female,United-States
3,Private,7th-8th,Divorced,Machine-op-inspct,Unmarried,White,Female,United-States
4,Private,Some-college,Separated,Prof-specialty,Own-child,White,Female,United-States
5,Private,HS-grad,Divorced,Other-service,Unmarried,White,Female,United-States
6,Private,10th,Separated,Adm-clerical,Unmarried,White,Male,United-States


In [25]:
catogorical_data = catogorical_data.apply(le.fit_transform)

In [26]:
catogorical_data.head()

Unnamed: 0,workclass,education,marital.status,occupation,relationship,race,sex,native.country
1,2,11,6,3,1,4,0,38
3,2,5,0,6,4,4,0,38
4,2,15,5,9,3,4,0,38
5,2,11,0,7,4,4,0,38
6,2,0,5,0,4,4,1,38


In [27]:
data = data.drop(catogorical_data.columns, axis=1)
data = pd.concat([data, catogorical_data], axis=1)
data.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week,income,workclass,education,marital.status,occupation,relationship,race,sex,native.country
1,82,132870,9,0,4356,18,0,2,11,6,3,1,4,0,38
3,54,140359,4,0,3900,40,0,2,5,0,6,4,4,0,38
4,41,264663,10,0,3900,40,0,2,15,5,9,3,4,0,38
5,34,216864,9,0,3770,45,0,2,11,0,7,4,4,0,38
6,38,150601,6,0,3770,40,0,2,0,5,0,4,4,1,38


In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 1 to 32560
Data columns (total 15 columns):
age               30162 non-null int64
fnlwgt            30162 non-null int64
education.num     30162 non-null int64
capital.gain      30162 non-null int64
capital.loss      30162 non-null int64
hours.per.week    30162 non-null int64
income            30162 non-null int64
workclass         30162 non-null int32
education         30162 non-null int32
marital.status    30162 non-null int32
occupation        30162 non-null int32
relationship      30162 non-null int32
race              30162 non-null int32
sex               30162 non-null int32
native.country    30162 non-null int32
dtypes: int32(8), int64(7)
memory usage: 2.8 MB


In [29]:
data['income'] = data['income'].astype('category')


## Decision Tree Model with Default parameters

In [30]:
x=data.drop('income',axis=1)
y=data['income']
#Train & Test split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state= 476)

In [31]:
tree = DecisionTreeClassifier()
model_tree = tree.fit(x_train,y_train)
model_tree

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [32]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [33]:
model_tree = tree.fit(x_train,y_train)
pred_tree = tree.predict(x_test)
a1 = accuracy_score(y_test,pred_tree)
print("The Accuracy of Desicion Tree is ", a1)

The Accuracy of Desicion Tree is  0.8118024091059786


In [34]:
confusion_matrix(y_test,pred_tree)

array([[5932,  856],
       [ 847, 1414]], dtype=int64)

In [35]:
print(classification_report(y_test, pred_tree))

             precision    recall  f1-score   support

          0       0.88      0.87      0.87      6788
          1       0.62      0.63      0.62      2261

avg / total       0.81      0.81      0.81      9049



## Random Forest Model with Default parameters

In [36]:
rf = RandomForestClassifier()
model_rf = rf.fit(x_train,y_train)
pred_rf = rf.predict(x_test)
a2 = accuracy_score(y_test, pred_rf)
print("The Accuracy of Random Forest is ", a2)

The Accuracy of Random Forest is  0.8437396397391977


## Logistic Regression & KNN model

In [37]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()

model_lg = lg.fit(x_train,y_train)
pred_lg = lg.predict(x_test)
a3 = accuracy_score(y_test, pred_lg)
print("The Accuracy of logistic regression is ", a3)

The Accuracy of logistic regression is  0.7821858768924743


In [38]:
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier()

In [39]:
model_knn =knn.fit(x_train,y_train) 
pred_knn = knn.predict(x_test)
a4 = accuracy_score(y_test, pred_knn)
print("The Accuracy of KNN is ", a4)

The Accuracy of KNN is  0.7596419493866725


# Build optimized Random forest model with tuned hyperparameters from grid search model  

In [40]:
rf_param = {
    "n_estimators": [25,50,100],
    "criterion" : ["gini"],
    "max_depth" : [3,4,5,6],
    "max_features" : ["auto","sqrt","log2"],
    "random_state" : [123]
}

In [41]:
GridSearchCV(rf, rf_param, cv = 5)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [25, 50, 100], 'criterion': ['gini'], 'max_depth': [3, 4, 5, 6], 'max_features': ['auto', 'sqrt', 'log2'], 'random_state': [123]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [42]:
grid =GridSearchCV(rf, rf_param, cv = 5)

In [43]:
grid.fit(x_train,y_train).best_params_

{'criterion': 'gini',
 'max_depth': 6,
 'max_features': 'auto',
 'n_estimators': 100,
 'random_state': 123}

In [44]:
rf1 = RandomForestClassifier(criterion = 'gini',
    max_depth = 6,
    max_features = 'auto',
    n_estimators = 100,
    random_state = 123)
model_rf1 = rf1.fit(x_train,y_train)
pred_rf1 = rf1.predict(x_test)
accuracy_score(y_test, pred_rf1)

0.8462813570560282

# K FOLD Cross Validation

In [45]:
cross_val_score(tree,x_train,y_train,scoring= "accuracy", cv=10)

array([0.79308712, 0.81392045, 0.7907197 , 0.79782197, 0.79640152,
       0.80729167, 0.80672667, 0.8028436 , 0.79952607, 0.8014218 ])

In [46]:
cross_val_score(tree,x,y,scoring= "accuracy", cv=5).mean()

0.7647386410219651

In [47]:
cross_val_score(rf,x_train,y_train,scoring= "accuracy", cv=5).mean()

0.8439355042558343

In [48]:
cross_val_score(lg,x_train,y_train,scoring= "accuracy", cv=5).mean()

0.7911251894805552

In [49]:
cross_val_score(knn,x_train,y_train,scoring= "accuracy", cv=5).mean()

0.7642686401045582

# Voting Classifier model

In [50]:
from sklearn.ensemble import VotingClassifier

In [51]:
model_vote = VotingClassifier(estimators=[('logistic Regression', lg), ('random forrest', rf), ('knn neighbors', knn),(' decision tree', tree)], voting='soft')
model_vote = model_vote.fit(x_train, y_train)

In [52]:
vote_pred = model_vote.predict(x_test)

In [53]:
a5 =  accuracy_score(y_test, vote_pred)
print("The Accuracy of voting classifier is ", a5)

The Accuracy of voting classifier is  0.8354514310973589


In [54]:
print(classification_report(y_test, vote_pred))

             precision    recall  f1-score   support

          0       0.85      0.94      0.90      6788
          1       0.75      0.51      0.61      2261

avg / total       0.83      0.84      0.82      9049



# Ensemble Technique Bagging 

## Increase Accuracy by Applying Ensemble technique BAGGING to our tuned random forest model

In [55]:
from sklearn.ensemble import BaggingClassifier

In [56]:
bagg = BaggingClassifier(base_estimator=rf1,n_estimators=15)

In [57]:
model_bagg =bagg.fit(x_train,y_train) 
pred_bagg = bagg.predict(x_test)

In [58]:
a6 = accuracy_score(y_test, pred_bagg)
print("The Accuracy of BAAGING is ", a6)

The Accuracy of BAAGING is  0.8458393192617969


In [59]:
confusion_matrix(y_test,pred_bagg)

array([[6496,  292],
       [1103, 1158]], dtype=int64)

In [60]:
print(classification_report(y_test, pred_bagg))

             precision    recall  f1-score   support

          0       0.85      0.96      0.90      6788
          1       0.80      0.51      0.62      2261

avg / total       0.84      0.85      0.83      9049



#  Ensemble Technique  ADA Boost 

## Increase Accuracy by Applying Ensemble technique ADABOOST to our tuned random forest model

In [61]:
from sklearn.ensemble import AdaBoostClassifier

In [62]:
Adaboost = AdaBoostClassifier(base_estimator=rf1, n_estimators=15)

In [63]:
model_boost =Adaboost.fit(x_train,y_train) 
pred_boost = Adaboost.predict(x_test)

In [64]:
a7 = accuracy_score(y_test, pred_boost)
print("The Accuracy of BOOSTING is ", a7)

The Accuracy of BOOSTING is  0.8663940766935573


In [65]:
confusion_matrix(y_test,pred_boost)

array([[6394,  394],
       [ 815, 1446]], dtype=int64)

In [66]:
print(classification_report(y_test, pred_boost))

             precision    recall  f1-score   support

          0       0.89      0.94      0.91      6788
          1       0.79      0.64      0.71      2261

avg / total       0.86      0.87      0.86      9049

