## <center>Problem Statement

In this assignment students need to predict whether a person makes over 50K per year
or not from classic adult dataset using XGBoost. The description of the dataset is as
follows:

Data Set Information:
Extraction was done by Barry Becker from the 1994 Census database. A set of
reasonably clean records was extracted using the following conditions: ((AAGE>16) &&
(AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Attribute Information:<br>
Listing of attributes:<br>>50K, <=50K.

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc,
9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed,
Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-
specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing,

Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala,
Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong,
Holand-Netherlands.

## Import Libraries

In [249]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import Data

In [299]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)

test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', skiprows = 1, header = None)

col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
              'occupation','relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week',
              'native_country', 'wage_class']
                       
train_set.columns = col_labels
test_set.columns = col_labels

## Data Exploration

In [300]:
train_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [301]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
wage_class        32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [302]:
train_set.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
wage_class        0
dtype: int64

In [303]:
cat_var=train_set.select_dtypes(include='object').columns.values
cat_var

array(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'wage_class'],
      dtype=object)

* No null data observed in the train dataset
* <b>Categorical Variables</b> are 'workclass', 'education', 'marital_status', 'occupation',
'relationship', 'race', 'sex', 'native_country', 'wage_class'
* <b>wage_class</b>, which is the target variable, needs to be converted to continuous variable in both training and test datasets


### Convert wage_class into integers

In [304]:
#Check unique values for wage_class in training dataset

train_set['wage_class'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [305]:
#Convert the wage_class into integers under 'wage'

train_set['wage']=train_set['wage_class'].replace([' <=50K',' >50K'],[0,1])
train_set.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class,wage
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,0
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K,0
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K,0
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K,1
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K,1
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K,1


In [306]:
# Check unique values for 'wage_class' in test_set

test_set['wage_class'].unique()

array([' <=50K.', ' >50K.'], dtype=object)

In [307]:
# Convert wage_class to integers

test_set['wage']=test_set['wage_class'].replace([' <=50K.', ' >50K.'],[0,1])
test_set.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class,wage
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.,0


### Convert the categorical data into integers

In [308]:
for col in train_set.columns: 
    if train_set[col].dtype == 'object': 
        train_set[col] = pd.Categorical(train_set[col]).codes 

for col in test_set.columns: 
    if test_set[col].dtype == 'object': 
        test_set[col] = pd.Categorical(test_set[col]).codes

In [309]:
X_train.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')

### Prepare X(independent variables) and y(target variable) in both training and test datasets

In [310]:
X_train=train_set.drop(columns=['wage_class','wage'])
X_test=test_set.drop(columns=['wage_class','wage'])
y_train = train_set.pop('wage')
y_test = test_set.pop('wage')

In [311]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

## Model Building

### Model # 1

In [312]:
cv = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
ind = {'learning_rate': 0.1, 'n_estimators': 20, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic'}
grid = GridSearchCV(xgb.XGBClassifier(**ind), 
                            cv, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1) 

In [313]:
model=grid.fit(X_train,y_train)

In [265]:
model.best_params_

{'max_depth': 7, 'min_child_weight': 1}

In [316]:
print('Accuracy Score Using This Model = {}'.format(grid.best_score_))

Accuracy Score Using This Model = 0.8607229507693253


### Model # 2

In [317]:
cv_1 = {'learning_rate': [0.1, 0.01], 'subsample': [0.7,0.8,0.9]}
ind_1 = {'n_estimators': 20, 'seed':0, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth': 3, 'min_child_weight': 1}


grid_1 = GridSearchCV(xgb.XGBClassifier(**ind_1), 
                            cv_1, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1)
model_1=grid_1.fit(X_train, y_train)

In [318]:
print('Accuracy Score Using This Model = {}',format(model_1.best_score_))

Accuracy Score Using This Model = {} 0.8491139707011456


## Observation
We observe that first model has a higher accuracy score as compared to the second one. So, we can use the first model for predictions of the sate

## Model Validation

In [270]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test) # Predict using our testdmat

In [271]:
y_pred[y_pred > 0.5] = 1
y_pred[y_pred <= 0.5] = 0
y_pred

array([0, 0, 0, ..., 1, 0, 1], dtype=int64)

In [321]:
acc=accuracy_score(y_pred, y_test)
print('Accuracy achieved while appying the model(i.e, first model) to the test data set = {}'.format(acc))

Accuracy achieved while appying the model(i.e, first model) to the test data set = 0.8626619986487316
