## ML Classification Assignment | Shivam Negi

This data was extracted from the census bureau database found at
http://www.census.gov/ftp/pub/DES/www/welcome.html
Donor: Ronny Kohavi and Barry Becker, Data Mining and
Visualization
Silicon Graphics.
e-mail: ronnyk@sgi.com for questions.
Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
48842 instances, mix of continuous and discrete (train=32561, test=16281)
45222 if instances with unknown values are removed (train=30162, test=15060)
Duplicate or conflicting instances : 6
Class probabilities for adult.all file

#### Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
#### Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)

Extraction was done by Barry Becker from the 1994 Census database. A set of
reasonably clean records was extracted using the following conditions:
((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) Prediction task is to
determine whether a person makes over 50K a year. Conversion of original data as
follows:
1. Discretized a gross income into two ranges with threshold 50,000.
2. Convert U.S. to US to avoid periods.
3. Convert Unknown to "?"
4. Run MLC++ GenCVFiles to generate data,test.
Description of fnlwgt (final weight)
The weights on the CPS files are controlled to independent estimates of the civilian
noninstitutional population of the US. These are prepared monthly for us by Population
Division here at the Census Bureau. We use 3 sets of controls.
These are:

1.A single cell estimate of the population 16+ for each state.
2. Controls for Hispanic Origin by age and sex.
3. Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6
times so that by the end we come back to all the controls we used.
The term estimate refers to population totals derived from CPS by creating "weighted
tallies" of any specified socio-economic characteristics of the population. People with
similar demographic characteristics should have similar weights. There is one important
caveat to remember about this statement. That is that since the CPS sample is actually a
collection of 51 state samples, each with its own probability of selection, the statement
only applies within state.

#### Dataset Link
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

#### Problem 1:
Prediction task is to determine whether a person makes over 50K a year.

#### Problem 2:
Which factors are important

#### Problem 3:
Which algorithms are best for this dataset

### Problem 1

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
import xgboost as xgb

train_data = pd.read_csv('adult.data', header = None)
test_data = pd.read_csv('adult.test' , skiprows = 1, header = None)
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race',
              'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'wage_class']
train_data.columns = col_labels
test_data.columns = col_labels

train_data

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [5]:
print("------------------------------------------------------------------------------\nSTATISTICAL DESCRIPTION\n\n")
print(train_data.describe())
print("------------------------------------------------------------------------------\nSHAPE\n\n")
print(train_data.shape)
print("------------------------------------------------------------------------------\nDTYPES\n\n")
print(train_data.dtypes)
print("------------------------------------------------------------------------------\nUNIQUE COUNT\n\n")
print(train_data.nunique())
print("------------------------------------------------------------------------------\nCOLUMNS\n\n")
print(train_data.columns)
print("------------------------------------------------------------------------------\nCORRELATION\n\n")
print(train_data.corr())
print("------------------------------------------------------------------------------\nMISSING VALUES\n\n")
print(train_data.isnull().sum())

------------------------------------------------------------------------------
STATISTICAL DESCRIPTION


                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%   

In [6]:
col_num = ['age', 'fnlwgt' , 'education_num', 'capital_gain' , 'capital_loss' , 'hours_per_week']
categorical = [ 'workclass', 'education', 'marital_status', 'occupation','relationship', 'race', 'sex','wage_class']

#Label Encoding
le = LabelEncoder()
for col in categorical:
    train_data[col] = le.fit_transform(train_data[col])

y = train_data['wage_class']
train_data = train_data.drop(['wage_class','native_country'], axis = 1)
X = train_data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Model Training

In [8]:
dmat=xgb.DMatrix(X_train,y_train)
test_dmat=xgb.DMatrix(X_test)

final_p={'colsample_bytree': 1.0,
         'max_depth': 3,
         'min_child_weight': 0,
         'subsample': 0.5,
         'reg_lambda': 100.0,
         'objective':'binary:logistic',
         'eta': 0.1
        }

final_clf=xgb.train(params=final_p,dtrain=dmat,num_boost_round=837)



## Model Evaluation

In [9]:
pred=final_clf.predict(test_dmat)
print(pred)
pred[pred > 0.5 ] = 1
pred[pred <= 0.5] = 0
print(pred)
print(accuracy_score(y_test,pred)*100)



[0.01042143 0.0327157  0.1666489  ... 0.9837209  0.21541658 0.0026273 ]
[0. 0. 0. ... 1. 0. 0.]
86.25243115979117


### Problem 2 | Important Features

In [10]:
final_clf.get_score(importance_type='gain')

{'relationship': 31.02135889180455,
 'education_num': 15.640361863235018,
 'capital_gain': 15.868374233108117,
 'hours_per_week': 4.575870823506827,
 'age': 4.53134963893835,
 'fnlwgt': 1.9398026767249887,
 'marital_status': 6.968755884104636,
 'occupation': 3.8354956765669694,
 'capital_loss': 4.987470732825,
 'workclass': 2.2422549917513197,
 'education': 2.1530877917229905,
 'race': 1.7644120225232136,
 'sex': 3.672781610959493}

### Important Features: 
  #### 1. Relationship
  #### 2. Capital Gain
  #### 3. Education

In [2]:
#import libraries

from pycaret.classification import *

#define target label and parameters
exp1 = setup(train_data, target = 'wage_class', feature_selection = True)

Unnamed: 0,Description,Value
0,session_id,4000
1,Target,wage_class
2,Target Type,Binary
3,Label Encoded,"<=50K: 0, >50K: 1"
4,Original Data,"(32561, 15)"
5,Missing Values,False
6,Numeric Features,5
7,Categorical Features,9
8,Ordinal Features,False
9,High Cardinality Features,False


### Problem 3 | Model Selection

In [3]:
compare_models(fold = 5, turbo = True)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.873,0.9288,0.6532,0.7862,0.7135,0.6328,0.6373,37.786
lightgbm,Light Gradient Boosting Machine,0.8715,0.9267,0.6548,0.7793,0.7116,0.6298,0.6337,0.944
xgboost,Extreme Gradient Boosting,0.8704,0.9255,0.6521,0.7771,0.709,0.6265,0.6306,9.244
gbc,Gradient Boosting Classifier,0.8632,0.9188,0.5856,0.7953,0.6745,0.5904,0.6015,7.846
ada,Ada Boost Classifier,0.8603,0.9146,0.6215,0.7578,0.6829,0.5944,0.5992,2.358
rf,Random Forest Classifier,0.8548,0.9047,0.6188,0.739,0.6736,0.5811,0.5849,3.582
lda,Linear Discriminant Analysis,0.84,0.8928,0.566,0.7141,0.6315,0.5311,0.5369,1.002
ridge,Ridge Classifier,0.8394,0.0,0.5106,0.746,0.6062,0.5097,0.5242,0.258
et,Extra Trees Classifier,0.833,0.8786,0.5967,0.6757,0.6337,0.526,0.5278,4.526
dt,Decision Tree Classifier,0.8147,0.7501,0.6249,0.6156,0.6202,0.4976,0.4977,0.59


<catboost.core.CatBoostClassifier at 0x1b862c1c4c8>

### Top 3 Model for the dataset:
   #### 1. CatBoost
   #### 2. LightGBM
   #### 3. XGBoost