# US Census Model

## Introduction

In this report I will show differents model I tried to predict the people with income higher than 50K.

In [1]:
import numpy as np
import pandas as pd
import sklearn.metrics
from  sklearn.preprocessing import scale
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

In [2]:
# Load the learning set
learn_X = pd.read_csv("./data/census_income_learn_data.csv")
learn_Y = pd.read_csv("./data/census_income_learn_label.csv")

In [3]:
# look at the independance of our features.
((learn_X.corr() > 0.65).values.sum() - 24) / 2

6.0

We can see 6 of the 24 features have strong correlations with other features.

In [4]:
learn_X.corrwith(learn_Y.label)

age                                                   0.135720
wage_per_hour                                         0.024528
capital_gains                                         0.240725
capital_losses                                        0.147417
divdends_from_stocks                                  0.175779
num_persons_worked_for_employer                       0.222684
weeks_worked_in_year                                  0.262316
fill_inc_questionnaire_for_veteran_admin_yes_or_no    0.022586
public_worker                                         0.079106
private_worker                                        0.123361
well_paid_occupation                                  0.328371
university_degree                                     0.214652
no_university                                        -0.206469
is_female                                            -0.157610
veterans_benefits_1_or_2                              0.143601
married_civilian_spouse_present                       0

We can see that we only have weak correlation to the class.

In [5]:
# Load the test dataset
test_X = pd.read_csv("./data/census_income_test_data.csv")
test_Y = pd.read_csv("./data/census_income_test_label.csv")

In [39]:
print(sklearn.metrics.classification_report(test_Y.values, np.random.choice([-1,1], len(test_Y), p=[0.938, 0.062])))

             precision    recall  f1-score   support

         -1       0.94      0.94      0.94     93576
          1       0.06      0.06      0.06      6186

avg / total       0.88      0.88      0.88     99762



In this example I just try to randomly assigne the label using the same distribution as the train data. This dummy classiffier can be use as a baseline. 

## Decision Tree and random forest

This first experiment use a decision tree to build a classifier. We will also try with a random forest classifier.

In [27]:
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(learn_X.values, learn_Y.values)
print(sklearn.metrics.classification_report(test_Y.values, decision_tree.predict(test_X.values)))

             precision    recall  f1-score   support

         -1       0.96      0.97      0.96     93576
          1       0.47      0.44      0.45      6186

avg / total       0.93      0.93      0.93     99762



We can see that the precision, recall and f1 for + 50000 income people is not good enough. 

In [28]:
decision_tree_res = pd.DataFrame(decision_tree.feature_importances_,
                                 index=learn_X.columns.values,
                                 columns=['importance'])
decision_tree_res.sort_values('importance', ascending=False)

Unnamed: 0,importance
age,0.204061
divdends_from_stocks,0.134219
well_paid_occupation,0.123811
capital_gains,0.120739
num_persons_worked_for_employer,0.06139
is_female,0.052977
weeks_worked_in_year,0.046671
capital_losses,0.038842
householders,0.027171
wage_per_hour,0.02507


In [29]:
random_forest = RandomForestClassifier()
random_forest = random_forest.fit(learn_X, learn_Y.label)
print(sklearn.metrics.classification_report(test_Y.values, random_forest.predict(test_X.values)))

             precision    recall  f1-score   support

         -1       0.96      0.98      0.97     93576
          1       0.59      0.40      0.48      6186

avg / total       0.94      0.95      0.94     99762



We have slightly better result unsing the Random forest clasifier.

In [30]:
random_forest_res = pd.DataFrame(random_forest.feature_importances_,
                                 index=learn_X.columns.values,
                                 columns=['importance'])
random_forest_res.sort_values('importance', ascending=False)

Unnamed: 0,importance
age,0.252969
divdends_from_stocks,0.148888
capital_gains,0.106322
well_paid_occupation,0.07587
num_persons_worked_for_employer,0.071542
weeks_worked_in_year,0.055309
is_female,0.04388
capital_losses,0.041679
wage_per_hour,0.025622
university_degree,0.024872


We can see that the random forest and the decision tree have similare features importance.

## Logistic Regression

The second experiment use a logistic regression to build the model.

In [31]:
logi_regr = LogisticRegression()
logi_regr = logi_regr.fit(scale(learn_X.values), learn_Y.label)
print(sklearn.metrics.classification_report(test_Y.label.tolist(),
                                            logi_regr.predict(scale(test_X.values)).tolist()))



             precision    recall  f1-score   support

         -1       0.95      0.99      0.97     93576
          1       0.70      0.29      0.41      6186

avg / total       0.94      0.95      0.94     99762





The logistic regression give result comparable to the random forest.

In [32]:
logi_regr_coef = pd.DataFrame(logi_regr.coef_.T, index=learn_X.columns.values, columns=['coef'])
logi_regr_coef.sort_values('coef', ascending=False)

Unnamed: 0,coef
weeks_worked_in_year,0.909527
age,0.827404
well_paid_occupation,0.545399
capital_gains,0.538056
num_persons_worked_for_employer,0.467549
divdends_from_stocks,0.402244
veterans_benefits_1_or_2,0.364615
tax_joint_both_under_65,0.319429
university_degree,0.18147
capital_losses,0.168993


For logistic regretion the most important feature are:
* veterans_benefits_1_or_2
* tax_nonfiler
* is_female
* well_paid_occupation
* no_university

## Conclusion

During my process I have first look at every feature and try to do the feature selection manually to reduce the number of feature. During this process on of my essie was to extract the most relevant categorie and group some categories together.

Secondly I did differents model and regression of the dataset to predict if the income is above 50K. The two class are highly umbalance about 6% of people in the dataset have an income higher than 50K. I try differents techinque to balanced more the data but none of them get better result. 

Week work in year, age and sex are corralated to the income above 50k knowing that this variable have an interactioon because children does not work the part of woman having a job is lower.
The second factor is what we can call capital activites made from capital gains and losses with the dividends from stock.