### Getting Started
In this project, you will employ several supervised algorithms of your choice to accurately model individuals' income using data collected from the 1994 U.S. Census. You will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Your goal with this implementation is to construct a model that accurately predicts whether an individual makes more than $50,000. This sort of task can arise in a non-profit setting, where organizations survive on donations. Understanding an individual's income can help a non-profit better understand how large of a donation to request, or whether or not they should reach out to begin with. While it can be difficult to determine an individual's general income bracket directly from public sources, we can (as we will see) infer this value from other publically available features.

The dataset for this project originates from the UCI Machine Learning Repository. The datset was donated by Ron Kohavi and Barry Becker, after being published in the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". You can find the article by Ron Kohavi online. The data we investigate here consists of small changes to the original dataset, such as removing the 'fnlwgt' feature and records with missing or ill-formatted entries.



### Exploring the Data
Run the code cell below to load necessary Python libraries and load the census data. Note that the last column from this dataset, 'income', will be our target label (whether an individual makes more than, or at most, $50,000 annually). All other columns are features about each individual in the census database.

In [5]:
%matplotlib inline
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
#from IPython.display import display 
# Allows the use of display() for DataFrames
from IPython.display import display
# Import supplementary visualization code visuals.py
import visuals as vs
#import "https://raw.githubusercontent.com/udacity/machine-learning/master/projects/finding_donors/visuals.py" as vs
    
#%matplotlib notebook
#import matplotlib.pyplot as plt

# Pretty display for notebooks
#%matplotlib inline

# Load the Census dataset
data = pd.read_csv("./data/census.csv")

# Success - Display the first record
display(data.head(n=1))

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K


In [6]:
# Load the wholesale customers dataset
try:
    data = pd.read_csv("./data/census.csv")
except:
    print("Dataset could not be loaded. Is the dataset missing?")


In [7]:
data.head(1)

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K


### Implementation: Data Exploration


A cursory investigation of the dataset will determine how many individuals fit into either group, and will tell us about the percentage of these individuals making more than \$50,000. In the code cell below, you will need to compute the following:

* The total number of records, 'n_records'
* The number of individuals making more than \$50,000 annually, 'n_greater_50k'.
* The number of individuals making at most \$50,000 annually, 'n_at_most_50k'.
* The percentage of individuals making more than \$50,000 annually, 'greater_percent'.

In [39]:
data.income.unique()

array(['<=50K', '>50K'], dtype=object)

In [45]:
data.groupby(['income']).size()
#data.groupby(['income']).size()[0]

income
<=50K    34014
>50K     11208
dtype: int64

In [46]:
n_records = data.shape[0]
n_greater_50k = data[data['income']=='<=50K'].shape[0]
n_at_most_50k = data[data['income']=='>50K'].shape[0]
greater_percent = n_greater_50k / n_records * 100

# Print the results
print("Total number of records: {}".format(n_records))
print("Individuals making more than $50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50,000: {}%".format(greater_percent))

Total number of records: 45222
Individuals making more than $50,000: 34014
Individuals making at most $50,000: 11208
Percentage of individuals making more than $50,000: 75.21560302507629%


In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
age                45222 non-null int64
workclass          45222 non-null object
education_level    45222 non-null object
education-num      45222 non-null float64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
sex                45222 non-null object
capital-gain       45222 non-null float64
capital-loss       45222 non-null float64
hours-per-week     45222 non-null float64
native-country     45222 non-null object
income             45222 non-null object
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB


In [51]:
# Split the data into features and target label
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

# Visualize skewed continuous features of original data
#vs.distribution(data)