## Visualize dataset using categorical embedding

The dataset for this project originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income). The datset was donated by Ron Kohavi and Barry Becker, after being published in the article _"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"_. We can find the article by Ron Kohavi [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf). The data we investigate here consists of small changes to the original dataset, such as removing the `'fnlwgt'` feature and records with missing or ill-formatted entries.

In [25]:
import warnings
warnings.simplefilter('ignore')

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Import train_test_split
from sklearn.model_selection import train_test_split

# Import two metrics from sklearn - fbeta_score and accuracy_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import accuracy_score

# Import the three supervised learning models from sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Import functionality for cloning a model
from sklearn.base import clone

import altair as alt
alt.data_transformers.enable('json')

# Pretty display for notebooks
%matplotlib inline

from gensim.models import Word2Vec

# TSNE
import time
from sklearn.manifold import TSNE

In [26]:
# Load the Census dataset
data = pd.read_csv("census.csv")

data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


**Featureset Exploration**

* **age**: Continuous. 
* **workclass**: Categorical - Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
* **education**: Categorical - Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
* **education-num**: Continuous. 
* **marital-status**: Categorical - Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
* **occupation**: Categorical - Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
* **relationship**: Categorical - Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
* **race**: Categorical - Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other. 
* **sex**: Categorical - Female, Male. 
* **capital-gain**: Continuous. 
* **capital-loss**: Continuous. 
* **hours-per-week**: Continuous. 
* **native-country**: Categorical - United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

----
## Preparing the Data
Before data can be used as input for machine learning algorithms, it often must be cleaned, formatted, and restructured. Fortunately, for this dataset, there are no invalid or missing entries we must deal with, however, there are some qualities about certain features that must be adjusted.

### Normalizing Numerical Features
In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as `'capital-gain'` or `'capital-loss'` above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning.

In [27]:
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
numerical_normalized = ['age_normalized', 'education-num_normalized', 'capital-gain_normalized',
                        'capital-loss_normalized', 'hours-per-week_normalized']

normalized_data = data
normalized_data[numerical_normalized] = data[numerical]

normalized_data[numerical_normalized] = scaler.fit_transform(normalized_data[numerical_normalized])

normalized_data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,age_normalized,education-num_normalized,capital-gain_normalized,capital-loss_normalized,hours-per-week_normalized
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K,0.30137,0.8,0.02174,0.0,0.397959
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K,0.452055,0.8,0.0,0.0,0.122449
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K,0.287671,0.533333,0.0,0.0,0.397959
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K,0.493151,0.4,0.0,0.0,0.397959
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,0.150685,0.8,0.0,0.0,0.397959


In [28]:
categorical = ['workclass', 'education_level', 'marital-status', 'occupation', 'relationship',
               'race', 'sex', 'native-country']

def getSentence(x):
    arr = []
    for col in categorical:
        arr.append(x[col].strip())
    return arr

In [29]:
normalized_data['combined-categories'] = normalized_data.apply(getSentence, axis=1)
normalized_data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,age_normalized,education-num_normalized,capital-gain_normalized,capital-loss_normalized,hours-per-week_normalized,combined-categories
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K,0.30137,0.8,0.02174,0.0,0.397959,"[State-gov, Bachelors, Never-married, Adm-cler..."
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K,0.452055,0.8,0.0,0.0,0.122449,"[Self-emp-not-inc, Bachelors, Married-civ-spou..."
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K,0.287671,0.533333,0.0,0.0,0.397959,"[Private, HS-grad, Divorced, Handlers-cleaners..."
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K,0.493151,0.4,0.0,0.0,0.397959,"[Private, 11th, Married-civ-spouse, Handlers-c..."
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,0.150685,0.8,0.0,0.0,0.397959,"[Private, Bachelors, Married-civ-spouse, Prof-..."


In [30]:
# window size of 8 includes all words in context
# ns_exponent of 0.0 samples all words equally
model = Word2Vec(list(normalized_data['combined-categories']), min_count=1, size=32, window=8, ns_exponent=0.0,
                 workers=8, iter=100)

In [31]:
# test vector for one of the categorical values
model['Private']

array([ 2.0701118 , -2.3951406 ,  1.140145  , -0.7317447 , -3.6734076 ,
       -0.31535238, -0.19988257, -1.0696453 ,  3.1696587 , -0.5462873 ,
       -2.1698182 , -2.845269  , -0.1661925 , -0.32693258,  0.6593459 ,
        0.1522853 ,  1.9685335 ,  1.3522162 , -0.94327897,  2.2531059 ,
       -0.46765828,  0.22242904, -0.28588045,  0.23324323, -2.2066123 ,
        0.82089067,  1.8995258 , -1.2735388 , -2.9351838 , -1.0066811 ,
       -0.7470288 ,  0.585649  ], dtype=float32)

In [32]:
def getCategoryArray(x):
    arr = []
    for col in categorical:
        arr.append(model[x[col].strip()])
    return np.mean(np.array(arr), axis=0)

In [33]:
normalized_data['categories_arr'] = normalized_data.apply(getCategoryArray, axis=1)
normalized_data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,...,hours-per-week,native-country,income,age_normalized,education-num_normalized,capital-gain_normalized,capital-loss_normalized,hours-per-week_normalized,combined-categories,categories_arr
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,...,40.0,United-States,<=50K,0.30137,0.8,0.02174,0.0,0.397959,"[State-gov, Bachelors, Never-married, Adm-cler...","[-0.68047947, -0.089676395, -0.45149243, 0.225..."
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,...,13.0,United-States,<=50K,0.452055,0.8,0.0,0.0,0.122449,"[Self-emp-not-inc, Bachelors, Married-civ-spou...","[0.25474602, -0.21321414, -0.20013303, -0.4185..."
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,...,40.0,United-States,<=50K,0.287671,0.533333,0.0,0.0,0.397959,"[Private, HS-grad, Divorced, Handlers-cleaners...","[0.2578815, 0.17632365, -0.24483004, 0.1211661..."
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,...,40.0,United-States,<=50K,0.493151,0.4,0.0,0.0,0.397959,"[Private, 11th, Married-civ-spouse, Handlers-c...","[1.041547, 0.060513243, -0.21621233, -0.933989..."
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,...,40.0,Cuba,<=50K,0.150685,0.8,0.0,0.0,0.397959,"[Private, Bachelors, Married-civ-spouse, Prof-...","[-0.044647295, 0.3955099, 0.51228225, 0.025548..."


In [34]:
normalizer = np.amax(list(normalized_data['categories_arr'])) - np.amin(list(normalized_data['categories_arr']))

In [35]:
def getCombinedArray(x):
    arr = x['categories_arr']
    for col in numerical_normalized:
        arr = np.append(arr, x[col]*normalizer)
    return arr

In [36]:
normalized_data['combined_arr'] = normalized_data.apply(getCombinedArray, axis=1)
normalized_data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,...,native-country,income,age_normalized,education-num_normalized,capital-gain_normalized,capital-loss_normalized,hours-per-week_normalized,combined-categories,categories_arr,combined_arr
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,...,United-States,<=50K,0.30137,0.8,0.02174,0.0,0.397959,"[State-gov, Bachelors, Never-married, Adm-cler...","[-0.68047947, -0.089676395, -0.45149243, 0.225...","[-0.6804794669151306, -0.0896763950586319, -0...."
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,...,United-States,<=50K,0.452055,0.8,0.0,0.0,0.122449,"[Self-emp-not-inc, Bachelors, Married-civ-spou...","[0.25474602, -0.21321414, -0.20013303, -0.4185...","[0.2547460198402405, -0.21321414411067963, -0...."
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,...,United-States,<=50K,0.287671,0.533333,0.0,0.0,0.397959,"[Private, HS-grad, Divorced, Handlers-cleaners...","[0.2578815, 0.17632365, -0.24483004, 0.1211661...","[0.2578814923763275, 0.17632365226745605, -0.2..."
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,...,United-States,<=50K,0.493151,0.4,0.0,0.0,0.397959,"[Private, 11th, Married-civ-spouse, Handlers-c...","[1.041547, 0.060513243, -0.21621233, -0.933989...","[1.0415469408035278, 0.060513243079185486, -0...."
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,...,Cuba,<=50K,0.150685,0.8,0.0,0.0,0.397959,"[Private, Bachelors, Married-civ-spouse, Prof-...","[-0.044647295, 0.3955099, 0.51228225, 0.025548...","[-0.04464729502797127, 0.39550989866256714, 0...."


### Visualize

In [37]:
def performTSNE(df, col, out_x='x', out_y='y', verbose=1, perplexity=40, n_iter=500):
    '''
    perform tSNE (t-distributed Stochastic Neighbor Embedding).
    A tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. 
    input:
        df - data frame
        col - name of col with embedding data
        out_x - name of column to store x-coordinates
        out_y - name of column to store y-cordinates
        verbose - 
        perplexity - related to the number of nearest neighbors - usually a value between 5 and 50
        n_iter - number of iterations
    output:
        input df with 2 additional columns for x and y co-ordinates
    '''
    X = np.array(list(df[col]))
    time_start = time.time()
    tsne = TSNE(n_components=2, verbose=verbose, random_state=32, perplexity=perplexity, n_iter=n_iter)
    tsne_results = tsne.fit_transform(X)
    if verbose > 0:
        print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
    df[out_x] = tsne_results[:,0]
    df[out_y] = tsne_results[:,1]
    
    return df

In [38]:
normalized_data = performTSNE(normalized_data, 'combined_arr')

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 45222 samples in 0.657s...
[t-SNE] Computed neighbors for 45222 samples in 90.299s...
[t-SNE] Computed conditional probabilities for sample 1000 / 45222
[t-SNE] Computed conditional probabilities for sample 2000 / 45222
[t-SNE] Computed conditional probabilities for sample 3000 / 45222
[t-SNE] Computed conditional probabilities for sample 4000 / 45222
[t-SNE] Computed conditional probabilities for sample 5000 / 45222
[t-SNE] Computed conditional probabilities for sample 6000 / 45222
[t-SNE] Computed conditional probabilities for sample 7000 / 45222
[t-SNE] Computed conditional probabilities for sample 8000 / 45222
[t-SNE] Computed conditional probabilities for sample 9000 / 45222
[t-SNE] Computed conditional probabilities for sample 10000 / 45222
[t-SNE] Computed conditional probabilities for sample 11000 / 45222
[t-SNE] Computed conditional probabilities for sample 12000 / 45222
[t-SNE] Computed conditional probabilities for s

In [39]:
normalized_data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,...,age_normalized,education-num_normalized,capital-gain_normalized,capital-loss_normalized,hours-per-week_normalized,combined-categories,categories_arr,combined_arr,x,y
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,...,0.30137,0.8,0.02174,0.0,0.397959,"[State-gov, Bachelors, Never-married, Adm-cler...","[-0.68047947, -0.089676395, -0.45149243, 0.225...","[-0.6804794669151306, -0.0896763950586319, -0....",-17.525459,-14.9901
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,...,0.452055,0.8,0.0,0.0,0.122449,"[Self-emp-not-inc, Bachelors, Married-civ-spou...","[0.25474602, -0.21321414, -0.20013303, -0.4185...","[0.2547460198402405, -0.21321414411067963, -0....",-4.725045,-3.83611
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,...,0.287671,0.533333,0.0,0.0,0.397959,"[Private, HS-grad, Divorced, Handlers-cleaners...","[0.2578815, 0.17632365, -0.24483004, 0.1211661...","[0.2578814923763275, 0.17632365226745605, -0.2...",-3.914472,-10.037725
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,...,0.493151,0.4,0.0,0.0,0.397959,"[Private, 11th, Married-civ-spouse, Handlers-c...","[1.041547, 0.060513243, -0.21621233, -0.933989...","[1.0415469408035278, 0.060513243079185486, -0....",-9.508018,4.769815
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,...,0.150685,0.8,0.0,0.0,0.397959,"[Private, Bachelors, Married-civ-spouse, Prof-...","[-0.044647295, 0.3955099, 0.51228225, 0.025548...","[-0.04464729502797127, 0.39550989866256714, 0....",21.247725,-12.948339


Now that we have the x and y co-ordinates, we can get rid of the extra columns.

In [40]:
to_drop = ['age_normalized', 'education-num_normalized', 'capital-gain_normalized', 'capital-loss_normalized',
           'hours-per-week_normalized', 'categories_arr', 'combined_arr']

#to be used in modeling
features = pd.DataFrame(normalized_data['combined_arr'].tolist())
normalized_data.drop(columns=to_drop, inplace=True)

In [41]:
scatter_chart = alt.Chart(normalized_data).mark_circle().encode(
                        x='x',
                        y='y',
                        color = 'income',
                        tooltip=['age', 'workclass', 'education_level', 'education-num',
                                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
                    ).properties(
                        width=700,
                        height=700
                    ).interactive()

In [42]:
scatter_chart.display()

In [43]:
scatter_chart.save('scatter_chart.html')

### Effect of Gender, Marital Status and Race

In [44]:
sex_selection = alt.selection_multi(fields=['sex'], name='sex')
marital_selection = alt.selection_multi(fields=['marital-status'], name='marital')
race_selection = alt.selection_multi(fields=['race'], name='race')

scatter_color = alt.condition(sex_selection | marital_selection | race_selection,
                      alt.Color('income:N'),
                      alt.value('lightgray'))

sex_color = alt.condition(sex_selection,
                      #alt.value("#e45756"),
                          alt.Color('income:N'),
                      alt.value('lightgray'))

marital_color = alt.condition(marital_selection,
                      #alt.value("#72b7b2"),
                              alt.Color('income:N'),
                      alt.value('lightgray'))

race_color = alt.condition(race_selection,
                      #alt.value("#54a24b"),
                           alt.Color('income:N'),
                      alt.value('lightgray'))


sex_bar = alt.Chart(normalized_data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('sex:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color=sex_color
                        ).properties(
                            width=200,
                            height=60,
                            title="Sex"
                        ).add_selection(
                                sex_selection
                        )

marital_bar = alt.Chart(normalized_data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('marital-status:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color=marital_color
                        ).properties(
                            width=200,
                            height=225,
                            title="Marital Status"
                        ).add_selection(
                                marital_selection
                        )

race_bar = alt.Chart(normalized_data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('race:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color=race_color
                        ).properties(
                            width=200,
                            height=125,
                            title="Race"
                        ).add_selection(
                                race_selection
                        )

scatter = alt.Chart(normalized_data).mark_circle().encode(
                        x='x',
                        y='y',
                        color = scatter_color,
                        tooltip=['age', 'workclass', 'education_level', 'education-num',
                                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
                    ).properties(
                        width=500,
                        height=500
                    ).add_selection(alt.selection_single())
                    # selection_single work arounf for vega-lite bug - to show tooltp
    
gender_marital_race_chart = (scatter | (sex_bar & marital_bar & race_bar)).configure_legend(
                        orient='bottom'
                    )

In [45]:
gender_marital_race_chart.display()

In [46]:
gender_marital_race_chart.save('gender_marital_race_chart.html')

### Effect of Education Level, Workclass and Occupation

In [47]:
workclass_selection = alt.selection_multi(fields=['workclass'], name='workclass')
education_level_selection = alt.selection_multi(fields=['education_level'], name='education_level')
occupation_selection = alt.selection_multi(fields=['occupation'], name='occupation')

scatter_color = alt.condition(workclass_selection | education_level_selection | occupation_selection,
                      alt.Color('income:N'),
                      alt.value('lightgray'))

workclass_color = alt.condition(workclass_selection,
                      alt.Color('income:N'),
                      alt.value('lightgray'))

education_level_color = alt.condition(education_level_selection,
                      alt.Color('income:N'),
                      alt.value('lightgray'))

occupation_color = alt.condition(occupation_selection,
                      alt.Color('income:N'),
                      alt.value('lightgray'))


workclass_bar = alt.Chart(normalized_data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('workclass:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color=workclass_color
                        ).properties(
                            width=200,
                            height=80,
                            title="Workclass"
                        ).add_selection(
                                workclass_selection
                        )

education_level_bar = alt.Chart(normalized_data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('education_level:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color=education_level_color
                        ).properties(
                            width=200,
                            height=225,
                            title="Education Level"
                        ).add_selection(
                                education_level_selection
                        )

occupation_bar = alt.Chart(normalized_data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('occupation:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color=occupation_color
                        ).properties(
                            width=200,
                            height=150,
                            title="Occupation"
                        ).add_selection(
                                occupation_selection
                        )

scatter = alt.Chart(normalized_data).mark_circle().encode(
                        x='x',
                        y='y',
                        color = scatter_color,
                        tooltip=['age', 'workclass', 'education_level', 'education-num',
                                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
                    ).properties(
                        width=500,
                        height=500
                    ).add_selection(alt.selection_single())
                    # selection_single work arounf for vega-lite bug - to show tooltp
    
workclass_education_occupation_chart = (scatter | (workclass_bar & education_level_bar & occupation_bar)).configure_legend(
                        orient='bottom'
                    )

In [48]:
workclass_education_occupation_chart.display()

In [49]:
workclass_education_occupation_chart.save('workclass_education_occupation_chart.html')

### Modeling

In [25]:
income = normalized_data['income'].map({'<=50K':0, '>50K':1})

### Shuffle and Split Data

In [26]:
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    income, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# Show the results of the split
print("Training set has {:,} samples.".format(X_train.shape[0]))
print("Testing set has {:,} samples.".format(X_test.shape[0]))

Training set has 36,177 samples.
Testing set has 9,045 samples.


In [27]:
# Initialize the classifier
clf = RandomForestClassifier(random_state=23)

# Create the parameters list you wish to tune, using a dictionary if needed.
parameters = {'n_estimators':[60, 75],'max_depth':[12, 14],'min_samples_leaf':[5, 6]}

# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta=0.5)

# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train,y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

print(grid_fit.best_estimator_)

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
time_start = time.time()
best_predictions = (best_clf.fit(X_train, y_train)).predict(X_test)
time_end = time.time()

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("Time taken {:.4f}".format(time_end - time_start))

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=14, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=75,
                       n_jobs=None, oob_score=False, random_state=23, verbose=0,
                       warm_start=False)
Unoptimized model
------
Accuracy score on testing data: 0.8377
F-score on testing data: 0.6700

Optimized Model
------
Final accuracy score on the testing data: 0.8605
Final F-score on the testing data: 0.7290
Time taken 5.1834
