## Visualize dataset using categorical embedding

The dataset for this project originates from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income). The datset was donated by Ron Kohavi and Barry Becker, after being published in the article _"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"_. We can find the article by Ron Kohavi [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf). The data we investigate here consists of small changes to the original dataset, such as removing the `'fnlwgt'` feature and records with missing or ill-formatted entries.

In [1]:
import warnings
warnings.simplefilter('ignore')

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time

# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Import train_test_split
from sklearn.model_selection import train_test_split

# Import two metrics from sklearn - fbeta_score and accuracy_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import accuracy_score

# Import the three supervised learning models from sklearn
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Import functionality for cloning a model
from sklearn.base import clone

import altair as alt
alt.data_transformers.enable('json')

from gensim.models import Word2Vec

# TSNE
import time
from sklearn.manifold import TSNE

In [2]:
# Load the Census dataset
data = pd.read_csv("census.csv")

data.head()

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38,Private,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53,Private,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28,Private,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


**Featureset Exploration**

* **age**: Continuous. 
* **workclass**: Categorical - Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
* **education**: Categorical - Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
* **education-num**: Continuous. 
* **marital-status**: Categorical - Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
* **occupation**: Categorical - Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
* **relationship**: Categorical - Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
* **race**: Categorical - Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other. 
* **sex**: Categorical - Female, Male. 
* **capital-gain**: Continuous. 
* **capital-loss**: Continuous. 
* **hours-per-week**: Continuous. 
* **native-country**: Categorical - United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

### Analyze the columns

#### Age

In [3]:
alt.Chart(data).mark_bar().encode(
    alt.X("age", bin=alt.Bin(maxbins=25)),
    y='count()',
    color='income'
)

For dashborad - I would like to combine some of the higher ages in one bucket.

In [4]:
def categorical_age(x):
    if (x < 20):
        return '< 20'
    if (x < 30):
        return '20 - 30'
    if (x < 40):
        return '30 - 40'
    if (x < 50):
        return '40 - 50'
    if (x < 60):
        return '50 - 60'
    if (x < 70):
        return '60 - 70'
    if (x >= 70):
        return '70+'
    
data['_age'] = data['age'].apply(categorical_age)

#### Capital-Gain

In [5]:
alt.Chart(data).mark_bar().encode(
    alt.X("capital-gain", bin=True),
    y='count()',
    color='income'
)

For dashborad - I would like to have just two buckets for capital-gain - 0 or >0.

In [6]:
def categorical_gain(x):
    if (x > 0):
        return '>0'
    else:
        return '=0'
    
data['capital_gain'] = data['capital-gain'].apply(categorical_gain)

#### Capital-Loss

In [7]:
alt.Chart(data).mark_bar().encode(
    alt.X("capital-loss", bin=True),
    y='count()',
    color='income'
)

Similar to capital-gain, I would like to keep only 2 buckets for capital-loss.

In [8]:
def categorical_loss(x):
    if (x > 0):
        return '>0'
    else:
        return '=0'
    
data['capital_loss'] = data['capital-loss'].apply(categorical_loss)

#### Hours per week

In [9]:
alt.Chart(data).mark_bar().encode(
    alt.X("hours-per-week", bin=True),
    y='count()',
    color='income'
)

I would combine the following categories into one:
1. 0-10 and 10-20
2. 70-80, 80-90, and 90-100

In [10]:
def categorical_hours(x):
    if (x < 20):
        return '< 20'
    if (x < 30):
        return '20 - 30'
    if (x < 40):
        return '30 - 40'
    if (x < 50):
        return '40 - 50'
    if (x < 60):
        return '50 - 60'
    if (x < 70):
        return '60 - 70'
    if (x >= 70):
        return '70+'
    
data['hours_per_week'] = data['hours-per-week'].apply(categorical_hours)

#### Education Number and Education Level

In [11]:
alt.Chart(data).mark_bar().encode(
    alt.X("education-num", bin=alt.Bin(maxbins=16)),
    y='count()',
    color='income'
)

In [12]:
alt.Chart(data).mark_bar().encode(
    alt.X("education-num", bin=alt.Bin(maxbins=16)),
    y='count()',
    color='education_level:N'
)

There is high-corelation between `education-num` and `education_level`. So I will be using only education level in the dashboard. Further, I'll combine:
1. 5th-6th and 7th-8th into one category
2. 9th and 10th into one
3. 11th and 12th in one

In [13]:
def categorical_education(x):
    x = x.strip()
    to_return = x
    if x == '5th-6th' or x == '7th-8th':
        to_return = '5th-8th'
    if x == '9th' or x == '10th':
        to_return = '9th-10th'
    if x == '11th' or x == '12th':
        to_return = '11th-12th'
        
    return to_return
    
data['_education_level'] = data['education_level'].apply(categorical_education)

#### Workclass

In [14]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('workclass:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color='income'
                        )

#### Occupation

In [15]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('occupation:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color='income'
                        )

#### Marital Status

In [16]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('marital-status:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color='income'
                        )

#### Relationship

In [17]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('relationship:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color='income'
                        )

#### Gender

In [18]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('sex:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color='income'
                        )

#### Race

In [19]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('race:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color='income'
                        )

#### Native Country

In [20]:
alt.Chart(data).mark_bar().encode(
                            x= alt.X('count()', axis=alt.Axis(title='')),
                            y=  alt.Y('native-country:N', sort='-x', 
                                      axis=alt.Axis(title='', 
                                                    labelFontSize=12,
                                                    ticks=False)),
                            color = 'income'
                        )

For the purposes of dashboard - I'll leave the top 9 countries and combine the rest as 'Others'

In [21]:
top_countries = ['United-States', 'Mexico', 'Philippines', 'Germany', 'Puerto-Rico', 'Canada', 'India', 
                 'El-Salvador', 'Cuba']

def categorical_country(x):
    x = x.strip()
    if x in top_countries:
        return x
    else:
        return 'Others'
    
data['native_country'] = data['native-country'].apply(categorical_country)

----
## Preparing the Data
Before data can be used as input for machine learning algorithms, it often must be cleaned, formatted, and restructured. Fortunately, for this dataset, there are no invalid or missing entries we must deal with, however, there are some qualities about certain features that must be adjusted.

### Normalizing Numerical Features
In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution (such as `'capital-gain'` or `'capital-loss'` above); however, normalization ensures that each feature is treated equally when applying supervised learners. Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning.

In [22]:
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
numerical_normalized = ['age_normalized', 'education-num_normalized', 'capital-gain_normalized',
                        'capital-loss_normalized', 'hours-per-week_normalized']

normalized_data = data
normalized_data[numerical_normalized] = data[numerical]
normalized_data[numerical_normalized] = scaler.fit_transform(normalized_data[numerical_normalized])

In [23]:
categorical = ['workclass', 'education_level', 'marital-status', 'occupation', 'relationship',
               'race', 'sex', 'native-country']

def getSentence(x):
    arr = []
    for col in categorical:
        arr.append(x[col].strip())
    return arr

normalized_data['combined-categories'] = normalized_data.apply(getSentence, axis=1)

In [24]:
# window size of 8 includes all words in context
# ns_exponent of 0.0 samples all words equally
model = Word2Vec(list(normalized_data['combined-categories']), min_count=1, size=32, window=8, ns_exponent=0.0,
                 workers=8, iter=100)

In [25]:
# test vector for one of the categorical values
model['Private']

array([ 1.4610063 , -2.6515975 , -1.9365681 , -0.941228  , -0.70973283,
       -1.3962952 , -0.86609143,  0.54141504, -1.8923876 , -1.5956795 ,
       -1.9748574 ,  0.21719351, -1.9046342 , -0.93459994, -0.040706  ,
        1.8496356 , -0.04559053,  0.43774414, -0.7138819 ,  0.5931256 ,
       -0.9088906 ,  0.11516427,  1.671683  ,  2.1076114 ,  0.82342803,
       -2.8486516 ,  3.059075  , -0.28296688,  2.365132  ,  1.5559391 ,
        3.4933639 ,  1.1827399 ], dtype=float32)

In [26]:
def getCategoryArray(x):
    arr = []
    for col in categorical:
        arr.append(model[x[col].strip()])
    return np.mean(np.array(arr), axis=0)

In [27]:
normalized_data['categories_arr'] = normalized_data.apply(getCategoryArray, axis=1)

In [28]:
normalizer = np.amax(list(normalized_data['categories_arr'])) - np.amin(list(normalized_data['categories_arr']))

In [29]:
def getCombinedArray(x):
    arr = x['categories_arr']
    for col in numerical_normalized:
        arr = np.append(arr, x[col]*normalizer)
    return arr

In [30]:
normalized_data['combined_arr'] = normalized_data.apply(getCombinedArray, axis=1)

In [31]:
def performTSNE(df, col, out_x='x', out_y='y', verbose=1, perplexity=40, n_iter=500):
    '''
    perform tSNE (t-distributed Stochastic Neighbor Embedding).
    A tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. 
    input:
        df - data frame
        col - name of col with embedding data
        out_x - name of column to store x-coordinates
        out_y - name of column to store y-cordinates
        verbose - 
        perplexity - related to the number of nearest neighbors - usually a value between 5 and 50
        n_iter - number of iterations
    output:
        input df with 2 additional columns for x and y co-ordinates
    '''
    X = np.array(list(df[col]))
    time_start = time.time()
    tsne = TSNE(n_components=2, verbose=verbose, random_state=32, perplexity=perplexity, n_iter=n_iter)
    tsne_results = tsne.fit_transform(X)
    if verbose > 0:
        print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
    df[out_x] = tsne_results[:,0]
    df[out_y] = tsne_results[:,1]
    
    return df

In [32]:
normalized_data = performTSNE(normalized_data, 'combined_arr')

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 45222 samples in 0.471s...
[t-SNE] Computed neighbors for 45222 samples in 81.830s...
[t-SNE] Computed conditional probabilities for sample 1000 / 45222
[t-SNE] Computed conditional probabilities for sample 2000 / 45222
[t-SNE] Computed conditional probabilities for sample 3000 / 45222
[t-SNE] Computed conditional probabilities for sample 4000 / 45222
[t-SNE] Computed conditional probabilities for sample 5000 / 45222
[t-SNE] Computed conditional probabilities for sample 6000 / 45222
[t-SNE] Computed conditional probabilities for sample 7000 / 45222
[t-SNE] Computed conditional probabilities for sample 8000 / 45222
[t-SNE] Computed conditional probabilities for sample 9000 / 45222
[t-SNE] Computed conditional probabilities for sample 10000 / 45222
[t-SNE] Computed conditional probabilities for sample 11000 / 45222
[t-SNE] Computed conditional probabilities for sample 12000 / 45222
[t-SNE] Computed conditional probabilities for s

Now that we have the x and y co-ordinates, we can get rid of the extra columns.

In [33]:
to_drop = ['age_normalized', 'education-num_normalized', 'capital-gain_normalized', 'capital-loss_normalized',
           'hours-per-week_normalized', 'categories_arr', 'combined-categories', 'combined_arr']

normalized_data.drop(columns=to_drop, inplace=True)

### Creating Dashborad

In [34]:
def get_bar_chart(data, col, color_col, title, sel=None, height=None):
    chart = alt.Chart(data).mark_bar().encode(
                                x= alt.X('count()', axis=alt.Axis(title='')),
                                y=  alt.Y(col +':N', sort='-x', 
                                          axis=alt.Axis(title='', 
                                                        labelFontSize=11,
                                                        ticks=False)),
                                color=alt.condition(sel, alt.Color(color_col + ':N'), alt.value('lightgray'))
                            )
    
    if height is None:
        chart = chart.properties(
            width=200,
            title=title
        )
    else:
        chart = chart.properties(
            width=200,
            height=height,
            title=title
        )
        
    if sel is not None:
        chart = chart.add_selection(
                        sel
                )
        
    return chart

In [35]:
normalized_data.columns

Index(['age', 'workclass', 'education_level', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income', '_age', 'capital_gain', 'capital_loss', 'hours_per_week',
       '_education_level', 'native_country', 'x', 'y'],
      dtype='object')

In [36]:
sex_sel = alt.selection_multi(fields=['sex'])
sex_chart = get_bar_chart(normalized_data, 'sex', 'income', 'Gender', sex_sel, 30)

race_sel = alt.selection_multi(fields=['race'])
race_chart = get_bar_chart(normalized_data, 'race', 'income', 'Race', race_sel, 75)

capital_gain_sel = alt.selection_multi(fields=['capital_gain'])
capital_gain_chart = get_bar_chart(normalized_data, 'capital_gain', 'income', 'Capital Gain', capital_gain_sel, 30)

capital_loss_sel = alt.selection_multi(fields=['capital_loss'])
capital_loss_chart = get_bar_chart(normalized_data, 'capital_loss', 'income', 'Capital Loss', capital_loss_sel, 30)

age_sel = alt.selection_multi(fields=['_age'])
age_chart = get_bar_chart(normalized_data, '_age', 'income', 'Age', age_sel, 100)

education_level_sel = alt.selection_multi(fields=['_education_level'])
education_chart = get_bar_chart(normalized_data, '_education_level', 'income', 'Education', education_level_sel, 200)

occupation_sel = alt.selection_multi(fields=['occupation'])
occupation_chart = get_bar_chart(normalized_data, 'occupation', 'income', 'Occupation_sel', occupation_sel, 200)

workclass_sel = alt.selection_multi(fields=['workclass'])
workclass_chart = get_bar_chart(normalized_data, 'workclass', 'income', 'Workclass', workclass_sel, 120)

hours_per_week_sel = alt.selection_multi(fields=['hours_per_week'])
hours_chart = get_bar_chart(normalized_data, 'hours_per_week', 'income', 'Hours/week', hours_per_week_sel, 100)

marital_status_sel = alt.selection_multi(fields=['marital-status'])
marital_chart = get_bar_chart(normalized_data, 'marital-status', 'income', 'Marital Status', marital_status_sel)

relationship_sel = alt.selection_multi(fields=['relationship'])
relationship_chart = get_bar_chart(normalized_data, 'relationship', 'income', 'Relationship', relationship_sel)

native_country_sel = alt.selection_multi(fields=['native_country'])
country_chart = get_bar_chart(normalized_data, 'native_country', 'income', 'Native Country', native_country_sel)

income_chart = alt.Chart(normalized_data).mark_bar().encode(
                                x= alt.X('count()', axis=alt.Axis(title='')),
                                y=  alt.Y('income:N', sort='-x', 
                                          axis=alt.Axis(title='', 
                                                        labelFontSize=11,
                                                        ticks=False)),
                                color='income:N'
                            ).properties(
                                width=500
                            )
text = income_chart.mark_text(
                align='left',
                baseline='middle',
                dx=3  # Nudges text to right so it doesn't appear on top of the bar
            ).encode(
                text=alt.Text('count()', format=',d')
            )

scatter = alt.Chart(normalized_data).mark_circle().encode(
                        x=alt.X('x', axis=alt.Axis(title='')),
                        y=alt.Y('y', axis=alt.Axis(title='')),
                        color = 'income:N',
                        tooltip=['age', 'workclass', 'education_level', 'education-num',
                                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
                    ).properties(
                        width=500,
                        height=500
                    ).add_selection(alt.selection_single())
                    # selection_single work arounf for vega-lite bug - to show tooltp

scatter_and_income =  (scatter & (income_chart + text)).transform_filter(
                        sex_sel
                    ).transform_filter(
                        race_sel
                    ).transform_filter(
                        capital_gain_sel
                    ).transform_filter(
                        capital_loss_sel
                    ).transform_filter(
                        age_sel
                    ).transform_filter(
                        education_level_sel
                    ).transform_filter(
                        occupation_sel
                    ).transform_filter(
                        workclass_sel
                    ).transform_filter(
                        hours_per_week_sel
                    ).transform_filter(
                        marital_status_sel
                    ).transform_filter(
                        relationship_sel
                    ).transform_filter(
                        native_country_sel
                    )

dashboard = (scatter_and_income
             | (sex_chart & race_chart & capital_gain_chart & capital_loss_chart & age_chart & hours_chart)
             | (education_chart & occupation_chart & workclass_chart)
             | (marital_chart & relationship_chart & country_chart)
                            ).configure_legend(orient='top'
                            ).configure_title(fontSize=12)

In [37]:
dashboard.display()

In [38]:
dashboard.save('dashboard.html')

### Alternate Dashboard

In [39]:
def get_alternate_bar_chart(data, col, color_col, title, height=None):
    chart = alt.Chart(data).mark_bar().encode(
                                x= alt.X('count()', axis=alt.Axis(title='')),
                                y=  alt.Y(col +':N', sort='-x', 
                                          axis=alt.Axis(title='', 
                                                        labelFontSize=11,
                                                        ticks=False)),
                                color=color_col + ':N'
                            )
    
    if height is None:
        chart = chart.properties(
            width=200,
            title=title
        )
    else:
        chart = chart.properties(
            width=200,
            height=height,
            title=title
        )
    chart = chart.transform_filter(brush)    
    return chart

In [40]:
brush = alt.selection_interval()

sex_chart = get_alternate_bar_chart(normalized_data, 'sex', 'income', 'Gender')
race_chart = get_alternate_bar_chart(normalized_data, 'race', 'income', 'Race')
capital_gain_chart = get_alternate_bar_chart(normalized_data, 'capital_gain', 'income', 'Capital Gain')
capital_loss_chart = get_alternate_bar_chart(normalized_data, 'capital_loss', 'income', 'Capital Loss')
age_chart = get_alternate_bar_chart(normalized_data, '_age', 'income', 'Age')
education_chart = get_alternate_bar_chart(normalized_data, '_education_level', 'income', 'Education')
occupation_chart = get_alternate_bar_chart(normalized_data, 'occupation', 'income', 'Occupation_sel')
workclass_chart = get_alternate_bar_chart(normalized_data, 'workclass', 'income', 'Workclass')
hours_chart = get_alternate_bar_chart(normalized_data, 'hours_per_week', 'income', 'Hours/week')
marital_chart = get_alternate_bar_chart(normalized_data, 'marital-status', 'income', 'Marital Status')
relationship_chart = get_alternate_bar_chart(normalized_data, 'relationship', 'income', 'Relationship')
country_chart = get_alternate_bar_chart(normalized_data, 'native_country', 'income', 'Native Country')

income_chart = alt.Chart(normalized_data).mark_bar().encode(
                                x= alt.X('count()', axis=alt.Axis(title='')),
                                y=  alt.Y('income:N', sort='-x', 
                                          axis=alt.Axis(title='', 
                                                        labelFontSize=11,
                                                        ticks=False)),
                                color='income:N'
                            ).properties(
                                width=500
                            )
                                 
text = income_chart.mark_text(
                align='left',
                baseline='middle',
                dx=3  # Nudges text to right so it doesn't appear on top of the bar
            ).encode(
                text=alt.Text('count()', format=',d')
            )

income_and_text = (income_chart + text).transform_filter(
                        brush
                    )
                                 
scatter = alt.Chart(normalized_data).mark_circle().encode(
                        x=alt.X('x', axis=alt.Axis(title='', labels=False)),
                        y=alt.Y('y', axis=alt.Axis(title='', labels=False)),
                        color = alt.condition(brush, alt.Color('income:N'), alt.value('lightgray')),
                        tooltip=['age', 'workclass', 'education_level', 'education-num',
                                 'marital-status', 'occupation', 'relationship', 'race', 'sex',
                                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
                    ).properties(
                        width=500,
                        height=500
                    ).add_selection(brush)
                                 
alternate_dashboard = ((scatter & income_and_text)
             | (sex_chart & race_chart & capital_gain_chart & capital_loss_chart & age_chart & hours_chart)
             | (education_chart & occupation_chart & workclass_chart)
             | (marital_chart & relationship_chart & country_chart)
                            ).configure_legend(orient='top'
                            ).configure_title(fontSize=12)

In [41]:
alternate_dashboard.display()

In [42]:
alternate_dashboard.save('alternate_dashboard.html')