Adult Dataset - Domain
===================

In this notebook, we infer the domain of the attributes of the adult dataset.

In [1]:
import json
from pathlib import Path

import numpy as np
import pandas as pd

### Read the dataset

**TODO**: please configure `DATA_PATH`.

In [2]:
DATA_PATH = Path('/home/nap/Workspace/Privacy-DARC/rmckenna/data')  # Where to save the domain informations
ADULT_DATASET_PATH = DATA_PATH / Path('adult.csv')

df = pd.read_csv(ADULT_DATASET_PATH)
df.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
14878,31,Private,234500,Bachelors,13,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,40,United-States,<=50K
15419,28,Private,139903,Bachelors,13,Never-married,Machine-op-inspct,Unmarried,Black,Female,0,0,40,United-States,<=50K
19533,26,Private,86483,Some-college,10,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States,<=50K
9984,28,Private,118861,10th,6,Married-civ-spouse,Craft-repair,Wife,Other,Female,0,0,48,Guatemala,<=50K
9780,49,Federal-gov,105586,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>50K
26563,50,Private,198362,Assoc-voc,11,Widowed,Exec-managerial,Unmarried,White,Female,0,0,40,United-States,<=50K
13022,54,Private,257765,7th-8th,4,Divorced,Machine-op-inspct,Not-in-family,White,Male,0,0,40,Guatemala,<=50K
28393,35,Private,35945,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K
292,29,Local-gov,220419,Bachelors,13,Never-married,Protective-serv,Not-in-family,White,Male,0,0,56,United-States,<=50K
8819,24,Private,207940,HS-grad,9,Married-civ-spouse,Other-service,Wife,White,Female,0,0,30,United-States,<=50K


### Infer the columns (= attributes)

In [3]:
columns = list(df.columns)
columns

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-country',
 'salary']

### Infer the domain of each attribute

Note that the attributes are considered as being categorical.

In [4]:
domain = {}

for column in columns:
    domain[column] = sorted(df[column].unique())
    print(f'{column} has {len(domain[column])} distinct values')

age has 73 distinct values
workclass has 9 distinct values
fnlwgt has 21648 distinct values
education has 16 distinct values
education-num has 16 distinct values
marital-status has 7 distinct values
occupation has 15 distinct values
relationship has 6 distinct values
race has 5 distinct values
sex has 2 distinct values
capital-gain has 119 distinct values
capital-loss has 92 distinct values
hours-per-week has 94 distinct values
native-country has 42 distinct values
salary has 2 distinct values


### Discretize the integer/float attributes

**TODO**: please define the attributes to discretize.

In [5]:
# Age:  0 => [17, 26[  |  1 => [26, 62[  |  2 => [62, 91[
domain['age'] = list(range(3))

# Final weight: 0 => [12285, 100000[  |  1 => [100000, 200000[  |  2 => [200000, 300000[  |  3 => [300000, 400000[
#               4 => [400000, 500000[  |  5 => [500000, 1484706[
domain['fnlwgt'] = list(range(6))

# Capital gain: 0 => [0, 5000[  |  1 => [5000, 10000[  |  2 => [10000, 20000[  |  3 => [20000, 100000[
domain['capital-gain'] = list(range(4))

# Capital loss: 0 => [0, 1000[  |  1 => [1000, 2000[  |  2 => [2000, 3000[  |  3 => [3000, 4357[
domain['capital-loss'] = list(range(4))

# Hours per week: 0 => [0, 25[  |  1 => [25, 50[  |  2 => [50, 75[  |  3 => [75, 100[
domain['hours-per-week'] = list(range(4))

### Save the domain in json format

In [6]:
print(domain)  # Final version of the domains

{'age': [0, 1, 2], 'workclass': ['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], 'fnlwgt': [0, 1, 2, 3, 4, 5], 'education': ['10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th', 'Assoc-acdm', 'Assoc-voc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters', 'Preschool', 'Prof-school', 'Some-college'], 'education-num': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], 'marital-status': ['Divorced', 'Married-AF-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'], 'occupation': ['?', 'Adm-clerical', 'Armed-Forces', 'Craft-repair', 'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Other-service', 'Priv-house-serv', 'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support', 'Transport-moving'], 'relationship': ['Husband', 'Not-in-family', 'Other-relative', 'Own-child', 'Unmarried', 'Wife'], 'race': ['Amer-Indian-Eskimo', 'As

In [7]:
class NpEncoder(json.JSONEncoder):
    """
    Custom Encoder to support numpy types.
    
    Source: https://stackoverflow.com/a/57915246/4075096.
    """
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return super(NpEncoder, self).default(obj)

In [8]:
json.dump(domain, open(DATA_PATH / 'adult-domain.json', 'w+'), cls=NpEncoder)