# Build a Linear Model with Estimators

Using Tensorflow's ```tf.estimator``` API, we will use census data of people's age, education, maritial status, and occupation to predict wether they make more than $50,000 a year. We will train a *logistic regression* model that outputs a number between 0 and 1 (the probability they make or dont make more than 50k a year).

In [1]:
import tensorflow as tf
import tensorflow.feature_column as fc

import os
import sys

import matplotlib.pyplot as plt
from IPython.display import clear_output

In [2]:
tf.enable_eager_execution()

Download official implmentation (TF's "wide and deep model" from the TF model repository)

In [3]:
! pip install -q requests
! git clone --depth 1 https://github.com/tensorflow/models

fatal: destination path 'models' already exists and is not an empty directory.


Add the root directory of the repository to the Python path

In [4]:
models_path = os.path.join(os.getcwd(), 'models')

sys.path.append(models_path)

Download the dataset

In [5]:
from official.wide_deep import census_dataset
from official.wide_deep import census_main

census_dataset.download("/tmp/census_data/")

## Command Line Usage

To run the repo's included program to expirament with this type of model, add ```tensorflow/models``` to ```PYTHONPATH``` 

In [6]:
if "PYTHONPATH" in os.environ:
    os.environ["PYTHONPATH"] += os.pathsep + models_path
else:
    os.environ["PYTHONPATH"] = models_path

In [7]:
!python -m official.wide_deep.census_main --help

Train DNN on census income dataset.
flags:

/home/waydegg/development/projects/loftyai/_learning/tensorflow/models/official/wide_deep/census_main.py:
  -bs,--batch_size:
    Batch size for training and evaluation. When using multiple gpus, this is
    the
    global batch size for all devices. For example, if the batch size is 32 and
    there are 4 GPUs, each GPU will get 8 examples on each step.
    (default: '40')
    (an integer)
  --[no]clean:
    If set, model_dir will be removed if it exists.
    (default: 'false')
  -dd,--data_dir:
    The location of the input data.
    (default: '/tmp/census_data')
  --[no]download_if_missing:
    Download data to data_dir if it is not already present.
    (default: 'true')
  -ebe,--epochs_between_evals:
    The number of training epochs to run between evaluations.
    (default: '2')
    (an integer)
  -ed,--export_dir:
    If set, a SavedModel serialization of the model will be exported to this
    directory at the 

In [8]:
# Run the model
!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2

I0619 20:52:58.103579 140369188927296 estimator.py:201] Using config: {'_model_dir': '/tmp/census_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': device_count {
  key: "GPU"
  value: 0
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa9d6140e80>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
W0619 20:52:58.104118 140369188927296 tf_logging.py:161] 'cpuinfo' not imported. CPU info will not be logged.
2019-06-19 20:52:58.104257: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU s

I0619 20:53:01.924414 140369188927296 basic_session_run_hooks.py:680] global_step/sec: 367.778
I0619 20:53:01.925095 140369188927296 basic_session_run_hooks.py:247] average_loss = 0.29963666, loss = 11.985467 (0.272 sec)
I0619 20:53:01.925290 140369188927296 basic_session_run_hooks.py:247] loss = 11.985467, step = 2030 (0.272 sec)
I0619 20:53:02.197356 140369188927296 basic_session_run_hooks.py:680] global_step/sec: 366.386
I0619 20:53:02.198033 140369188927296 basic_session_run_hooks.py:247] average_loss = 0.25066406, loss = 10.026562 (0.273 sec)
I0619 20:53:02.198242 140369188927296 basic_session_run_hooks.py:247] loss = 10.026562, step = 2130 (0.273 sec)
I0619 20:53:02.463029 140369188927296 basic_session_run_hooks.py:680] global_step/sec: 376.382
I0619 20:53:02.463628 140369188927296 basic_session_run_hooks.py:247] average_loss = 0.33679628, loss = 13.471851 (0.266 sec)
I0619 20:53:02.463823 140369188927296 basic_session_run_hooks.py:247] loss = 13.471851, step = 2230 (0.266 sec)
I

## Data

We'll use ```census_dataset.py``` which helps with cleaning up some of the data

In [9]:
! ls /tmp/census_data

adult.data  adult.test


In [10]:
train_file = "/tmp/census_data/adult.data"
test_file = "/tmp/census_data/adult.test"

In [11]:
import pandas as pd

train_df = pd.read_csv(train_file, header=None, names=census_dataset._CSV_COLUMNS)
test_df = pd.read_csv(test_file, header=None, names=census_dataset._CSV_COLUMNS)

In [12]:
train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Create an ```input_function``` that prepares the data in a way that it can be inputted into the model. This function will return a ```tf.data.Dataset``` of batches of (```features-dict```, ```label```) pairs. It is not called until it is passed into the ```tf.estimator.Estimator``` methods such as ```train``` and ```evaluate```.

In [13]:
def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
    label = df[label_key]
    ds = tf.data.Dataset.from_tensor_slices((dict(df), label))
    
    if shuffle:
        ds = ds.shuffle(10000)
    
    ds = ds.batch(batch_size).repeat(num_epochs)
    
    return ds

With eager execution enabled, we can easily inspect the result of the input function:

In [14]:
ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
    print('Some feature keys: ', list(feature_batch.keys())[:5])
    print()
    print("A batch of Ages : ", feature_batch["age"])
    print()
    print("A batch of Labels: ", label_batch)

Instructions for updating:
Colocations handled automatically by placer.


W0619 20:53:10.015953 140536260691776 deprecation.py:323] From /home/waydegg/anaconda3/envs/fastai-course-v3/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py:532: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


Some feature keys:  ['age', 'workclass', 'fnlwgt', 'education', 'education_num']

A batch of Ages :  tf.Tensor([48 30 53 36 51 31 50 39 43 31], shape=(10,), dtype=int32)

A batch of Labels:  tf.Tensor(
[b'<=50K' b'<=50K' b'<=50K' b'>50K' b'>50K' b'>50K' b'<=50K' b'<=50K'
 b'>50K' b'<=50K'], shape=(10,), dtype=string)


Lets look at the `input_fn` from the `census_dataset`. Larger datasets should be *streamed from disk*, and this function does just that:

In [15]:
import inspect
print(inspect.getsource(census_dataset.input_fn))

def input_fn(data_file, num_epochs, shuffle, batch_size):
  """Generate an input function for the Estimator."""
  assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have run census_dataset.py and '
      'set the --data_dir argument to the correct path.' % data_file)

  def parse_csv(value):
    tf.logging.info('Parsing {}'.format(data_file))
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')  # binary classification
    return features, classes

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv, num_parallel_calls=5)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = 

Wrap the input function to an object with an expected signature

In [16]:
import functools

train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

### Feature Columns

**Base Feature Columns**

Numeric Columns (for Continuous features)

In [17]:
age = fc.numeric_column('age')

In [18]:
# Inspect the feature column result
fc.input_layer(feature_batch, [age]).numpy()

Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0619 20:53:10.141360 140536260691776 deprecation.py:323] From /home/waydegg/anaconda3/envs/fastai-course-v3/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:205: NumericColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0619 20:53:10.142695 140536260691776 deprecation.py:323] From /home/waydegg/anaconda3/envs/fastai-course-v3/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:2121: NumericColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


Instructions for updating:
Use tf.cast instead.


W0619 20:53:10.143920 140536260691776 deprecation.py:323] From /home/waydegg/anaconda3/envs/fastai-course-v3/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column_v2.py:2703: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.


Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


W0619 20:53:10.144951 140536260691776 deprecation.py:323] From /home/waydegg/anaconda3/envs/fastai-course-v3/lib/python3.7/site-packages/tensorflow/python/feature_column/feature_column.py:206: NumericColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed after 2018-11-30.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.


array([[48.],
       [30.],
       [53.],
       [36.],
       [51.],
       [31.],
       [50.],
       [39.],
       [43.],
       [31.]], dtype=float32)

Train and evaluate a model using only the `age` feature:

In [19]:
classifier = tf.estimator.LinearClassifier(feature_columns=[age])
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output() # used for display in notebook
print(result)

{'accuracy': 0.74737424, 'accuracy_baseline': 0.76377374, 'auc': 0.67835975, 'auc_precision_recall': 0.31139234, 'average_loss': 0.52639776, 'label/mean': 0.23622628, 'loss': 33.60895, 'precision': 0.17675544, 'prediction/mean': 0.27435714, 'recall': 0.01898076, 'global_step': 1018}


We can define a `NumericColumn` for each continuous feature that we want to use in the model:

In [20]:
train_df.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [21]:
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

my_numeric_columns = [age, education_num, capital_gain, capital_loss, hours_per_week]

fc.input_layer(feature_batch, my_numeric_columns).numpy()

array([[  48.,    0.,    0.,    9.,   40.],
       [  30.,    0., 1719.,   10.,   25.],
       [  53.,    0.,    0.,    4.,   40.],
       [  36.,    0., 1902.,   13.,   50.],
       [  51.,    0., 1564.,   16.,   70.],
       [  31.,    0.,    0.,   13.,   40.],
       [  50.,    0.,    0.,    7.,   40.],
       [  39.,    0.,    0.,    9.,   40.],
       [  43.,    0.,    0.,   10.,   50.],
       [  31.,    0.,    0.,   12.,   40.]], dtype=float32)

Retrain the model on these features by changing the `feature_columns` argument to the constructor:

In [25]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns)
classifier.train(train_inpf)

result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
    print('%s, %s' % (key, value))

accuracy, 0.78140163
accuracy_baseline, 0.76377374
auc, 0.7933894
auc_precision_recall, 0.56146234
average_loss, 1.2828498
global_step, 1018
label/mean, 0.23622628
loss, 81.90619
precision, 0.56889105
prediction/mean, 0.34047222
recall, 0.30811232


Categorical columns

In [29]:
relationship = fc.categorical_column_with_vocabulary_list(
    'relationship',
    ['Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried', 'Other-relative']
)

Since the `input_layer` function is designed for DNN (Dense Neural Network) models, it expects dense inputs. So, we must wrap the one-hot encoded categorical columns in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this step).

In [31]:
fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])

<tf.Tensor: id=4893, shape=(10, 7), dtype=float32, numpy=
array([[48.,  1.,  0.,  0.,  0.,  0.,  0.],
       [30.,  0.,  1.,  0.,  0.,  0.,  0.],
       [53.,  1.,  0.,  0.,  0.,  0.,  0.],
       [36.,  1.,  0.,  0.,  0.,  0.,  0.],
       [51.,  0.,  1.,  0.,  0.,  0.,  0.],
       [31.,  1.,  0.,  0.,  0.,  0.,  0.],
       [50.,  0.,  1.,  0.,  0.,  0.,  0.],
       [39.,  1.,  0.,  0.,  0.,  0.,  0.],
       [43.,  1.,  0.,  0.,  0.,  0.,  0.],
       [31.,  0.,  0.,  0.,  0.,  1.,  0.]], dtype=float32)>

If we don't know the set of possible values in advance (or there are just too many), we can use `categorical_column_with_hash_bucket`instead:

In [32]:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

In [33]:
for item in feature_batch['occupation'].numpy():
    print(item.decode())

Craft-repair
?
Craft-repair
Prof-specialty
Prof-specialty
Sales
Other-service
Craft-repair
Exec-managerial
Other-service


With the hashed feature column, we now have an output shape of `(batch_size, hash_bucket_size)`

In [36]:
occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])

occupation_result.numpy().shape

(10, 1000)

In [37]:
tf.argmax(occupation_result, axis=1).numpy()

array([466,  65, 466, 979, 979, 631, 527, 466, 800, 527])

Create the feature columns for the other categorical features:

In [40]:
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])


my_categorical_columns = [relationship, occupation, education, marital_status, workclass]

A linear classifier with both numerical and categorical feature columns:

In [41]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
    print("%s: %s" % (key, value))

accuracy: 0.83969045
accuracy_baseline: 0.76377374
auc: 0.885083
auc_precision_recall: 0.71800274
average_loss: 0.36432135
global_step: 1018
label/mean: 0.23622628
loss: 23.260847
precision: 0.70752186
prediction/mean: 0.2146328
recall: 0.5478419


**Derived Feature Columns**

*Make Continuous Features Categorical through Bucketization*

In [42]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

When bucketing, the model sees each bucket as a one-hot feature:

In [45]:
fc.input_layer(feature_batch, [age, age_buckets]).numpy()

array([[48.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [30.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [53.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [36.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [51.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [31.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [50.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [39.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [43.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [31.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]],
      dtype=float32)

*Learn complex relationships with crossed column*

In [46]:
education_x_occupation = tf.feature_column.crossed_column(
    ['education', 'occupation'], hash_bucket_size=1000)

In [48]:
?fc.crossed_column