# TensorFlow Linear Model Tutorial

In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a
binary classification problem: Given census data about a person such as age,
education, marital status, and occupation (the features), we will try to predict
whether or not the person earns more than 50,000 dollars a year (the target
label). We will train a **logistic regression** model, and given an individual's
information our model will output a number between 0 and 1, which can be
interpreted as the probability that the individual has an annual income of over
50,000 dollars.

## Setup

To try the code for this tutorial:

[Install TensorFlow](tensorlfow.org/install) if you haven't already.

Next import the relavant packages:

In [42]:
import tensorflow as tf
import tensorflow.feature_column as fc 
tf.enable_eager_execution()

import os
import sys
from IPython.display import clear_output

ValueError: tf.enable_eager_execution must be called at program startup.

Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),
 add the root directory to your python path, and jump to the `wide_deep` directory:

In [2]:
if "wide_deep" not in os.getcwd():
    ! git clone --depth 1 https://github.com/tensorflow/models
    models_path = os.path.join(os.getcwd(), 'models')
    sys.path.append(models_path)   
    os.environ['PYTHONPATH'] += os.pathsep+models_path
    os.chdir("models/official/wide_deep")

fatal: destination path 'models' already exists and is not an empty directory.


Execute the data download script:

In [3]:
import census_dataset
import census_main

census_dataset.download("/tmp/census_data/")

Execute the tutorial code with the following command to train the model described in this tutorial, from the command line:

In [4]:
output = !python -m census_main --model_type=wide --train_epochs=2
print([line for line in output if 'accuracy:' in line])

['I0711 14:47:25.747490 139708077598464 tf_logging.py:115] accuracy: 0.833794']


Read on to find out how this code builds its linear model.

## Reading The Census Data

The dataset we're using is the
[Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).
We have provided
[census_dataset.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_dataset.py)
which downloads the code and performs some additional cleanup.

Since the task is a binary classification problem, we'll construct a label
column named "label" whose value is 1 if the income is over 50K, and 0
otherwise. For reference, see `input_fn` in
[census_main.py](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py).

Next, let's take a look at the data and see which columns we can use to
predict the target label. 

In [5]:
!ls  /tmp/census_data/

adult.data  adult.test


In [6]:
train_file = "/tmp/census_data/adult.data"
test_file = "/tmp/census_data/adult.test"

In [7]:
import pandas
train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)
test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)

train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The columns can be grouped into two types—categorical
and continuous columns:

*   A column is called **categorical** if its value can only be one of the
    categories in a finite set. For example, the relationship status of a person
    (wife, husband, unmarried, etc.) or the education level (high school,
    college, etc.) are categorical columns.
*   A column is called **continuous** if its value can be any numerical value in
    a continuous range. For example, the capital gain of a person (e.g. $14,084)
    is a continuous column.

Here's a list of columns available in the Census Income dataset:

## Converting Data into Tensors

When building a tf.estimator model, the input data is specified by means of an
input function or `input_fn`. This builder function returns a `tf.data.Dataset`
of batches of `(features-dict,label)` pairs. It will not be called until it is
later passed to `tf.estimator.Estimator` methods such as `train` and `evaluate`.

In more detail, the input builder function returns the following as a pair:

1.  `features`: A dict from feature names to `Tensors` or
    `SparseTensors` containing batches of features.
2.  `labels`: A `Tensor` containing batches of labels.

The keys of the `features` will be used to configure the model's input layer.

Note that the input function will be called while
constructing the TensorFlow graph, not while running the graph. What it is
returning is a representation of the input data as sequence of tensorflow graph
operations.

For small problems like this it's easy to make a `tf.data.Dataset` by slicing the `pandas.DataFrame`:

In [8]:
def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
  df = df.copy()
  label = df.pop(label_key)
  ds = tf.data.Dataset.from_tensor_slices((dict(df),label))

  if shuffle:
    ds = ds.shuffle(10000)

  ds = ds.batch(batch_size).repeat(num_epochs)

  return ds

Since we have eager execution enabled it is easy to inspect the resulting dataset:

In [9]:
ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds:
    break
    
print('Some feature keys:', list(feature_batch.keys())[:5])
print()
print('A batch of Ages  :', feature_batch['age'])
print()
print('A batch of Labels:', label_batch )

Some feature keys: ['capital_gain', 'occupation', 'gender', 'capital_loss', 'workclass']

A batch of Ages  : tf.Tensor([61 18 37 47 47 32 18 23 28 37], shape=(10,), dtype=int32)

A batch of Labels: tf.Tensor(
[b'>50K' b'<=50K' b'>50K' b'>50K' b'>50K' b'>50K' b'<=50K' b'<=50K'
 b'<=50K' b'<=50K'], shape=(10,), dtype=string)


But this approach has severly-limited scalability. For larger data it should be streamed off disk.
the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: 

TODO(markdaoust): This `input_fn` should use `tf.contrib.data.make_csv_dataset`

In [10]:
import inspect
print(inspect.getsource(census_dataset.input_fn))

def input_fn(data_file, num_epochs, shuffle, batch_size):
  """Generate an input function for the Estimator."""
  assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have run census_dataset.py and '
      'set the --data_dir argument to the correct path.' % data_file)

  def parse_csv(value):
    tf.logging.info('Parsing {}'.format(data_file))
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')  # binary classification
    return features, classes

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv, num_parallel_calls=5)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = 

This input_fn gives equivalent output:

In [11]:
ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)

INFO:tensorflow:Parsing /tmp/census_data/adult.data


I0711 14:47:26.362334 140466218788608 tf_logging.py:115] Parsing /tmp/census_data/adult.data


In [12]:
for feature_batch, label_batch in ds:
    break
    
print('Feature keys:', list(feature_batch.keys())[:5])
print()
print('Age batch   :', feature_batch['age'])
print()
print('Label batch :', label_batch )

Feature keys: ['capital_gain', 'occupation', 'gender', 'capital_loss', 'workclass']

Age batch   : tf.Tensor([46 38 42 37 29 48 46 40 73 49], shape=(10,), dtype=int32)

Label batch : tf.Tensor([False False False False False False False False  True False], shape=(10,), dtype=bool)


Because `Estimators` expect an `input_fn` that takes no arguments, we typically wrap configurable input function into an obejct with the expected signature. For this notebook configure the `train_inpf` to iterate over the data twice:

In [13]:
import functools
train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

## Selecting and Engineering Features for the Model

Estimators use a system called `feature_columns` to describe how the model
should interpret each of the raw input features. An Estimator exepcts a vector
of numeric inputs, and feature columns describe how the model shoukld convert
each feature.

Selecting and crafting the right set of feature columns is key to learning an
effective model. A **feature column** can be either one of the raw columns in
the original dataframe (let's call them **base feature columns**), or any new
columns created based on some transformations defined over one or multiple base
columns (let's call them **derived feature columns**). Basically, "feature
column" is an abstract concept of any raw or derived variable that can be used
to predict the target label.

### Base Feature Columns

#### Numeric columns

The simplest `feature_column` is `numeric_column`. This indicates that a feature is a numeric value that should be input to the model directly. For example:

In [14]:
age = fc.numeric_column('age')

The model will use the `feature_column` definitions to build the model input. You can inspect the resulting output using the `input_layer` function:

In [15]:
fc.input_layer(feature_batch, [age]).numpy()

<tf.Tensor: id=237, shape=(10, 1), dtype=float32, numpy=
array([[46.],
       [38.],
       [42.],
       [37.],
       [29.],
       [48.],
       [46.],
       [40.],
       [73.],
       [49.]], dtype=float32)>

The following code will train and evaluate a model on only the `age` feature.

In [16]:
classifier = tf.estimator.LinearClassifier(feature_columns=[age], n_classes=2)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()
print(result)

{'precision': 0.29166666, 'auc_precision_recall': 0.31132147, 'average_loss': 0.5239897, 'label/mean': 0.23622628, 'auc': 0.6781367, 'loss': 33.4552, 'prediction/mean': 0.22513431, 'accuracy': 0.7631595, 'recall': 0.0018200728, 'global_step': 1018, 'accuracy_baseline': 0.76377374}


Similarly, we can define a `NumericColumn` for each continuous feature column
that we want to use in the model:

In [17]:
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

In [18]:
my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]

In [19]:
fc.input_layer(feature_batch, my_numeric_columns).numpy()

<tf.Tensor: id=2160, shape=(10, 5), dtype=float32, numpy=
array([[4.600e+01, 0.000e+00, 0.000e+00, 6.000e+00, 4.000e+01],
       [3.800e+01, 4.508e+03, 0.000e+00, 1.300e+01, 4.000e+01],
       [4.200e+01, 0.000e+00, 0.000e+00, 1.400e+01, 4.000e+01],
       [3.700e+01, 0.000e+00, 0.000e+00, 1.100e+01, 4.000e+01],
       [2.900e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],
       [4.800e+01, 0.000e+00, 0.000e+00, 1.300e+01, 5.500e+01],
       [4.600e+01, 0.000e+00, 0.000e+00, 9.000e+00, 5.000e+01],
       [4.000e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],
       [7.300e+01, 6.418e+03, 0.000e+00, 4.000e+00, 9.900e+01],
       [4.900e+01, 0.000e+00, 0.000e+00, 4.000e+00, 4.000e+01]],
      dtype=float32)>

You could retrain a model on these features with, just by changing the `feature_columns` argument to the constructor:

In [20]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns, n_classes=2)
classifier.train(train_inpf)

result = classifier.evaluate(test_inpf)

clear_output()
for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))

accuracy: 0.7817087
accuracy_baseline: 0.76377374
auc: 0.8027547
auc_precision_recall: 0.5611528
average_loss: 1.0698086
global_step: 1018
label/mean: 0.23622628
loss: 68.30414
precision: 0.57025987
prediction/mean: 0.36397633
recall: 0.30811232


#### Categorical columns

To define a feature column for a categorical feature, we can create a
`CategoricalColumn` using one of the `tf.feature_column.categorical_column*` functions.

If you know the set of all possible feature values of a column and there are only a few of them, you can use `categorical_column_with_vocabulary_list`. Each key in the list will get assigned an auto-incremental ID starting from 0. For example, for the `relationship` column we can assign the feature string `Husband` to an integer ID of 0 and "Not-in-family" to 1, etc., by doing:

In [21]:
relationship = fc.categorical_column_with_vocabulary_list(
    'relationship', [
        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
        'Other-relative'])


This will create a sparse one-hot vector from the raw input feature.

The `input_layer` function we're using for demonstration is designed for DNN models, and so expects dense inputs. To demonstrate the categorical column we must wrap it in a `tf.feature_column.indicator_column` to create the dense one-hot output (Linear `Estimators` can often skip this dense-step).

Note: the other sparse-to-dense option is `tf.feature_column.embedding_column`.

Run the input layer, configured with both the `age` and `relationship` columns:

In [23]:
fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])

<tf.Tensor: id=4490, shape=(10, 7), dtype=float32, numpy=
array([[46.,  0.,  0.,  0.,  0.,  1.,  0.],
       [38.,  1.,  0.,  0.,  0.,  0.,  0.],
       [42.,  0.,  1.,  0.,  0.,  0.,  0.],
       [37.,  1.,  0.,  0.,  0.,  0.,  0.],
       [29.,  1.,  0.,  0.,  0.,  0.,  0.],
       [48.,  1.,  0.,  0.,  0.,  0.,  0.],
       [46.,  1.,  0.,  0.,  0.,  0.,  0.],
       [40.,  1.,  0.,  0.,  0.,  0.,  0.],
       [73.,  1.,  0.,  0.,  0.,  0.,  0.],
       [49.,  1.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)>

What if we don't know the set of possible values in advance? Not a problem. We
can use `categorical_column_with_hash_bucket` instead:

In [24]:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

What will happen is that each possible value in the feature column `occupation`
will be hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:

In [25]:
for item in feature_batch['occupation'].numpy():
    print(item.decode())

Machine-op-inspct
Transport-moving
Prof-specialty
Adm-clerical
Handlers-cleaners
Prof-specialty
Other-service
Farming-fishing
Farming-fishing
Handlers-cleaners


if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`

In [27]:
occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])

occupation_result.numpy().shape

(10, 1000)

It's easier to see the actual results if we take the tf.argmax over the `hash_bucket_size` dimension.

In the output below, note how any duplicate occupations are mapped to the same pseudo-random index:

Note: Hash collisions are unavoidable, but often have minimal impact on model quiality. The effeect may be noticable if the hash buckets are being used to compress the input space. See [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb) for a more visual example of the effect of these hash collisions.

In [28]:
tf.argmax(occupation_result, axis=1).numpy()

array([911, 420, 979,  96,  10, 979, 527, 936, 936,  10])

No matter which way we choose to define a `SparseColumn`, each feature string
will be mapped into an integer ID by looking up a fixed mapping or by hashing.
Under the hood, the `LinearModel` class is responsible for
managing the mapping and creating `tf.Variable` to store the model parameters
(also known as model weights) for each feature ID. The model parameters will be
learned through the model training process we'll go through later.

We'll do the similar trick to define the other categorical features:

In [29]:
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])


In [30]:
my_categorical_columns = [relationship, occupation, education, marital_status, workclass]

It's easy to use both sets of columns to configure a model that uses all these features:

In [31]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns, n_classes=2)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()
for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))

accuracy: 0.83342546
accuracy_baseline: 0.76377374
auc: 0.8807037
auc_precision_recall: 0.6601031
average_loss: 0.8671454
global_step: 1018
label/mean: 0.23622628
loss: 55.36468
precision: 0.6496042
prediction/mean: 0.2628341
recall: 0.6401456


### Derived feature columns

#### Making Continuous Features Categorical through Bucketization

Sometimes the relationship between a continuous feature and the label is not
linear. As a hypothetical example, a person's income may grow with age in the
early stage of one's career, then the growth may slow at some point, and finally
the income decreases after retirement. In this scenario, using the raw `age` as
a real-valued feature column might not be a good choice because the model can
only learn one of the three cases:

1.  Income always increases at some rate as age grows (positive correlation),
1.  Income always decreases at some rate as age grows (negative correlation), or
1.  Income stays the same no matter at what age (no correlation)

If we want to learn the fine-grained correlation between income and each age
group separately, we can leverage **bucketization**. Bucketization is a process
of dividing the entire range of a continuous feature into a set of consecutive
bins/buckets, and then converting the original numerical feature into a bucket
ID (as a categorical feature) depending on which bucket that value falls into.
So, we can define a `bucketized_column` over `age` as:

In [32]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

where the `boundaries` is a list of bucket boundaries. In this case, there are
10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24,
25-29, ..., to 65 and over).

With bucketing, the model sees each bucket a one-hot feature:

In [33]:
fc.input_layer(feature_batch, [age, age_buckets]).numpy()

array([[46.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [38.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [42.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [37.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [29.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [48.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [46.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [40.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [73.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [49.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]],
      dtype=float32)

#### Learn complex relationships with crossed column

Using each base feature column separately may not be enough to explain the data.
For example, the correlation between education and the label (earning > 50,000
dollars) may be different for different occupations. Therefore, if we only learn
a single model weight for `education="Bachelors"` and `education="Masters"`, we
won't be able to capture every single education-occupation combination (e.g.
distinguishing between `education="Bachelors" AND occupation="Exec-managerial"`
and `education="Bachelors" AND occupation="Craft-repair"`). To learn the
differences between different feature combinations, we can add **crossed feature
columns** to the model.

In [34]:
education_x_occupation = tf.feature_column.crossed_column(
    ['education', 'occupation'], hash_bucket_size=1000)

We can also create a `crossed_column` over more than two columns. Each
constituent column can be either a base feature column that is categorical
(`SparseColumn`), a bucketized real-valued feature column, or even another
`CrossColumn`. Here's an example:

In [35]:
age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)

These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.

For a visual example the effect of hash-buckets with crossed columns see [this notebook](https://colab.research.google.com/github/tensorflow/models/blob/master/samples/outreach/blogs/housing_prices.ipynb)



## Defining The Logistic Regression Model

After processing the input data and defining all the feature columns, we're now
ready to put them all together and build a Logistic Regression model. In the
previous section we've seen several types of base and derived feature columns,
including:

*   `CategoricalColumn`
*   `NumericColumn`
*   `BucketizedColumn`
*   `CrossedColumn`

All of these are subclasses of the abstract `FeatureColumn` class, and can be
added to the `feature_columns` field of a model:

In [36]:
import tempfile

base_columns = [
    education, marital_status, relationship, workclass, occupation,
    age_buckets,
]
crossed_columns = [
    tf.feature_column.crossed_column(
        ['education', 'occupation'], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
]

model_dir = tempfile.mkdtemp()
model = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns + crossed_columns)

INFO:tensorflow:Using default config.


I0711 14:48:54.071429 140466218788608 tf_logging.py:115] Using default config.


INFO:tensorflow:Using config: {'_global_id_in_cluster': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_num_worker_replicas': 1, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc03341f668>, '_evaluation_master': '', '_train_distribute': None, '_model_dir': '/tmp/tmpligbanno', '_session_config': None, '_save_checkpoints_steps': None, '_master': '', '_num_ps_replicas': 0, '_task_type': 'worker', '_log_step_count_steps': 100, '_save_summary_steps': 100, '_service': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}


I0711 14:48:54.073915 140466218788608 tf_logging.py:115] Using config: {'_global_id_in_cluster': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_num_worker_replicas': 1, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc03341f668>, '_evaluation_master': '', '_train_distribute': None, '_model_dir': '/tmp/tmpligbanno', '_session_config': None, '_save_checkpoints_steps': None, '_master': '', '_num_ps_replicas': 0, '_task_type': 'worker', '_log_step_count_steps': 100, '_save_summary_steps': 100, '_service': None, '_task_id': 0, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5}


The model also automatically learns a bias term, which controls the prediction
one would make without observing any features (see the section [How Logistic
Regression Works](#how_it_works) for more explanations). The learned model files will be stored
in `model_dir`.

## Training and evaluating our model

After adding all the features to the model, now let's look at how to actually
train the model. Training a model is just a single command using the
tf.estimator API:

In [38]:
model.train(train_inpf)
clear_output()

After the model is trained, we can evaluate how good our model is at predicting
the labels of the holdout data:

In [39]:
results = model.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))

accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.70
average_loss: 0.35
global_step: 1018.00
label/mean: 0.24
loss: 22.37
precision: 0.69
prediction/mean: 0.24
recall: 0.57


The first line of the final output should be something like
`accuracy: 0.83`, which means the accuracy is 83%. Feel free to try more
features and transformations and see if you can do even better!

After the model is evaluated, we can use the model to predict whether an individual has an annual income of over
50,000 dollars given an individual's information input.

Let's look in more detail how the model did:

In [40]:
import numpy as np
predict_df = test_df[:20].copy()

pred_iter = model.predict(
    lambda:easy_input_function(predict_df, label_key='income_bracket',
                               num_epochs=1, shuffle=False, batch_size=10))

classes = np.array(['<=50K', '>50K'])
pred_class_id = []
for pred_dict in pred_iter:
  pred_class_id.append(pred_dict['class_ids'])

predict_df['predicted_class'] = classes[np.array(pred_class_id)]
predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']

clear_output()
predict_df[['income_bracket','predicted_class', 'correct']]

Unnamed: 0,income_bracket,predicted_class,correct
0,<=50K,<=50K,True
1,<=50K,<=50K,True
2,>50K,<=50K,False
3,>50K,<=50K,False
4,<=50K,<=50K,True
5,<=50K,<=50K,True
6,<=50K,<=50K,True
7,>50K,>50K,True
8,<=50K,<=50K,True
9,<=50K,<=50K,True


If you'd like to see a working end-to-end example, you can download our
[example code](https://github.com/tensorflow/models/tree/master/official/wide_deep/census_main.py)
and set the `model_type` flag to `wide`.

## Adding Regularization to Prevent Overfitting

Regularization is a technique used to avoid **overfitting**. Overfitting happens
when your model does well on the data it is trained on, but worse on test data
that the model has not seen before, such as live traffic. Overfitting generally
occurs when a model is excessively complex, such as having too many parameters
relative to the number of observed training data. Regularization allows for you
to control your model's complexity and makes the model more generalizable to
unseen data.

In the Linear Model library, you can add L1 and L2 regularizations to the model
as:

In [41]:
#TODO(markdaoust): is the regularization strength here not working?
model = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=0.1,
        l2_regularization_strength=0.1))

model.train(train_inpf)

results = model.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))

accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.89
auc_precision_recall: 0.70
average_loss: 0.35
global_step: 2036.00
label/mean: 0.24
loss: 22.29
precision: 0.69
prediction/mean: 0.24
recall: 0.56


One important difference between L1 and L2 regularization is that L1
regularization tends to make model weights stay at zero, creating sparser
models, whereas L2 regularization also tries to make the model weights closer to
zero but not necessarily zero. Therefore, if you increase the strength of L1
regularization, you will have a smaller model size because many of the model
weights will be zero. This is often desirable when the feature space is very
large but sparse, and when there are resource constraints that prevent you from
serving a model that is too large.

In practice, you should try various combinations of L1, L2 regularization
strengths and find the best parameters that best control overfitting and give
you a desirable model size.

<a id="how_it_works"> </a>
## How Logistic Regression Works

Finally, let's take a minute to talk about what the Logistic Regression model
actually looks like in case you're not already familiar with it. We'll denote
the label as \\(Y\\), and the set of observed features as a feature vector
\\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). We define \\(Y=1\\) if an individual
earned > 50,000 dollars and \\(Y=0\\) otherwise. In Logistic Regression, the
probability of the label being positive (\\(Y=1\\)) given the features
\\(\mathbf{x}\\) is given as:

$$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$

where \\(\mathbf{w}=[w_1, w_2, ..., w_d]\\) are the model weights for the
features \\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). \\(b\\) is a constant that is
often called the **bias** of the model. The equation consists of two parts—A
linear model and a logistic function:

*   **Linear Model**: First, we can see that \\(\mathbf{w}^T\mathbf{x}+b = b +
    w_1x_1 + ... +w_dx_d\\) is a linear model where the output is a linear
    function of the input features \\(\mathbf{x}\\). The bias \\(b\\) is the
    prediction one would make without observing any features. The model weight
    \\(w_i\\) reflects how the feature \\(x_i\\) is correlated with the positive
    label. If \\(x_i\\) is positively correlated with the positive label, the
    weight \\(w_i\\) increases, and the probability \\(P(Y=1|\mathbf{x})\\) will
    be closer to 1. On the other hand, if \\(x_i\\) is negatively correlated
    with the positive label, then the weight \\(w_i\\) decreases and the
    probability \\(P(Y=1|\mathbf{x})\\) will be closer to 0.

*   **Logistic Function**: Second, we can see that there's a logistic function
    (also known as the sigmoid function) \\(S(t) = 1/(1+\exp(-t))\\) being
    applied to the linear model. The logistic function is used to convert the
    output of the linear model \\(\mathbf{w}^T\mathbf{x}+b\\) from any real
    number into the range of \\([0, 1]\\), which can be interpreted as a
    probability.

Model training is an optimization problem: The goal is to find a set of model
weights (i.e. model parameters) to minimize a **loss function** defined over the
training data, such as logistic loss for Logistic Regression models. The loss
function measures the discrepancy between the ground-truth label and the model's
prediction. If the prediction is very close to the ground-truth label, the loss
value will be low; if the prediction is very far from the label, then the loss
value would be high.

## What Next

For more about estimators:

- The [TensorFlow Hub transfer-learning tutorial](https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub)
- The [Gradient-boosted-trees estimator tutorial](https://github.com/tensorflow/models/tree/master/official/boosted_trees)
- This [blog post]( https://medium.com/tensorflow/classifying-text-with-tensorflow-estimators) on processing text with `Estimators`
- How to [build a custom CNN estimator](https://www.tensorflow.org/tutorials/estimators/cnn)