# Use the Boston dataset to estimate the median price of a house

In [1]:
# Import the necessary libraries
import pandas as pd
from sklearn import datasets
import tensorflow as tf
import itertools

## Step 1) Import the data with pandas

You define the column names and store it in COLUMNS. You can use pd.read_csv() to import the data.

In [2]:
COLUMNS = ["crim", "zn", "indus", "nox", "rm", "age", 
          "dis", "tax", "ptratio", "medv"]

training_set = pd.read_csv("D:/boston/boston_train.csv", skipinitialspace = True, skiprows = 1, names = COLUMNS)

test_set = pd.read_csv("D:/boston/boston_test.csv", skipinitialspace = True, skiprows = 1, names = COLUMNS)

prediction_set = pd.read_csv("D:/boston/boston_predict.csv", skipinitialspace = True, skiprows = 1, names = COLUMNS)

# You can print the shape of the data.
print(training_set.shape, test_set.shape, prediction_set.shape)

(400, 10) (100, 10) (6, 10)


Note that the label, i.e. your y, is included in the dataset. So you need to define two other lists. One containing only the features and one with the name of the label only. These two lists will tell your estimator what are the features in the dataset and what column name is the label

It is done with the code below.

In [3]:
FEATURES = ["crim", "zn", "indus", "nox", "rm", "age", 
          "dis", "tax", "ptratio"]
LABEL = "medv"

## Step 2) Convert the data

You need to convert the numeric variables in the proper format. Tensorflow provides a method to convert continuous variable: tf.feature_column.numeric_column().

In the previous step, you define a list a feature you want to include in the model. Now you can use this list to convert them into numeric data. If you want to exclude features in your model, feel free to drop one or more variables in the list FEATURES before you construct the feature_cols

Note that you will use Python list comprehension with the list FEATURES to create a new list named feature_cols. It helps you avoid writing nine times tf.feature_column.numeric_column(). A list comprehension is a faster and cleaner way to create new lists

In [4]:
training_set.head()

Unnamed: 0,crim,zn,indus,nox,rm,age,dis,tax,ptratio,medv
0,2.3004,0.0,19.58,0.605,6.319,96.1,2.1,403,14.7,23.8
1,13.3598,0.0,18.1,0.693,5.887,94.7,1.7821,666,20.2,12.7
2,0.12744,0.0,6.91,0.448,6.77,2.9,5.7209,233,17.9,26.6
3,0.15876,0.0,10.81,0.413,5.961,17.5,5.2873,305,19.2,21.7
4,0.03768,80.0,1.52,0.404,7.274,38.3,7.309,329,12.6,34.6


In [5]:
training_set.dtypes

crim       float64
zn         float64
indus      float64
nox        float64
rm         float64
age        float64
dis        float64
tax          int64
ptratio    float64
medv       float64
dtype: object

In [6]:
feature_cols = [tf.feature_column.numeric_column(k) for k in FEATURES]

# Step 3) Define the estimator

In this step, you need to define the estimator. Tensorflow currently provides 6 pre-built estimators, including 3 for classification task and 3 for regression task:

- Regressor
 - DNNRegressor
 - LinearRegressor
 - DNNLineaCombinedRegressor
- Classifier
 - DNNClassifier
 - LinearClassifier
 - DNNLineaCombinedClassifier
 
In this tutorial, you will use the Linear Regressor. To access this function, you need to use tf.estimator.

The function needs two arguments:

* feature_columns: Contains the variables to include in the model
* model_dir: path to store the graph, save the model parameters, etc

Tensorflow will automatically create a file named train in your working directory. You need to use this path to access the Tensorboard.

In [7]:
estimator = tf.estimator.LinearRegressor(feature_columns = feature_cols, model_dir = "train")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'train', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002095868D788>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


The tricky part with TensorFlow is the way to feed the model. Tensorflow is designed to work with parallel computing and very large dataset. Due to the limitation of the machine resources, it is impossible to feed the model with all the data at once. For that, you need to feed a batch of data each time. Note that, we are talking about huge dataset with millions or more records. If you don't add batch, you will end up with a memory error.

For instance, if your data contains 100 observations and you define a batch size of 10, it means the model will see 10 observations for each iteration (10*10).

When the model has seen all the data, it finishes one **epoch**. An epoch defines how many times you want the model to see the data. It is better to set this step to none and let the model performs iteration number of time.

A second information to add is if you want to shuffle the data before each iteration. During the training, it is important to shuffle the data so that the model does not learn specific pattern of the dataset. If the model learns the details of the underlying pattern of the data, it will have difficulties to generalize the prediction for unseen data. This is called **overfitting**. The model performs well on the training data but cannot predict correctly for unseen data.

TensorFlow makes this two steps easy to do. When the data goes to the pipeline, it knows how many observations it needs (batch) and if it has to shuffle the data.

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters:

- x: feature data
- y: label data
- batch_size: batch. By default 128
- num_epoch: Number of epoch, by default 1
- shuffle: Shuffle or not the data. By default, None

You need to feed the model many times so you define a function to repeat this process. all this function get_input_fn.

In [8]:
def get_input_fn(data_set, num_epochs = None, n_batch = 128, shuffle = True):
    return tf.compat.v1.estimator.inputs.pandas_input_fn(
    x = pd.DataFrame({k: data_set[k].values for k in FEATURES}),
    y = pd.Series(data_set[LABEL].values),
    batch_size = n_batch,
    num_epochs = num_epochs,
    shuffle = shuffle)

The usual method to evaluate the performance of a model is to:

- Train the model
- Evaluate the model in a different dataset
- Make prediction

Tensorflow estimator provides three different functions to carry out this three steps easily.

# Step 4) Train the model

You can use the estimator train to evaluate the model. The train estimator needs an input_fn and a number of steps. You can use the function you created above to feed the model. Then, you instruct the model to iterate 1000 times. Note that, you don't specify the number of epochs, you let the model iterates 1000 times. If you set the number of epoch to 1, then the model will iterate 4 times: There are 400 records in the training set, and the batch size is 128

1. 128 rows
2. 128 rows
3. 128 rows
4. 16 rows

Therefore, it is easier to set the number of epoch to none and define the number of iteration.

In [9]:
estimator.train(input_fn = get_input_fn(training_set, 
                                        num_epochs = None, 
                                        n_batch = 128, 
                                        shuffle = False), 
                                        steps = 1000)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calli

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressorV2 at 0x2095868dbc8>

You can check the Tensorboard through the following command:
activate hello-tf
- For MacOS

tensorboard --logdir=./train
- For Windows

tensorboard --logdir=train	

## Step 5) Evaluate your model

You can evaluate the fit of your model on the test set with the code below:

In [10]:
ev = estimator.evaluate(input_fn = get_input_fn(test_set, 
                                                num_epochs = 1,
                                                n_batch = 128,
                                                shuffle = False))

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-23T17:01:24Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train\model.ckpt-4000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-10-23-17:01:25
INFO:tensorflow:Saving dict for global step 4000: average_loss = 20.341146, global_step = 4000, label/mean = 22.08, loss = 20.341146, prediction/mean = 22.887001
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 4000: train\model.ckpt-4000


You can print the loss with the code below:

In [11]:
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))

Loss: 20.341146


The model has a loss of 25.544739. You can check the summary statistic to get an idea of how big the error is.

In [12]:
training_set['medv'].describe()

count    400.000000
mean      22.625500
std        9.572593
min        5.000000
25%       16.600000
50%       21.400000
75%       25.025000
max       50.000000
Name: medv, dtype: float64

From the summary statistic above, you know that the average price for a house is 22 thousand, with a minimum price of 9 thousands and maximum of 50 thousand. The model makes a typical error of 25k dollars.

## Step 6) Make the prediction

Finally, you can use the estimator predict to estimate the value of 6 Boston houses.

In [13]:
y = estimator.predict(input_fn = get_input_fn(prediction_set,
                                             num_epochs = 1,
                                             n_batch = 128,
                                             shuffle = False))

To print the estimated values of , you can use this code:

In [14]:
predictions = list(p["predictions"] for p in itertools.islice(y, 6))
print("Predictions: {}".format(str(predictions)))

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train\model.ckpt-4000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Predictions: [array([35.025585], dtype=float32), array([19.02856], dtype=float32), array([24.776585], dtype=float32), array([33.31087], dtype=float32), array([14.747737], dtype=float32), array([20.348122], dtype=float32)]


The model forecast the following values:

| House  | Prediction |
| ------------- | ------------- |
| 1  | 33.803963  |
| 2  | 18.942837  |
| 3  | 26.030296  |
| 4  | 31.343994  |
| 5  | 15.666693  |
| 6  | 20.952637  |

## Numpy Solution

his section explains how to train the model using a numpy estimator to feed the data. The method is the same exept that you will use numpy_input_fn estimator.

In [15]:
training_set_n = pd.read_csv("D:/boston/boston_train.csv").values
test_set_n = pd.read_csv("D:/boston/boston_test.csv").values
prediction_set_n = pd.read_csv("D:/boston/boston_predict.csv").values

## Step 1) Import the data

First of all, you need to differentiate the feature variables from the label. You need to do this for the training data and evaluation. It is faster to define a function to split the data.

In [16]:
def prepare_data(df):
    X_train = df[:, :-3]
    Y_train = df[:, -3]
    return X_train, Y_train

You can use the function to split the label from the features of the train/evaluate dataset

In [17]:
X_train, Y_train = prepare_data(training_set_n)
X_test, Y_test = prepare_data(test_set_n)
print(X_train.shape, Y_train.shape)

(400, 9) (400,)


You need to exclude the last column of the prediction dataset because it contains only NaN

In [18]:
X_predict = prediction_set_n[:, :-2]

Confirm the shape of the array. Note that, the label should not have a dimension, it means (400,).

In [19]:
print(X_train.shape, Y_train.shape, X_predict.shape)

(400, 9) (400,) (6, 9)


You can construct the feature columns as follow:

In [20]:
feature_columns = [tf.feature_column.numeric_column('x', shape = X_train.shape[1:])]

The estimator is defined as before, you instruct the feature columns and where to save the graph.

In [21]:
estimator = tf.estimator.LinearRegressor(feature_columns = feature_columns,
                                        model_dir = "train1")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'train1', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000209586A5F08>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


You can use the numpy estimapor to feed the data to the model and then train the model. Note that, we define the input_fn function before to ease the readability.

In [22]:
train_input = tf.compat.v1.estimator.inputs.numpy_input_fn(x = {"x": X_train}, 
                                                 y = Y_train,
                                                batch_size = 128,
                                                shuffle = False,
                                                num_epochs = None)
estimator.train(input_fn = train_input, steps = 5000)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train1\model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 5000 into train1\model.ckpt.
INFO:tensorflow:loss = 62.03431, step = 5000
INFO:tensorflow:global_step/sec: 1066.69
INFO:tensorflow:loss = 61.8862, step = 5100 (0.095 sec)
INFO:tensorflow:global_step/sec: 1253.34
INFO:tensorflow:loss = 61.744713, step = 5200 (0.079 sec)
INFO:tensorflow:global_step/sec: 1352.5
INFO:tensorflow:loss = 61.609512, step = 5300 (0

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressorV2 at 0x209586a5988>

In [23]:
eval_input = tf.compat.v1.estimator.inputs.numpy_input_fn(x = {"x": X_test},
                                                          y = Y_test,
                                                          shuffle = False,
                                                          batch_size = 128,
                                                          num_epochs = 1)
estimator.evaluate(eval_input, steps = None)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-23T17:01:30Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train1\model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-10-23-17:01:31
INFO:tensorflow:Saving dict for global step 10000: average_loss = 17.154873, global_step = 10000, label/mean = 22.08, loss = 17.154873, prediction/mean = 23.208712
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10000: train1\model.ckpt-10000


{'average_loss': 17.154873,
 'label/mean': 22.08,
 'loss': 17.154873,
 'prediction/mean': 23.208712,
 'global_step': 10000}

Finaly, you can compute the prediction. It should be the similar as pandas.

In [24]:
test_input = tf.compat.v1.estimator.inputs.numpy_input_fn(x = {"x": X_predict},
                                                          batch_size = 128,
                                                          num_epochs = 1,
                                                          shuffle = False)
y = estimator.predict(test_input)
predictions = list(p["predictions"] for p in itertools.islice(y, 6))
print("Predictions: {}".format(str(predictions)))

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train1\model.ckpt-10000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Predictions: [array([35.186127], dtype=float32), array([19.197412], dtype=float32), array([23.930864], dtype=float32), array([34.54175], dtype=float32), array([13.562332], dtype=float32), array([19.704182], dtype=float32)]
