# Tensorflow solution

The last section is dedicated to a TensorFlow solution. This method is sligthly more complicated than the other one.

Note that if you use Jupyter notebook, you need to Restart and clean the kernel to run this session.

TensorFlow has built a great tool to pass the data into the pipeline. In this section, you will build the input_fn function by yourself.

## Step 1) Define the path and the format of the data

First of all, you declare two variables with the path of the csv file. Note that, you have two files, one for the training set and one for the testing set.

In [1]:
import tensorflow.compat.v1 as tf 
tf.disable_v2_behavior()

df_train = "D:/boston/boston_train.csv"
df_eval = "D:/boston/boston_test.csv"

Instructions for updating:
non-resource variables are not supported in the long term


Then, you need to define the columns you want to use from the csv file. We will use all. After that, you need to declare the type of variable it is.

Floats variable are defined by [0.]

In [2]:
COLUMNS = ["crim", "zn", "indus", "nox", "rm", "age",
                "dis", "tax", "ptratio", "medv"]
RECORDS_ALL = [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]

## Step 2) Define the input_fn function

The function can be broken into three part:

1. Import the data
2. Create the iterator
3. Consume the data

Below is the overal code to define the function. The code will be explained after

In [3]:
def input_fn(data_file, batch_size, num_epoch = None):
    # Step 1
    def parse_csv(value):        
        columns = tf.decode_csv(value, record_defaults = RECORDS_ALL)        
        features = dict(zip(COLUMNS, columns))
        #labels = features.pop('median_house_value')        
        labels =  features.pop('medv')        
        return features, labels
          
    # Extract lines from input files using the Dataset API.    
    dataset = (tf.data.TextLineDataset(data_file).skip(1).map(parse_csv))
          
    dataset = dataset.repeat(num_epoch)    
    dataset = dataset.batch(batch_size)
    # Step 3    
    iterator = dataset.make_one_shot_iterator()    
    features, labels = iterator.get_next()    
    return features, labels

** Import the data **

For a csv file, the dataset method reads one line at a time. To build the dataset, you need to use the object TextLineDataset. Your dataset has a header so you need to use skip(1) to skip the first line. At this point, you only read the data and exclude the header in the pipeline. To feed the model, you need to separate the features from the label. The method used to apply any transformation to the data is map.

This method calls a function that you will create in order to instruct how to transform the data. In a nutshell, you need to pass the data in the TextLineDataset object, exclude the header and apply a transformation which is instructed by a function.

Code explanation:

- tf.data.TextLineDataset(data_file): This line read the csv file
- .skip(1) : skip the header
- .map(parse_csv)): parse the records into the tensors. You need to define a function to instruct the map object. You can call this function parse_csv.

This function parses the csv file with the method tf.decode_csv and declares the features and the label. The features can be declared as a dictionary or a tuple. You use the dictionary method because it is more convenient.

Code explanation:

- tf.decode_csv(value, record_defaults= RECORDS_ALL): the method decode_csv uses the output of the TextLineDataset to read the csv file. record_defaults instructs TensorFlow about the columns type.
- dict(zip(_CSV_COLUMNS, columns)): Populate the dictionary with all the columns extracted during this data processing
- features.pop('median_house_value'): Exclude the target variable from the feature variable and create a label variable

The Dataset needs further elements to iteratively feeds the Tensors. Indeed, you need to add the method repeat to allow the dataset to continue indefinitely to feed the model. If you don't add the method, the model will iterate only one time and then throw an error because no more data are fed in the pipeline.

After that, you can control the batch size with the batch method. It means you tell the dataset how many data you want to pass in the pipeline for each iteration. If you set a big batch size, the model will be slow.

## Step 3) Create the iterator

Now you are ready for the second step: create an iterator to return the elements in the dataset.

The simplest way of creating an operator is with the method make_one_shot_iterator.

After that, you can create the features and labels from the iterator.

## Step 3) Consume the data

You can check what happens with input_fn function. You need to call the function in a session to consume the data. You try with a batch size equals to 1.

Note that, it prints the features in a dictionary and the label as an array.

It will show the first line of the csv file. You can try to run this code many times with different batch size.

In [4]:
next_batch = input_fn(df_train, batch_size = 1, num_epoch = None)
with tf.Session() as sess:
    first_batch = sess.run(next_batch)
    print(first_batch)

Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
({'crim': array([2.3004], dtype=float32), 'zn': array([0.], dtype=float32), 'indus': array([19.58], dtype=float32), 'nox': array([0.605], dtype=float32), 'rm': array([6.319], dtype=float32), 'age': array([96.1], dtype=float32), 'dis': array([2.1], dtype=float32), 'tax': array([403.], dtype=float32), 'ptratio': array([14.7], dtype=float32)}, array([23.8], dtype=float32))


## Step 4) Define the feature column

You need to define the numeric columns as follow:

In [5]:
X1= tf.feature_column.numeric_column('crim')
X2= tf.feature_column.numeric_column('zn')
X3= tf.feature_column.numeric_column('indus')
X4= tf.feature_column.numeric_column('nox')
X5= tf.feature_column.numeric_column('rm')
X6= tf.feature_column.numeric_column('age')
X7= tf.feature_column.numeric_column('dis')
X8= tf.feature_column.numeric_column('tax')
X9= tf.feature_column.numeric_column('ptratio')

Note that you need to combined all the variables in a bucket

In [6]:
base_columns = [X1, X2, X3,X4, X5, X6,X7, X8, X9]

## Step 5) Build the model

You can train the model with the estimator LinearRegressor.

In [7]:
model = tf.estimator.LinearRegressor(feature_columns = base_columns, model_dir = 'train2')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'train2', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001E51C664DC8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


You need to use a lambda function to allow to write the argument in the function inpu_fn. If you don't use a lambda function, you cannot train the model.

In [8]:
# Train the estimator
model.train(steps =1000,    
          input_fn= lambda : input_fn(df_train, batch_size=128, num_epoch = None))

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into train2\model.ckpt.
INFO:tensorflow:loss = 83729.64, step = 1
INFO:tensorflow:global_step/sec: 119.532
INFO:tensorflow:loss = 13909.657, step = 101 (0.838 sec)
INFO:tensorflow:global_step/sec: 128.055
INFO:tensorflow:loss = 12881.449, step = 201 (0.782 sec)
INFO:tensorflow:global_step/sec: 129.378
INFO:tensorflow:loss = 12391.541, step = 301 (0.772 sec)
INFO:te

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressor at 0x1e51c734dc8>

You can evaluate the fit of you model on the test set with the code below:

In [10]:
results = model.evaluate(steps = None, 
                       input_fn = lambda: input_fn(df_eval, batch_size = 128, num_epoch = 1))
for key in results:
    print(" {}, was: {}".format(key, results[key]))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-10-23T17:21:25Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train2\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-10-23-17:21:26
INFO:tensorflow:Saving dict for global step 1000: average_loss = 32.15896, global_step = 1000, label/mean = 22.08, loss = 3215.896, prediction/mean = 22.404533
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: train2\model.ckpt-1000
 average_loss, was: 32.158958435058594
 label/mean, was: 22.079999923706055
 loss, was: 3215.89599609375
 prediction/mean, was: 22.40453338623047
 global_step, was: 1000


The last step is predicting the value of based on the value of , the matrices of the features. You can write a dictionary with the values you want to predict. Your model has 9 features so you need to provide a value for each. The model will provide a prediction for each of them.

In the code below, you wrote the values of each features that is contained in the df_predict csv file.

You need to write a new input_fn function because there is no label in the dataset. You can use the API from_tensor from the Dataset.

In [11]:
prediction_input = {
          'crim': [0.03359,5.09017,0.12650,0.05515,8.15174,0.24522],
          'zn': [75.0,0.0,25.0,33.0,0.0,0.0],
          'indus': [2.95,18.10,5.13,2.18,18.10,9.90],
          'nox': [0.428,0.713,0.453,0.472,0.700,0.544],
          'rm': [7.024,6.297,6.762,7.236,5.390,5.782],
          'age': [15.8,91.8,43.4,41.1,98.9,71.7],
          'dis': [5.4011,2.3682,7.9809,4.0220,1.7281,4.0317],
          'tax': [252,666,284,222,666,304],
          'ptratio': [18.3,20.2,19.7,18.4,20.2,18.4]
     }

def test_input_fn():
    dataset = tf.data.Dataset.from_tensors(prediction_input)    
    return dataset
     
# Predict all our prediction_input
pred_results = model.predict(input_fn = test_input_fn)

Finaly, you print the predictions.

In [13]:
for pre in enumerate(pred_results):
    print(pre)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from train2\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
(0, {'predictions': array([32.297546], dtype=float32)})
(1, {'predictions': array([18.96125], dtype=float32)})
(2, {'predictions': array([27.270979], dtype=float32)})
(3, {'predictions': array([29.299236], dtype=float32)})
(4, {'predictions': array([16.436684], dtype=float32)})
(5, {'predictions': array([21.460876], dtype=float32)})


# Summary

To train a model, you need to:

- Define the features: Independent variables: X
- Define the label: Dependent variable: y
- Construct a train/test set
- Define the initial weight
- Define the loss function: MSE
- Optimize the model: Gradient descent
- Define:
  - Learning rate
  - Number of epoch
  - Batch size
 
In this tutorial, you learned how to use the high level API for a linear regression estimator. You need to define:

1. Feature columns. If continuous: tf.feature_column.numeric_column(). You can populate a list with python list comprehension
2. The estimator: tf.estimator.LinearRegressor(feature_columns, model_dir)
3. A function to import the data, the batch size and epoch: input_fn()

After that, you are ready to train, evaluate and make prediction with train(), evaluate() and predict()