<a href="https://colab.research.google.com/github/rih28/dataAnalytics/blob/master/Part_4_4_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This problem is slightly different from the other example shown. We have the training and test sets separated into two data sets. These data sets are actually different from the ones used in the example we have been following, but the concepts are the same and are still for the california housing problem. 

First thing is to obtain the data using pandas. Again, notice the use of two separate variables, df_train and df_test.

In [1]:
# needed to create the data frame
import pandas as pd

# create data frame from csv file we hosted on our github
df_train = pd.read_csv('https://raw.githubusercontent.com/rih28/dataAnalytics/master/FinalTS.csv', index_col=0)
df_test = pd.read_csv('https://raw.githubusercontent.com/rih28/dataAnalytics/master/FinalTST.csv', index_col=0)

print(df_train[:6])
print(df_test[:6])

   longitude  latitude  ...  population_per_household  median_house_value
1    -122.23     37.88  ...                  1.819209              452600
2    -122.22     37.86  ...                  3.362745              358500
4    -122.25     37.85  ...                  1.388060              341300
5    -122.25     37.85  ...                  1.207265              342200
6    -122.25     37.85  ...                  1.501818              269700
7    -122.25     37.84  ...                  2.755668              299200

[6 rows x 13 columns]
    longitude  latitude  ...  population_per_household  median_house_value
3     -122.24     37.85  ...                  2.802260              352100
10    -122.25     37.84  ...                  2.172269              261100
11    -122.26     37.85  ...                  2.263682              281500
13    -122.26     37.85  ...                  2.346154              213500
20    -122.27     37.84  ...                  2.509091              162900
28    -12

Next we use numpy to get the first 11 (apart from the id's in column 1) as our predictors, and leaving the final column as our target values. We need to do this for both the training and test datasets.

In [2]:
# needed to help with speedy maths based calculations
import numpy as np

# iloc allows us to select by rows. Here, we are shuffling the data by rows determined at random.
df_train = df_train.iloc[np.random.permutation(len(df_train))]

predictors_train = df_train.iloc[:,1:12]
predictors_test = df_test.iloc[:,1:12]
print(predictors_test)


       latitude  ...  population_per_household
3         37.85  ...                  2.802260
10        37.84  ...                  2.172269
11        37.85  ...                  2.263682
13        37.85  ...                  2.346154
20        37.84  ...                  2.509091
...         ...  ...                       ...
20614     39.09  ...                  3.039062
20615     39.08  ...                  3.069620
20617     39.08  ...                  3.085333
20622     39.01  ...                  3.082803
20630     39.12  ...                  3.801980

[4126 rows x 11 columns]


Similarly, for the targets i.e. the median_house_value.

In [3]:
targets_train = df_train.iloc[:,12:13]
targets_test = df_test.iloc[:,12:13]
print(targets_train)

       median_house_value
5104               120000
17992              234300
14968              153800
5654               268500
4720               500000
...                   ...
934                211300
10354              202200
9562               110200
19342              160000
1239               103800

[16514 rows x 1 columns]


We use a SCALE value of 1000000 due to the size of values in the median_house_value. This is to help the model train better as nodes train better with values between 0 and 1.

In [4]:
SCALE = 1000000.0 # Need this to scale the results.

Get the size of the training and test datasets. Here, we use a single column to measure the length. It doesn't matter which column you use.

In [5]:
trainsize = int(len(df_train['median_house_value']))
testsize = int(len(df_test['median_house_value']))

Get the number of input values i.e. nppredictors which is 11. Also, we set the number of outputs. We only want 1, a prediction of the median_house_value.

In [6]:
nppredictors = len(predictors_train.columns);
print(len(predictors_test.columns))
noutputs = 1;

11


Set up tensorflow for training a DNN Regressor. The only part that is different in the setup from the Linear Regressor (apart from the name) is the hidden_units.

This set's up a deep neural network with an input of 11, hidden layer 1 has 20 nodes, hidden layer 2 has 18 and hidden layer has 12 nodes with an output of 1. You will read more about this configuration soon.

In [7]:
# import tensorflow
%tensorflow_version 1.x
import tensorflow as tf

# check the version
print(tf.__version__)

# needed for high-level file management
import shutil  

# logging for tensorflow
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# removes a saved model from the last training attempt.
shutil.rmtree('/tmp/DNN_house_regression_trained_model', ignore_errors=True)

estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_house_regression_trained_model', hidden_units=[20,18,14], optimizer=tf.train.AdamOptimizer(learning_rate=0.01), enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_train.values)))

# Prints a log to show model is starting to train
print("starting to train");

# Train the model. Pass in predictor values and target values.
estimator.fit(predictors_train.values, targets_train.values.reshape(trainsize, noutputs)/SCALE, steps=10000)

# Next, we can check our predictions based on our predictors.
preds = estimator.predict(x=predictors_train.values)

# Apply the Scale value (not really needed here) to the outputs.
predslistscale = preds['scores']*SCALE

# pred = format(str(predslistscale)) # useful for checking outputs and printing.

# Calculate RMSE i.e. how good the model works using the predictions and targets.
# i.e. take the difference between the actual and the forecast then square the difference, 
# find the average of all the squares and then find the square root. 
# The RMSE essentially punishes larger errors i.e. it puts a heavier weight on larger errors.
rmse = np.sqrt(np.mean((targets_train.values - predslistscale)**2))
print('DNNRegression has RMSE of {0}'.format(rmse));


# Calculate the mean of the Life Satisfaction Values.
avg = np.mean(df_train['median_house_value'])

# Calculate the RMSE using Life Satisfaction Values and the mean of all target values.
# The fit of a proposed regression model should therefore be better than the fit of the mean model.
# In this case, it doesn't seem to be the case but it will vary on every run.
rmse = np.sqrt(np.mean((df_train['median_house_value'] - avg)**2))
print('Just using average = {0} has RMSE of {1}'.format(avg, rmse));


TensorFlow 1.x selected.
1.15.2
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please feed input to tf.data to support dask.
Instructions for updating:
Please access pandas data directly.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please convert numpy dtypes explicitly.
Instructions for updating:
Please specify feature columns explicitly.
Instructions for updating:
Please switch to tf.contrib.estimator.*_head.
Instructions for updating:
Please replace uses

Same problem applies with training and overfitting, so just re-run cells if values aren't reasonable. However, this is mostly just an exercise and is not really required. You can see we run the model differently here with no user constructed input data frame. The set-up is the same as the training part above, but there is no need for the optimizer. 

You can see we insert the predictors_test.values, which are the test set input values.

The overall output is two arrays (that only shows a few outputs. The predicted values seem reasonable and are not just the same, so the model has learned reasonably well but they seem to differ quite a bit from the target values (actual outputs).

predicted [239933.05 247445.73 215407.52 ... 149025.11 162564.19 160469.2 ]
targets   [352100    261100    281500    ... 55100     77500     108300   ]

Remember, this is real data and very noisy. But the data isn't quite as bad as it seems. You can see that higher predicted values match the higher targets. The numbers aren't exactly correct but could certainly be used to make rough estimates i.e. these values give this median_house_value which is low/medium/high in scale. 

In [14]:
estimator = tf.contrib.learn.SKCompat(tf.contrib.learn.DNNRegressor(model_dir='/tmp/DNN_house_regression_trained_model', hidden_units=[20,18,14], enable_centered_bias=False, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(predictors_test.values)))

preds = estimator.predict(x=predictors_test.values)
predslistscale = preds['scores']*SCALE
pred = format(str(predslistscale))
#np.set_printoptions(threshold='nan')
print("predicted", pred)
print("targets", targets_test['median_house_value'].values)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fcb96fbda20>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/DNN_house_regression_trained_model', '_session_creation_timeout_secs': 7200}
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/DNN_house_regression_trained_model/mo

**Go back to the course text**