# Simple Tensorflow Regression 
MPG Dataset Regression

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

df = pd.read_csv("http://blackboxai.us/ML/auto-mpg.csv", na_values=['NA', '?'])
df[0:5]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [9]:
cars = df['name']

df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

df['horsepower'].isnull().values.any()

False

In [0]:
# break into x and y for training

x = df[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin']].values
y = df['mpg'].values

In [16]:
#build model

model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x,y, verbose=2, epochs=100)

Train on 398 samples
Epoch 1/100
398/398 - 1s - loss: 4648.5295
Epoch 2/100
398/398 - 0s - loss: 1201.6101
Epoch 3/100
398/398 - 0s - loss: 234.3136
Epoch 4/100
398/398 - 0s - loss: 114.5348
Epoch 5/100
398/398 - 0s - loss: 36.9385
Epoch 6/100
398/398 - 0s - loss: 18.2309
Epoch 7/100
398/398 - 0s - loss: 15.1617
Epoch 8/100
398/398 - 0s - loss: 15.0090
Epoch 9/100
398/398 - 0s - loss: 15.6932
Epoch 10/100
398/398 - 0s - loss: 14.1259
Epoch 11/100
398/398 - 0s - loss: 14.1898
Epoch 12/100
398/398 - 0s - loss: 14.2392
Epoch 13/100
398/398 - 0s - loss: 13.5241
Epoch 14/100
398/398 - 0s - loss: 14.4278
Epoch 15/100
398/398 - 0s - loss: 13.9355
Epoch 16/100
398/398 - 0s - loss: 13.2591
Epoch 17/100
398/398 - 0s - loss: 13.6646
Epoch 18/100
398/398 - 0s - loss: 13.5684
Epoch 19/100
398/398 - 0s - loss: 14.0241
Epoch 20/100
398/398 - 0s - loss: 13.2699
Epoch 21/100
398/398 - 0s - loss: 13.3896
Epoch 22/100
398/398 - 0s - loss: 13.5435
Epoch 23/100
398/398 - 0s - loss: 14.8222
Epoch 24/100
398

<tensorflow.python.keras.callbacks.History at 0x7fa5cb9bac88>

# Neural Network Hyperparameters
If you look at the above code you will see that the neural network is made up of 4 layers. The first layer is the input layer. This is specified by input_dim and it is set to be the number of inputs that the dataset has. One input neuron is needed for ever input (including dummy variables). However, there are also several hidden layers, with 25 and 10 neurons each. You might be wondering how these numbers were chosen? This is one of the most common questions about neural networks. Unfortunately, there is not a good answer. These are hyperparameters. They are settings that can affect neural network performance, yet there is not a clearly defined means of setting them.

In general, more hidden neurons mean more capability to fit to complex problems. However, too many neurons can lead to overfitting and lengthy training times. Too few can lead to underfitting the problem and will sacrifice accuracy. Also, how many layers you have is another hyperparameter. In general, more layers allow the neural network to be able to perform more of its own feature engineering and data preprocessing. But this also comes at the expense of training times and risk of overfitting. In general, you will see that neuron counts start out larger near the input layer and tend to shrink towards the output layer in a sort of triangular fashion.

# Controlling the Amount of Output

One line is produced for each training epoch. You can eliminate this output by setting the verbose setting of the fit command:

verbose=0 - No progress output (use with Juputer if you do not want output)
verbose=1 - Display progress bar, does not work well with Jupyter
verbose=2 - Summary progress output (use with Jupyter if you want to know the loss at each epoch)


# Regression Prediction

Next, we will perform actual predictions. These predictions are assigned to the pred variable. These are all MPG predictions from the neural network. Notice that this is a 2D array? You can always see the dimensions of what is returned by printing out pred.shape. Neural networks can return multiple values, so the result is always an array. Here the neural network only returns 1 value per prediction (there are 398 cars, so 398 predictions). However, a 2D array is needed because the neural network has the potential of returning more than one value.


In [17]:
pred = model.predict(x)
print("Shape: {}".format(pred.shape))
print(pred[0:5])


Shape: (398, 1)
[[16.667439]
 [14.977112]
 [16.95792 ]
 [17.119595]
 [16.950394]]


We would like to see how good these predictions are. We know what the correct MPG is for each car, so we can measure how close the neural network was.

In [18]:
# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y))
print(f"Final score (RMSE): {score}")

Final score (RMSE): 3.5091692396876355


This means that, on average the predictions were within +/- 5.89 values of the correct value. This is not very good, but we will soon see how to improve it.

We can also print out the first 10 cars, with predictions and actual MPG

In [21]:
# Sample predictions
for i in range(10):
    print(f"{i+1}. Car name: {cars[i]}, MPG: {y[i]}, predicted MPG: {pred[i]}")

1. Car name: chevrolet chevelle malibu, MPG: 18.0, predicted MPG: [16.667439]
2. Car name: buick skylark 320, MPG: 15.0, predicted MPG: [14.977112]
3. Car name: plymouth satellite, MPG: 18.0, predicted MPG: [16.95792]
4. Car name: amc rebel sst, MPG: 16.0, predicted MPG: [17.119595]
5. Car name: ford torino, MPG: 17.0, predicted MPG: [16.950394]
6. Car name: ford galaxie 500, MPG: 15.0, predicted MPG: [9.825745]
7. Car name: chevrolet impala, MPG: 14.0, predicted MPG: [9.46888]
8. Car name: plymouth fury iii, MPG: 14.0, predicted MPG: [9.819969]
9. Car name: pontiac catalina, MPG: 14.0, predicted MPG: [9.017757]
10. Car name: amc ambassador dpl, MPG: 15.0, predicted MPG: [13.393219]
