# This file contains code to generate a neural network model to approximate the miles per gallon for a car with given features

- Read the dataset
- Clean the dataset
- Create X and y from the dataset
- Perform train test split
- Create the model
- Fit the model
- Verify the model
- Save the model to disk

In [67]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from sklearn import metrics
import numpy as np

## Read the dataset

In [53]:
df = pd.read_csv('auto-mpg.csv', na_values=['NA', '?'])
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


## Clean the dataset

Find the null values

In [54]:
df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

We replace the null values with the median

In [55]:
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

In [56]:
df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

## Create X and y from the dataset

In [57]:
y = df.pop('mpg').values
X = df.drop('name', axis=1).values

In [58]:
y[:5]

array([18., 15., 18., 16., 17.])

In [59]:
X[:5]

array([[8.000e+00, 3.070e+02, 1.300e+02, 3.504e+03, 1.200e+01, 7.000e+01,
        1.000e+00],
       [8.000e+00, 3.500e+02, 1.650e+02, 3.693e+03, 1.150e+01, 7.000e+01,
        1.000e+00],
       [8.000e+00, 3.180e+02, 1.500e+02, 3.436e+03, 1.100e+01, 7.000e+01,
        1.000e+00],
       [8.000e+00, 3.040e+02, 1.500e+02, 3.433e+03, 1.200e+01, 7.000e+01,
        1.000e+00],
       [8.000e+00, 3.020e+02, 1.400e+02, 3.449e+03, 1.050e+01, 7.000e+01,
        1.000e+00]])

## Perform train test split

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Create the model

In [61]:
model = Sequential([
    Dense(25, input_dim=X_train.shape[1], activation='relu'),
    Dense(10, activation='relu'),
    Dense(1)
])

In [62]:
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 25)                200       
_________________________________________________________________
dense_4 (Dense)              (None, 10)                260       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 11        
Total params: 471
Trainable params: 471
Non-trainable params: 0
_________________________________________________________________


In [63]:
early_stopping = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto', restore_best_weights=True)

## Fit the model

In [64]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), callbacks=[early_stopping], verbose=2, epochs=1000)

Train on 298 samples, validate on 100 samples
Epoch 1/1000
298/298 - 1s - loss: 2947.3437 - val_loss: 1932.5704
Epoch 2/1000
298/298 - 0s - loss: 773.0710 - val_loss: 860.4151
Epoch 3/1000
298/298 - 0s - loss: 419.5301 - val_loss: 430.5986
Epoch 4/1000
298/298 - 0s - loss: 261.5668 - val_loss: 242.1064
Epoch 5/1000
298/298 - 0s - loss: 191.2243 - val_loss: 213.9000
Epoch 6/1000
298/298 - 0s - loss: 182.6175 - val_loss: 164.0566
Epoch 7/1000
298/298 - 0s - loss: 158.2214 - val_loss: 152.4150
Epoch 8/1000
298/298 - 0s - loss: 145.4029 - val_loss: 141.7545
Epoch 9/1000
298/298 - 0s - loss: 133.7412 - val_loss: 134.8809
Epoch 10/1000
298/298 - 0s - loss: 128.8909 - val_loss: 130.1175
Epoch 11/1000
298/298 - 0s - loss: 125.0012 - val_loss: 127.2711
Epoch 12/1000
298/298 - 0s - loss: 120.4884 - val_loss: 121.8039
Epoch 13/1000
298/298 - 0s - loss: 117.0906 - val_loss: 114.6618
Epoch 14/1000
298/298 - 0s - loss: 109.8408 - val_loss: 108.7218
Epoch 15/1000
298/298 - 0s - loss: 105.7079 - val_l

<tensorflow.python.keras.callbacks.History at 0x7ffc86449490>

## Verify the model

In [71]:
y_test_pred = model.predict(X_test)
error_sqrt = np.sqrt(metrics.mean_squared_error(y_test_pred, y_test))
print(f'The error on the test set is {error_sqrt}')

The error on the test set is 5.602953722643528


## Save the model to disk

In [72]:
model.save('mpg_model.h5')