# Apply Neural Networks to Regression Problem

<font color='steelblue'>

<span style="font-family:verdana; font-size:1.6em;">
    <b>Predict Fuel Efficiency using Neural Networks</b><br><br>
</span>
<span style="font-family:verdana; font-size:1.4em;">
    <b>Following examples are included in the processing:</b><em>
    <ol>
        <li>Check the version of Tensorflow and Keras </li>
        <li>Load training and test data including labels</li>
        <li>Normalize the images</li>
        <li>Plot few images after being normalized</li>
        <li>Create a Neural Network and build a model</li>
        <li>Train the model on the training dataset</li>
        <li>Evaluate the accuracy of the model using test dataset</li>
        <li>Plot the accuracy and loss for the model</li>
    </ol></em>    
</span>

</font>

<span style="font-family:verdana; font-size:1.2em;">
    In a <i>regression</i> problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a <i>classification</i> problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the `tf.keras` API, see [this guide](https://www.tensorflow.org/guide/keras) for details.
</span>

In [None]:
%config IPCompleter.greedy = True

In [None]:
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-whitegrid')    # grids in the plots

In [None]:
# make sure tensorflow is properly installed
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

tf.__version__, tf.keras.__version__

## Locate the dataset

In [None]:
dataset_path = keras.utils.get_file("auto-mpg.data", 
                                    "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

## Import Dataset since it is stored locally

In [None]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

## Clean Data
### Unknown values then drop them

In [None]:
df.isna().sum()

In [None]:
df = df.dropna()
df.shape

## Handle Categorical features
### Origin is a categorical column convert it to numeric

In [None]:
df['Origin'] = df['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

In [None]:
df = pd.get_dummies(df, prefix='', prefix_sep='')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.describe().transpose()

## Explore Data

In [None]:
sns.pairplot(df[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
plt.show()

In [None]:
# Like to have the target variable at the end of the dataset
mpgs = df.pop("MPG")
df['MPG'] = mpgs

In [None]:
df.head()

## Standardize features
### Define the features that need to be standardized. Then apply scaling to those features

In [None]:
tostd = ['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year']
tostd

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[tostd] = scaler.fit_transform(df[tostd])
df.head()

## Create Training and Test datasets
<span style="font-family:verdana; font-size:1.2em;">
    <ol>
        <li>Create featurs column list</li>
        <li>Create X and Y</li>
        <li>Create Training and Test datasets</li>
    </ol>    
</span>

In [None]:
features = list(df.columns)
features.remove('MPG')
features

In [None]:
X = df[features].values

In [None]:
y = df['MPG'].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
                                                    random_state = 2345)

# Create Neural Network
<span style="font-family:verdana; font-size:1.2em;">
        <ol>
        <li>Build a sequential model with 2 densely connected hidden layers</li>
        <li>Add an output layer that has a single continous value</li>
        <li>Define the optimizer with a learning rate</li>
        <li>Compile the model with loss function as MSE and use metrics MAE and MSE</li>  
    </ol>    
<i><ul>
<li>Mean Squared Error (MSE) is a common loss function used for regression problems (different loss functions are used for classification problems) </li>
<li>Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE)</li>
<li>If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting</li>
    </ul></i>
</span>

In [None]:
# Instantiate the model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[len(features)]),
    layers.Dense(128, activation='relu'),
    layers.Dense(1)
])

In [None]:
# Create optimizer
optimizer = tf.keras.optimizers.RMSprop(0.001)

In [None]:
model.compile(loss='mse', optimizer=optimizer,
              metrics=['mae', 'mse'])

In [None]:
model.summary()

# Train the neural network

In [None]:
# Train the model and include a validation set (composed of 10% of the dataset)
# Capturing the returned history enables you to plot the change in 
# error/loss and accuracy over time
EPOCHS = 1000
history = model.fit(X_train, y_train, validation_split=0.2, 
                    epochs = EPOCHS, verbose = 1)

## Model performance on training dataset

In [None]:
metrics_names = model.metrics_names
metrics_names

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

In [None]:
hist.head()

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, string, ylim = [1, 10]):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.title('Training and validation')
    plt.ylim = ylim
    plt.xlabel('Epochs')
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

In [None]:
plot_graphs(history, 'mae')

In [None]:
plot_graphs(history, 'mse', ylim = [1, 20])

In [None]:
# Use the test data to evaluate the model, data that the model has never seen

loss, mae, mse = model.evaluate(X_test, y_test, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))
print("Testing set Mean Squared Error: {:5.2f} MPG".format(mse))

In [None]:
test_predictions = model.predict(X_test).flatten()
#plt.figure(figsize = (6,6))
a = plt.axes(aspect='equal')
plt.scatter(y_test, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
# for the line
plt.plot([0,50], [0, 50], 'r')
plt.title("True v/s Predictions")
plt.show()

In [None]:
error = test_predictions - y_test
plt.hist(error, bins = 25)
plt.xlabel("Predictions Error - MPG")
plt.ylabel("Count")
plt.show()

#### Not a gaussian curve, larger data set might give a better curve

<span style="font-family:Arial; font-size:1.4em;">
<font color='tomato'>
    <h2>Neural Networks v/s Linear Regression</h2>
    <ol>
        <li>Training and Test datasets are already created</li>
        <li>Use Linear Regression to train the model)</li>
        <li>Make predictions using the model built</li>
        <li>Compare results between 2 models</li>
    </ol>
</font>
</span>