## The Boston Housing Price dataset using keras - Regression 

#### We will attempt to predict the median price of homes in a given Boston suburb in the mid-1970s, given data points about the suburb at the time, such as the crime rate, the local property tax rate, and so on. 
#### It has relatively few data points: only 506, split between *404 training samples and 102 test samples*. And each feature in the input data (for example, the crime rate) has a different scale. For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, others between 0 and 100, and so on.

In [1]:
import tensorflow as tf # import tensor flow 
import numpy as np
import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

### Loading Data

In [1]:
 from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

In [1]:
train_data.shape, test_data.shape

In [1]:
train_targets

#### As you can see, you have 404 training samples and 102 test samples, each with 13 numerical features, such as per capita crime rate, average number of rooms per dwelling, accessibility to highways, and so on.
#### The targets are the median values of owner-occupied homes, in thousands of dollars:

#### The prices are typically between $10,000 and $50,000. If that sounds cheap, remember that this was the mid-1970s, and these prices aren’t adjusted for inflation.


In [1]:
# Normalizing the data

mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

#### It would be problematic to feed into a neural network values that all take wildly different ranges. The model might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done using NumPy.


### Buidling our Model

In [1]:
# Model Defination

def build_model():
    model = keras.Sequential([#1
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
    ])
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

#### 1.) Because you’ll need to instantiate the same model multiple times, you use a function to construct it.

#### The model ends with a single unit and no activation (it will be a linear layer). This is a typical setup for scalar regression (a regression where you’re trying to predict a single continuous value). Applying an activation function would constrain the range the output can take; for instance, if you applied a sigmoid activation function to the last layer, the model could only learn to predict values between 0 and 1. Here, because the last layer is purely linear, the model is free to learn to predict values in any range.

### Chooing a Loss Function 

#### Note that we compile the model with the mse loss function — mean squared error, the square of the difference between the predictions and the targets. This is a widely used loss function for regression problems.
#### We are also monitoring a new metric during training: mean absolute error (MAE). It’s the absolute value of the difference between the predictions and the targets. For instance, an MAE of 0.5 on this problem would mean your predictions are off by $500 on average.


In [1]:
# K-Fold Validation
k= 4
num_val_samples = len(train_data) // k 
num_epochs = 100
all_scores = []
for i in range(k):
    print('processing fold #%d' % i)
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples] #1
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate( #2
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)
    model = build_model() #3
    model.fit(partial_train_data, partial_train_targets, #4
              epochs=num_epochs, batch_size=1, verbose=0)
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)#5
    all_scores.append(val_mae)

1. Prepares the validation data: data from partition #k 
2. Prepares the training data: data from all other partitions 
3. Builds the Keras model (already compiled)
4. Trains the model (in silent mode, verbose = 0) 
5. Evaluates the model on the validation data

### Validating approach using K-fold validation

#### To evaluate your model while you keep adjusting its parameters (such as the number of epochs used for training), you could split the data into a training set and a validation set, as you did in the previous examples. But because you have so few data points, the validation set would end up being very small (for instance, about 100 examples). As a consequence, the validation scores might change a lot depending on which data points you chose to use for validation and which you chose for training: the validation scores might have a high variance with regard to the validation split. This would prevent you from reliably evaluating your model. 
#### The best practice in such situations is to use K-fold cross-validation (see figure 3.11). It consists of splitting the available data into K partitions (typically K = 4 or 5), instantiating K identical models, and training each one on K – 1 partitions while evaluating on the remaining partition. The validation score for the model used is then the average of the K validation scores obtained. In terms of code, this is straightforward.

In [1]:
all_scores

In [1]:
np.mean(all_scores)

#### The different runs do indeed show rather different validation scores, from 2.1 to 2.6. The average (2.2) is a much more reliable metric than any single score — that’s the entire point of K-fold cross-validation. In this case, you’re off by $2,200 on average, which is significant considering that the prices range from $10,000 to $50,000.

#### Let’s try training the model a bit longer: 500 epochs. To keep a record of how well the model does at each epoch, you’ll modify the training loop to save the per-epoch validation score log.

In [1]:
num_epochs = 500
all_mae_histories = []
for i in range(k):
    print('processing fold #%d' % i)
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)
    model = build_model()
    history = model.fit(partial_train_data, partial_train_targets,
                        validation_data=(val_data, val_targets),
                        epochs=num_epochs, batch_size=1, verbose=0)
    mae_history = history.history['val_mae']
    all_mae_histories.append(mae_history)


1. Prepares the validation data: data from partition #k 
2. Prepares the training data: data from all other partitions 
3. Builds the Keras model (already compiled)
4. Trains the model (in silent mode, verbose=0)

#### compute the average of the per-epoch MAE scores for all folds.

In [1]:
 average_mae_history = [
    np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

#### Plotting Validation Scores

In [1]:
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

#### It may be a little difficult to see the plot, due to scaling issues and relatively high variance. Let’s do the following:

#### Omit the first 10 data points, which are on a different scale than the rest of the curve.

#### Replace each point with an exponential moving average of the previous points, to obtain a smooth curve.


In [1]:
def smooth_curve(points, factor=0.9):
  smoothed_points = []
  for point in points:
    if smoothed_points:
      previous = smoothed_points[-1]
      smoothed_points.append(previous * factor + point * (1 - factor))
    else:
      smoothed_points.append(point)
  return smoothed_points
smooth_mae_history = smooth_curve(average_mae_history[10:])
plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

#### According to this plot, validation MAE stops improving significantly after 80 epochs. Past that point, you start overfitting.
#### Once we are finished tuning other parameters of the model (in addition to the number of epochs, we could also adjust the size of the intermediate layers), we can train a final production model on all of the training data, with the best parameters, and then look at its performance on the test data.


In [1]:
model = build_model()
model.fit(train_data, train_targets,
          epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

1. Gets a fresh, compiled model 
2. Trains it on the entirety of the data

### Final results 

In [1]:
test_mae_score

#### You’re still off by about $2,750.

#### Gererating Prediction on new data

####  this scalar regression model, predict() returns the model’s guess for the sample’s price in thousands dollars:

In [1]:
predictions = model.predict(test_data)

In [1]:
predictions[0]

#### The first house in the test set is predicted to have a price of about $9,500.

### Take away from this example:
1. Regression is done using different loss functions than what we used for classification. Mean squared error (MSE) is a loss function commonly used for regression.

2. Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally, the concept of accuracy doesn’t apply for regression. A common regression metric is mean absolute error (MAE).

3. When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.

4. When there is little data available, using K-fold validation is a great way to reliably evaluate a model.

5. When little training data is available, it’s preferable to use a small model with few intermediate layers (typically only one or two), in order to avoid severe overfitting.
