<a href="https://colab.research.google.com/github/udlbook/iclimbtrees/blob/main/notebooks/MachineLearning/NonparametericModel_Answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **Non-parametric models answers**

This notebook investigates the non-parametric network for estimating the price of a car from the age and mileage.

You can save a local copy of this notebook in your Google account and work through it in Colab (recommended) or you can download the notebook and run it locally using Jupyter notebook or similar.

Contact me at iclimbtreesmail@gmail.com if you find any mistakes or have any suggestions.

In [1]:
# Import math library
import numpy as np
# Suppress annoying warnings
import warnings
warnings.filterwarnings('ignore', message='Mean of empty slice.', category=RuntimeWarning)
warnings.filterwarnings('ignore', message='invalid value encountered in scalar divide', category=RuntimeWarning)

In [2]:
# This is the data from the figures.  We'll use this to train the model
ages = np.array([3.9, 1.7, 4.5, 0.1, 8.5,    0.6, 2.9, 5.2, 5.7, 7.2,   2.2, 1.2, 3.7, 9.7, 3.1,  4.4, 1.9, 8.0, 6.7, 4.9 ])
mileages = np.array([37.2, 19.8, 35.0, 3.1, 152.2,  21.7, 40.2, 80.9, 101.5, 105.6,  30.4, 10.5, 71.2, 172.1, 61.0,  105.5, 20.1, 79.8, 79.1, 74.7])
prices = np.array([23.6, 27.0, 22.4, 33.4, 15.8,   31.2, 24.0, 21.4, 21.0, 17.0,  27.4, 29.1, 22.8, 10.5, 23.1,  21.9, 28.4, 18.2, 18.7, 22.5])

# Here is some separate "test" data, that we can use to see how well our model is working
ages_test = np.array([1.912, 6.14, 6.31, 3.13, 2.60, 4.085, 7.22, 7.15, 2.59,  1.25,  8.41,  0.10, 5.05, 2.76,  6.38,  7.75,  0.10, 8.46, 4.81, 4.81])
mileages_test = np.array([24.1, 48.3, 115.3, 21.5, 41.8,  50.1, 101.7, 124.3, 67.2, 16.4,  72.6, 13.2, 76.3, 70.2, 58.9,  106.0, 22.0, 108.0, 50.5, 49.7])
prices_test = np.array([27.3, 20.7, 17.8, 27.6, 26.9, 24.3, 17.9, 17.5, 27.0, 30.5, 16.7, 27.8, 20.8, 25.3, 20.7, 18.1, 30.4, 16.9, 21.0, 21.5])

First, let's write routines to assign each data example to the appropariate bin.  

For "age", we have 8 bins, corresponding to 0-1, 0-2,...6-7,7+ years.

For "mileage", we have 8 bins corrsponding to 0-20k, 20-40k, ..., 120-140k, 140k+ miles.  

In [3]:
# Takes a vector of ages, and returns a vector of bins with values 0,1,2,3,4,5,6,7
def bin_age(age):
    bin = age.astype(int)
    bin = np.clip(bin,None, 7)
    return bin

# Takes a vector of mileages, and returns a vector of bins with values 0,1,2,3,4,5,6,7
def bin_mileage(mileages):
    bin = (mileages/20).astype(int)
    bin = np.clip(bin,None, 7)
    return bin

# 1D Model

Now let's build the 1D model that uses only age to predict the outcome.  First, we "train" the model.  This means binning the mileage and computing the mean for each bin.  We'll store this in a variable $\boldsymbol\phi_{age}$.

In [4]:
def train_model(ages, prices):
  bin_ages = bin_age(ages)
  bin_means = np.zeros(8)
  for i in range(8):
     bin_means[i] = np.mean(prices[bin_ages==i])
  return bin_means

phi_age = train_model(ages, prices)
print(np.array2string(phi_age, precision=1, floatmode='fixed'))

[32.3 28.2 25.7 23.2 22.3 21.2 18.7 15.4]


Now let's write a routine for inference.  This takes a vector of ages and returns the estimated prices

In [6]:
def perform_inference(ages, phi_age):
  bin_ages = bin_age(ages)
  prices = np.zeros(len(ages))
  for i in range(len(ages)):
    prices[i] = phi_age[bin_ages[i]]
  return prices

# Show the predicted prices, the true prices, and the absolute error
predicted_prices = perform_inference(ages, phi_age )
print(np.array2string(np.column_stack((predicted_prices, prices, np.abs(predicted_prices-prices))), precision=1, floatmode='fixed'))

[[23.2 23.6  0.4]
 [28.2 27.0  1.2]
 [22.3 22.4  0.1]
 [32.3 33.4  1.1]
 [15.4 15.8  0.4]
 [32.3 31.2  1.1]
 [25.7 24.0  1.7]
 [21.2 21.4  0.2]
 [21.2 21.0  0.2]
 [15.4 17.0  1.6]
 [25.7 27.4  1.7]
 [28.2 29.1  0.9]
 [23.2 22.8  0.4]
 [15.4 10.5  4.9]
 [23.2 23.1  0.1]
 [22.3 21.9  0.4]
 [28.2 28.4  0.2]
 [15.4 18.2  2.8]
 [18.7 18.7  0.0]
 [22.3 22.5  0.2]]


Let's also see the predictions for the test data

In [7]:
# Show the predicted prices, the true prices, and the absolute error
predicted_prices_test = perform_inference(ages_test, phi_age )
print(np.array2string(np.column_stack((predicted_prices_test, prices_test, np.abs(predicted_prices_test-prices_test))), precision=1, floatmode='fixed'))

[[28.2 27.3  0.9]
 [18.7 20.7  2.0]
 [18.7 17.8  0.9]
 [23.2 27.6  4.4]
 [25.7 26.9  1.2]
 [22.3 24.3  2.0]
 [15.4 17.9  2.5]
 [15.4 17.5  2.1]
 [25.7 27.0  1.3]
 [28.2 30.5  2.3]
 [15.4 16.7  1.3]
 [32.3 27.8  4.5]
 [21.2 20.8  0.4]
 [25.7 25.3  0.4]
 [18.7 20.7  2.0]
 [15.4 18.1  2.7]
 [32.3 30.4  1.9]
 [15.4 16.9  1.5]
 [22.3 21.0  1.3]
 [22.3 21.5  0.8]]


You can see that the errors are a bit bigger;  this is expected, since we didn't use this data to train the model.

# 2D Model

Now let's build the 2D model that exploits both age and mileage.  We'll store the means of the bins in the variable $\boldsymbol\phi_{2D}$.

In [10]:
def train_model_2D(ages, mileages, prices):
  bin_ages = bin_age(ages)
  bin_mileages = bin_mileage(mileages)
  bin_means = np.zeros((8,8))
  for i in range(8):
    for j in range(8):
      bin_means[i,j] = np.mean(prices[(bin_mileages==i) & (bin_ages==j)])
  return bin_means

phi_2D = train_model_2D(ages, mileages, prices)
print(np.array2string(phi_2D, precision=1, floatmode='fixed'))

[[33.4 28.1  nan  nan  nan  nan  nan  nan]
 [31.2 28.4 27.4 23.6 22.4  nan  nan  nan]
 [ nan  nan 24.0  nan  nan  nan  nan  nan]
 [ nan  nan  nan 23.0 22.5  nan 18.7 18.2]
 [ nan  nan  nan  nan  nan 21.4  nan  nan]
 [ nan  nan  nan  nan 21.9 21.0  nan 17.0]
 [ nan  nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan 13.2]]


Now let's write a routine to perform inference with the 2D model.  It will take the age and mileage and return the price.

In [11]:
def perform_inference_2D(ages, mileage, phi_2D):
  bin_ages = bin_age(ages)
  bin_mileages = bin_mileage(mileage)
  prices = np.zeros(len(ages))
  for i in range(len(mileage)):
      prices[i] = phi_2D[bin_mileages[i], bin_ages[i]]
  return prices


# Show the predicted prices, the true prices, and the absolute error
predicted_prices_2D = perform_inference_2D(ages, mileages, phi_2D )
print(np.array2string(np.column_stack((predicted_prices_2D, prices, np.abs(predicted_prices_2D-prices))), precision=1, floatmode='fixed'))

[[23.6 23.6  0.0]
 [28.1 27.0  1.1]
 [22.4 22.4  0.0]
 [33.4 33.4  0.0]
 [13.2 15.8  2.7]
 [31.2 31.2  0.0]
 [24.0 24.0  0.0]
 [21.4 21.4  0.0]
 [21.0 21.0  0.0]
 [17.0 17.0  0.0]
 [27.4 27.4  0.0]
 [28.1 29.1  1.1]
 [23.0 22.8  0.2]
 [13.2 10.5  2.7]
 [23.0 23.1  0.1]
 [21.9 21.9  0.0]
 [28.4 28.4  0.0]
 [18.2 18.2  0.0]
 [18.7 18.7  0.0]
 [22.5 22.5  0.0]]


Often, the result is zero because there was only one example in the bin, so the mean is exacltly the value of that data point.

Now, let's look at the results for the test data

In [12]:
# Show the predicted prices, the true prices, and the absolute error
predicted_prices_test_2D = perform_inference_2D(ages_test, mileages_test, phi_2D )
print(np.array2string(np.column_stack((predicted_prices_test_2D, prices_test, np.abs(predicted_prices_test_2D-prices_test))), precision=1, floatmode='fixed'))

[[28.4 27.3  1.1]
 [ nan 20.7  nan]
 [ nan 17.8  nan]
 [23.6 27.6  4.0]
 [24.0 26.9  2.9]
 [ nan 24.3  nan]
 [17.0 17.9  0.9]
 [ nan 17.5  nan]
 [ nan 27.0  nan]
 [28.1 30.5  2.4]
 [18.2 16.7  1.5]
 [33.4 27.8  5.6]
 [ nan 20.8  nan]
 [ nan 25.3  nan]
 [ nan 20.7  nan]
 [17.0 18.1  1.1]
 [31.2 30.4  0.8]
 [17.0 16.9  0.1]
 [ nan 21.0  nan]
 [ nan 21.5  nan]]


Often, the result is "nan" because the test data point falls into a bin that did not have a training point. We only manage to generate a prediction for 10/20 test points.

Let's try to rectify this by training with more data.  Since, we have expanded the number of bins by 8, it's reasonable to try with 64 data points.  Here's a routine that generates fake data that is similar to the original data.

In [13]:
# Generates fake data (you don't need to understand this)
def generate(n_samples):
    data = np.random.multivariate_normal(np.array([4.305,65.08]),  np.array([[11.02,170.37], [170.37, 3215.95]]), n_samples)
    ages = np.clip(data[:,0],0.1,None)
    mileage = np.clip(data[:,1], 0.5, None)
    prices = 31.42 -1.60768419 *ages -0.02352476 * mileage +np.random.normal(0,1.14,len(ages))
    return ages, mileage, prices

# Sets the random seed so we get the same results each time
np.random.seed(1)
# Generate new train data
ages_64, mileages_64, prices_64 = generate(64)

np.random.seed(2)
# Generate new test data
ages_test_64, mileages_test_64, prices_test_64 = generate(20)

Fit the model to the new data

In [14]:
phi_2D_64 = train_model_2D(ages_64, mileages_64, prices_64)
print(np.array2string(phi_2D_64, precision=1, floatmode='fixed'))

[[30.7  nan 26.8  nan 22.9  nan  nan  nan]
 [ nan  nan 25.2 25.0  nan  nan  nan  nan]
 [26.1  nan 26.8 24.5  nan 21.5 20.1  nan]
 [ nan  nan  nan 23.2 22.9 21.6 20.1  nan]
 [ nan  nan  nan 23.8 23.8 21.8 19.3 18.3]
 [ nan  nan  nan  nan 22.1 20.6 18.8 17.4]
 [ nan  nan  nan  nan  nan  nan 19.7 15.8]
 [ nan  nan  nan  nan  nan  nan  nan 15.0]]


Let's look at the test results for this model:

In [15]:
predicted_prices_test_2D_64 = perform_inference_2D(ages_test_64, mileages_test_64, phi_2D_64 )
print(np.array2string(np.column_stack((predicted_prices_test_2D_64, prices_test_64, np.abs(predicted_prices_test_2D_64-prices_test_64))), precision=1, floatmode='fixed'))

[[21.8 19.9  1.9]
 [15.0 14.4  0.6]
 [15.0  7.9  7.1]
 [ nan 25.2  nan]
 [15.8 16.1  0.2]
 [ nan 30.1  nan]
 [21.6 21.4  0.2]
 [25.0 25.0  0.0]
 [ nan 24.6  nan]
 [18.8 18.5  0.3]
 [17.4 18.3  0.9]
 [ nan 21.9  nan]
 [21.8 19.9  1.9]
 [17.4 16.3  1.2]
 [15.0 14.4  0.6]
 [ nan 27.8  nan]
 [15.0  7.5  7.5]
 [ nan 28.9  nan]
 [25.0 24.4  0.6]
 [23.2 22.1  1.2]]


The situation is a bit better;  we manage to make predictions for most of the data now, but there are still test points in bins that we did not have training data in.

This is the curse of dimensionality in action!  As the dimensionality increases, the size of the input space quicly dwarfs the number of training examples.  If you like, you can experiment to find how many training datapoints you need to make a prediction for all of the new test data points.  It's a surprising amount even for 2D data.