# Predicting MPG with OLS

**Author:** Richard Hemphill<br>
**ID:** 903877709<br>
**Class:** ECE5268 Theory of Neural Networks<br>
**Instructor:** Dr. Georgios C. Anagnostopoulos<br>
**Description:** Utilize characteristics from various cars to predict miles-per-gallon fuel consumption.  The prediction equation is determined using Ordinary Least Squares regression.

In [58]:
# CONSTANTS
DATASET_FILE = 'autompg_dataset.csv'
NUMBER_FOR_TRAINING = 200
NUMBER_FOR_VALIDATION = 100

In [59]:
# LIBRARIES
import numpy as np                  # matrix manipulation
import random                       # shuffle data
import matplotlib.pyplot as plt     # surface plot

In [60]:
# FUNCTIONS
# Create Augmented Design Matrix
def AugmentedDesignMatrix(dataSet, features):
    # Create the design matrix.
    adm = dataSet[features[0]]
    for feature in features[1:]:
        adm = np.column_stack((adm,dataSet[feature]))
     # Augment the design matrix to accomodate the bias term.
    adm = np.column_stack((adm,np.ones(len(adm))))
    return adm

In [61]:
# Calculate Mean Squared Error
def MSE(actual, predicted):
    return np.square(np.subtract(actual, predicted)).mean()

In [62]:
def PredictionEquation(y, xs, w):
    eq = '{} = '.format(y)
    wfmat = lambda i: ('+' if i > 0 else '') + '{:0.6}'.format(i)
    for idx, x in enumerate(xs):
        eq = eq + '{}*{}'.format(wfmat(w[idx]), x)
    eq = eq + wfmat(w[-1])
    return eq

In [63]:
# Load data file
csvFile = open(DATASET_FILE, 'r')
dataSet = np.genfromtxt(csvFile, delimiter=',', names=True, case_sensitive=True)
csvFile.close()

In [64]:
# shuffle data randomly so that training will not use same sets every time.
random.shuffle(dataSet)

In [65]:
# Split the data set into groups for training, validation and test.
trainData = dataSet[:NUMBER_FOR_TRAINING]
valData = dataSet[NUMBER_FOR_TRAINING+1:NUMBER_FOR_TRAINING+NUMBER_FOR_VALIDATION]
testData = dataSet[NUMBER_FOR_TRAINING+NUMBER_FOR_VALIDATION+1:]

In [66]:
# Specify the output feature
OUTPUT_FEATURE='mpg'

## Part (a):
Use OLS regression on the training data to predict _mpg_ based on _horsepower_ and _weight_.

In [67]:
# Specify the input features to be used.
inputFeatures1=['horsepower', 'weight']

In [68]:
# Create the output vector
Y = trainData[OUTPUT_FEATURE]

In [69]:
# Create the augmented the design matrix
X = AugmentedDesignMatrix(dataSet=trainData,features=inputFeatures1)

In [70]:
# Create the augmented model parameter vector.
W = np.ones(len(inputFeatures1)+1)

In [71]:
# Calculate the augmented model parameter vector using OLS
R = np.dot(X.T, X)
Rinv = np.linalg.inv(R)
W = np.dot(np.dot(Rinv, X.T), Y)

### i Prediction Equation

In [72]:
print(PredictionEquation(y=OUTPUT_FEATURE, xs=inputFeatures1, w=W))

mpg = -0.0246242*horsepower-0.00439616*weight+36.312


### Observation
The predition equation makes sense.  Effenent cars have an _mpg_ in the 30's (bias).  Sports cars (high _horsepower_) are not as gas efficient.  The heavier the car (_weight_), the more fuel energy it needs to move.