Exercise 4 - Polynomial Regression
========

Sometimes our data doesn't have a linear relationship, but we still want to predict an outcome.

Suppose we want to predict how satisfied people might be with a piece of fruit, we would expect satisfaction would be low if the fruit was under ripened or over ripened. Satisfaction would be high in between underripened and overripened.

This is not something linear regression will help us with, so we can turn to polynomial regression to help us make predictions for these more complex non-linear relationships!

Step 1
------

In this exercise we will look at a dataset analysing internet traffic over the course of the day. Observations were made every hour over the course of several days. Suppose we want to predict the level of traffic we might see at any time during the day, how might we do this?

Let's start by opening up our data and having a look at it.

In [None]:
# This sets up the graphing configuration
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as graph
%matplotlib inline
graph.rcParams['figure.figsize'] = (15,5)
graph.rcParams["font.family"] = "DejaVu Sans"
graph.rcParams["font.size"] = "12"
graph.rcParams['image.cmap'] = 'rainbow'
graph.rcParams['axes.facecolor'] = 'white'
graph.rcParams['figure.facecolor'] = 'white'

In [None]:
import numpy as np
import pandas as pd

data = pd.read_csv('Data/traffic_by_hour.csv')

###--- BELOW, WRITE print(data) TO VIEW THE DATASET ---###

###

Step 2
-----

Next we're going to flip the data with the transpose method - our rows will become columns and our columns will become rows. Transpose is commonly used to reshape data so we can use it. Let's try it out.

In [None]:
### REPLACE THE ??? BELOW WITH transpose
data_t = np.???(data)

print(data_t)

In [None]:
# Let's visualise the data!

###--- REPLACE ??? BELOW WITH sample ---###
for sample in range(0,data_t.shape[1]):
    graph.plot(data.columns.values, data_t[???])
###

graph.xlabel('Time of day')
graph.ylabel('Internet traffic (Gbps)')
graph.show()

Step 3
-----

This all looks a bit busy, let's see if we can draw out a clearer pattern by taking the __average values__ for each hour.

In [None]:
# We want to look at the mean values for each hour.

hours = data.columns.values

###--- REPLACE THE ???s BELOW WITH hour ---###
y = [data[???].mean() for ??? in hours]  # This will be our outcome we measure - amount of internet traffic
x = np.transpose([int(???) for ??? in hours]) # This is our feature - time of day
###

# This makes our graph, don't edit!
graph.scatter(x, y)
for sample in range(0,data_t.shape[1]):
    graph.plot(hours, data_t[sample], alpha=0.25)
graph.xlabel('Time of day')
graph.ylabel('Internet traffic (Gbps)')
graph.show()

This alone could help us make a prediction if we wanted to know the expected traffic exactly on the hour.

But, we'll need to be a bit more clever if we want to make a __good__ prediction of times in-between.

Step 4
------

Let's use the midpoints in between the hours to help us analyse the relationship between the time of day and the amount of internet traffic.

Numpy's `polyfit(x,y,d)` function allows us to do just this. We specify a feature $x$ (time of day), our outcome $y$ (the amount of traffic), and the degree $d$ of the polynomial (how curvy the line is).

In [None]:
# Polynomials of degree 1 are linear!
# Lets include this one just for comparison

###--- REPLACE THE ??? BELOW WITH 1 ---###
p1 = np.polyfit(x, y, ???)
###

In [None]:
# Let's also compare a few higher-degree polynomials

###--- REPLACE THE ???'s BELOW WITH 2, 3, AND THEN 4 ---###
p2 = np.polyfit(x, y, ???)
p3 = np.polyfit(x, y, ???)
p4 = np.polyfit(x, y, ???)
###

# Let's plot it!
graph.scatter(x, y)
xp = np.linspace(0, 24, 100)

# black dashed linear degree 1
graph.plot(xp, np.polyval(p1, xp), 'k--')
# red degree 2
graph.plot(xp, np.polyval(p2, xp), 'r-')
# blue degree 3
graph.plot(xp, np.polyval(p3, xp), 'b-') 
# yellow degree 4
graph.plot(xp, np.polyval(p4, xp), 'y-') 

graph.xticks(x, data.columns.values)
graph.xlabel('time of day')
graph.ylabel('internet traffic (Gbps)')
graph.show()

None of these polynomials do a great job of generalising the data. Let's try a few more.

In [None]:
###--- REPLACE THE ???'S 5, 6, AND 7 ---###
p5 = np.polyfit(x, y, ???)
p6 = np.polyfit(x, y, ???)
p7 = np.polyfit(x, y, ???)
###

# Let's plot it!
graph.scatter(x, y)
xp = np.linspace(0, 24, 100)

# black dashed linear degree 1
graph.plot(xp, np.polyval(p1, xp), 'k--')
# red degree 5
graph.plot(xp, np.polyval(p5, xp), 'r-') 
# blue degree 6
graph.plot(xp, np.polyval(p6, xp), 'b-') 
# yellow degree 7
graph.plot(xp, np.polyval(p7, xp), 'y-') 

graph.xticks(x, data.columns.values)
graph.xlabel('Time of day')
graph.ylabel('Internet traffic (Gbps)')
graph.show()

It looks like the 5th and 6th degree polynomials have an identical curve. This looks like a good curve to use.

We could perhaps use an even higher degree polynomial to fit it even more tightly, but we don't want to overfit the curve, since we want just a generalisation of the relationship.

Let's see how our degree 6 polynomial compares to the real data.

In [None]:
for row in range(0,data_t.shape[1]):
    graph.plot(data.columns.values, data_t[row], alpha = 0.5)

###--- REPLACE ??? BELOW WITH p6 - THE POLYNOMIAL WE WISH TO VISUALIZE ---###    
graph.plot(xp, np.polyval(???, xp), 'k-')
###

graph.xlabel('Time of day')
graph.ylabel('Internet traffic (Gbps)')
graph.show()

Step 5
------

Now let's try using this model to make a prediction for a time between 00 and 24.

In [None]:
###--- REPLACE ??? BELOW WITH 12.5 (this represents the time 12:30) ---###
t = ???
###

###--- REPLACE ??? BELOW WITH p6 SO WE CAN VISUALIZE THE 6TH DEGREE POLYNOMIAL MODEL. ---###
pred = np.polyval(???, t)
###

print("at t=%s, predicted internet traffic is %s Gbps"%(t,pred))

# Now let's visualise it
graph.plot(xp, np.polyval(p6, xp), 'y-')

graph.plot(t, pred, 'ko') # result point
graph.plot(np.linspace(0, t, 2), np.full([2], pred), dashes=[6, 3], color='black') # dashed lines (to y-axis)
graph.plot(np.full([2], t), np.linspace(0, pred, 2), dashes=[6, 3], color='black') # dashed lines (to x-axis)

graph.xticks(x, data.columns.values)
graph.ylim(0, 60)
graph.title('expected traffic throughout the day')
graph.xlabel('time of day')
graph.ylabel('internet traffic (Gbps)')

graph.show()

Conclusion
-----

And there we have it! You have made a polynomial regression model and used it for analysis! This models gives us a prediction for the level of internet traffic we should expect to see at any given time of day.

You can go back to the course and either click __'Next Step'__ to start an optional step with tips on how to better work with AI models, or you can go to the next module where instead of predicting numbers we predict categories.