## Assignment



Learning Outcomes:

-   Practice matrix reshaping
-   Learn how to add confidence intervals to an ordinary least square
    regression model
-   Learn how to draw polygons with matplotlib
-   Combine all your python skills



### Instructions



In the previous module, we used linear regression analysis to build a
simple linear model of the storks versus babies data (see
Fig. [storksandbabies](#storksandbabies))

![img](storks.png "The storks versus babies data") 
This figure shows the data, the linear regression
model, and a shaded area that represents the 95% confidence
interval.

In the previous assignment, you added noise to a linear data set, and
explored how this noise affects the accuracy of the regression
analysis. You observed that with increasing data density, the match
between model and original data became better. So the above confidence
interval is the area where you can be 95% sure that your regression
model fits your data. I.e., in your previous assignment, all orange
lines would fit inside this envelope. In other words, your regression
analysis finds the best fit but depending how noisy your data is,
uncertainty remains whether the best fit is also the true fit. This
uncertainty is expressed by the shaded confidence interval.

Note that this is different from the uncertainty of a prediction you
base on your regression model. This so-called prediction interval is
much larger. But back to our original problem. We want to create some
code that allows us to plot this confidence interval. There is no
ready-made solution, but with a bit of numpy magic, we can create our
own!

The shaded area in the above figure, is a polygon. If you were to draw
this by hand, you would first draw the outline, and then fill it with
color. The following figure illustrates this approach:

![img](polygon_n.png "A polygon is drawn by following it's outline either clockwise or anticlockwise")

The key here, is that we have to draw in a consistent direction,
either clockwise, or anticlockwise (you remember those drawing by
numbers books?). We will look into the details of this below. But
first we need to get the data from our regression model.

Add code in the below box which reads the sedimentation rate versus
sulfate reduction rate data from the previous exercise.



In [1]:
# read data

Next, create a linear regression analysis of this data which treats
the sedimentation rate as the independent variable (let's call it `sedrate`), and the sulfate
reduction rate as the dependent variable (`srr`)



In [1]:
# regression model

Now, we create the model predictions and then we extract the
confidence intervals we that we will use for our polygon. I recommend to start
with a small number of predicted values, say 5. If all works well, you
can increase the number for your final submission. Below the relevant
code to handle the stats model. BTW, Execute the `dict(sedrate=nx)` piece
in a separate cell to see what it does.



In [1]:
# create predictions
nx :np.ndarray =  # create 5 values between min-x and max-x

# create a prediction at the nx locations see the documentation for
# statsmodels.regression.linear_model.OLSResults.get_prediction
# the function returns a linear_model.PredictionResults type
# which is so long that I dropped it from the type hint. 
prediction = results.get_prediction(exog=dict(sedrate=nx))

# extract the confidence intervals. The default is the 95% interval
# you can set other values, see the documentation for
# statsmodels.regression.linear_model.PredictionResults.conf_int

ci :np.ndarray = prediction.conf_int()
print(f" ci = {ci}")

This should yield a 2-dimensional array with 10 numbers. The below
numbers are for the stork examples, so your numbers will be different,
but it should look like this:

    ci = [[  25.60903207  424.44834033]
     [ 262.42790764  619.53409124]
     [ 393.92939094  919.93723441]
     [ 483.47868295 1262.29256887]
     [ 560.36720228 1617.30867601]]

The first row, contains the lower boundary values at `nx(i)`, and the
second row contains the upper boundary values at `nx(i)`. So this data
is missing the x-coordinates, and it is not yet in a format which can
be used by the polygon function of matplotlib.

As we have seen above, in order to plot a polygon, we need a sequence
coordinate pairs (typically called vertices) for each polygon point,
and this sequence must be ordered in such a way that it describes the
polygon either in clockwise or counter-clock wise direction (see the
above figure).

Here is what needs to be done:



In [1]:
# extract lower and upper boundary values from ci

In [1]:
# combine them into a new 1-d array in such way that they describe the
# polygon in counter clockwise fashion

In [1]:
# Create a new vector of x-values which corresponds to the polygon
# values in the previous cell

In [1]:
# combine your new x and y values into a 2-dimensional array of
# vertices.  I.e. the first columns contains the x-values, and the 2nd
# column contains the y-values.

The below numbers are from the stork example, so your numbers will be
different, but you can see how this array describes the locations of
the numbers 1 to 6 in the above figure:

    coords = [[0.00000000e+00 2.56090321e+01]
              [7.50000000e+03 2.62427908e+02]
              [1.50000000e+04 3.93929391e+02]
              [2.25000000e+04 4.83478683e+02]
              [3.00000000e+04 5.60367202e+02]
              [3.00000000e+04 1.61730868e+03]
              [2.25000000e+04 1.26229257e+03]
              [1.50000000e+04 9.19937234e+02]
              [7.50000000e+03 6.19534091e+02]
              [0.00000000e+00 4.24448340e+02]]

last but not least, let's create the final figure.  A couple pointers though:

1.  First create your regular regression graph (6 by 4 inches, ggplot
    style, proper axis labels with units, no title, no legend,
    but include the regression parameters similar to
    Fig. [storksandbabies](#storksandbabies), and the regression line
2.  The polygon method is not part of matplotlibs pyplot interface, so
    you need to import it separately (see the code snippet below)
3.  The polygon is made transparent with the alpha parameter (from 0 to
    1). You still need to pay attention to what is plotted first and last
4.  Adding a polygon to a plot does not update the plot axis limits. So
    if there is no other data in the plot you need to set the axis
    limits explicitly.



In [1]:
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon

# create the polygon
pol :Polygon = Polygon(coords,alpha=0.2)

# add the polyon to the plot
ax.add_patch(pol)

print(type(pol))

For the second part of your assignment, create:

-   a function which takes the output of `prediction.conf_int()` as
    argument and returns a list vertices which can be used by the
    polygon function function.
-   The function should be fully annotated, have a doc-string, use type
    hinting etc.
-   Recast your code so that it uses this function and place your code into
    a single cell using the below template.



In [1]:
"""
Description:
Purpose:
Author:
Date:
"""
# ----------- third party library imports ------------------

# ----------- functions definitions  -----------------------

# ----------- main program ---------------------------------
# --- variable declarations

# --- code starts here
#

### Marking Scheme



Total points: 23 pts

-   read data: 2pts
-   regression model: 2pts
-   predictions: 2pts
-   upper and lower boundary values: 2pts
-   Vector of increasing and decreasing x-values: 2pts
-   2d-arrays with correct vertices: 2pts
-   regular regression plot with regression line (1pt), regression parameters (1pt), correct layout and labels (1pt)
-   Correctly plotted polygon (4pt)
-   Correctly working function with docs string, type hints, etc. 4pts



### Submission Instructions



Create a new (or copy and existing) notebook in your `submissions`
folder before editing it. Otherwise, your edits may be overwritten the
next time you log into syzygy. Please name your copy
"Assignment-Name-FirstName-LastName": 

-   Replace the `Assignment-Name` with the name of the assignment
    (i.e., the filename of the respective Jupyter Notebook)
-   `FirstName-Last-Name` with your own name.

Note: If the notebook contains images, you need to copy the image files as well!

Your notebook/pdf must start with the following lines 

**Assignment Title**

**Date:**

**First Name:**

**Last Name:**

**Student: Id**

Before submitting your assignment:

-   Check the marking scheme, and make sure you have covered all requirements.
-   re-read the learning outcomes and verify that you are comfortable
    with each concept. If not, please speak up on the discussion board
    and ask for further clarification. I can guarantee that if you feel
    uncertain about a concept, at least half the class will be in the
    same boat. So don't be shy!

To submit your assignment, you need to download it `ipyn` notebook
format **and** `pdf` format. The best way to export your notebook as
pdf, is to select `print`, and then `print to pdf`.  Please submit
both files on Quercus. Note that the pdf export can fail if your file
contains invalid markup/python code. So you need to check that the pdf
export is complete and does not miss any sections. If you have export
problems, please contact the course instructor directly.

Notebooks typically have empty code cells in which you have to enter
python code. Please use the respective cell below each question, or
create a python cell where necessary. Add text cells to enter your
answers where appropriate. Your answers will only count if the code
executes without error. It is thus recommended to run your solutions
before submitting the assignment.

**Note: Unless specifically requested, do not type your answers by**
**hand. Instead, write code that produces the answer. Your pdf file**
**should show the code as well as the results of the code execution.**

