# I, slope

This exercise gives you some hands-on practice with linear regression.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Pandas modules.
import numpy as np
import pandas as pd
from scipy.optimize import minimize

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('isloping.ok')

## Hemoglobin and serum creatinine

We start by loading the [data on chronic kidney disease](https://matthew-brett.github.io/cfd2020/data/chronic_kidney_disease).

In [None]:
ckd = pd.read_csv('ckd_clean.csv')
ckd.head()

Our interest here is in two variables / columns:

* "Hemoglobin" : the concentration of the protein in red blood cells.  This
  tends to go down in chronic kidney disease.
* "Serum creatinine" : this is a measure of how well the kidney is clearing
  waste products from the blood.  If your kidneys are working well, your
  creatinine should be low.

In [None]:
hgb = np.array(ckd['Hemoglobin'])
creat = np.array(ckd['Serum Creatinine'])

First we do a simple plot.

In [None]:
ckd.plot.scatter('Hemoglobin', 'Serum Creatinine')

To guess at the intercept, we make our job easier by forcing the plot library to include the x=0, y=0 point on the plot.

In [None]:
ckd.plot.scatter('Hemoglobin', 'Serum Creatinine')
# Make sure 0, 0 is on the plot.
plt.axis([0, 18, 0, 15])

Looking at this plot, we see that a straight is not very good predictor for
the data.  It looks as though a straight line would be useful for the kidney
patient values, with relatively low hemoglobin and high creatinine, but it
would not be a good prediction for the pack of not-kidney patients at the
bottom left of the plot, with high hemoglobin and low creatinine.

Ignoring that problem for now, we eyeball the plot, and guess that a good straight line would have an intercept of around 12, and a slope of around -1.

The line gives us the following predictions for the y (creatinine) values:

In [None]:
predicted = 12 + hgb * -1

The errors are:

In [None]:
errors = creat - predicted

We plot this guessed line on the data.

In [None]:
# Don't worry about this code, it's just to plot the line, and the errors.
plt.plot(hgb, creat, 'o')
plt.plot(hgb, predicted, 'o', color='red')
# Draw a line between predicted and actual
for i in np.arange(len(hgb)):
    x = hgb[i]
    y_0 = predicted[i]
    y_1 = creat[i]
    plt.plot([x, x], [y_0, y_1], ':', color='black', linewidth=1)

Now your job is to find the best (least-squares) line fitting `hgb` (on the x axis) to `creat` (on the y axis).

To help you, here is the `ss_any_line` function from [using minimize](https://matthew-brett.github.io/cfd2020/mean-slopes/using_minimize).

In [None]:
def ss_any_line(c_s, x_values, y_values):
    c, s = c_s
    predicted = c + x_values * s
    error = y_values - predicted
    return np.sum(error ** 2)

Use this function to calculate the sum of squares error for a line with
intercept 12 and slope -1.

In [None]:
ss_guessed = ...
ss_guessed

In [None]:
_ = ok.grade('q_1_guessed_line')

Now use `minimize` to calculate the best fit intercept and slope:

In [None]:
best_inter, best_slope = ...
print(best_inter)
print(best_slope)

In [None]:
_ = ok.grade('q_2_best_line')

## Done

You're finished with the assignment!  Be sure to...

- **run all the tests** (the next cell has a shortcut for that),
- **Save and Checkpoint** from the "File" menu.
- Finally, **restart** the kernel for this notebook, and **run all the cells**,
  to check that the notebook still works without errors.  Use the
  "Kernel" menu, and choose "Restart and run all".  If you find any
  problems, go back and fix them, save the notebook, and restart / run
  all again, before submitting.  When you do this, you make sure that
  we, your humble markers, will be able to mark your notebook.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]