In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline

# Week 10: I scream, you scream, we all scream for ice cream and shark attacks
---

Tonight we'll be looking at Ice Cream, Fudge, and Sharks—cool!

In [None]:
tbl = Table.read_table("ice-cream-stats.csv")
tbl

In [None]:
# Let's set some variables to the columns for easier reference
ice, fudge, shark = tbl.column("Ice Cream Sales (cones)"), tbl.column("Fudge Sale Volume (g)"), tbl.column("Shark Attacks") 

# First, let's cement some knowledge
---

Alrighty, before we get into it, let's make sure we are clear on a few things related to distributions, confidence intervals, hypothesis testing, etc.  Let's draw it all out on the chalkboard:

1. **Population**
   - Distribution
   - Population Parameters
     - Mean, SD, others
     
     
2. **Sample**
   - How to obtain (without replacement from pop)
   - Distribution
   - Sample Statistics
     - Mean, SD, others
     
     
3. **Distribution of Sample Statistics**
   - How to obtain (with replacement from sample, or mathematically)
   - Distribution
     - What is the statistic?
   - Features
     - Mean, SD

# Let's talk about relationships
---

Let's consider the following table:

|x|y|
|---|---|
|1|1|
|2|4|
|4|8|
|5|10|

If we were to receive an $x=3$, what would we expect the corresponding $y$ to equal? Why?

Does this expectation seem like a solid choice?

Now, let's consider what we expect the value of $y$ to equal when we receive an $x=6$.

Given x := 6, what do we expect y to equal?

Does this expectation seem like a solid choice?

What about the following graph of `Ice Cream Sales` versus `Fudge Sales Volume`?

In [None]:
tbl.scatter("Ice Cream Sales (cones)", "Fudge Sale Volume (g)")

What, roughly, would we expect $Fudge\ Sales\ Volume$ to equal provided that $Ice\ Cream\ Sales = 100$?

Let's focus on our justification for coming up with a value for $Fudge$.  We saw a strong *relationship* between $Ice \ Cream$ and $Fudge$.  What does that mean?

If we're speaking statistically, then we consider `relationship == association`.  And an association really boils down to the following claim:  

<center>"When this changes value, then that changes value too."</center>

# Correlation
---

What does correlation (or the correlation coefficient) tell us?

Correlation can be positive or negative, and it can be perfect, strong, moderate, weak, or zero.  We can also quantify it using the correlation coefficient, $r$.

So, going back to our table of $x$ and $y$ above, what correlation do we see?  What about for $Ice\ Cream$ and $Fudge$?

In [None]:
# What is the correlation (qualitatively) between x and y?

In [None]:
# Ice Cream and Fudge are strongly associated.
# What is the correlation between Ice Cream and Fudge?  Is it a strong correlation?

So, how do we go about measuring the correlation (finding the correlation coefficient)?

We must first put our x and y values into *"standard units"*.  Recall, to put a point into standard units, what formula is used?

In [None]:
# First, what does it mean (in words) for something a point to be `n` in standard units?

For some data set, $X$, let $x_i = some\ datapoint\ in\ X$

Then, $x_{i, SU} =$ ??


And, replacing $x_i$ with all of our datapoints, $X$, we can convert the entire data set into standard units.

In [None]:
# Before we convert Ice Cream and Fudge to SU, what do we know about what the scatter plot
# will look like after converting both?

In [None]:
# Let's write a quick standard units function
def standard_units(array):
    return (array - np.mean(array)) / np.std(array)

In [None]:
ice_su = standard_units(ice)
fudge_su = standard_units(fudge)

plt.scatter(ice_su, fudge_su)
plt.xlabel("Ice Cream Sales (cones) In Standard Units"); plt.ylabel("Fudge Sale Volume (g) In Standard Units");

In [None]:
# What changed between the two plots?

Out of curiosity, what do you expect the mean and standard deviation of $a_{SU}$ to be?  What about $b_{SU}$?

In [None]:
np.mean(ice_su).round(5)

In [None]:
np.std(ice_su)

Now, to find our correlation coefficient between two data sets, $X$ and $Y$, there is one more formula.  It's written as?

In [None]:
# Calculate the correlation coefficient between a and b

r = np.mean(ice_su * fudge_su)
r

Recalling what the correlation between two data sets tells us, what does this really mean?

In [None]:
# In words, what does our correlation coefficient mean?

# Onwards, to regression!
---

So, we have a correlation coefficient now.  What can we do with it?

One idea is to plot it somehow.  But how?

In [None]:
# Let's try incorporating the correlation coefficient into our plot

plt.scatter(ice_su, fudge_su)
plt.plot(np.linspace(-2, 2, 2), np.linspace(-2, 2, 2)*r, c='red')
plt.xlabel("Ice Cream Sales (cones) In Standard Units"); plt.ylabel("Fudge Sale Volume (g) In Standard Units");

In [None]:
# Now that we've incorporated r into our plot, what does r represent?

And that's the premise of linear regression!

So, before we go much further, let's figure something out: *What is the purpose of regression?*

Well, it's very useful since it can help us **predict** a value when given an input.  Even if we don't have a data point we can still predict what the corresponding value would be.

We've been looking just at linear regression—stuff that falls in a line—but the same concepts carry over to higher powers!

How about we take a look at some data that's a bit more linear?  Take a look at the plot of $Ice\ Cream\ Sales$ vs $Shark\ Attacks$.

In [None]:
tbl.scatter("Ice Cream Sales (cones)", "Shark Attacks")

Okay, let's find the correlation coefficient so that we can plot a best fit line.

In [None]:
# Before calculating anything, what do you think the correlation coefficient should be?

In [None]:
# We can use the standard_units function we created
ice_su = standard_units(ice)
shark_su = standard_units(shark)

r = np.mean(ice_su * shark_su)
r

In [None]:
# Let's plot it!

xrange = np.linspace(min(ice), max(ice), 2)

plt.scatter(ice, shark)
plt.plot(xrange, xrange * r, c='r')
plt.xlabel("Ice Cream Sales (cones)"); plt.ylabel("Shark Attacks");

Uh-oh.  What's wrong with that plot?

In [None]:
# Let's fix the plot

xrange = np.linspace(min(ice_su), max(ice_su), 2)
plt.scatter(ice_su, shark_su)
plt.plot(xrange, xrange * r, c='r')
plt.xlabel("Ice Cream Sales (cones) In Standard Units"); plt.ylabel("Shark Attacks In Standard Units");

In [None]:
# What is the equation of the line we plotted?

Cool.  I'm glad we fixed that plot.  We should be able to plug in values into our best-fit line now.  But I'm not really satisfied with the plot we ended up creating.

In [None]:
# Why am I sad? (no not because of week 10, AKA how can we make the plot more useful?)















Let's go ahead and solve that problem.

In order to come up with an equation for the best-fit line in original units (not standard units), what needs to be done?

Well, let's examine what we'd need to change.  For starters, the best-fit line in standard units passes through (0, 0).  Second of all, it has a different slope than what we'd expect.

In standard units, we have $y = rx$.  We want to put this back into original units as the form $y=mx+b$.

$$m = r\cdot \frac{SD_y}{SD_x}$$

$$b = mean_y - m\cdot mean_x$$

In [None]:
# Let's derive this


$$y_{SU} = r\cdot x_{SU}$$

$$\frac{y - mean_y}{SD_y} = r\cdot \frac{x-mean_x}{SD_x}$$

$$\hspace{2.85cm} y-mean_y = r\cdot \frac{1}{SD_x} \cdot (x-mean_x) \cdot SD_y$$

$$\hspace{1.6cm} y-mean_y = \frac{r\cdot SD_y}{SD_x} \cdot (x-mean_x)$$

$$\hspace{2.45cm} y-mean_y = \frac{r\cdot SD_y}{SD_x}x - \frac{r\cdot SD_y}{SD_x}mean_x$$

$$\hspace{6.2cm} y = \frac{r\cdot SD_y}{SD_x}x - (mean_y - \frac{r\cdot SD_y}{SD_x}mean_x)$$


Great, so let's get our line into original units.  Then, we can start predicting values!

In [None]:
# Write the euqation of the best-fit line between
# Ice Cream and Shark Attacks in original units.

slope = r * np.std(shark) / np.std(ice)
intercept = np.mean(shark) - slope * np.mean(ice)

"Shark Attacks = {} * Ice Cream Sales (cones) + {}".format(slope, intercept)

In [None]:
# And plot it out
xrange = np.linspace(min(ice), max(ice), 2)

plt.scatter(ice, shark)
plt.plot(xrange, slope * xrange + intercept, c='r')
plt.xlabel("Ice Cream Sales (cones)"); plt.ylabel("Shark Attacks");

In [None]:
# Alright, now, given an Ice Cream value of 100,
# what do we expect the corresponding y value to be?

value = slope * 100 + intercept
value

In [None]:
# What if Ice Cream has a value of 0?

value = slope * 0 + intercept
value

```
It's impossible to have negative shark attacks!  Regression doesn't work perfectly on data outside of our data set!
```

In [None]:
# Interpret the best-fit line equation in words

# And residuals
---


A residual is defined as the actual y value minus the predicted y value.  This is often written as $y-\hat{y}$

Let's take a look at our residuals for $Ice\ Cream$ and $Shark\ Attacks$, and then look at our residuals for $Ice\ Cream$ and $Fudge\ Sales$.

In [None]:
resids_ice_shark = shark - (slope * ice + intercept)

plt.scatter(ice, resids_ice_shark)

# Let's draw a handy line at y=0 (spot on prediction!)
plt.axhline(0, c='k')
plt.xlabel("Ice Cream Sales (cones)"); plt.ylabel("Residuals of Shark Attacks");

In [None]:
corr_ice_fudge = np.mean(ice_su * fudge_su)
slope_ice_fudge = corr_ice_fudge * np.std(fudge) / np.std(ice)
int_ice_fudge = np.mean(fudge) - slope_ice_fudge * np.mean(ice)

resids_ice_fudge = fudge - (slope_ice_fudge * ice + int_ice_fudge)

plt.scatter(ice, resids_ice_fudge)
plt.axhline(0, c='k')
plt.xlabel("Ice Cream Sales (cones)"); plt.ylabel("Residuals of Fudge Sales");

In order for our "best-fit" to actually be even a *good* fit, there's a key feature of the residual plot that we should observe.

In [None]:
# What needs to be true about our residual plot for our best-fit to be good?

Out of curiosity, what should the mean of the residual points be?

In [None]:
print("Icecream vs Fudge:\t\t", np.round(np.mean(resids_ice_fudge), 10),
      "\nIcecream vs Shark Attacks:\t", np.round(np.mean(resids_ice_shark), 10))

What about the standard deviation?  Which do we expect has a higher SD, the residual plot for Ice Cream and Fudge or for Ice Cream and Sharks?

In [None]:
np.std(resids_ice_fudge), np.std(resids_ice_shark)

In [None]:
# Why is this the case?

# Ice Cream &Rightarrow; Shark Attacks?
---

Well, we've done it!  We've figured out that there's a very strong positive correlation between ice cream sales and shark attacks.  Does this mean that ice cream sales cause increased shark attacks?

In [None]:
# Does the strong correlation imply that Ice Cream Sales causes increase shark attacks?

# Bootstrapping For Regression

We use bootstrapping in regression for two reasons:
1. To determine if a correlation is significant
2. To determine what range of values could be predicted given an input

## 1. Determine if a Correlation is Significant

In [None]:
# Ice Cream vs Fudge

correlations = make_array()
for i in range(10000):
    indexes = np.random.choice(len(ice), len(ice))
    
    bootstrap_ice_su = standard_units(ice[indexes])
    bootstrap_fudge_su = standard_units(fudge[indexes])
    
    correlations = np.append(correlations, np.mean(bootstrap_ice_su * bootstrap_fudge_su))

In [None]:
percentile(2.5, correlations), percentile(97.5, correlations)

In [None]:
# Ice Cream vs Shark Attacks

correlations = make_array()
for i in range(10000):
    indexes = np.random.choice(len(ice), len(ice))
    
    bootstrap_ice_su = standard_units(ice[indexes])
    bootstrap_shark_su = standard_units(shark[indexes])
    
    correlations = np.append(correlations, np.mean(bootstrap_ice_su * bootstrap_shark_su))

In [None]:
percentile(2.5, correlations), percentile(97.5, correlations)

## 2. Determine the range of predictions for an input

In [None]:
preds = make_array()
for i in range(10):
    indexes = np.random.choice(len(ice), len(ice))

    bootstrap_ice_su = standard_units(ice[indexes])
    bootstrap_shark_su = standard_units(shark[indexes])
    
    r = np.mean(bootstrap_ice_su * bootstrap_shark_su)
    slope = r * np.std(shark[indexes]) / np.std(ice[indexes])
    intercept = np.mean(shark[indexes]) - slope * np.mean(ice[indexes])

    preds = np.append(preds, slope * 100 + intercept)

In [None]:
percentile(2.5, preds), percentile(97.5, preds)

# Bonus!
---
In addition to the correlation coefficient, $r$, there is also a *"coefficient of determination"*, $r^2$, which is defined as:
> The proportion of the variance in the dependent variable that explained by the best-fit line.

Let's interpret it in terms of Ice Cream Sales and Shark Attacks!

In [None]:
# Interpret (in words) the coefficient of determination

r2 = r**2
r2

```
The best-fit line explains 96% of the variation in the number of Shark Attacks.
```

# Thank you for a great quarter!
### TA Evaluations are open