In [None]:
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline

# Normal CIs

## Question

100 people were asked about the duration of last night's sleep. Here are the results:

In [None]:
np.random.seed(42)
sleep = np.random.lognormal(2, .2, 100)

Construct a 95% confidence interval for the population's mean sleep duration.

## Question

How would you construct a 95% CI for the population's *median* sleep duration?

## Question

A scientist claims that the mean sleep duration is 8 hours. Your alternative hypothesis is that 8 hours of sleep is too high. Do you accept the alternative hypothesis? Use a 9% level of confidence.

## Question

600 people are polled about an upcoming election. 342 said they would vote for candidate A, while the rest said they would vote for candidate B.

- What is the 95% confidence interval for the population proportion of people who would vote for A?
- What is the 99% confidence interval?

# Correlation

In [None]:
tbl = bpd.read_csv("ice-cream-stats.csv")
tbl

In [None]:
# Let's set some variables to the columns for easier reference
ice = tbl.get('Ice Cream Sales (cones)')
fudge = tbl.get('Fudge Sale Volume (g)')
shark = tbl.get('Shark Attacks')

# First: relationships
---

Let's consider the following table:

|x|y|
|---|---|
|1|1|
|2|4|
|4|8|
|5|10|

**Question**: If we were to receive an $x=3$, what would we expect the corresponding $y$ to equal? Why?

**Question**: Given x = 6, what do we expect y to equal?

## Question

What, roughly, would we expect $Fudge\ Sales\ Volume$ to equal provided that $Ice\ Cream\ Sales = 100$?

Let's focus on our justification for coming up with a value for $Fudge$.  We saw a strong *relationship* between $Ice \ Cream$ and $Fudge$.  What does that mean?

If we're speaking statistically, then we consider `relationship == association`.  And an association really boils down to the following claim:  

<center>"When this changes value, then that changes value too."</center>

# Correlation
---

What does correlation (or the correlation coefficient) tell us?

## Question

Are $x$ and $y$ correlated? What do you think the correlation coefficient is (roughly)?

|x|y|
|---|---|
|1|1|
|2|4|
|4|8|
|5|10|

## Question

Are ice cream and fudge *associated*? Are they *correlated*?

## Question

What is the correlation coefficient between $x$ and $y$?

|x|y|
|---|---|
|1|1|
|2|4|
|4|8|
|5|10|

## Question

What is the correlation coefficient between ice cream and fudge?

## Question

Suppose we have converted the ice cream column to standard units. What is the mean of the new column? What about the standard deviation?

## Question

In words, what does our correlation coefficient mean?

# Regression

In [None]:
x = (shark - shark.mean()) / np.std(shark)

In [None]:
y = (ice - ice.mean()) / np.std(ice)

## Question

Are shark attacks and ice cream sales *associated*? What about *correlated*?

## Question

What is the *correlation coefficient*?

# Regression

There is a linear relationship between ice cream sales and shark attacks.

In [None]:
tbl.plot(kind='scatter', x="Ice Cream Sales (cones)", y="Shark Attacks")

It is still there if we convert to standard units:

In [None]:
x = (ice - ice.mean()) / np.std(ice)
y = (shark - shark.mean()) / np.std(shark)

In [None]:
plt.scatter(x, y)

## Correlation

The correlation coefficient is:

In [None]:
r = (x * y).mean()
r

What does the correlation coefficient have to do with the slope of this line?

## Deriving the slope

- We believe that $y_i \approx m x_i$, but what is the slope, $m$?
- We know that $\frac{1}{n} \sum_{i=1} x_i y_i = 0.98 = r$
- So
$$
\begin{align*}
    &y_i \approx m x_i
    \\
    &\implies
    x_i y_i \approx m x_i^2
    \\
    &\implies
    \sum x_i y_i \approx \sum m x_i^2
    \\
    &\implies
    \frac 1n \sum x_i y_i \approx \frac 1n \sum m x_i^2
    \\
    &\implies
    \frac 1n \sum x_i y_i \approx m \cdot \frac 1n \sum x_i^2
    \\
    &\implies
    r \approx m \times \operatorname{Variance(x)}
    \\
    &\implies
    r \approx m
\end{align*}
$$

In [None]:
# Let's plot it!

xrange = np.linspace(x.min(), x.max(), 2)

plt.scatter(x, y)
plt.plot(xrange, xrange * r, c='r')
plt.xlabel("Ice Cream Sales (cones)"); plt.ylabel("Shark Attacks");

## Question

Suppose there are 40 ice cream cones sold on a given day. What is the predicted number of shark attacks?

## Question

This was all in standard units. What is the slope of the line in *original* units? The intercept?

$$m = r\cdot \frac{SD_y}{SD_x}$$

$$b = mean_y - m\cdot mean_x$$

## Question

Suppose there are 10 ice cream cones sold on a given day. What is the predicted number of shark attacks?