In [None]:
from datascience import *
import numpy as np

In [None]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In this lecture, I am going to use more interactive plots (they look better) so I am using the plotly.express library.  We won't test you on this but it's good to know.

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Lecture 10

## Review: Standard Units

**Task**
Convert the children's heights to standard units

In [None]:
family_heights = Table.read_table('data/family_heights.csv')
family_heights

## Predicting Child Heights

Recall, long ago, in lecture 10 we built a function to predict child heights.  We started with [Galton's height dataset](https://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf) which contained the full grown heigh of children and the height's of both of their parents. We then computed the average height of the parents of each child.

The following is the simplified version of the data containing just the parent's heights and the child height.

In [None]:
# Note: Child heights are the **adult** heights of children in a family
families = Table.read_table('data/family_heights.csv')
parent_avgs = (families.column('father') + families.column('mother'))/2
heights = Table().with_columns(
    'Parent Average', parent_avgs,
    'Child', families.column('child'),
)
heights

What was the relationship between height of the full grown child and the height of the parents?

In [None]:
#heights.iscatter('Parent Average', 'Child')
heights.scatter('Parent Average', 'Child')

### The Nearest Neighbor Predictions

Could we use this data to help us predict the height of a newborn child given the parent's height? 

In lecture 10, we actually developed a highly sophisticated process for predicting the height of a child given the average heigh of both their parents.  We looked at children of parents with similar heights in our data and then took the average height of those nearby children.

In [None]:
def nearest_neighbor_predictor(parent_average, window=0.5):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    similar_child_heights = (
        heights
            .where("Parent Average", are.between(lower_bound, upper_bound))
            .column("Child")
    )
    return np.mean(similar_child_heights)

In [None]:
my_height = 5*12 + 11 # 5 ft 11 inches
spouse_height = 5*12 + 7 # 5 ft 7 inches
our_average = (my_height + spouse_height) / 2.0
our_average

In [None]:
window = 0.5
lower_bound = our_average - window
upper_bound = our_average + window
print(lower_bound, upper_bound)

In [None]:
nearest_neighbor_predictor(our_average)

In pictures, this nearest neighbor predictor looks like.

In [None]:
heights.scatter('Parent Average', 'Child')
# You don't need to know the details of this plotting code yet.
plots.plot([lower_bound, lower_bound], [50, 85], color='red', lw=2)
plots.plot([our_average, our_average], [50, 85], color='orange', lw=2);
plots.plot([upper_bound, upper_bound], [50, 85], color='red', lw=2);

In [None]:
# Interactive plot

In [None]:
fig = px.scatter(x=heights.column('Parent Average'), y=heights.column('Child'))
fig.add_vline(our_average - 0.5)
fig.add_vline(our_average + 0.5)
fig.add_scatter(x=[our_average], y=[nearest_neighbor_predictor(our_average)], 
                name="Prediction", marker_size=10)

To get a sense as to how well our predictor works, we can apply it to each of the records in our dataset. Of course, we already know the height of the children for each of the records but this gives us a simple way to evaluate our predictions when we know the answer.

In [None]:
prediction = heights.apply(nearest_neighbor_predictor, 'Parent Average')
heights_with_predictions = heights.with_columns('Prediction', prediction)

In [None]:
#heights_with_predictions.iscatter('Parent Average')
heights_with_predictions.scatter('Parent Average')

The yellow line above is not actually a line but a curve.  It is actually a fairly  advanced model capable of capturing complex non-linear relationships.  However, for many activities in data science we will be interested in a simple line that approximates the yellow curve. In this and the next few lectures we will build an intuition for the properties of this line and it's derivation. 

We still start with a mathematical description of the **linear association** between two variables: *a numerical measure of how closely two variables follow a line.*

---
<center>Return to Slides</center>

---

## Association

We already saw one example of an association between two variables:

In [None]:
heights_with_predictions.scatter('Parent Average')

Let's look at another dataset consist of hybrid cars.  This dataset contains the vehicle model, the year it was released, the manufacturers suggested retail price (msrp), the acceleration (in km/h/s so bigger *is* better), fuel efficiency (mpg), and the type of car (class).

In [None]:
hybrid = Table.read_table('data/hybrid.csv')
hybrid.show(5)

There are some expensive hybrids...

In [None]:
hybrid.sort('msrp', descending=True)

The first step in studying an association is to visualize the data.

In [None]:
hybrid.scatter('msrp', 'mpg')

In [None]:
hybrid.scatter('msrp', 'acceleration')

We could even consider plotting at MPG, acceleration, and price all at once.

In [None]:
px.scatter(hybrid.to_df(), 
           x="msrp", 
           y="mpg", 
           size="mpg",
           hover_name="vehicle", 
           color="class")

What kinds of associations do we observe?

---
<center>Return to Slides</center>

---

## Correlation

Correlation is a measure of the linear relationship between two variables. Before we show you how to compute correlation, let's build an intuition for what it means.  To do that we will use the following helper function to generate data with different correlation values.

This is a helper function that generates and plots synthetic data with a given $r$ value. You are not expected to understand how this function works as it is well beyond the scope of this class.

In [None]:
def make_correlated_data(r, n=500):
    "Generate a a table with columns x and y with a correlation of approximately r"
    x = np.random.normal(0, 1, n)
    z = np.random.normal(0, 1, n)
    y = r*x + (np.sqrt(1-r**2))*z
    return Table().with_columns("x", x, "y", y)

def r_scatter(r, n=500, ax=None):
    plots.figure(figsize=(5,5))
    "Generate a scatter plot with a correlation approximately r"
    x = np.random.normal(0, 1, n)
    z = np.random.normal(0, 1, n)
    y = r*x + (np.sqrt(1-r**2))*z
    if ax:        
        ax.scatter(x, y, color='darkblue', s=20)
        ax.set_xlim(-4, 4)
        ax.set_ylim(-4, 4) 
    else:
        plots.scatter(x, y, color='darkblue', s=20)
        plots.xlim(-4, 4)
        plots.ylim(-4, 4)    

In [None]:
r_scatter(0.0)

In [None]:
fig, ax = plots.subplots(2, 3, dpi=80, figsize=(16,9))
n = 500
r_scatter(0.2, n, ax[0,0])
r_scatter(0.5, n, ax[0,1])
r_scatter(0.8, n, ax[0,2])
r_scatter(-0.2, n, ax[1,0])
r_scatter(-0.5, n, ax[1,1])
r_scatter(-0.8, n, ax[1,2])
fig.tight_layout(pad=1)

### Computing the Correlation

To derive the correlation, we start by converting our data to standard units.

Recall in previous lectures we introduced a function to transform our data into standard units.

In [None]:
def standard_units(x):
    "Convert any array of numbers to standard units."
    return (x - np.average(x)) / np.std(x)

In [None]:
t = Table().with_columns('x', [1,2,3,4,5,6], 'y', [1,3,3.2,5,5.1,7])
t

In [None]:
t.scatter('x', 'y')

**Question** 
What is the correlation coeffifient of those two variables

**Tasks**
- add the standard units
- multiply the standard units of x and y for each row
- calculate the eaverage of the products

--> The average of the products of the standard units IS the correlation coefficient

Lets add standard unit (SU) versions of the `mpg`, `msrp`, and `acceleration` to our hybrid table.

In [None]:
hybrid = hybrid.with_columns(
    "mpg (SU)", standard_units(hybrid.column('mpg')),
    "msrp (SU)", standard_units(hybrid.column('msrp')),
    "acceleration (SU)", standard_units(hybrid.column('acceleration')),
)
hybrid.show(5)

How does this change the plots?

In [None]:
hybrid.scatter('msrp', 'acceleration')

In [None]:
hybrid.scatter("msrp (SU)", "acceleration (SU)")

I could not plot the marker size in standard units because marker sizes must be a positive integer.

The correlation is the **average** of the **product** of the **standard units** of each variable.

\begin{align}
r & = \frac{1}{n}\sum_{i=1}^n \left( \frac{x_i - \text{Mean}(x)}{\text{Stdev}(x)} \right) * \left( \frac{y_i - \text{Mean}(y)}{\text{Stdev}(y)} \right) \\
 & = \frac{1}{n} \sum_{i=1}^n \text{StandardUnits}(x_i) *  \text{StandardUnits}(y_i)\\
 & = \text{Mean}\left(\text{StandardUnits}(x) *  \text{StandardUnits}(y)\right)
\end{align}


In [None]:
np.mean(hybrid.column("acceleration (SU)") * hybrid.column("msrp (SU)"))

A positive correlation close to 1 would mean that when acceleration is larger than the mean then msrp should also be larger than the mean.  Looking at the histogram of the product we see:

In [None]:
Table().with_column("Product", hybrid.column("acceleration (SU)") * hybrid.column("msrp (SU)")).hist("Product", bins=20)

### Defining the Correlation Function

Let's define a function that computes the correlation between two columns in a table.

Can you guess the value of the correlation for each of the following relationships:

In [None]:
fig = make_subplots(1,3)
fig.add_scatter(x=hybrid.column("msrp"), y=hybrid.column("mpg"), mode="markers", row=1, col=1)
fig.add_scatter(x=hybrid.column("msrp"), y=hybrid.column("acceleration"), mode="markers", row=1, col=2)
fig.add_scatter(x=hybrid.column("mpg"), y=hybrid.column("acceleration"), mode="markers", row=1, col=3)
fig.update_xaxes(title_text="msrp", row=1, col=1)
fig.update_yaxes(title_text="mpg", row=1, col=1)
fig.update_xaxes(title_text="msrp", row=1, col=2)
fig.update_yaxes(title_text="acceleration", row=1, col=2)
fig.update_xaxes(title_text="mpg", row=1, col=3)
fig.update_yaxes(title_text="acceleration", row=1, col=3)
fig.update_layout(showlegend=False)

In [None]:
correlation(hybrid, "msrp", "mpg")

In [None]:
correlation(hybrid, "msrp", "acceleration")

In [None]:
correlation(hybrid, "mpg", "acceleration")

### Switching Axes

What happens if we swap the axes?

In [None]:
hybrid.scatter("msrp", "acceleration")

In [None]:
hybrid.scatter("acceleration", "msrp")

In [None]:
correlation(hybrid, "msrp", "acceleration")

In [None]:
correlation(hybrid, "acceleration", "msrp")

<details><summary>Solution</summary>

Switching axes doesn't affect the correlation.  It is a symmetric function.

</details>

---

<center>Return to Slides</center>

---

## Care when Interpreting Correlation

When computing correlation it is important to always visualize your data first and then consider each of the following issues.


### Correlation does Not Imply Causation

We have covered this one extensively at this point.  

### Nonlinearity

Low correlation does not imply absence of a relationship. Correlation measures linear relationships.  Data with strong non-linear relationship may have very low correlation.  

In [None]:
new_x = np.arange(-4, 4.1, 0.5)
nonlinear = Table().with_columns('x', new_x, 'y', new_x**2)
nonlinear.scatter('x', 'y')

There is clearly a relationship to this data.  Given the value of $x$ you can easily predict the value of $y$.  What is the correlation?

In [None]:
correlation(nonlinear, 'x', 'y')

As a quick aside, how would our nearest neighbor predictor work on this non-linear data.

In [None]:
def nn_predictor(x):
    return np.mean(nonlinear.where("x", are.between(x-0.51, x+0.51)).column("y"))

In [None]:
nonlinear.with_column("Prediction", nonlinear.apply(nn_predictor, "x")).scatter("x")

### Outliers

Outliers can have a significant effect on correlation.  

In [None]:
line = Table().with_columns(
        'x', make_array(1, 2, 3, 4),
        'y', make_array(1, 2, 3, 4)
    )
line.scatter('x', 'y')

In [None]:
correlation(line, 'x', 'y')

In [None]:
outlier = Table().with_columns(
        'x', make_array(1, 2, 3, 4, 5),
        'y', make_array(1, 2, 3, 4, 0)
    )
outlier.scatter('x', 'y')

In [None]:
correlation(outlier, 'x', 'y')

### Ecological Correlations

The correlation between aggregated variables (e.g., after grouping) may be much higher than the correlation between the underlying variables.

In [None]:
sat2014 = Table.read_table('data/sat2014.csv').sort('State')
sat2014

In [None]:
sat2014.scatter('Critical Reading', 'Math')

In [None]:
correlation(sat2014, 'Critical Reading', 'Math')

That is a very strong correlation.  However, each data point corresponds to a large cloud of data points where each person might have had greater variability in their scores.  

### Bonus: Understanding the SAT data
While we have the data loaded.  Does anyone have a guess which dots correspond to which state?

In [None]:
def rate_code(x):
    if x<25:
        return 'low'
    elif x<50:
        return 'low-moderate'
    elif x<75: 
        return 'moderate-high'
    else:
        return 'high'
    
rate_codes = sat2014.apply(rate_code, 'Participation Rate')
sat2014 = sat2014.with_column('Rate Code', rate_codes)
sat2014

In [None]:
sat2014.scatter('Critical Reading', 'Math', group='Rate Code')

In [None]:
px.scatter(sat2014.to_df(), 
           x = "Critical Reading",
           y = "Math",
           hover_name = "State",
           size = "Participation Rate")