In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 09 

In today's lecture, we will:
1. review functions and applying functions to tables by building a simple but sophisticated prediction function. 
2. we will introduce the group operation.


## Prediction

Can we predict how tall a child will grow based on the height of their parents?

To do this we will use the famous [Galton's height dataset](https://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf) that was collected to demonstrate the connection between parent's heights and the height of their children. 

In [None]:
families = Table.read_table('data/family_heights.csv')
families

**Discussion:** This data was collected for Europeans living in the late 1800s.  What are some of the potential issues with this data?

### Exploring the Data

**Exercise:** Add a column `"parent average"` containing the average height of both parents.

<details> <summary>Click for Solution</summary>

```python
families = families.with_column(
    "parent average", (families.column('father') + families.column('mother'))/2.0
)
families
```
</details>

What is the relationship between a child's height and the average parent's height? 

**Exercise:** Make a scatter plot showing the relationship between the `"parent average"` and the `"child"` height.

<details> <summary>Click for Solution</summary> <br><br>
    
```python

families.scatter("parent average", "child")

```
    
<br><br></details>

**Questions:**
1. Do we observe a relationship between child and parent height?
2. Would a line plot help reveal that relationship? 
3. Could we learn something from a histogram?

### Making a Prediction

If we wanted to predict the height of a child given the height of the parents, we could look at the heigh of children with parents who have a similar average height. 


In [None]:
my_height = 5*12 + 8 # 5 ft 8 inches
spouse_height = 5*12 + 7 # 5 ft 7 inches

In [None]:
our_average = (my_height + spouse_height) / 2.0
our_average

Let's look at parents that are within 1 inch of our height.

In [None]:
window = 1 
lower_bound = our_average - window
upper_bound = our_average + window

In [None]:
families.scatter('parent average', 'child')
# You don't need to know the details of this plotting code yet.
plots.plot([lower_bound, lower_bound], [50, 85], color='red', lw=2)
plots.plot([our_average, our_average], [50, 85], color='orange', lw=2);
plots.plot([upper_bound, upper_bound], [50, 85], color='red', lw=2);

**Exercise:** Create a function that takes an average of the parents heights and returns *an __array__ of all the children's heights* that are within the window of the parent's average height.

In [None]:
def similar_child_heights(parent_average):
    pass

<details> <summary>Click for Solution</summary> <br><br>   

```python
def similar_child_heights(parent_average):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return (
        families
            .where("parent average", are.between(lower_bound, upper_bound))
            .column("child")
    )
```

<br><br></details>

Testing the function:

In [None]:
# window = 1.0
similar_child_heights(our_average)

**Exercise:** Create a function to predict the child's height as the average of the height of children within the window of the average parent height.

In [None]:
def predict_child_height(parent_average):
    pass 

<details> <summary>Click for Solution</summary> <br><br>   

```python
def predict_child_height(parent_average):
    return np.average(similar_child_heights(parent_average))
```

<br><br></details>

In [None]:
predict_child_height(our_average)

Let's plot the predicted height as well as the distribution of children's heights:

In [None]:
# window = 1.0
similar = similar_child_heights(our_average)
predicted_height = predict_child_height(our_average)

print("Mean:", predicted_height)
Table().with_column("child", similar).hist("child", bins=20)
plots.plot([predicted_height, predicted_height], [0, .1], color="red")

**Discussion:** Is this a good predictor? How would I know? What happens when I change window size?

### Evaluating the Predictions

To evaluate the predictions, let's see how the predictions compare to the actual heights of all the children in our dataset.  


**Exercise:** Apply the function (using `apply`) to all the parent averages in the table and save the result to the `"predicted"` column.

<details> <summary>Click for Solution</summary> <br><br>   

```python
# window = 0.5
families = families.with_column(
    "predicted", families.apply(predict_child_height, "parent average"))
families
```

<br><br></details>

**Exercise:** Construct a scatter plot with the `"parent average"` height on the x-axis and the `"child"` height and the `"predicted"` height on the y-axis. 

In [None]:
(
    families
    .select('parent average','child', 'predicted')
    .scatter('parent average')
)

<details> <summary>Click for Solution</summary> <br><br>   

```python
(
    families
    .select('parent average','child', 'predicted')
    .scatter('parent average')
)
```

<br><br></details>

**Discussion:** What do we see in this plot?  What trends.

**Exercise:** Define a function to compute the error (the difference) between the predicted value and the true value and apply that function to the table adding a column containing the `"error"`.  Then construct a histogram of the errors.


In [None]:
def error(predicted, true_value):
    pass

families = families.with_column("error", families.apply(error, "predicted", "child"))
families

<details> <summary>Click for Solution</summary> <br><br>   

```python
def error(predicted, true_value):
    return predicted - true_value

families = families.with_column(
    "error", families.apply(error, "predicted", "child"))
families
```

<br><br></details>

Visualizing the distribution of the errors:

In [None]:
families.hist('error')

**Discussion:** Is this good?

### Split by female and male

**Exercise:** Overlay the histograms of the error for male and female children.

Hint: use the keyword argument `group` in hist()

<details> <summary>Click for Solution</summary> <br><br>   

```python
families.hist('error', group='sex')
```

<br><br></details>

**Discussion:** What do we observe?

### Building a Better Predictor

Based on what we observed, let's build a better predictor. 

**Exercise:** Implement a new height prediction function that considers averages the  height of children with the same sex and whose parents had a similar height.

*Hint:* Here is the previous function:
```python
def similar_child_heights(parent_average):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return np.average(
        families
            .where("parent average", are.between(lower_bound, upper_bound))
            .column("child")
    )
```

<details> <summary>Click for Solution</summary> <br><br>   

```python
def predict_child_height_with_sex(parent_average, sex):
    lower_bound = parent_average - window
    upper_bound = parent_average + window
    return np.average(
        families
        .where("sex", sex)
        .where("parent average", are.between(lower_bound, upper_bound))
        .column("child")
    )
```

<br><br></details>

Let's test it out.

In [None]:
predict_child_height_with_sex(our_average, "male")

In [None]:
predict_child_height_with_sex(our_average, "female")

**Exercise:** Apply the better predictor to the table and save the predictions in a column called `"predicted with sex"`.

In [None]:
predicted_with_sex = families.apply(predict_child_height_with_sex, "parent average", "sex")
families = families.with_column("predicted with sex", predicted_with_sex)
families

<details> <summary>Click for Solution</summary> <br><br>   

```python
families = families.with_column(
    "predicted with sex", families.apply(predict_child_height_with_sex, "parent average", "sex"))
families
```

<br><br></details>

**Exercise:** Construct a histogram of the new errors broken down by the sex of the child.

In [None]:
error_with_sex = families.apply(error, "predicted with sex", "child")
families = families.with_column("error with sex",  error_with_sex)

families.hist("error with sex", group="sex")

As a point of comparison

In [None]:
families.hist("error", group="sex")

---
<center> Return to slides <center>

---

## Grouping

For this part of the notebook we will use the following toy data:

In [None]:
cones = Table.read_table('data/cones.csv')
cones

**Exercise:** Use the `group` function to determine the number of cones with each flavor.

<details> <summary>Click for Solution</summary> <br><br>   

```python
cones.group('Flavor')
```

<br><br></details>

**Exercise:** Use the `group` function to compute the average price of cones for each flavor.

<details> <summary>Click for Solution</summary> <br><br>   

```python
cones.group('Flavor', np.average)
 ```

<br><br></details>

**Exercise:** Use the `group` function to compute min price of cones for each flavor.

**Question**:
Why does color have a min?

<details> <summary>Click for Solution</summary> <br><br>   

```python
cones.group('Flavor', np.min)
```

<br><br></details>

What is really going on:

In [None]:
cones

In [None]:
def my_grp(grp):
    print(grp)
    return grp

cones.group("Flavor", my_grp)