# Re-Galtoning

In [None]:
# Don't change this cell; just run it.
import numpy as np  # The array library.
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')

In this exercise, you will very likely find yourselves using

* [groupby](/useful-pandas/groupby)
* [merge](/useful-pandas/merge)

as well as some of your other Pandas skills.

## The data

The data for your task relates to a very famous paper by [Francis
Galton](https://en.wikipedia.org/wiki/Francis_Galton), published in 1886.
Galton was an extraordinarily versatile scientist who laid the groundwork for
early statistics, and particularly regression and correlation.  The paper we
are interested in here is:

> Galton, F. (1886). [Regression Towards Mediocrity in Hereditary Stature](
https://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf)
Journal of the Anthropological Institute, 15, 246-263

In fact, this paper is the origin of the term *regression* for fitting
prediction lines to data.

Galton was a keen eugenicist, and was very interested in inheritance.  In this
case he studied the relationship of children's heights to the heights of their
parents.

Galton asked families to give him data about:

* The father's height
* The mother's height
* The height and gender of each adult child in the family.

You can read more about the data files at the [Galton heights datasets
page](https://github.com/odsti/datasets/tree/regalton/galtons_heights).

## Reconstructing a dataframe

First, here is the data frame that you are aiming to reconstruct.  Your task is
to rebuild this table, including its data and column names, from the component data frames you will see further below.

In [None]:
# Data frame you are aiming to reconstruct.
combined = pd.read_csv('galton_combined.csv')
combined.head()

As you can see, this combined data frame has one row per adult child, along
with their parents heights, and a unique identifier for the family, in the
`family` column.  We will come onto `midparentHeight` later.

The components you will be using to reconstruct the `combined` data frame are the following data frames:

In [None]:
# Data frame with data about families.
families = pd.read_csv('galton_families.csv')
families.head()

This data frame has information about the families, but no information about the children.  Next:

In [None]:
# Data frame with data about the children.
children = pd.read_csv('galton_children.csv')
children.head()

## Mid-parent height

Galton wanted to predict the height of the adult children from the heights of
the parents.  He wanted one number to encapsulate the height of both parents,
and this number is `midParentHeight` in the `combined` data frame.

Women are not as tall as men, on average.  To adjust for this, Galton
multiplied the mother's height by 1.08 before averaging with the father's
height, to give `midParentHeight`.

## Ready, set

To recap — your task is to reconstruct the data of the `combined` data frame,
using the data from the `families` and `children` data frames.  Call the reconstructed data frame `reconstructed`.

Try to get the values in `reconstructed` to match `combined` as well as you
can.  Rename the columns to match the columns of `combined`.

You will all but certainly find yourself using the `groupby` and `merge` methods from the links above.

Good luck!

In [None]:
#- Your code here.

In [None]:
# Make more cells as you need them

In [None]:
reconstructed = ...