# High School Heights Dataset

You will find three datasets containing heights of the high school students. 

All heights are in __inches__. 

__The data is simulated__. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls. 

|   Height Statistics (inches)    | Boys| Girls |
| ----------- | ----------- | ----------- |
| Mean       | 67       | 62 |
| Standard Deviation   | 2.9        | 2.2 |

There are 500 measurements for each gender.

Here are the datasets:

* __hs_heights.csv__: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.   

* __hs_heights_pair.csv__: has two columns. The first column has boy's heights. The second column contains girl's heights.

* __hs_heights_flag.csv__: has two columns. The first column has the flag __is_girl__. The second column contains a girl's height if the flag is __1__. Otherwise, it contains a boy's height.     



## How to (re)generate these datasets

Here's the code to create these datasets:
  

In [1]:
import numpy as np
import pandas as pd

### Generate heights from normal distribution

In [2]:
# good ones - 109 or 100. 
# maybe - 148(tallest), 151, 122
np.random.seed(180)

boys = np.random.normal(loc=67, scale=2.9, size=500)
girls = np.random.normal(loc=62, scale=2.2, size=500)

boys = boys.round(2)
girls = girls.round(2)

### Dataset: `hs_heights.csv`

In [3]:
heights_combined = np.concatenate([boys, girls])

np.random.shuffle(heights_combined)

pd.DataFrame(heights_combined).to_csv("hs_heights.csv", index=False)

In [4]:
pd.read_csv('hs_heights.csv')

Unnamed: 0,0
0,61.53
1,63.55
2,63.96
3,64.66
4,63.88
...,...
995,66.19
996,67.77
997,64.65
998,63.57


### Dataset: `hs_heights_pair.csv`

In [5]:
df = pd.DataFrame({
    'boys':boys,
    'girls':girls
})

df.to_csv("hs_heights_pair.csv", index=False)

In [6]:
df.describe().round(2)

Unnamed: 0,boys,girls
count,500.0,500.0
mean,67.16,61.98
std,2.89,2.12
min,59.16,56.68
25%,65.18,60.6
50%,67.13,62.0
75%,68.95,63.38
max,77.15,69.35


In [7]:
pd.read_csv('hs_heights_pair.csv')

Unnamed: 0,boys,girls
0,64.44,62.47
1,67.40,63.27
2,63.93,62.99
3,69.29,64.60
4,66.96,60.73
...,...,...
495,70.89,65.69
496,62.13,58.06
497,72.97,62.42
498,67.23,62.18


### Dataset: `hs_heights_flag.csv`

In [8]:
boys_with_flag  = [(0, boy) for boy in boys]
girls_with_flag  = [(1, girl) for girl in girls]

df = pd.DataFrame(boys_with_flag + girls_with_flag).sample(frac=1, random_state=180)
df.columns = ['is_girl', 'height']
df.reset_index(drop=True, inplace=True)

In [9]:
df.groupby('is_girl').describe().T.round(2)

Unnamed: 0,is_girl,0,1
height,count,500.0,500.0
height,mean,67.16,61.98
height,std,2.89,2.12
height,min,59.16,56.68
height,25%,65.18,60.6
height,50%,67.13,62.0
height,75%,68.95,63.38
height,max,77.15,69.35


In [10]:
df.to_csv("hs_heights_flag.csv", index=False)

In [11]:
pd.read_csv("hs_heights_flag.csv")

Unnamed: 0,is_girl,height
0,0,65.22
1,1,62.50
2,0,66.80
3,0,70.86
4,1,65.92
...,...,...
995,1,63.46
996,0,67.21
997,0,65.83
998,1,64.00
