# Height vs Weight


## Table of Contents
1. Introduction
1. Load data 
1. Understand the dataset
1. Examine the data distribution (1-dimensional)
    1. Height distribution by gender
    1. Exercise: Weight distribution by gender
1. Examine the data distribution (2-dimensional)
    1. The relationship between height and weight
    1. Calculate the Pearson's correlation coefficient between the men's height and weight
    1. Exercise: Correlation between the women's height and weight
1. References

---

## Introduction
In this exercise, we will use a dataset that contains 10000 (artificial) measurements of height and weight for men and women (taken from the D. Conway's "Machine-Learning for Hackers" book).

Pretending that we are the statisticians/data-scientists, we will examine this dataset to understand whether:
- Is there any weight/height difference between men and women in this dataset?
- For both gender, is there any relationship between the height and weight? e.g. A taller person tends to have a heavier weight.

And in this exercise, you will learn:
1. How to load a `.csv` data using a popular `Pandas` package.
1. How to generate basic statistics to describe a dataset. 
1. How to plot graphs using the `Seaborn` package.
1. How to calculate the correlation between two variables using the `Scipy` package.



---

## Load data
*(Check out the left navigation pane)* 

The dataset is stored in a file named `height_vs_weight_dataset.csv` using a popular comma-separated data format (.csv). 
Our first task in this section is to load the data using Python and examine the contents within the dataset.

In [None]:
# enable interactive mode on the matplotlib
%matplotlib widget

# import python packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression


# set the default figure size for the graphs
plt.rcParams['figure.figsize'] =(6,4)

# load data
data = pd.read_csv("height_vs_weight_data.csv")

---

## Understand the dataset
After loading the data, the first step is to explore the dataset to understand:
- the number of variables collected
- the variable type (e.g. categorical or continuous)

In [None]:
# number of men and women samples
total_men_samples = data.query("Gender=='Male'")["Gender"].count()
total_women_samples = data.query("Gender=='Female'")["Gender"].count()
print(f"""The dataset contains height/weight measurements from {total_men_samples} men and {total_women_samples} women.""")

In [None]:
# examine data structure
# show 10 random samples from the dataset
data.sample(10)

From the table above, we can observe that there are three variables (or table columns) collected for each sample (or table row): `Gender`, `Weight` and `Height`.


- `Gender` is a **categorical** variable with only two values: `Male` or `Female`.
- `Weight` and `Height` are **numerical** variables that can take any continuous value.

Understanding the variable types can help us deciding the kind of statistical models/tests/graphs to be used in a data analysis. However, this topic is beyond the scope of this exercise. For further reading, you may read the article by D. Richard, *"Types of data"* (url: https://www.nature.com/articles/6400501) to understand more variable types.
 

---
## Examine the data distribution (1-dimensional)
Histogram is a way to examine the probability distribution of a variable.

### Height distribution by gender

In [None]:
# plot a histogram with x-axis represents the Height and color represents different Gene
plt.figure()
g = sns.histplot(data=data, x="Height", hue="Gender")
g.set_xlabel("Height (cm)")

In the histogram above, each bar represents the number of samples occur within a particular height range (or sometimes called "bin"). For example, for the women's data (orange-coloured bars), there are 508 out of 5000 women whose heights are between 135 and 140 cm.

In [None]:
# calculate the mean and std of height for men and women
variable = "Height"
mean_height = dict()
std_height = dict()

for gender in ["Male", "Female"]:
    #1. find the data belong to a gender, 
    #2. then grab the value from the relevant variable,
    #3. calculate the mean/std of the variable
    #4. round the mean/std value to 2 decimal points.
    mean_height[gender] = np.round(data.query(f"Gender=='{gender}'")[variable].mean(),2)
    std_height[gender] = np.round(data.query(f"Gender=='{gender}'")[variable].std(),2)
    
    print(f"Height for '{gender}', mean: {mean_height[gender]} cm, std: {std_height[gender]} cm")
print("\n")

`Mean` is a way of measuring the central tendency of the value of a variable. From the above calculation, we can tell that the average men height is 187.02 cm, which is higher than the average women height of 135.86 cm.

`Variance` is a way of measuring the spread of the variable values. The `variance` of 19.78 cm tell us that not all the men has the height of 187.02 cm. In fact, assuming that the data is Gaussian distributed, 68% of the men's height is between 187.02 +/- 19.78 cm.


### Exercise: Weight distribution by gender
In this exercise, we would like to understand whether there is any body **weight** difference between men and women.

**Hint**: Replace the "__" in the following codes with a proper variable name

In [None]:
plt.figure()
g = sns.histplot(data=data, x="__", hue="Gender") # hint: replace "__" here
g.set_xlabel("Weight (kg)")

**Hint**: Replace the "__" in the following codes with a proper variable name

In [None]:
#Here, we use a less-complicated code.
average_male_weight = data.query("Gender=='Male'")["__"].mean()  # hint: replace "__" here
average_female_weight = data.query("Gender=='Female'")["__"].mean()  # hint: replace "__" here

print(f"""The average weight for:
male:   {average_male_weight:.2f} kg, 
female: {average_female_weight:.2f} kg
weight difference = {average_male_weight-average_female_weight:.2f} kg""")


**Questions**:
1. Is the men's weight different from the women's weight?
2. (Extra) Is the difference huge? How much difference should it be to be considered as a huge (or significant) difference?

---

## Examine the data distribution (2-dimensional)
### The relationship between height and weight
One way to examine the relationship between height and weight is through plotting a scatterplot.

In [None]:
# Examine the data distribution
plt.figure()
g = sns.scatterplot(data=data, x="Weight", y="Height", hue="Gender", alpha=.5)
g.set_xlabel("Weight (kg)")
g.set_ylabel("Height (cm)")

g.hlines(y=175, xmin=50, xmax=80, linestyle='--', color='r')
g.hlines(y=100, xmin=50, xmax=80, linestyle='--', color='b')

In the above graph, each point represents a sample with its weight and height. The graph region with a higher number of points means that there are more samples distributed around that particular region. 

From the graph, we can observe that for both men and women, there is a tendency that when a person is taller, his/her body weight is heavier.

For example, 
- for a woman with a height of 100 cm (*blue horizontal line*), her body weight is roughly between 55 and 63 kg. 
- for a woman with a height of 175 cm (*red horizontal line*), her body weight is roughly between 65 and 70 kg. 

**Extra:**

From the graph, we can tell that both men and women's data are Gaussian/Normal distributed since the shape of their data distribution is elliptic. Knowing the data distribution can be useful especially when comes to determining the suitable statistical test/model. However, the discussion on the types of data distribution and their associated statistical tests is beyond the scope of this exercise.

### Calculate the Pearson's correlation coefficient between the men's height and weight
While we observed a trend between height and weight in both genders, is there a way to quantify the relationship?
In this section, we use a method called Pearson's correlation to calculate the strength of the relationship between the height and weight in both genders.

Pearson correlation test provides us a ***r*** value between -1.0 and 1.0. The correlation value can be interpreted as follows:

| r    | Interpretation                  |
|------|---------------------------------|
| -1.0 | a perfect negative relationship |
| 0.0  | no relationship                 |
| 1.0  | a perfect positive relationship |

In [None]:
# find the men data
male_data = data.query("Gender=='Male'")

# calculate the Pearson's Correlation between the weight and height of the men
corr, _ = pearsonr(male_data["Weight"], male_data["Height"])
print('Pearsons correlation, r: %.3f' % corr)

The correlation value of **0.863** suggested that there is a strong positive relationship between body's weight and height in Men. Or in other words, mens' weight tend to increase with their heights.

### Exercise: Correlation between the women's height and weight
Now, repeat the step above and find whether there is any relationship between women's height and weight.

**Hint**: Replace the "__" in the following codes with a proper gender value

In [None]:
# find the women data
subset = data.query("Gender=='__'") # hint: replace "__" here

# calculate the Pearson's Correlation between the weight and height of the women
corr, _ = pearsonr(subset["Weight"], subset["Height"])
print('Pearsons correlation, r: %.3f' % corr)

---
## References
- Steven Buechler, "Study of Height vs Weight", from the "Computing with Data" seminar (https://www3.nd.edu/~steve/computing_with_data/2_Motivation/motivate_ht_wt.html)
- Drew Conway and John Myles White, "Height vs Weight" dataset from the "Machine-Learning for Hackers" textbook