# Simulating a Dataset

## 1. Obesity - A Growing Problem

### Key Facts About Obesity [1]
- Worldwide obesity has nearly tripled since 1975
- In 2016, more than 1.9 billion adults, 18 years and older, were overweight. Of these over 650 million were obese.
- 39% of adults aged 18 years and over were overweight in 2016, and 13% were obese.
- Most of the world's population live in countries where overweight and obesity kills more people than underweight.
- 41 million children under the age of 5 were overweight or obese in 2016.
- Over 340 million children and adolescents aged 5-19 were overweight or obese in 2016.
- Obesity is preventable.

Obesity is increasingly affecting humans of all ages, shapes and sizes in the modern world.
According to irishhealth.com [2] and medscape [3], there are a number of classifications, based on Body Mass Index (BMI), ranging from "Below Normal Weight" to "Class 3 Obese", as seen in the table below:

In [1]:
import pandas as pd

d = {'BMI (kg/m2)' : ['< 18', '18-24.9', '25-29.9', '30-34.9', '35-39.9', '> 40'],'Category' : ['Below Normal Weight', 'Normal Weight', 'Overweight', 'Class 1 Obese', 'Class 2 Obese', 'Class 3 Obese']}
df = pd.DataFrame(data = d, columns=['BMI (kg/m2)','Category'])
df

Unnamed: 0,BMI (kg/m2),Category
0,< 18,Below Normal Weight
1,18-24.9,Normal Weight
2,25-29.9,Overweight
3,30-34.9,Class 1 Obese
4,35-39.9,Class 2 Obese
5,> 40,Class 3 Obese


When thinking about the risk factors that impact obesity levels, the main indicator used to determine a person's weight classification is their Body Mass Index (BMI).

BMI should be calculated taking into account a persons general fitness, their muscle mass, waist size, as well as their height and weight, but for the purposes of this notebook, the more generalised version of BMI will be used.

In [2]:
# example BMI calculation, for purposes of demonstrating method to be used in this notebook
height_cm = 179

# For the purposes of BMI calculation, height must be provided in metres
height_m = height_cm / 100
weight_kg = 120

# BMI calculation is weight (kg) divided by height(m)^2
bmi = weight_kg/height_m**2
bmi

# This example person, Mr. Joe Bloggs, with a weight of 120 kg and height of 1.79 m, has a BMI of ~37.45
# This puts Joe Bloggs in the "Class 2 Obese" category, according to the table given above

37.452014606285694

Although BMI is a crude calculation for an indicator of obesity, that cannot be relied upon as accurate in all circumstances [4], for the purposes of this exercise, it will suffice in order to allow the simulation of a dataset.

The dataset used below was downloaded from [5] and contains Gender, Height (cm), Weight (kg) and BMI Index Category (as per above table index) for 500 people.

In [3]:
# Dataset is stored in csv form and was downloaded [3] and saved locally before importing
# First read the csv data using Pandas
gend_bmi_data = pd.read_csv('500_Person_Gender_Height_Weight_Index.csv', header=None, skiprows=1, names=['Gender','Height_cm','Weight_kg', 'BMI_Category'])

# Create Pandas dataframe to hold data for investigation
# dataframe automatically assigned an index column
df_gend_bmi = pd.DataFrame(data=gend_bmi_data)
df_gend_bmi.index = df_gend_bmi.index+1
# Print the data to notebook
# Summary information at the end shows there are 500 records, each with 4 columns of data (not including the auto-generated index column)
df_gend_bmi

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI_Category
1,Male,174,96,4
2,Male,189,87,2
3,Female,185,110,4
4,Female,195,104,3
5,Male,149,61,3
6,Male,189,104,3
7,Male,147,92,5
8,Male,154,111,5
9,Male,174,90,3
10,Female,169,103,4


In order to ensure that none of the data in the dataset/dataframe has been corrupted or is incorrect, there is some basic validation checks below: 

In [4]:
df_gend_bmi.dtypes

Gender          object
Height_cm        int64
Weight_kg        int64
BMI_Category     int64
dtype: object

In [52]:
# Take a subset of the Dataframe df_gend_bmi where the gender is Male only
# Save a count of this subset as 'male'
df_male = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Male']
male = len(df_male.index) # https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe

# Similar for 'female'
df_female = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Female']
female = len(df_female.index)

# displays that there are 245 male, 255 female giving a total of 500 correct genders out of 500.
print(male, female, male+female)

245 255 500


In [6]:
# return the maximum/minimum values in each column and store in an array
max_vals, min_vals = df_gend_bmi.max(), df_gend_bmi.min()

height_max, height_min = max_vals[1], min_vals[1]
weight_max, weight_min = max_vals[2], min_vals[2]
bmi_max, bmi_min = max_vals[3], min_vals[3]

print("height range: ", height_min, "to", height_max, "cm")
print("weight range: ", weight_min, "to", weight_max, "kg")
print("height range: ", bmi_min, "to", bmi_max)

height range:  140 to 199 cm
weight range:  50 to 160 kg
height range:  0 to 5


There are no numeric values outside what seem to be reasonable upper and lower bounds for each column of data.

## 2. Variables to be Considered

### Gender
Without getting into any modern political correctness, thus hijacking this highly informative piece of work and turning it into a political mess, we will assume that there are but two gender types available for selection:
- Male
- Female

Since we are randomly selecting our data, the chances of selecting either a male or female with each trial will be constant at 0.5.

Such a trial, with only two possible outcomes, would suggest a binomial distribution for this variable [6].

In [71]:
import numpy as np

a = ['Male', 'Female']

df_rand_gender = pd.DataFrame(data=np.random.choice(a, 100, replace=True))
df_rand_gender.index = df_rand_gender.index+1

df_rand_male = df_rand_gender.loc[df_rand_gender[0] == 'Male']
rand_male = len(df_rand_male.index)

df_rand_female = df_rand_gender.loc[df_rand_gender[0] == 'Female']
rand_female = len(df_rand_female.index)

rand_gender_count = [rand_male, rand_female]
rand_gender_count

[42, 58]

In [72]:
df_rand_gender

Unnamed: 0,0
1,Male
2,Female
3,Female
4,Male
5,Male
6,Male
7,Female
8,Male
9,Female
10,Female


### Height & Weight
For these variable we will generate an array of random integer values.
We will use summary statistics, from our sample set of 500 records, to generate our random height data. We already know:
- the range of heights recorded is 140cm - 199cm
- the range of weights recorded is 50kg - 160kg.

Since we are sampling numerical values, we can assume that, once our sample is large enough (over 30 datapoints), then the data will be normally distributed about the mean value.

Before we can generate a random sample of normally distributed data, we need:
- the mean value of each set of data
- the standard deviation of each set of data

In [73]:
mean_data = np.mean(gend_bmi_data)
sdev_data = np.std(gend_bmi_data)

mean_h = int(mean_data[0])
mean_w = int(mean_data[1])
sdev_h = int(sdev_data[0])
sdev_w = int(sdev_data[1])

summStats = [mean_h, mean_w, sdev_h, sdev_w]
summStats

[169, 106, 16, 32]

From the above, we now have:
- mean height: 169cm
- mean weight: 106kg
- std. dev. height: 16cm
- std. dev. weight: 32kg

Now we can generate a random sample of data for each variable:

In [85]:
rand_weight = np.random.normal(mean_w, sdev_w, 100)
rand_weight = [int(x) for x in rand_weight]

df_rand_weight = pd.DataFrame(rand_weight)
df_rand_weight.index = df_rand_weight.index+1
df_rand_weight

Unnamed: 0,0
1,127
2,72
3,88
4,71
5,104
6,142
7,67
8,119
9,124
10,112


In [86]:
rand_height = np.random.normal(mean_h, sdev_h, 100)
rand_height = [int(x) for x in rand_height]

df_rand_height = pd.DataFrame(rand_height)
df_rand_height.index = df_rand_height.index+1
df_rand_height

Unnamed: 0,0
1,159
2,153
3,178
4,195
5,169
6,163
7,154
8,189
9,153
10,155


To pull each of the three variabes together now into a single dataframe:

In [91]:
BMI_Data_synth = pd.concat([df_rand_gender, df_rand_height, df_rand_weight], axis=1, ignore_index=True)
BMI_Data_synth.columns = ['Gender', 'Height_cm', 'Weight_kg']
BMI_Data_synth

Unnamed: 0,Gender,Height_cm,Weight_kg
1,Male,159,127
2,Female,153,72
3,Female,178,88
4,Male,195,71
5,Male,169,104
6,Male,163,142
7,Female,154,67
8,Male,189,119
9,Female,153,124
10,Female,155,112


### Body Mass Index (BMI)
As discussed earlier, an individual's BMI is a simple calculation based on that individuals height and weight.
This variable is linked to the height and weight variables, already discussed, by the following equation:

<b>BMI = weight / $height^{2}$</b>

In [93]:
BMI_Data_synth['BMI'] = (BMI_Data_synth.Weight_kg*(100**2))/(BMI_Data_synth.Height_cm**2) #https://stackoverflow.com/questions/45393123/adding-calculated-column-in-pandas
BMI_Data_synth

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI
1,Male,159,127,50.235355
2,Female,153,72,30.757401
3,Female,178,88,27.774271
4,Male,195,71,18.671926
5,Male,169,104,36.413291
6,Male,163,142,53.445745
7,Female,154,67,28.250970
8,Male,189,119,33.313737
9,Female,153,124,52.971079
10,Female,155,112,46.618106


## 3. Data Synthesis

## 4. References

### 1.  http://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight
### 2. http://www.irishhealth.com/calc/bmi01.html
### 3. https://reference.medscape.com/calculator/body-mass-index-bmi
### 4. https://www.nhs.uk/live-well/healthy-weight/bmi-calculator/#limitations-of-the-bmi
### 5. https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex#500_Person_Gender_Height_Weight_Index.csv
### 6. https://stattrek.com/probability-distributions/binomial.aspx
### 7. https://www.oecd.org/els/health-systems/Obesity-Update-2017.pdf
### 7. 
### 8. 
### 9. 