# Simulating a Dataset

## 1. Obesity - A Growing Problem

### Key Facts About Obesity [1]
- Worldwide obesity has nearly tripled since 1975
- In 2016, more than 1.9 billion adults, 18 years and older, were overweight. Of these over 650 million were obese.
- 39% of adults aged 18 years and over were overweight in 2016, and 13% were obese.
- Most of the world's population live in countries where overweight and obesity kills more people than underweight.
- 41 million children under the age of 5 were overweight or obese in 2016.
- Over 340 million children and adolescents aged 5-19 were overweight or obese in 2016.
- Obesity is preventable.

Obesity is increasingly affecting humans of all ages, shapes and sizes in the modern world.
According to irishhealth.com [2] and medscape [3], there are a number of classifications, based on Body Mass Index (BMI), ranging from "Below Normal Weight" to "Class 3 Obese", as seen in the table below:

In [31]:
import pandas as pd

d = {'BMI (kg/m2)' : ['< 18', '18-24.9', '25-29.9', '30-34.9', '35-39.9', '> 40'],'Category' : ['Below Normal Weight', 'Normal Weight', 'Overweight', 'Class 1 Obese', 'Class 2 Obese', 'Class 3 Obese']}
df = pd.DataFrame(data = d, columns=['BMI (kg/m2)','Category'])
df

Unnamed: 0,BMI (kg/m2),Category
0,< 18,Below Normal Weight
1,18-24.9,Normal Weight
2,25-29.9,Overweight
3,30-34.9,Class 1 Obese
4,35-39.9,Class 2 Obese
5,> 40,Class 3 Obese


When thinking about the risk factors that impact obesity levels, the main indicator used to determine a person's weight classification is their Body Mass Index (BMI).

BMI should be calculated taking into account a persons general fitness, their muscle mass, waist size, as well as their height and weight, but for the purposes of this notebook, the more generalised version of BMI will be used.

In [32]:
# example BMI calculation, for purposes of demonstrating method to be used in this notebook
height_cm = 179

# For the purposes of BMI calculation, height must be provided in metres
height_m = height_cm / 100
weight_kg = 120

# BMI calculation is weight (kg) divided by height(m)^2
bmi = weight_kg/height_m**2
bmi

# This example person, Mr. Joe Bloggs, with a weight of 120 kg and height of 1.79 m, has a BMI of ~37.45
# This puts Joe Bloggs in the "Class 2 Obese" category, according to the table given above

37.452014606285694

Although BMI is a crude calculation for an indicator of obesity, that cannot be relied upon as accurate in all circumstances [4], for the purposes of this exercise, it will suffice in order to allow the simulation of a dataset.

The dataset used below was downloaded from [5] and contains Gender, Height (cm), Weight (kg) and BMI Index Category (as per above table index) for 500 people.

In [33]:
# Dataset is stored in csv form and was downloaded [3] and saved locally before importing
# First read the csv data using Pandas
gend_bmi_data = pd.read_csv('500_Person_Gender_Height_Weight_Index.csv', header=None, skiprows=1, names=['Gender','Height_cm','Weight_kg', 'BMI_Category'])

# Create Pandas dataframe to hold data for investigation
# dataframe automatically assigned an index column
df_gend_bmi = pd.DataFrame(data=gend_bmi_data)
df_gend_bmi.index = df_gend_bmi.index+1
# Print the data to notebook
# Summary information at the end shows there are 500 records, each with 4 columns of data (not including the auto-generated index column)
df_gend_bmi

In order to ensure that none of the data in the dataset/dataframe has been corrupted or is incorrect, there is some basic validation checks below: 

In [34]:
df_gend_bmi.dtypes

Gender          object
Height_cm        int64
Weight_kg        int64
BMI_Category     int64
dtype: object

In [35]:
# Take a subset of the Dataframe df_gend_bmi where the gender is Male only
# Save a count of this subset as 'male'
df_male = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Male']
male = len(df_male.index) # https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe

# Similar for 'female'
df_female = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Female']
female = len(df_female.index)

total = male + female
# We can now calculate the percentage split of the dataset between genders
print("Dataset is", str((male*100)/total) + "% male and", str((female*100)/total) + "% female.")

Dataset is 49.0% male and 51.0% female.


In [36]:
# return the maximum/minimum values in each column and store in an array
max_vals, min_vals = df_gend_bmi.max(), df_gend_bmi.min()

height_max, height_min = max_vals[1], min_vals[1]
weight_max, weight_min = max_vals[2], min_vals[2]
bmi_max, bmi_min = max_vals[3], min_vals[3]

print("height range: ", height_min, "to", height_max, "cm")
print("weight range: ", weight_min, "to", weight_max, "kg")
print("height range: ", bmi_min, "to", bmi_max)

height range:  140 to 199 cm
weight range:  50 to 160 kg
height range:  0 to 5


There are no numeric values outside what seem to be reasonable upper and lower bounds for each column of data.

## 2. Variables to be Considered

### Gender
Without getting into any modern political correctness, thus hijacking this highly informative piece of work and turning it into a political mess, we will assume that there are but two gender types available for selection:
- Male
- Female

Since we are randomly selecting our data, the chances of selecting either a male or female with each trial will be constant at 0.5.

Such a trial, with only two possible outcomes, would suggest a binomial distribution for this variable [6].

In [37]:
import numpy as np

a = ['Male', 'Female']

df_rand_gender = pd.DataFrame(data=np.random.choice(a, 100, replace=True))
df_rand_gender.index = df_rand_gender.index+1

df_rand_male = df_rand_gender.loc[df_rand_gender[0] == 'Male']
rand_male = len(df_rand_male.index)

df_rand_female = df_rand_gender.loc[df_rand_gender[0] == 'Female']
rand_female = len(df_rand_female.index)

rand_gender_count = [rand_male, rand_female]
rand_gender_count

[59, 41]

In [38]:
# run this code cell if you want to see the dataset containing randomly generated Gender data
df_rand_gender

### Height & Weight
For these variable we will generate an array of random integer values.
We will use summary statistics, from our sample set of 500 records, to generate our random height data. We already know:
- the range of heights recorded is 140cm - 199cm
- the range of weights recorded is 50kg - 160kg.

Since we are sampling numerical values, we can assume that, once our sample is large enough (over 30 datapoints), then the data will be normally distributed about the mean value.

Before we can generate a random sample of normally distributed data, we need:
- the mean value of each set of data
- the standard deviation of each set of data

In [39]:
mean_data = np.mean(gend_bmi_data)
sdev_data = np.std(gend_bmi_data)

mean_h = int(mean_data[0])
mean_w = int(mean_data[1])
sdev_h = int(sdev_data[0])
sdev_w = int(sdev_data[1])

summStats = [mean_h, mean_w, sdev_h, sdev_w]
summStats

[169, 106, 16, 32]

From the above, we now have:
- mean height: 169cm
- mean weight: 106kg
- std. dev. height: 16cm
- std. dev. weight: 32kg

Now we can generate a random sample of data for each variable.

However, there is certainly a relationship between a person's height and weight. Before we simply generate two random sample of numerical data, we should investigate this relationship and employ it in generating our data. Otherwise, the chances are that our BMI data will not make sense. as there will be no relationship between our two random sets of data.

What we can do here is to define a relationship stating that a persons weight is dependant somehow on their height.
To define that relationship mathematically, we will revert to the original dataset '500_Person_Gender_Height_Weight_Index.csv'.

Next we will perform linear regression analysis on the height and weight data to define a regression equation, whic will form the basis of the relationship between the two variables:

In [41]:
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html
from scipy import stats

# reg_slope, reg_intercept, rvalue, pvalue, stderr = stats.linregress(gend_bmi_data['Height_cm'], gend_bmi_data['Weight_kg'])
# [reg_slope, reg_intercept, rvalue]

# reg_slope, reg_intercept, rvalue, pvalue, stderr = stats.linregress(df_male['Height_cm'], df_male['Weight_kg'])
# [reg_slope, reg_intercept, rvalue]

reg_slope, reg_intercept, rvalue, pvalue, stderr = stats.linregress(df_female['Height_cm'], df_female['Weight_kg'])
[reg_slope, reg_intercept, rvalue]

[0.05783654433087901, 95.85267170072629, 0.02756862409201924]

Now we have some values that we can use to define a relationship between a persons height and their weight, based on the original dataset imported to the notebook.

The slope value will act as the coefficient of our synthesized 'Height_cm' variable and we will then add the intercept to this result, in order to get a value for or synthesized 'Weight_kg'.

First, we generate a random sample of heights, from a normal probability distribution:

In [43]:
rand_height = np.random.normal(mean_h, sdev_h, 100)
rand_height = [int(x) for x in rand_height]

df_rand_height = pd.DataFrame(rand_height)
df_rand_height.index = df_rand_height.index+1
df_rand_height

Next, we will pull together our two synthesized variable samples, into a single Pandas dataframe:

In [44]:
BMI_Data_synth = pd.concat([df_rand_gender, df_rand_height], axis=1, ignore_index=True)
BMI_Data_synth.columns = ['Gender', 'Height_cm']
BMI_Data_synth

Unnamed: 0,Gender,Height_cm
1,Female,152
2,Male,186
3,Male,166
4,Male,172
5,Female,174
6,Female,172
7,Male,158
8,Male,174
9,Female,188
10,Male,183


In order to generate our weight data, now we will add a new column to BMI_Data_synth, called Weight_kg. We will populate it using our regression equation, as defined above:

In [29]:
BMI_Data_synth['Weight_kg'] = (BMI_Data_synth.Height_cm*reg_slope)+reg_intercept #https://stackoverflow.com/questions/45393123/adding-calculated-column-in-pandas
BMI_Data_synth

Unnamed: 0,Gender,Height_cm,Weight_kg
1,Female,181,105.760970
2,Female,169,106.345921
3,Male,184,105.614732
4,Male,160,106.784634
5,Female,134,108.052027
6,Female,180,105.809716
7,Female,173,106.150937
8,Male,182,105.712224
9,Female,170,106.297175
10,Male,181,105.760970


### Body Mass Index (BMI)
As discussed earlier, an individual's BMI is a simple calculation based on that individuals height and weight.
This variable is linked to the height and weight variables, already discussed, by the following equation:

<b>BMI = weight / $height^{2}$</b>

In [26]:
BMI_Data_synth['BMI'] = (BMI_Data_synth.Weight_kg*(100**2))/(BMI_Data_synth.Height_cm**2) #https://stackoverflow.com/questions/45393123/adding-calculated-column-in-pandas
BMI_Data_synth

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI
1,Male,138,170,89.266961
2,Male,136,113,61.094291
3,Female,212,80,17.799929
4,Male,186,123,35.553243
5,Female,174,95,31.377989
6,Male,149,153,68.915815
7,Male,157,100,40.569597
8,Male,162,119,45.343698
9,Female,173,97,32.410037
10,Female,186,96,27.748873


In [38]:
def getCat(BMI):
    if BMI >= 0 & BMI < 18:
        cat = 0
    elif BMI >=18 & BMI < 25:
        cat = 1
    elif BMI >=25 & BMI < 30:
        cat = 2
    elif BMI >=30 & BMI < 35:
        cat = 3
    elif BMI >=35 & BMI < 40:
        cat = 4
    elif BMI >=40:
        cat = 5
    return cat
        
BMI_Data_synth['Category'] = getCat(34)
# BMI_Data_synth

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI,Category
1,Male,138,170,89.266961,0
2,Male,136,113,61.094291,0
3,Female,212,80,17.799929,0
4,Male,186,123,35.553243,0
5,Female,174,95,31.377989,0
6,Male,149,153,68.915815,0
7,Male,157,100,40.569597,0
8,Male,162,119,45.343698,0
9,Female,173,97,32.410037,0
10,Female,186,96,27.748873,0


## 3. Data Synthesis

## 4. References

### 1.  http://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight
### 2. http://www.irishhealth.com/calc/bmi01.html
### 3. https://reference.medscape.com/calculator/body-mass-index-bmi
### 4. https://www.nhs.uk/live-well/healthy-weight/bmi-calculator/#limitations-of-the-bmi
### 5. https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex#500_Person_Gender_Height_Weight_Index.csv
### 6. https://stattrek.com/probability-distributions/binomial.aspx
### 7. https://www.oecd.org/els/health-systems/Obesity-Update-2017.pdf
### 7. 
### 8. 
### 9. 