# Simulating a Dataset

## 1. Obesity - A Growing Problem

### Key Facts About Obesity [1]
- Worldwide obesity has nearly tripled since 1975
- In 2016, more than 1.9 billion adults, 18 years and older, were overweight. Of these over 650 million were obese.
- 39% of adults aged 18 years and over were overweight in 2016, and 13% were obese.
- Most of the world's population live in countries where overweight and obesity kills more people than underweight.
- 41 million children under the age of 5 were overweight or obese in 2016.
- Over 340 million children and adolescents aged 5-19 were overweight or obese in 2016.
- Obesity is preventable.

Obesity is increasingly affecting humans of all ages, shapes and sizes in the modern world.
According to irishhealth.com [2] and medscape [3], there are a number of classifications, based on Body Mass Index (BMI), ranging from "Below Normal Weight" to "Class 3 Obese", as seen in the table below:

In [2]:
import pandas as pd

# here we are creating a dataset to store in the variable 'd'
# we are creating 2 columns 'BMI (kg/m2)' and 'Category'
# each columns then has a data array associated with it
d = {'BMI (kg/m2)' : ['< 18', '18-24.9', '25-29.9', '30-34.9', '35-39.9', '> 40'],'Category' : ['Below Normal Weight', 'Normal Weight', 'Overweight', 'Class 1 Obese', 'Class 2 Obese', 'Class 3 Obese']}

# now we use pd.DataFrame to import the created dataset and display the columns noted
df = pd.DataFrame(data = d, columns=['BMI (kg/m2)','Category'])
df

Unnamed: 0,BMI (kg/m2),Category
0,< 18,Below Normal Weight
1,18-24.9,Normal Weight
2,25-29.9,Overweight
3,30-34.9,Class 1 Obese
4,35-39.9,Class 2 Obese
5,> 40,Class 3 Obese


When thinking about the risk factors that impact obesity levels, the main indicator used to determine a person's weight classification is their Body Mass Index (BMI).

BMI should be calculated taking into account a persons general fitness, their muscle mass, waist size, as well as their height and weight, but for the purposes of this notebook, the more generalised version of BMI will be used.

In [3]:
# example BMI calculation, for purposes of demonstrating method to be used in this notebook
height_cm = 179

# For the purposes of BMI calculation, height must be provided in metres
height_m = height_cm / 100
weight_kg = 120

# BMI calculation is weight (kg) divided by height(m)^2
bmi = weight_kg/height_m**2
bmi

# This example person, Mr. Joe Bloggs, with a weight of 120 kg and height of 1.79 m, has a BMI of ~37.45
# This puts Joe Bloggs in the "Class 2 Obese" category, according to the table given above

37.452014606285694

Although BMI is a crude calculation for an indicator of obesity, that cannot be relied upon as accurate in all circumstances [4], for the purposes of this exercise, it will suffice in order to allow the simulation of a dataset.

The dataset used below was downloaded from [5] and contains Gender, Height (cm), Weight (kg) and BMI Index Category (as per above table index) for 500 people.

In [4]:
# Dataset is stored in csv form and was downloaded [3] and saved locally before importing
# First read the csv data using Pandas
gend_bmi_data = pd.read_csv('500_Person_Gender_Height_Weight_Index.csv', header=None, skiprows=1, names=['Gender','Height_cm','Weight_kg', 'BMI_Category'])

# Create Pandas dataframe to hold data for investigation
# dataframe automatically assigned an index column
df_gend_bmi = pd.DataFrame(data=gend_bmi_data)
df_gend_bmi.index = df_gend_bmi.index+1
# Print the data to notebook
# Summary information at the end shows there are 500 records, each with 4 columns of data (not including the auto-generated index column)
df_gend_bmi

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI_Category
1,Male,174,96,4
2,Male,189,87,2
3,Female,185,110,4
4,Female,195,104,3
5,Male,149,61,3
6,Male,189,104,3
7,Male,147,92,5
8,Male,154,111,5
9,Male,174,90,3
10,Female,169,103,4


In order to ensure that none of the data in the dataset/dataframe has been corrupted or is incorrect, there is some basic validation checks below: 

In [5]:
# displays the datatypes of Dataframe columns. This is dictated by the type of data stored in the column
df_gend_bmi.dtypes

Gender          object
Height_cm        int64
Weight_kg        int64
BMI_Category     int64
dtype: object

In [6]:
# Take a subset of the Dataframe df_gend_bmi where the gender is Male only
# Save a count of this subset as 'male'
df_male = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Male']
male = len(df_male.index) # https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe

# Similar for 'female'
df_female = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Female']
female = len(df_female.index)

total = male+female

# We can now calculate the percentage split of the dataset between genders
d = {'Male' : [(male*100)/total],'Female' : [(female*100)/total]}
df = pd.DataFrame(data = d, index = ['%'], columns=['Male','Female'])
df

Unnamed: 0,Male,Female
%,49.0,51.0


In [7]:
# return the maximum/minimum values in each column and store in an array
max_vals, min_vals = df_gend_bmi.max(), df_gend_bmi.min()

# splitting out array values into individual variables for storage and use later
height_max, height_min = max_vals[1], min_vals[1]
weight_max, weight_min = max_vals[2], min_vals[2]
bmi_max, bmi_min = max_vals[3], min_vals[3]

# use data in variabels to create new dataframe displaying max/min values for each category of data
d = {'Min' : [height_min, weight_min, bmi_min],'Max' : [height_max, weight_max, bmi_max]}
df = pd.DataFrame(data = d, index=['Height', 'Weight', 'BMI'], columns=['Min','Max'])
df

Unnamed: 0,Min,Max
Height,140,199
Weight,50,160
BMI,0,5


There are no numeric values outside what seem to be reasonable upper and lower bounds for each column of data.

## 2. Variables to be Considered

### Gender
Without getting into any modern political correctness, thus hijacking this highly informative piece of work and turning it into a political mess, we will assume that there are but two gender types available for selection:
- Male
- Female

Since we are randomly selecting our data, the chances of selecting either a male or female with each trial will be constant at 0.5.

Such a trial, with only two possible outcomes, would suggest a binomial distribution for this variable [6].

In [8]:
# numpy will allow generation of random data for multiple probability distributions
import numpy as np

# declare and initiate string array variable
a = ['Male', 'Female']

# create dataframe using data sampled randomly from the list of options declared above
df_rand_gender = pd.DataFrame(data=np.random.choice(a, 100, replace=True))
df_rand_gender.index = df_rand_gender.index+1

# split above out into two gender specific dataframes, using loc and conditions
df_rand_male = df_rand_gender.loc[df_rand_gender[0] == 'Male']
rand_male = len(df_rand_male.index)

df_rand_female = df_rand_gender.loc[df_rand_gender[0] == 'Female']
rand_female = len(df_rand_female.index)

total = rand_female+rand_male
male_pct, female_pct = (rand_male*100)/total, (rand_female*100)/total
d = {'Male' : [male_pct],'Female' : [female_pct]}
df = pd.DataFrame(data = d, index = ['Synthesized %'], columns=['Male','Female'])
df

Unnamed: 0,Male,Female
Synthesized %,48.0,52.0


In [9]:
# run this code cell if you want to see the dataset containing randomly generated Gender data
df_rand_gender

Unnamed: 0,0
1,Male
2,Male
3,Female
4,Male
5,Female
6,Male
7,Female
8,Female
9,Female
10,Female


We will now split out the synthesized array into two gender-specific arrays:

In [11]:
df_rand_male = df_rand_gender.loc[df_rand_gender[0] == 'Male']
df_rand_male.reset_index(inplace=True)
df_rand_male.drop(labels='index', axis=1, inplace=True)
df_rand_male.index = df_rand_male.index+1

df_rand_female = df_rand_gender.loc[df_rand_gender[0] == 'Female']
df_rand_female.reset_index(inplace=True)
df_rand_female.drop(labels='index', axis=1, inplace=True)
df_rand_female.index = df_rand_female.index+1

### Height & Weight
For these variable we will generate an array of random integer values.
We will use summary statistics, from our sample set of 500 records, to generate our random height data. We already know:
- the range of heights recorded is 140cm - 199cm
- the range of weights recorded is 50kg - 160kg.

Since we are sampling numerical values, we can assume that, once our sample is large enough (over 30 datapoints), then the data will be normally distributed about the mean value.

Before we can generate a random sample of normally distributed data, we need:
- the mean value of each set of data
- the standard deviation of each set of data

In [12]:
mean_data = np.mean(gend_bmi_data)
sdev_data = np.std(gend_bmi_data)

mean_h = int(mean_data[0])
mean_w = int(mean_data[1])
sdev_h = int(sdev_data[0])
sdev_w = int(sdev_data[1])

d = {'Height' : [mean_h, sdev_h],'Weight' : [mean_w, sdev_w]}
df = pd.DataFrame(data = d, index = ['Mean', 'Std Deviation'], columns=['Height','Weight'])
df

Unnamed: 0,Height,Weight
Mean,169,106
Std Deviation,16,32


From the above, we now have:
- mean height: 169cm
- mean weight: 106kg
- std. dev. height: 16cm
- std. dev. weight: 32kg

Now we can generate a random sample of data for each variable.

However, there may be a relationship between a person's height and their weight. Before we simply generate two random sample of numerical data, we should investigate this relationship and employ it in generating our data. Otherwise, the chances are that our BMI data will not make sense. as there will be no relationship between our two random sets of data.

What we can do here is to define a relationship stating that a persons weight is dependant somehow on their height.
To define that relationship mathematically, we will revert to the original dataset: '500_Person_Gender_Height_Weight_Index.csv'.

Next we will perform linear regression analysis on the height and weight data to define a regression equation, which will form the basis of the relationship between the two variables:

In [13]:
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html
from scipy import stats

reg_slope, reg_intercept, rvalue, pvalue, stderr = stats.linregress(df_gend_bmi['Height_cm'], df_gend_bmi['Weight_kg'])

d = {'Regression Stats' : [reg_slope, reg_intercept, rvalue]}
df = pd.DataFrame(data = d, index = ['Slope', 'Intercept', 'Correlation Coefficient'], columns=['Regression Stats'])
df

Unnamed: 0,Regression Stats
Slope,0.000882
Intercept,105.850131
Correlation Coefficient,0.000446


The above figures suggest that:
1. <b>slope:</b> the regression equation is that of a virtually horzontal line. The slope is extremely small, which means that the line is rising very slowly. In terms of our variables, this means that our height variable is having virtually no effect on the corresponding weight, or that the weight is more or less a constant value regardless of height.
2. <b>intercept:</b> the weight value is somewhere around 105 / 106 kg, regardless of what height is input to the equation.
3. <b>correlation coefficient:</b> this value is so close to zero as to suggest that there is no correlation whatsoever between the height and corresponding weight values in the dataset.

What we can do is to check the same regression stats on a gender specific basis.
First for males:

In [14]:
reg_slope, reg_intercept, rvalue, pvalue, stderr = stats.linregress(df_male['Height_cm'], df_male['Weight_kg'])

d = {'Regression Stats' : [reg_slope, reg_intercept, rvalue]}
df = pd.DataFrame(data = d, index = ['Slope', 'Intercept', 'Correlation Coefficient'], columns=['Regression Stats'])
df

Unnamed: 0,Regression Stats
Slope,-0.048746
Intercept,114.583977
Correlation Coefficient,-0.026133


and for females:

In [15]:
reg_slope, reg_intercept, rvalue, pvalue, stderr = stats.linregress(df_female['Height_cm'], df_female['Weight_kg'])

d = {'Regression Stats' : [reg_slope, reg_intercept, rvalue]}
df = pd.DataFrame(data = d, index = ['Slope', 'Intercept', 'Correlation Coefficient'], columns=['Regression Stats'])
df

Unnamed: 0,Regression Stats
Slope,0.057837
Intercept,95.852672
Correlation Coefficient,0.027569


In both cases it is clear that there is very weak or no relationship between the variables. Therefore we cannot use this to generate weight data connected to height.

What we can do now is to generate summary statistics for the gender specific datasets and use these parameters to generate random height and weight data by sampling from a normal probability distribution.

In [16]:
# return the maximum/minimum values in each column and store in an array
male_max, female_max, male_min, female_min = df_male.max(), df_female.max(), df_male.min(), df_female.min()

m_height_max, m_height_min, f_height_max, f_height_min = male_max[1], male_min[1], female_max[1], female_min[1]
m_weight_max, m_weight_min, f_weight_max, f_weight_min = male_max[2], male_min[2], female_max[2], female_min[2]
m_bmi_max, m_bmi_min, f_bmi_max, f_bmi_min = male_max[3], male_min[3], female_max[3], female_min[3]

d = {'Height (m)' : [m_height_min, m_height_max],'Height (f)' : [f_height_min, f_height_max], 'Weight (m)' : [m_weight_min, m_weight_max],'Weight (f)' : [f_weight_min, f_weight_max], 'BMI (m)' : [m_bmi_min, m_bmi_max],'BMI (f)' : [f_bmi_min, f_bmi_max]}
df = pd.DataFrame(data = d, index=['Min', 'Max'], columns=['Height (m)','Height (f)', 'Weight (m)','Weight (f)', 'BMI (m)','BMI (f)'])
df

Unnamed: 0,Height (m),Height (f),Weight (m),Weight (f),BMI (m),BMI (f)
Min,140,140,50,50,0,0
Max,199,199,160,160,5,5


According to the summary above, it seems that there is no difference in the max/min values for height and weight, regardless of gender.

Next we will gather the mean and standard deviation for each gender specific set:

In [17]:
mean_male = np.mean(df_male)
mean_female = np.mean(df_female)
sdev_male = np.std(df_male)
sdev_female = np.std(df_female)

mean_ht_m = int(mean_male[0])
mean_wt_m = int(mean_male[1])
mean_ht_f = int(mean_female[0])
mean_wt_f = int(mean_female[1])
sdev_ht_m = int(sdev_male[0])
sdev_wt_m = int(sdev_male[1])
sdev_ht_f = int(sdev_female[0])
sdev_wt_f = int(sdev_female[1])

d = {'Height' : [mean_ht_m, sdev_ht_m, mean_ht_f, sdev_ht_f],'Weight' : [mean_wt_m, sdev_wt_m, mean_wt_f, sdev_wt_f]}
df = pd.DataFrame(data = d, index = ['Mean (m)', 'Std Deviation (m)', 'Mean (f)', 'Std Deviation (f)'], columns=['Height','Weight'])
df

Unnamed: 0,Height,Weight
Mean (m),169,106
Std Deviation (m),17,31
Mean (f),170,105
Std Deviation (f),15,32


Using the above measures, we will generate normally distributed height and weight values for each gender-specific sythesized dataset:

In [18]:
rand_ht_m = np.random.normal(mean_ht_m, sdev_ht_m, int(male_pct))
rand_ht_m = [int(x) for x in rand_ht_m]

rand_wt_m = np.random.normal(mean_wt_m, sdev_wt_m, int(male_pct))
rand_wt_m = [int(x) for x in rand_wt_m]

df_rand_ht_m = pd.DataFrame(rand_ht_m)
df_rand_ht_m.index = df_rand_ht_m.index+1

df_rand_wt_m = pd.DataFrame(rand_wt_m)
df_rand_wt_m.index = df_rand_wt_m.index+1

rand_ht_f = np.random.normal(mean_ht_f, sdev_ht_f, int(female_pct))
rand_ht_f = [int(x) for x in rand_ht_f]

rand_wt_f = np.random.normal(mean_wt_f, sdev_wt_f, int(female_pct))
rand_wt_f = [int(x) for x in rand_wt_f]

df_rand_ht_f = pd.DataFrame(rand_ht_f)
df_rand_ht_f.index = df_rand_ht_f.index+1

df_rand_wt_f = pd.DataFrame(rand_wt_f)
df_rand_wt_f.index = df_rand_wt_f.index+1
df_rand_ht_m

Unnamed: 0,0
1,176
2,176
3,124
4,183
5,151
6,180
7,190
8,142
9,171
10,172


Next, we will pull together our two synthesized variable samples, into a single Pandas dataframe:

In [19]:
BMI_Data_Male = pd.concat([df_rand_male, df_rand_ht_m, df_rand_wt_m], axis=1, ignore_index=True)
BMI_Data_Male.columns = ['Gender', 'Height_cm', 'Weight_kg']

BMI_Data_Female = pd.concat([df_rand_female, df_rand_ht_f, df_rand_wt_f], axis=1, ignore_index=True)
BMI_Data_Female.columns = ['Gender', 'Height_cm', 'Weight_kg']

Now to combine both genders back into one single data frame:

In [20]:
BMI_Data_Full = pd.concat([BMI_Data_Male, BMI_Data_Female], axis=0, ignore_index=True)
BMI_Data_Full.index = BMI_Data_Full.index+1
BMI_Data_Full

Unnamed: 0,Gender,Height_cm,Weight_kg
1,Male,176,103
2,Male,176,121
3,Male,124,146
4,Male,183,124
5,Male,151,83
6,Male,180,126
7,Male,190,93
8,Male,142,123
9,Male,171,34
10,Male,172,93


### Body Mass Index (BMI)
As discussed earlier, an individual's BMI is a simple calculation based on that individuals height and weight.
This variable is linked to the height and weight variables, already discussed, by the following equation:

<b>BMI = weight / $height^{2}$</b>

In [21]:
BMI_Data_Full['BMI'] = (BMI_Data_Full.Weight_kg*(100**2))/(BMI_Data_Full.Height_cm**2) #https://stackoverflow.com/questions/45393123/adding-calculated-column-in-pandas
BMI_Data_Full

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI
1,Male,176,103,33.251550
2,Male,176,121,39.062500
3,Male,124,146,94.953174
4,Male,183,124,37.027084
5,Male,151,83,36.401912
6,Male,180,126,38.888889
7,Male,190,93,25.761773
8,Male,142,123,60.999802
9,Male,171,34,11.627509
10,Male,172,93,31.435911


## 4. References

### 1.  http://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight
### 2. http://www.irishhealth.com/calc/bmi01.html
### 3. https://reference.medscape.com/calculator/body-mass-index-bmi
### 4. https://www.nhs.uk/live-well/healthy-weight/bmi-calculator/#limitations-of-the-bmi
### 5. https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex#500_Person_Gender_Height_Weight_Index.csv
### 6. https://stattrek.com/probability-distributions/binomial.aspx
### 7. https://www.oecd.org/els/health-systems/Obesity-Update-2017.pdf