# Simulating a Dataset

## 1. Obesity - A Growing Problem

### Key Facts About Obesity [1]
- Worldwide obesity has nearly tripled since 1975
- In 2016, more than 1.9 billion adults, 18 years and older, were overweight. Of these over 650 million were obese.
- 39% of adults aged 18 years and over were overweight in 2016, and 13% were obese.
- Most of the world's population live in countries where overweight and obesity kills more people than underweight.
- 41 million children under the age of 5 were overweight or obese in 2016.
- Over 340 million children and adolescents aged 5-19 were overweight or obese in 2016.
- Obesity is preventable.

Obesity is increasingly affecting humans of all ages, shapes and sizes in the modern world.
According to irishhealth.com [2] and medscape [3], there are a number of classifications, based on Body Mass Index (BMI), ranging from "Below Normal Weight" to "Class 3 Obese", as seen in the table below:

In [21]:
import pandas as pd

d = {'BMI (kg/m2)' : ['< 18', '18-24.9', '25-29.9', '30-34.9', '35-39.9', '> 40'],'Category' : ['Below Normal Weight', 'Normal Weight', 'Overweight', 'Class 1 Obese', 'Class 2 Obese', 'Class 3 Obese']}
df = pd.DataFrame(data = d, columns=['BMI (kg/m2)','Category'])
df

Unnamed: 0,BMI (kg/m2),Category
0,< 18,Below Normal Weight
1,18-24.9,Normal Weight
2,25-29.9,Overweight
3,30-34.9,Class 1 Obese
4,35-39.9,Class 2 Obese
5,> 40,Class 3 Obese


When thinking about the risk factors that impact obesity levels, the main indicator used to determine a person's weight classification is their Body Mass Index (BMI).

BMI should be calculated taking into account a persons general fitness, their muscle mass, waist size, as well as their height and weight, but for the purposes of this notebook, the more generalised version of BMI will be used.

In [22]:
# example BMI calculation, for purposes of demonstrating method to be used in this notebook
height_cm = 179

# For the purposes of BMI calculation, height must be provided in metres
height_m = height_cm / 100
weight_kg = 120

# BMI calculation is weight (kg) divided by height(m)^2
bmi = weight_kg/height_m**2
bmi

# This example person, Mr. Joe Bloggs, with a weight of 120 kg and height of 1.79 m, has a BMI of ~37.45
# This puts Joe Bloggs in the "Class 2 Obese" category, according to the table given above

37.452014606285694

Although BMI is a crude calculation for an indicator of obesity, that cannot be relied upon as accurate in all circumstances [4], for the purposes of this exercise, it will suffice in order to allow the simulation of a dataset.

The dataset used below was downloaded from [5] and contains Gender, Height (cm), Weight (kg) and BMI Index Category (as per above table index) for 500 people.

In [23]:
# Dataset is stored in csv form and was downloaded [3] and saved locally before importing
# First read the csv data using Pandas
gend_bmi_data = pd.read_csv('500_Person_Gender_Height_Weight_Index.csv', header=None, skiprows=1, names=['Gender','Height_cm','Weight_kg', 'BMI_Category'])

# Create Pandas dataframe to hold data for investigation
# dataframe automatically assigned an index column
df_gend_bmi = pd.DataFrame(data=gend_bmi_data)
df_gend_bmi.index = df_gend_bmi.index+1
# Print the data to notebook
# Summary information at the end shows there are 500 records, each with 4 columns of data (not including the auto-generated index column)
df_gend_bmi

Unnamed: 0,Gender,Height_cm,Weight_kg,BMI_Category
1,Male,174,96,4
2,Male,189,87,2
3,Female,185,110,4
4,Female,195,104,3
5,Male,149,61,3
6,Male,189,104,3
7,Male,147,92,5
8,Male,154,111,5
9,Male,174,90,3
10,Female,169,103,4


In order to ensure that none of the data in the dataset/dataframe has been corrupted or is incorrect, there is some basic validation checks below: 

In [24]:
df_gend_bmi.dtypes

Gender          object
Height_cm        int64
Weight_kg        int64
BMI_Category     int64
dtype: object

In [44]:
# Take a subset of the Dataframe df_gend_bmi where the gender is Male only
# Save a count of this subset as 'male'
df_male = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Male']
male = len(df_male.index) # https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe

# Similar for 'female'
df_female = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Female']
female = len(df_female.index)

# displays that there are 245 male, 255 female giving a total of 500 correct genders out of 500.
print(male, female, male+female)

245 255 500


In [61]:
# return the maximum/minimum values in each column and store in an array
max_vals, min_vals = df_gend_bmi.max(), df_gend_bmi.min()

height_max, height_min = max_vals[1], min_vals[1]
weight_max, weight_min = max_vals[2], min_vals[2]
bmi_max, bmi_min = max_vals[3], min_vals[3]

print("height range: ", height_min, "to", height_max, "cm")
print("weight range: ", weight_min, "to", weight_max, "kg")
print("height range: ", bmi_min, "to", bmi_max)

height range:  140 to 199 cm
weight range:  50 to 160 kg
height range:  0 to 5


There are no numeric values outside what seem to be reasonable upper and lower bounds for each column of data.
Next, to check that all values lie between these bounds, in order to eliminate the possibility of any empty/non-numeric values:

In [79]:
df_height = df_gend_bmi.loc[[df_gend_bmi.Height_cm] >= height_min & [df_gend_bmi.Height_cm] <= height_max]
df_gend_bmi.count()

TypeError: unsupported operand type(s) for &: 'int' and 'list'

From a quick glance at the data above, row 493 contains an Index value of In order to identify if there is a bias towards one gender or another in the data, I will:

1. find what proportion are male/female in the entire dataset
2. identify what proportion of each gender are obese (Index values {3, 4, 5})
3. Compare these proportions to identify if a significant difference exists between genders

Gender    239
Height    239
Weight    239
Index     239
dtype: int64

In [16]:
df_female_bmi = df_gend_bmi.loc[df_gend_bmi['Gender'] == 'Female']
df_female_bmi.count()

Gender    248
Height    248
Weight    248
Index     248
dtype: int64

## 2. Variables to be Considered

### Risk Factors


### Prediction Variable
Body Mass Index (BMI)



## 3. Data Synthesis

## 4. References

### 1.  http://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight
### 2. http://www.irishhealth.com/calc/bmi01.html
### 3. https://reference.medscape.com/calculator/body-mass-index-bmi
### 4. https://www.nhs.uk/live-well/healthy-weight/bmi-calculator/#limitations-of-the-bmi
### 5. https://www.kaggle.com/yersever/500-person-gender-height-weight-bodymassindex#500_Person_Gender_Height_Weight_Index.csv
### 6. 
### 7. 
### 8. 
### 9. 