Project Intro:
This investigation using synthetic data aims to explore the connections between the fitness level, heart rate, age and body mass index (BMI) of 100 individuals. The information gained from this data will help us look at how these factors relate to one another and the impacts each have on health and fitness.

In [22]:
#importing libraries for use in project code below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [23]:
#Packages References:
#https://numpy.org
#https://pandas.pydata.org
#https://matplotlib.org
#https://seaborn.pydata.org

Variables & Distributions:

`workout_hours` - catagorical/discrete variable (set values indicating hours of exercise 1-5 each week) and this data is synthesise using the choice random method between the hour number options.
`heart_rate` - continuous variable - as heart rate could have any number of infinite values. For this we will use 60 - 200bpm as this is the range of optimal heart rates for people between 20 and 80 (changes depending on age as we will hopefully see) and is generated using a normal distribution
`age` - discrete variable - will generate the age through random package np between ages of 18 and 80. This data will be generated using randint.
`bmi` - continuous variable like heart_rate - It is assumed that BMI increases with increasing exercise hours following a normal distribution similar to heart_rate

In [24]:
#References Variables & Distributions
#https://pandas.pydata.org/docs/user_guide/categorical.html
#https://www.codecademy.com/learn/stats-variable-types/modules/stats-variable-types/cheatsheet#
#https://seaborn.pydata.org/tutorial/categorical.html
#https://www.codecademy.com/learn/mlef-exploratory-data-analysis-in-python/modules/mlef-variable-types-for-data-science/cheatsheet
#https://levelup.gitconnected.com/when-and-how-to-categorize-continuous-variables-in-python-f4c3b357dd46
#https://subscription.packtpub.com/book/data/9781789806311/1/ch01lvl1sec03/identifying-numerical-and-categorical-variables
#https://www.health.harvard.edu/heart-health/what-your-heart-rate-is-telling-you
#https://uihc.org/health-topics/target-heart-rate-exercise
#https://sparkbyexamples.com/python/how-to-use-numpy-random-choice-in-python/

Dataset Generation and Synthetic Data:

In [25]:
#Data creation for health dataset
#Hours of exercise per week
workout_hours = np.random.choice([1, 2, 3, 4, 5], size=100)

`np.random.choice` is a 1-D Array and gives the choice of the 5 hours of exercise samples given (1-5). The shape of the data is the size(100) which is our 100 people/information points to generate.

In [26]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html

In [27]:
#Heart Rate
heart_rate = np.random.normal(70, 10, size=100)
#mean hr based on search with standard deviation of 10, this is also the base/resting heart rate.

In [28]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

In [29]:
#age range
age = np.random.randint(18, 81, size=100)

In [30]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html

In [31]:
#bmi
meanbmi = 27 - (workout_hours * 0.5)
bmi = np.random.normal(meanbmi, 3, size=100)
#assumed that the more you workout the lower you BMI 

In [32]:
print(meanbmi)

[25.  25.  24.5 26.5 26.5 25.  24.5 24.5 24.5 24.5 26.5 25.  25.5 25.
 25.5 24.5 25.5 25.5 26.  25.  25.  24.5 26.5 24.5 26.  25.  25.5 25.
 24.5 26.  26.5 24.5 25.  26.5 25.5 25.  24.5 25.5 25.  25.5 24.5 25.5
 26.  26.  25.  26.  26.  25.5 24.5 24.5 25.5 25.5 26.  25.  25.5 25.5
 26.  25.5 25.  24.5 26.5 26.5 26.  24.5 26.5 24.5 25.  25.5 26.5 25.
 24.5 25.5 25.5 26.5 25.5 25.5 26.  26.5 25.  25.5 26.5 25.  24.5 25.5
 25.5 25.5 24.5 25.  25.5 25.5 24.5 25.  24.5 24.5 26.  25.  25.  26.
 25.  25. ]


In [33]:
print(bmi)

[19.2693846  15.46198842 27.57092738 30.96304074 22.14276411 26.98915539
 28.8965735  24.71773333 27.24704176 24.10520291 25.95859302 25.65172335
 19.31548083 25.1817803  25.25850198 26.29809543 23.85692887 27.49104482
 27.79282824 24.91696165 23.7264832  26.76833571 29.5414791  23.83698721
 26.22849387 20.88095808 25.84894929 27.27208078 21.38825124 25.98019181
 25.26643273 23.18127631 25.32406945 24.87684299 24.47864138 25.65848398
 21.8010856  26.56337823 26.07049608 26.55165791 20.33595582 24.01754611
 27.02066333 26.58459646 23.39187567 21.20669163 26.9166702  26.83945081
 24.64902114 24.49860013 20.48591634 26.30867957 25.40290726 25.36133085
 31.27475444 23.77603402 23.58069707 23.13849334 24.01988344 25.72311241
 29.72766856 27.22434473 28.86117895 25.15092623 18.26966574 33.31160916
 26.7849648  20.15385466 33.87990801 21.46591973 22.57348132 27.73764023
 26.11886691 30.80774802 27.90255122 26.94154615 22.2934054  28.00325634
 20.63984892 24.4376474  22.33909421 26.37057101 21

To check that the variables are working and the kernal is working

## Data Frame Creation

In [34]:
data = pd.DataFrame({
    'Exercise_Habits': workout_hours,
    'BMI': bmi,
    'Heart_Rate': heart_rate,
    'Age': age
})

In [35]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
#https://www.w3schools.com/python/pandas/pandas_dataframes.asp

#### Test dataframe was created and functioning using the `describe()` method. Displays first 5 rows as standard but I have put 100 in between the brackets to see 100 data points:


In [36]:
print(data.head(100))

    Exercise_Habits        BMI  Heart_Rate  Age
0                 4  19.269385   75.275127   43
1                 4  15.461988   57.347680   20
2                 5  27.570927   83.316228   70
3                 1  30.963041   63.029273   55
4                 1  22.142764   73.562739   71
..              ...        ...         ...  ...
95                4  27.578819   67.258131   73
96                4  26.430634   56.701320   49
97                2  27.659520   76.162185   43
98                4  19.868436   65.740383   24
99                4  23.564872   76.561455   37

[100 rows x 4 columns]


In [37]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

#### Summary of data using the `describe()` method

In [38]:
#summarize data
print(data.describe())

       Exercise_Habits         BMI  Heart_Rate         Age
count       100.000000  100.000000  100.000000  100.000000
mean          3.320000   25.073694   70.295400   49.550000
std           1.317175    3.238593   10.379848   17.938377
min           1.000000   15.461988   50.499135   18.000000
25%           2.000000   23.339226   62.260099   34.750000
50%           3.000000   25.382119   69.522292   49.000000
75%           4.000000   26.997032   76.315368   67.000000
max           5.000000   33.879908   95.223160   80.000000


In [39]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

## Plotting

Plotting various graphs to show the relationship between the variables and their synthetic data