Project Intro:
This investigation using synthetic data aims to explore the connections between the fitness level, heart rate, age and body mass index (BMI) of 100 individuals. The information gained from this data will help us look at how these factors relate to one another and the impacts each have on health and fitness.

In [None]:
#importing libraries for use in project code below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#For error experienced on line 250.
import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning


: 

In [None]:
#Packages References:
#https://numpy.org
#https://pandas.pydata.org
#https://matplotlib.org
#https://seaborn.pydata.org
#https://stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur

: 

Variables & Distributions:

`workout_hours` - catagorical/discrete variable (set values indicating hours of exercise 1-5 each week) and this data is synthesise using the choice random method between the hour number options.
`heart_rate` - continuous variable - as heart rate could have any number of infinite values. For this we will use 60 - 200bpm as this is the range of optimal heart rates for people between 20 and 80 (changes depending on age as we will hopefully see) and is generated using a normal distribution
`age` - discrete variable - will generate the age through random package np between ages of 18 and 80. This data will be generated using randint.
`bmi` - continuous variable like heart_rate - It is assumed that BMI increases with increasing exercise hours following a normal distribution similar to heart_rate

In [None]:
#References Variables & Distributions
#https://pandas.pydata.org/docs/user_guide/categorical.html
#https://www.codecademy.com/learn/stats-variable-types/modules/stats-variable-types/cheatsheet#
#https://seaborn.pydata.org/tutorial/categorical.html
#https://www.codecademy.com/learn/mlef-exploratory-data-analysis-in-python/modules/mlef-variable-types-for-data-science/cheatsheet
#https://levelup.gitconnected.com/when-and-how-to-categorize-continuous-variables-in-python-f4c3b357dd46
#https://subscription.packtpub.com/book/data/9781789806311/1/ch01lvl1sec03/identifying-numerical-and-categorical-variables
#https://www.health.harvard.edu/heart-health/what-your-heart-rate-is-telling-you
#https://uihc.org/health-topics/target-heart-rate-exercise
#https://sparkbyexamples.com/python/how-to-use-numpy-random-choice-in-python/

: 

Dataset Generation and Synthetic Data:

In [None]:
#Data creation for health dataset
#Hours of exercise per week
workout_hours = np.random.choice([1, 2, 3, 4, 5], size=100)

: 

`np.random.choice` is a 1-D Array and gives the choice of the 5 hours of exercise samples given (1-5). The shape of the data is the size(100) which is our 100 people/information points to generate.

In [None]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html

: 

In [None]:
#Heart Rate
heart_rate = np.random.normal(70, 10, size=100)
#mean hr based on search with standard deviation of 10, this is also the base/resting heart rate.

: 

In [None]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

: 

In [None]:
#age range
age = np.random.randint(18, 81, size=100)

: 

In [None]:
#https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html

: 

In [None]:
#bmi
meanbmi = 27 - (workout_hours * 0.5)
bmi = np.random.normal(meanbmi, 3, size=100)
#assumed that the more you workout the lower you BMI 

: 

In [None]:
print(meanbmi)

: 

In [None]:
print(bmi)

: 

To check that the variables are working and the kernal is working

## Data Frame Creation

In [None]:
data = pd.DataFrame({
    'Workout Hours': workout_hours,
    'BMI': bmi,
    'Heart_Rate': heart_rate,
    'Age': age
})

: 

In [None]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
#https://www.w3schools.com/python/pandas/pandas_dataframes.asp

: 

#### Test dataframe was created and functioning using the `describe()` method. Displays first 5 rows as standard but I have put 100 in between the brackets to see 100 data points:


In [None]:
print(data.head(101))

: 

In [None]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

: 

#### Summary of data using the `describe()` method

In [None]:
#summarize data
print(data.describe())

: 

In [None]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

: 

## Plotting

Plotting various graphs to show the relationship between the variables and their synthetic data

Scatter Plot - pairplot() using seaborn shows the relationships between the variables

In [None]:
sns.pairplot(data)
plt.show()

: 

In [None]:
plt.figure(figsize = (12, 5))

plt.subplot(1 ,3, 1)
sns.scatterplot(x=workout_hours, y=bmi, data=data)
plt.title('Workout Hours Verses BMI')

plt.subplot(1, 3, 2)
sns.scatterplot(x=age, y=heart_rate, data=data)
plt.title('Age Verses Heart Rate')

plt.subplot(1, 3, 3)
sns.scatterplot(x=bmi, y=heart_rate, data=data)
plt.title('BMI verses Heart Rate')

plt.tight_layout()
plt.show()

: 

In [None]:
#boxplot
sns.catplot(x=workout_hours, y=bmi, data=data, kind="box")
plt.xlabel('Workout Hours')
plt.ylabel('BMI')
plt.title('Boxplot of BMI verses Workout Hours')
plt.show()

#boxplot
sns.catplot(x=age, y=heart_rate, data=data, kind="box")
plt.xlabel('Age')
plt.ylabel('Heart Rate')
plt.title('Boxplot of BMI verses Workout Hours')
plt.show()

: 

: 