# Programming for Data Analysis

## Project 1 - HDIP in Data Analytics 2023

<br>

<details>
<summary> Student Information </summary>
    
| Info | Details |
| -------- | -------- |
| Course: | KDATG_L08_Y1 |
| Author: | Rebecca Hannah Quinn |
| Student Number: | G00425671 |
    
</details>

---

### An Investigation into BMI and Personal Fitness

For this project I will investigate a real world phenomena and investigate it by synthesising data and plotting the results through calculation and code from various python libraries. The top is on exercise and the body mass index and how it is lowered, or maintained or raised depending on the hours spent working out, age and heart rate or BPM.
#### Overall Health and BMI

Body Mass Index or BMI is a way to calculate an individuals health in regards to their weight. It is basically a tool that uses height and weight to determine the persons status. However, BMI may not alone provide enough health information of the individual to determine overall health, other factors such as age, sex, activity level and current body fat and muscle density should be considered as well, but the BMI scale can give a good idea of the direction they should head in or if anything further may need to be assessed. <sup> [^1] [^2] [^3] </sup>

**BMI Formula:**
`BMI = kg/m2`

#### Heart Rate, Exercise Habits and Age

In addition to overall health evaluations, heart rate or BPM is analyzed during both activity and rest. (For this investigation it is assumed to be during rest and based on exercise hours per week). Regulare aerobic exercise and exercise in general is highly recommended to improve heart stregth which in turn will benefit everything from nutrition to phyiscal performance and in turn your BMI. If we assume BMI is regulated or lowered depending on more exercise a week.
A subfactor of heart rate health is also looked at through target zones and max heart rates depending on age.

For example:

| AGE | Target Heart Rate Zone | MAX Heart Rate |
| --- | ---------------------- | -------------- |
| 20 | 100 BPM - 170 BPM | 200 bpm |
| 80 | 70 BPM - 119 BPM | 140 |

To measure your ideal BPM you must make sure you are within 50% - 80% of your maximum BPM based on your age.<sup> [^5] [^6] </sup>


<details>

<summary>

Introduction References
    
</summary>

[^1]: [BMI limitations: Age and sex, body composition, and health] (https://www.medicalnewstoday.com/articles/323543#takeaway) <br>
[^2]: [Effects of Exercise Interventions on Weight, Body Mass Index, Lean Body Mass and Accumulated Visceral Fat in Overweight and Obese Individuals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials - PMC] (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7967650/#:~:text=Decreasing%20accumulation%20of%20fat%20is,low%20visceral%20fat%20%5B7%5D.) <br>
[^3]: [What is the body mass index (BMI)? - NHS] (https://www.nhs.uk/common-health-questions/lifestyle/what-is-the-body-mass-index-bmi/)<br>
[^4]: [Target heart rate for exercise | University of Iowa Hospitals & Clinics] (https://uihc.org/health-topics/target-heart-rate-exercise)<br>
[^5]: What your heart rate is telling you - Harvard Health (https://www.health.harvard.edu/heart-health/what-your-heart-rate-is-telling-you)<br>

</details>

---


In [None]:
#importing libraries for use in project code below
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#For error experienced on line 250.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning


<details>
<summary> Python Library References: </summary>
    
[Numpy] https://numpy.org <br>
[Pandas] https://pandas.pydata.org <br>
[Matplotlib] https://matplotlib.org <br>
[Seaborn] https://seaborn.pydata.org <br>

To install all within your own files for use please see previous section to import each.

---


### Variables & Distributions:

#### `workout_hours` 
A catagorical variable with set values indicating the hours of exercise with a range of 1 to 5 hours per week. 
<sup> [^6] [^7] [^8] </sup>

<br>

#### `heart_rate` 
A continuous variable as heart rates, when measure either in real life or synthetically created, could have any number of infinite value. For this investigation I will use 60 - 200bpm as this is the min and max any person between the age of 18 and 80 may experience with exercise depending on their fitnes level. This is data generated using a `np.normal` distribution, using 70 as the mean, 10 for the standard deviation and the size/shape is 100 (people) <sup> [^9] [^10] </sup>

<br>

#### `age` 
A discrete variable and will generate the age through `np.random` sub.package between ages of 18 and 80. This data will be generated using the `randint()` function. <sup> [^11] </sup>

<br>

#### `bmi` 
A continuous variable as `heart_rate`. It is assumed for this investigation that BMI decreases or will stay at a healthy BMI with increasing or continued exercise hours following a normal distribution similar to `heart_rate`. <sup> [^12] </sup>

<br>

Using these variables and their distributions, I hope to find that increased exercise weekly is beneficial in any age and that older ages may exercise less and might have a higher BMI as would be the assumption in reality.
<sup> [^13] </sup>

<details>  
<summary> 

###### References - Variables & Distributions 
    
</summary>

[^6]: [Categorical data — pandas 2.1.3 documentation] (https://pandas.pydata.org/docs/user_guide/categorical.html)<br>
[^7]: [Variable Types: Variable Types Cheatsheet - Quantitative Vs. Categorical Variables| Codecademy] (https://www.codecademy.com/learn/stats-variable-types/modules/stats-variable-types/cheatsheet#)<br>
[^8]: [Visualizing categorical data — seaborn 0.13.0 documentation] (https://seaborn.pydata.org/tutorial/categorical.html)<br>
[^9]: [When and How to Categorize Continuous Variables in Python | by Max Hilsdorf | Level Up Coding] (https://levelup.gitconnected.com/when-and-how-to-categorize-continuous-variables-in-python-f4c3b357dd46)<br>
[^10]: [What your heart rate is telling you - Harvard Health] (https://www.health.harvard.edu/heart-health/what-your-heart-rate-is-telling-you)<br>
[^11]: [How to Use NumPy Random choice() in Python?] (https://sparkbyexamples.com/python/how-to-use-numpy-random-choice-in-python/)<br>
[^12]: [Target heart rate for exercise | University of Iowa Hospitals & Clinics] (https://uihc.org/health-topics/target-heart-rate-exercise)<br>
[^13]: [Python Feature Engineering Cookbook] (https://subscription.packtpub.com/book/data/9781789806311/1/ch01lvl1sec03/identifying-numerical-and-categorical-variables)<br>
    
</details>


---

### Dataset Generation and Synthetic Data:

In [None]:
#Data creation for health dataset
#Hours of exercise per week
workout_hours = np.random.choice([1, 2, 3, 4, 5], size=100)

`np.random.choice` is a 1-D Array and gives the choice of the 5 hours of exercise samples given (1-5). The shape of the data is the size(100) which is our 100 people/information points to generate.

In [None]:
#14 https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html

In [None]:
#Heart Rate
heart_rate = np.random.normal(70, 10, size=100)
#mean hr based on search with standard deviation of 10, this is also the base/resting heart rate.

In [None]:
#15 https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

In [None]:
#age range
age = np.random.randint(18, 81, size=100)

In [None]:
#16 https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html

In [None]:
#bmi
meanbmi = 27 - (workout_hours * 0.5)
bmi = np.random.normal(meanbmi, 3, size=100)
#assumed that the more you workout the lower you BMI 

To check that the variables are working and the kernal is working:

In [None]:
print(meanbmi)

In [None]:
print(bmi)

## Data Frame Creation

In [None]:
data = pd.DataFrame({
    'Workout_Hours': workout_hours,
    'BMI': bmi,
    'Heart_Rate': heart_rate,
    'Age': age
})

In [None]:
#17 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
#18 https://www.w3schools.com/python/pandas/pandas_dataframes.asp

Test dataframe was created and functioning using the `describe()` method. Displays first 5 rows as standard but I have put 100 in between the brackets to see 100 data points:


In [None]:
print(data.head(101))

In [None]:
#19 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

#### Summary of data using the `describe()` method

In [None]:
#summarize data
print(data.describe())

In [None]:
#20 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

---

## Plotting

By selecting the right type of plot to visualize the above data we can gain valuable insight into the relationships of each of the four variables and their impact on the BMI. 

#### Pairplot

A pairplot is used to identify correlations and trends quickly for the initial investigation and can assist in directing which variables to focus on or investigate further. 

#### Boxplot



#### Heatmap




Scatter Plot - pairplot() using seaborn shows the relationships between the variables

In [None]:
sns.pairplot(data)
plt.show()

In [None]:
plt.figure(figsize = (12, 5))

plt.subplot(1 ,3, 1)
sns.scatterplot(x=workout_hours, y=bmi, data=data)
plt.title('Workout Hours Verses BMI')

plt.subplot(1, 3, 2)
sns.scatterplot(x=age, y=heart_rate, data=data)
plt.title('Age Verses Heart Rate')

plt.subplot(1, 3, 3)
sns.scatterplot(x=bmi, y=heart_rate, data=data)
plt.title('BMI verses Heart Rate')

plt.tight_layout()
plt.show()

In [None]:
#boxplot
boxplot = sns.catplot(x=workout_hours, y=bmi, data=data, kind="box")
plt.xlabel('Workout Hours')
plt.ylabel('BMI')
plt.title('Boxplot of BMI verses Workout Hours')
plt.show(boxplot)

In [None]:
#heatmap for showing stronger relations between variables

#calculate the relationship of the data in the dataframe generated earlier
correlation = data.corr()

#plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='autumn', fmt='.2f')
plt.title('Variables Correlation')
plt.show()