## Data Preparation

#### Data Extracting

In [13]:
# reading csv using pandas
import pandas as pd
health_data = pd.read_csv("health_data.csv", header=0, sep=",")
print(health_data)

# showing top 5 rows
print(health_data.head())

    Duration Average_Pulse Max_Pulse  Calorie_Burnage  Hours_Work  Hours_Sleep
0       30.0            80       120            240.0        10.0          7.0
1       45.0            85       120            250.0        10.0          7.0
2       45.0            90       130            260.0         8.0          7.0
3       60.0            95       130            270.0         8.0          7.0
4       60.0           100       140            280.0         0.0          7.0
5        NaN           NaN       NaN              NaN         NaN          NaN
6       60.0           105       140            290.0         7.0          8.0
7       60.0           110       145            300.0         7.0          8.0
8       45.0           NaN        AF              NaN         8.0          8.0
9       45.0           115       145            310.0         8.0          8.0
10      60.0           120       150            320.0         0.0          8.0
11      60.0         9 000       130              Na

#### Code Explaination
- Importing Pandas library.
- Naming the data frame as `health_data`.
- `header=0` means that the headers for the varaiable names are to be found in the first row.
- `sep=","` means that `","` is used as the seperator between the values in .csv file.
- If we have a large csv file, we can use the `head()` function to only show the top 5 rows.

#### Data Cleaning

- If we look at the imported data, we can see the data are "dirty" with wrongly or unregistered values.
- There are some blank fields.
- Average pulse of 9000 is not possible.
- 9000 will be treated as non-numeric, because of space seperator.
- One observation of max pulse is denoted as "AF", which does not make sense.
- So We must clean the data in order to perform analysis.

#### Removing Blank Rows
- We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.
- **We can remove the rows with missing observations to fix the problem.**
- When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values.
- So, removing the NaN cells gives us a clean data set that can be analyzed.
- **We can use the `dropna()` function to remove the NaNs. `axis=0` means that we want to remove all rows that have a NaN value.**

In [14]:
# dropping NaN value rows
health_data.dropna(axis=0,inplace=True)
print(health_data)

    Duration Average_Pulse Max_Pulse  Calorie_Burnage  Hours_Work  Hours_Sleep
0       30.0            80       120            240.0        10.0          7.0
1       45.0            85       120            250.0        10.0          7.0
2       45.0            90       130            260.0         8.0          7.0
3       60.0            95       130            270.0         8.0          7.0
4       60.0           100       140            280.0         0.0          7.0
6       60.0           105       140            290.0         7.0          8.0
7       60.0           110       145            300.0         7.0          8.0
9       45.0           115       145            310.0         8.0          8.0
10      60.0           120       150            320.0         0.0          8.0
12      45.0           125       150            330.0         8.0          8.0


## Data Categories

#### Data types
- We can use `info()` function to list the data types within our data set.

In [16]:
# printing data types of the dataframe
print(health_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 12
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Duration         10 non-null     float64
 1   Average_Pulse    10 non-null     object 
 2   Max_Pulse        10 non-null     object 
 3   Calorie_Burnage  10 non-null     float64
 4   Hours_Work       10 non-null     float64
 5   Hours_Sleep      10 non-null     float64
dtypes: float64(4), object(2)
memory usage: 560.0+ bytes
None


#### Code Explanation
- We see that this data set has two different types of data:
  - Float64
  - Object
- We can not use objects to calculate and perform analysis here.
- We must covert the type object to float64 (float64 is a number with a decimal in python).
- We can use `astype()` function to convert the data into float64

#### Data type conversion

In [17]:
# converting Average_Pulse and Max_Pulse object type to float type
health_data["Average_Pulse"] = health_data['Average_Pulse'].astype(float)
health_data["Max_Pulse"] = health_data['Max_Pulse'].astype(float)
print(health_data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 12
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Duration         10 non-null     float64
 1   Average_Pulse    10 non-null     float64
 2   Max_Pulse        10 non-null     float64
 3   Calorie_Burnage  10 non-null     float64
 4   Hours_Work       10 non-null     float64
 5   Hours_Sleep      10 non-null     float64
dtypes: float64(6)
memory usage: 560.0 bytes
None


## Analyze the data

- When we cleaned the data set, we can start analyzing the data.
- We can use the `describe()` function in Python to summarize data.

#### Code

In [26]:
pd.set_option('display.max_columns', None)
print(health_data.describe())

        Duration  Average_Pulse   Max_Pulse  Calorie_Burnage  Hours_Work  \
count  10.000000      10.000000   10.000000        10.000000   10.000000   
mean   51.000000     102.500000  137.000000       285.000000    6.600000   
std    10.488088      15.138252   11.352924        30.276504    3.627059   
min    30.000000      80.000000  120.000000       240.000000    0.000000   
25%    45.000000      91.250000  130.000000       262.500000    7.000000   
50%    52.500000     102.500000  140.000000       285.000000    8.000000   
75%    60.000000     113.750000  145.000000       307.500000    8.000000   
max    60.000000     125.000000  150.000000       330.000000   10.000000   

       Hours_Sleep  
count    10.000000  
mean      7.500000  
std       0.527046  
min       7.000000  
25%       7.000000  
50%       7.500000  
75%       8.000000  
max       8.000000  


#### Explanation
- `pd.set_option('display.max_columns', <numbers_of_columns>)` is used to see all available columns for a dataframe in one line.
- `pd.set_option('display.max_columns', 40)` will display 40 columns.
- `pd.set_option('display.max_columns', None)` will display all columns.
- `health_data.describe()` will describe all the functionalities like: count, mean, std deviation, percentiles, max etc
- Here:
    - **Count** - Counts the number of observations
    - **Mean** - The average value
    - **Std** - Standard deviation (explained in the statistics chapter)
    - **Min** - The lowest value
    - **25%, 50% and 75%** - are percentiles (explained in the statistics chapter)
    - **Max** - The highest value