#**Week-8 Assignment**
##**Heart Rate**
#
---
Respondents generated this dataset to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents the use of different Fitbit trackers and individual tracking behaviors/preferences.

---
This Notebook contains:
### `Dataset : Heart Rate`

# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [1]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

After importing librarires, we will import the data using `GitHub` link of raw file

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset from a `URL` into a `Pandas` DataFrame.

In [2]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load car-related dataset from URL into 'df' DataFrame
url = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Raw%20CSV/heartrate_seconds_merged.csv'
df = pd.read_csv(url, encoding='unicode_escape')

# Display first two rows of the loaded DataFrame
df.head(2)

Unnamed: 0,Id,Time,Value
0,2022484408,4/12/2016 7:21:00 AM,97
1,2022484408,4/12/2016 7:21:05 AM,102


# **Preliminary Data Inspection**

To perform any operation on the dataset, we need to get familiarized with the dataset.

Showcasing the data types of each column in the '`df`' DataFrame.

In [3]:
df.dtypes        # Display data types of columns in the 'df' DataFrame

Id        int64
Time     object
Value     int64
dtype: object

Detailed information about the '`df`' DataFrame, including *data types* and *memory usage*

In [4]:
df.info(verbose=True)        # Display concise information about 'df' DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483658 entries, 0 to 2483657
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Time    object
 2   Value   int64 
dtypes: int64(2), object(1)
memory usage: 56.8+ MB


Showing the number of rows and columns

In [5]:
df.shape         # Display the shape (rows, columns)

(2483658, 3)

Using `df.describe()` to generate descriptive statistics for the numerical columns in the '`df`' DataFrame.

In [6]:
df.describe()      # Generate descriptive statistics for numerical columns

Unnamed: 0,Id,Value
count,2483658.0,2483658.0
mean,5513764629.27,77.33
std,1950223760.95,19.4
min,2022484408.0,36.0
25%,4388161847.0,63.0
50%,5553957443.0,73.0
75%,6962181067.0,88.0
max,8877689391.0,203.0


# **Data Cleaning and Trimming**

Modifying each and every column accordingly to get a smooth analysis in data vizualization.

### 1. Standardize Date Column

Here we convert the '`Date`' column to datetime format. Now we extract the date only from '`Date`' and create a new columnrelated to the name of Dataframe. Now we drop the original '`Date`' column and move the new column to its original position. Finally, we print the DataFrame to observe the changes.

In [7]:
# Convert the 'Date' column to datetime format
df['Time'] = pd.to_datetime(df['Time'], format='%m/%d/%Y %I:%M:%S %p')

# Convert the 'ActivityHour' column to datetime format
df['Day'] = df['Time'].dt.date
df['Time'] = df['Time'].dt.strftime('%H:%M')
date_column = df.pop('Day')
df.insert(1, 'Day', date_column)

# Print the DataFrame to see the changes
df.head(5)

Unnamed: 0,Id,Day,Time,Value
0,2022484408,2016-04-12,07:21,97
1,2022484408,2016-04-12,07:21,102
2,2022484408,2016-04-12,07:21,105
3,2022484408,2016-04-12,07:21,103
4,2022484408,2016-04-12,07:21,101


Here we replace the '`Value`' column with the median value within each group of '`Id`', '`Day`', and '`Time`'. Now we drop duplicate rows based on '`Id`', '`Day`', and '`Time`', keeping the first occurrence. We rename the '`Value`' column to '`Heartrate`', and then convert the '`Heartrate`' column to integer type.

In [8]:
df['Value'] = df.groupby(['Id', 'Day', 'Time'])['Value'].transform('median')
df = df.drop_duplicates(subset=['Id', 'Day', 'Time'], keep='first')

# Use the 'rename' method to rename a column
df.rename(columns={'Value': 'Heartrate'}, inplace=True)
df['Heartrate'] = df['Heartrate'].astype(int)

# Create a new column 'DailyMeanHeartrate' to store the mean heart rate per day per ID
df['DailyMeanHeartrate'] = df.groupby(['Id', 'Day'])['Heartrate'].transform('mean')
df

Unnamed: 0,Id,Day,Time,Heartrate,DailyMeanHeartrate
0,2022484408,2016-04-12,07:21,102,74.05
5,2022484408,2016-04-12,07:22,92,74.05
14,2022484408,2016-04-12,07:23,58,74.05
20,2022484408,2016-04-12,07:24,58,74.05
26,2022484408,2016-04-12,07:25,57,74.05
...,...,...,...,...,...
2483626,8877689391,2016-05-12,14:40,56,69.92
2483635,8877689391,2016-05-12,14:41,57,69.92
2483642,8877689391,2016-05-12,14:42,56,69.92
2483649,8877689391,2016-05-12,14:43,57,69.92


### 2. Handling the large missing values

Now, we will calculate and display the percentage of missing values for each column in the '`df`' DataFrame. <br>This information helps in understanding the completeness of the dataset and identifies columns with **missing data**.

In [9]:
# Calculate the percentage of missing values for each column in 'df' DataFrame
row_size = df.shape[0]
for i in df.columns:
    if df[i].isnull().sum() > 0:
        print(i, "----------", (df[i].isnull().sum() / row_size) * 100)

 checking for and printing the number of duplicate rows in the '`df`' DataFrame, helping to identify and address potential data duplication issues.

In [10]:
# Check for duplicate rows in 'car' DataFrame
df.duplicated().sum()

0

Remove the duplicate rows from the '`df`' DataFrame

In [11]:
# Drop duplicate rows from the 'car' DataFrame
df = df.drop_duplicates()
df.duplicated().sum()

0

#**Viewing & Saving Clean Data**

Now we sort the DataFrame.

In [12]:
# Sort the melted DataFrame by 'Id' and 'Activity'
df = df.sort_values(by=['Id', 'Day', 'Time'])

Viewing the final and cleaned data, saving it into `.csv` format

Here, the modified '`df`' DataFrame is saved to a **CSV file** named '`heart_rate_cleaned.csv`' without including the index column.

In [13]:
df.to_csv('heart_rate_cleaned.csv', index=False)   # Save the modified 'df' DataFrame to a CSV file named 'heart_rate_cleaned.csv'