#**Week-8 Assignment**
##**`Hourly` Stats Merge**
### *By Arijit Dhali [Linkedin](https://www.linkedin.com/in/arijit-dhali-b255b0138/)*
---
Respondents generated this dataset to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents the use of different Fitbit trackers and individual tracking behaviors/preferences.

---
This Notebook contains:
### `Dataset : Hourly Stats Merge`

# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [10]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations
import re                                 # Import the regular expression module
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

After importing librarires, we will import the data using `GitHub` link of raw file

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset from a `URL` into a `Pandas` DataFrame.

In [11]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load related dataset from URL into multiple DataFrame
url_1 = 'https://raw.githubusercontent.com/ArijitDhali/PrepInsta-DA-Week-8/main/Clean%20Files/hourly_calories_cleaned.csv'
calories = pd.read_csv(url_1, encoding='unicode_escape')
url_2 = 'https://raw.githubusercontent.com/ArijitDhali/PrepInsta-DA-Week-8/main/Clean%20Files/hourly_intensities_cleaned.csv'
intensity = pd.read_csv(url_2, encoding='unicode_escape')
url_3 = 'https://raw.githubusercontent.com/ArijitDhali/PrepInsta-DA-Week-8/main/Clean%20Files/hourly_steps_cleaned.csv'
steps = pd.read_csv(url_3, encoding='unicode_escape')


# Display first two rows of the loaded DataFrame
#df.sample(5)

Inspecting the columns present in every dataset.

In [12]:
# Display random row from each dataframe
print(calories.sample())
print(intensity.sample())
print(steps.sample())

               Id ActivityDay ActivityHour  Calories
15246  6290855005  2016-05-03        10:00       142
               Id ActivityDay ActivityHour  TotalIntensity  AverageIntensity
10845  4445114986  2016-04-29        05:00               0              0.00
               Id ActivityDay ActivityHour  StepTotal
17435  7086361926  2016-04-15        23:00          0


# **Processing & Merging Dataset**

###1. Intensity Dataframe

Merging Dataframes on Left join

In [13]:
# Merge DataFrames on specified columns
calories_intensity = pd.merge(calories, intensity, on=['Id', 'ActivityDay', 'ActivityHour'], how='left')

# Display a head sample of 5 rows from the merged DataFrame
calories_intensity.head(5)

Unnamed: 0,Id,ActivityDay,ActivityHour,Calories,TotalIntensity,AverageIntensity
0,1503960366,2016-04-12,00:00,81,20,0.33
1,1503960366,2016-04-12,01:00,61,8,0.13
2,1503960366,2016-04-12,02:00,59,7,0.12
3,1503960366,2016-04-12,03:00,47,0,0.0
4,1503960366,2016-04-12,04:00,48,0,0.0


###2. Steps Dataframe

Merging Dataframes on Left join

In [14]:
# Merge DataFrames on specified columns
final = pd.merge(calories_intensity, steps, on=['Id', 'ActivityDay', 'ActivityHour'], how='left')

# Display a head sample of 5 rows from the merged DataFrame
final.head(5)

Unnamed: 0,Id,ActivityDay,ActivityHour,Calories,TotalIntensity,AverageIntensity,StepTotal
0,1503960366,2016-04-12,00:00,81,20,0.33,373
1,1503960366,2016-04-12,01:00,61,8,0.13,160
2,1503960366,2016-04-12,02:00,59,7,0.12,151
3,1503960366,2016-04-12,03:00,47,0,0.0,0
4,1503960366,2016-04-12,04:00,48,0,0.0,0


##**Final DataFrame Process**

Here we count the number of duplicate rows in the '`final`' DataFrame using the `duplicated()` method.

In [15]:
# Count the number of duplicate rows in the 'final' DataFrame
final.duplicated().sum()

0

Here we count the number of null values in each column of the '`final`' DataFrame using the `isnull()` method.

In [16]:
# Count the number of null values in each column of the 'final' DataFrame
final.isnull().sum()

Id                  0
ActivityDay         0
ActivityHour        0
Calories            0
TotalIntensity      0
AverageIntensity    0
StepTotal           0
dtype: int64

Here we convert all numeric columns in the '`final`' DataFrame to two decimal places. Then, we standardize column names by converting camel case to snake case using a regular expression.

In [17]:
# Convert all numeric columns to two decimal places
final = final.round(2)

# Standardize column names by converting camel case to snake case
final.columns = [re.sub('([a-z0-9])([A-Z])', r'\1_\2', col).lower() for col in final.columns]

final.sample()

Unnamed: 0,id,activity_day,activity_hour,calories,total_intensity,average_intensity,step_total
4729,2026352035,2016-04-27,05:00,47,0,0.0,0


Now we will save the final data

In [18]:
final.to_csv('hourly_stats_data.csv', index=False) #save a copy of final data