# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [24]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations
import re                                 # Import the regular expression module
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset into a `Pandas` DataFrame.

In [25]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load related dataset into multiple DataFrame

met = pd.read_csv('minute_MET_cleaned.csv', encoding='unicode_escape')
calories = pd.read_csv('minute_calories_cleaned.csv', encoding='unicode_escape')
intensity = pd.read_csv('minute_intensities_cleaned.csv', encoding='unicode_escape')
sleep = pd.read_csv('minute_sleep_cleaned.csv', encoding='unicode_escape')
steps = pd.read_csv('minute_steps_cleaned.csv', encoding='unicode_escape')


# Display first two rows of the loaded DataFrame
#df.sample(5)

Inspecting the columns present in every dataset.

In [26]:
# Display random row from each dataframe
print(met.sample())
print(calories.sample())
print(intensity.sample())
print(sleep.sample())
print(steps.sample())

                 Id ActivityDay ActivityMinute  METs
1230115  8583815059  2016-05-04       04:55:00    10
                Id ActivityDay ActivityMinute  Calories
978239  6962181067  2016-04-24       11:59:00      0.99
                Id ActivityDay ActivityMinute  Intensity
845516  6117666160  2016-04-12       22:56:00          0
               Id ActivityDay ActivityMinute  Sleep
91162  4702921684  2016-04-15       03:37:30      1
                Id ActivityDay ActivityMinute  Steps
866518  6117666160  2016-04-27       12:58:00     41


# **Processing & Merging Dataset**

###1. Calories Dataframe

Merging Dataframes on Left join

In [27]:
# Merge DataFrames on specified columns
met_calories = pd.merge(met, calories, on=['Id', 'ActivityDay', 'ActivityMinute'], how='left')

# Display a random sample of 5 rows from the merged DataFrame
met_calories.head(5)

Unnamed: 0,Id,ActivityDay,ActivityMinute,METs,Calories
0,1503960366,2016-04-12,00:00:00,10,0.79
1,1503960366,2016-04-12,00:01:00,10,0.79
2,1503960366,2016-04-12,00:02:00,10,0.79
3,1503960366,2016-04-12,00:03:00,10,0.79
4,1503960366,2016-04-12,00:04:00,10,0.79


###2. Intensity Dataframe

Merging Dataframes on Left join

In [28]:
# Merge DataFrames on specified columns
met_calories_intensity = pd.merge(met_calories, intensity, on=['Id', 'ActivityDay', 'ActivityMinute'], how='left')

# Display a random sample of 5 rows from the merged DataFrame
met_calories_intensity.head(5)

Unnamed: 0,Id,ActivityDay,ActivityMinute,METs,Calories,Intensity
0,1503960366,2016-04-12,00:00:00,10,0.79,0
1,1503960366,2016-04-12,00:01:00,10,0.79,0
2,1503960366,2016-04-12,00:02:00,10,0.79,0
3,1503960366,2016-04-12,00:03:00,10,0.79,0
4,1503960366,2016-04-12,00:04:00,10,0.79,0


###3. Sleep Dataframe

Merging Dataframes on Left join

In [29]:
# Merge DataFrames on specified columns
met_calories_intensity_sleep = pd.merge(met_calories_intensity, sleep, on=['Id', 'ActivityDay', 'ActivityMinute'], how='left')

# Display a random sample of 5 rows from the merged DataFrame
met_calories_intensity_sleep.head(5)

Unnamed: 0,Id,ActivityDay,ActivityMinute,METs,Calories,Intensity,Sleep
0,1503960366,2016-04-12,00:00:00,10,0.79,0,
1,1503960366,2016-04-12,00:01:00,10,0.79,0,
2,1503960366,2016-04-12,00:02:00,10,0.79,0,
3,1503960366,2016-04-12,00:03:00,10,0.79,0,
4,1503960366,2016-04-12,00:04:00,10,0.79,0,


###4. Steps Dataframe

Merging Dataframes on Left join

In [30]:
# Merge DataFrames on specified columns
final = pd.merge(met_calories_intensity_sleep, steps, on=['Id', 'ActivityDay', 'ActivityMinute'], how='left')

# Display a head sample of 5 rows from the merged DataFrame
final.head(5)

Unnamed: 0,Id,ActivityDay,ActivityMinute,METs,Calories,Intensity,Sleep,Steps
0,1503960366,2016-04-12,00:00:00,10,0.79,0,,0
1,1503960366,2016-04-12,00:01:00,10,0.79,0,,0
2,1503960366,2016-04-12,00:02:00,10,0.79,0,,0
3,1503960366,2016-04-12,00:03:00,10,0.79,0,,0
4,1503960366,2016-04-12,00:04:00,10,0.79,0,,0


##**Final DataFrame Process**

Here we count the number of duplicate rows in the '`final`' DataFrame using the `duplicated()` method.

In [31]:
# Count the number of duplicate rows in the 'final' DataFrame
final.duplicated().sum()

0

Here we count the number of null values in each column of the '`final`' DataFrame using the `isnull()` method.

In [32]:
# Count the number of null values in each column of the 'final' DataFrame
final.isnull().sum()

Id                      0
ActivityDay             0
ActivityMinute          0
METs                    0
Calories                0
Intensity               0
Sleep             1200220
Steps                   0
dtype: int64

Fill the Sleep `NaN` values with `0` and make sure that the data type is int.

In [33]:
# Replace NaN values in a specific column with 0
final['Sleep'] = final['Sleep'].fillna(0)

# Convert a column to integer type
final['Sleep'] = final['Sleep'].astype(int)

final.isnull().sum()

Id                0
ActivityDay       0
ActivityMinute    0
METs              0
Calories          0
Intensity         0
Sleep             0
Steps             0
dtype: int64

Here we convert all numeric columns in the '`final`' DataFrame to two decimal places. Then, we standardize column names by converting camel case to snake case using a regular expression.

In [34]:
# Convert all numeric columns to two decimal places
final = final.round(2)

# Standardize column names by converting camel case to snake case
final.columns = [re.sub('([a-z0-9])([A-Z])', r'\1_\2', col).lower() for col in final.columns]

final.sample()

Unnamed: 0,id,activity_day,activity_minute,mets,calories,intensity,sleep,steps
534016,4057192912,2016-04-12,16:16:00,56,6.91,1,0,82


Now we will save the final data

In [35]:
final.to_csv('minute_stats_data.csv', index=False) #save a copy of final data