#**Week-8 Assignment**
##**`Daily` Stats Merge**
---
Respondents generated this dataset to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents the use of different Fitbit trackers and individual tracking behaviors/preferences.

---
This Notebook contains:
### `Dataset : Daily Stats Merge`

# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [1]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations
import re                                 # Import the regular expression module
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

After importing librarires, we will import the data using `GitHub` link of raw file

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset from a `URL` into a `Pandas` DataFrame.

In [2]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load related dataset from URL into multiple DataFrame
url_1 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/daily_activity_cleaned.csv'
activity = pd.read_csv(url_1, encoding='unicode_escape')
#url_2 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/daily_calories_cleaned.csv'
#calories = pd.read_csv(url_2, encoding='unicode_escape')
#url_3 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/daily_intensities_cleaned.csv'
#intensity = pd.read_csv(url_3, encoding='unicode_escape')
#url_4 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/daily_steps_cleaned.csv'
#steps = pd.read_csv(url_4, encoding='unicode_escape')
url_5 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/heart_rate_cleaned.csv'
heart = pd.read_csv(url_5, encoding='unicode_escape')
url_6 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/sleep_day_data_cleaned.csv'
sleep = pd.read_csv(url_6, encoding='unicode_escape')
url_7 = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Clean%20CSV/weight_log_info_data_cleaned.csv'
weight = pd.read_csv(url_7, encoding='unicode_escape')


# Display first two rows of the loaded DataFrame
#df.sample(5)

Inspecting the columns present in every dataset.

In [3]:
# Display random row from each dataframe
print(activity.sample())
print(heart.sample())
print(sleep.sample())
print(weight.sample())

             Id ActivityDate  TotalSteps  TotalDistance  TrackerDistance  \
845  8378563200   2016-05-09        8382           6.65             6.65   

     LoggedActivitiesDistance  VeryActiveDistance  ModeratelyActiveDistance  \
845                      2.09                1.27                      0.66   

     LightActiveDistance  SedentaryActiveDistance  VeryActiveMinutes  \
845                 4.72                     0.00                 71   

     FairlyActiveMinutes  LightlyActiveMinutes  SedentaryMinutes  Calories  
845                   13                   171               772      3721  
               Id         Day   Time  Heartrate  DailyMeanHeartrate
33077  2347167796  2016-04-21  08:46         85               71.58
             Id    SleepDay  TotalMinutesAsleep  TotalSleepRecords  \
157  4388161847  2016-05-01                 547                  1   

     TotalTimeInBed  
157             597  
            Id Date Weight   BMI  WeightKg  WeightPounds
15  6962181

# **Processing & Merging Dataset**

###1. Heart Rate Dataframe

In [4]:
# Specify column names to drop
columns_to_drop = ['Heartrate', 'Time']

# Drop the specified columns from the DataFrame
heart = heart.drop(columns=columns_to_drop)

# Drop duplicate rows based on all columns
heart = heart.drop_duplicates()

heart.head(5)

Unnamed: 0,Id,Day,DailyMeanHeartrate
0,2022484408,2016-04-12,74.05
718,2022484408,2016-04-13,78.64
1474,2022484408,2016-04-14,70.55
2244,2022484408,2016-04-15,78.63
2954,2022484408,2016-04-16,74.5


In [5]:
# Merge DataFrames on specified columns
activity_heart = pd.merge(activity, heart, left_on=['Id', 'ActivityDate'], right_on=['Id', 'Day'], how='left')

# Move the 'Date Weight' column
moving_column = activity_heart.pop('DailyMeanHeartrate')
activity_heart.insert(2, 'DailyMeanHeartrate', moving_column)

# Rename the 'Value' column to 'Heartrate'
activity_heart.rename(columns={'DailyMeanHeartrate': 'DailyAverageHeartrate'}, inplace=True)

activity_heart.head(5)

Unnamed: 0,Id,ActivityDate,DailyAverageHeartrate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,Day
0,1503960366,2016-04-12,,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,
1,1503960366,2016-04-13,,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797,
2,1503960366,2016-04-14,,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776,
3,1503960366,2016-04-15,,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745,
4,1503960366,2016-04-16,,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863,


###2. Sleep Day Dataframe

In [6]:
# Merge DataFrames on specified columns
activity_heart_sleep = pd.merge(activity_heart, sleep, left_on=['Id', 'ActivityDate'], right_on=['Id', 'SleepDay'], how='left')
activity_heart_sleep.head(5)

Unnamed: 0,Id,ActivityDate,DailyAverageHeartrate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,Day,SleepDay,TotalMinutesAsleep,TotalSleepRecords,TotalTimeInBed
0,1503960366,2016-04-12,,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,,2016-04-12,327.0,1.0,346.0
1,1503960366,2016-04-13,,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797,,2016-04-13,384.0,2.0,407.0
2,1503960366,2016-04-14,,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776,,,,,
3,1503960366,2016-04-15,,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745,,2016-04-15,412.0,1.0,442.0
4,1503960366,2016-04-16,,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863,,2016-04-16,340.0,2.0,367.0


###3. Weight Dataframe

In [7]:
# Merge DataFrames on specified columns
final = pd.merge(activity_heart_sleep, weight, left_on=['Id', 'ActivityDate'], right_on=['Id', 'Date Weight'], how='left')
final.head(5)

Unnamed: 0,Id,ActivityDate,DailyAverageHeartrate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories,Day,SleepDay,TotalMinutesAsleep,TotalSleepRecords,TotalTimeInBed,Date Weight,BMI,WeightKg,WeightPounds
0,1503960366,2016-04-12,,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,,2016-04-12,327.0,1.0,346.0,,,,
1,1503960366,2016-04-13,,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797,,2016-04-13,384.0,2.0,407.0,,,,
2,1503960366,2016-04-14,,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776,,,,,,,,,
3,1503960366,2016-04-15,,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745,,2016-04-15,412.0,1.0,442.0,,,,
4,1503960366,2016-04-16,,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863,,2016-04-16,340.0,2.0,367.0,,,,


In [8]:
final.shape

(940, 25)

##**Final DataFrame Process**

In [9]:
# Replace the column names with the ones you want to drop
columns_to_drop = ['Date Weight', 'SleepDay', 'Day']

# Drop the specified columns
final = final.drop(columns=columns_to_drop)
final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   DailyAverageHeartrate     334 non-null    float64
 3   TotalSteps                940 non-null    int64  
 4   TotalDistance             940 non-null    float64
 5   TrackerDistance           940 non-null    float64
 6   LoggedActivitiesDistance  940 non-null    float64
 7   VeryActiveDistance        940 non-null    float64
 8   ModeratelyActiveDistance  940 non-null    float64
 9   LightActiveDistance       940 non-null    float64
 10  SedentaryActiveDistance   940 non-null    float64
 11  VeryActiveMinutes         940 non-null    int64  
 12  FairlyActiveMinutes       940 non-null    int64  
 13  LightlyActiveMinutes      940 non-null    int64  
 14  SedentaryM

Here we count the number of duplicate rows in the '`final`' DataFrame using the `duplicated()` method.

In [10]:
# Count the number of duplicate rows in the 'final' DataFrame
final.duplicated().sum()

0

Here we count the number of null values in each column of the '`final`' DataFrame using the `isnull()` method.

In [11]:
# Count the number of null values in each column of the 'final' DataFrame
final.isnull().sum()

Id                            0
ActivityDate                  0
DailyAverageHeartrate       606
TotalSteps                    0
TotalDistance                 0
TrackerDistance               0
LoggedActivitiesDistance      0
VeryActiveDistance            0
ModeratelyActiveDistance      0
LightActiveDistance           0
SedentaryActiveDistance       0
VeryActiveMinutes             0
FairlyActiveMinutes           0
LightlyActiveMinutes          0
SedentaryMinutes              0
Calories                      0
TotalMinutesAsleep          530
TotalSleepRecords           530
TotalTimeInBed              530
BMI                         873
WeightKg                    873
WeightPounds                873
dtype: int64

Here we convert all numeric columns in the '`final`' DataFrame to two decimal places. Then, we standardize column names by converting camel case to snake case using a regular expression.

In [12]:
# Convert all numeric columns to two decimal places
final = final.round(2)

# Standardize column names by converting camel case to snake case
final.columns = [re.sub('([a-z0-9])([A-Z])', r'\1_\2', col).lower() for col in final.columns]

final.head()

Unnamed: 0,id,activity_date,daily_average_heartrate,total_steps,total_distance,tracker_distance,logged_activities_distance,very_active_distance,moderately_active_distance,light_active_distance,sedentary_active_distance,very_active_minutes,fairly_active_minutes,lightly_active_minutes,sedentary_minutes,calories,total_minutes_asleep,total_sleep_records,total_time_in_bed,bmi,weight_kg,weight_pounds
0,1503960366,2016-04-12,,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985,327.0,1.0,346.0,,,
1,1503960366,2016-04-13,,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797,384.0,2.0,407.0,,,
2,1503960366,2016-04-14,,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776,,,,,,
3,1503960366,2016-04-15,,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745,412.0,1.0,442.0,,,
4,1503960366,2016-04-16,,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863,340.0,2.0,367.0,,,


Now we will save the final data

In [13]:
final.to_csv('daily_stats_data.csv', index=False) #save a copy of final data