## Smartwach Data Analysis using Python

There is a lot of competition among the brands in the smartwatch industry. Smartwatches are preferred by people who like to take care of their fitness. Analyzing the data collected on your fitness is one of the use cases of Data Sciencie in healthcare. So if you want to learn how to analyze smartwach fitness data, this article is for you.

Original project Aman Kharwal

The dataset for Smartwatch data analysis is publicly available in kaggle. This dataset was initially collected from 30 female users of the Fitbit smartwatch.

Now, let's start for importing the neccesary python libraries

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go


In [16]:
import pandas as pd
data = pd.read_csv("FitBit data.csv")

In [17]:
print(data.head())

           Id ActivityDate  TotalSteps  TotalDistance  TrackerDistance  \
0  1503960366    3/25/2016       11004           7.11             7.11   
1  1503960366    3/26/2016       17609          11.55            11.55   
2  1503960366    3/27/2016       12736           8.53             8.53   
3  1503960366    3/28/2016       13231           8.93             8.93   
4  1503960366    3/29/2016       12041           7.85             7.85   

   LoggedActivitiesDistance  VeryActiveDistance  ModeratelyActiveDistance  \
0                       0.0                2.57                      0.46   
1                       0.0                6.92                      0.73   
2                       0.0                4.66                      0.16   
3                       0.0                3.19                      0.79   
4                       0.0                2.16                      1.09   

   LightActiveDistance  SedentaryActiveDistance  VeryActiveMinutes  \
0                 4.07

Before moving forward, let's have a look at whether this dataset has any null values or not

In [18]:
print(data.isnull().sum())

Id                          0
ActivityDate                0
TotalSteps                  0
TotalDistance               0
TrackerDistance             0
LoggedActivitiesDistance    0
VeryActiveDistance          0
ModeratelyActiveDistance    0
LightActiveDistance         0
SedentaryActiveDistance     0
VeryActiveMinutes           0
FairlyActiveMinutes         0
LightlyActiveMinutes        0
SedentaryMinutes            0
Calories                    0
dtype: int64


So the dataset does not have any null values. Let's have a look at the information about columns in the dataset

In [19]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 457 entries, 0 to 456
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        457 non-null    int64  
 1   ActivityDate              457 non-null    object 
 2   TotalSteps                457 non-null    int64  
 3   TotalDistance             457 non-null    float64
 4   TrackerDistance           457 non-null    float64
 5   LoggedActivitiesDistance  457 non-null    float64
 6   VeryActiveDistance        457 non-null    float64
 7   ModeratelyActiveDistance  457 non-null    float64
 8   LightActiveDistance       457 non-null    float64
 9   SedentaryActiveDistance   457 non-null    float64
 10  VeryActiveMinutes         457 non-null    int64  
 11  FairlyActiveMinutes       457 non-null    int64  
 12  LightlyActiveMinutes      457 non-null    int64  
 13  SedentaryMinutes          457 non-null    int64  
 14  Calories  

The column containing the date of the record is an object. We may need to use dates in our analysis, so let's convert this column into a datetime column

In [20]:
# Changing datatype of ActivityDate

data["ActivityDate"] = pd.to_datetime(data["ActivityDate"],
                                      format="%m/%d/%Y")
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 457 entries, 0 to 456
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Id                        457 non-null    int64         
 1   ActivityDate              457 non-null    datetime64[ns]
 2   TotalSteps                457 non-null    int64         
 3   TotalDistance             457 non-null    float64       
 4   TrackerDistance           457 non-null    float64       
 5   LoggedActivitiesDistance  457 non-null    float64       
 6   VeryActiveDistance        457 non-null    float64       
 7   ModeratelyActiveDistance  457 non-null    float64       
 8   LightActiveDistance       457 non-null    float64       
 9   SedentaryActiveDistance   457 non-null    float64       
 10  VeryActiveMinutes         457 non-null    int64         
 11  FairlyActiveMinutes       457 non-null    int64         
 12  LightlyActiveMinutes  

Look at all the columns, you will see information about very active, fairly active, lightly active, and sedentary minutes in the dataset. Let's combine all these columns as total minutes before moving forward

In [21]:
data["TotalMinutes"] = data["VeryActiveDistance"] + data["FairlyActiveMinutes"] + data["LightlyActiveMinutes"] + data["SedentaryMinutes"]
print(data["TotalMinutes"].sample(5))

17     1087.04
372    1318.10
99     1440.00
196     322.00
403    1440.00
Name: TotalMinutes, dtype: float64


Now let's have a look at the descriptive statistics of the dataset

In [22]:
print(data.describe())

                 Id                   ActivityDate    TotalSteps  \
count  4.570000e+02                            457    457.000000   
mean   4.628595e+09  2016-04-04 04:40:26.258205696   6546.562363   
min    1.503960e+09            2016-03-12 00:00:00      0.000000   
25%    2.347168e+09            2016-04-02 00:00:00   1988.000000   
50%    4.057193e+09            2016-04-05 00:00:00   5986.000000   
75%    6.391747e+09            2016-04-08 00:00:00  10198.000000   
max    8.877689e+09            2016-04-12 00:00:00  28497.000000   
std    2.293781e+09                            NaN   5398.493064   

       TotalDistance  TrackerDistance  LoggedActivitiesDistance  \
count     457.000000       457.000000                457.000000   
mean        4.663523         4.609847                  0.179427   
min         0.000000         0.000000                  0.000000   
25%         1.410000         1.280000                  0.000000   
50%         4.090000         4.090000               

### Let's Analyze the Smartwatch Data
The dataset has a "Calories" column, it contains the data about the number of calories burned in a day. Let's have a look at the relationship between calories burned and the total steps walked in a day:

In [23]:
figure = px.scatter(data_frame=data, x="Calories",
                    y="TotalSteps", size="VeryActiveMinutes",
                    trendline="ols",
                    title="Relationship between Calories & Total Steps")
figure.show()

You can see that there is a linear relationship between the total number of steps and the number of calories burned in a day. Now, let's look at the average total number of active minutes in a day:

In [26]:
label = ["Very Active Minutes", "Fairly Active Minutes",
         "Lightly Active Minutes", "Inactive Minutes"]
counts = data[["VeryActiveMinutes", "FairlyActiveMinutes",
         "LightlyActiveMinutes", "SedentaryMinutes"]].mean()
colors = ['gold', 'lightgreen', "pink", "blue"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Total Active Minutes')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=30,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))

fig.show()

### Observations
1. 83.3% of Total inactive minutes in a day
2. 14,2% of Lightly active minutes in a day
3. On an average, only 21 minutes (1,39%) were very active
4. and 1.09% (13 minutes) of fairly active minutes in a day

We transformed the data type of the ActiviteDate column to the datetime column above. Let's use it to find the weekday of the records and add a new column to this dataset as "Day"

In [27]:
data["Day"] = data["ActivityDate"].dt.day_name()
print(data["Day"].head())

0      Friday
1    Saturday
2      Sunday
3      Monday
4     Tuesday
Name: Day, dtype: object


Now let's have a look at the very active, fairly active, and lightly active minutes on each day of the week

In [28]:
fig = go.Figure()
fig.add_trace(go.Bar(
    x=data["Day"],
    y=data["VeryActiveMinutes"],
    name='Very Active',
    marker_color='purple'
))

fig.add_trace(go.Bar(
    x=data["Day"],
    y=data["FairlyActiveMinutes"],
    name='Fairly Active',
    marker_color='green'
))

fig.add_trace(go.Bar(
    x=data["Day"],
    y=data["LightlyActiveMinutes"],
    name='Lightly Active',
    marker_color='pink'
))

fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.show()

Now let's have a look at the number of inactive minutes on each day of the week

In [29]:
day = data["Day"].value_counts()
label = day.index
counts = data["SedentaryMinutes"]
colors = ['gold', 'lightgreen', "pink", "blue", "skyblue", "cyan", "orange"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Inactive Minutes Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=30,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

So Wednesday is the most inactive day according to the lifestyle of all the individuals in the dataset. Now let's have a look at the number of calories burned on each day of the week

In [30]:
calories = data["Day"].value_counts()
label = calories.index
counts = data["Calories"]
colors = ['gold', 'lightgreen', "pink", "blue", "skyblue", "cyan", "orange"]

fig = go.Figure(data=[go.Pie(labels=label, values=counts)])
fig.update_layout(title_text='Inactive Minutes Daily')
fig.update_traces(hoverinfo='label+percent', textinfo='value', textfont_size=30,
                  marker=dict(colors=colors, line=dict(color='black', width=3)))
fig.show()

Friday is, therefore, one of the most active days for all individuals in the dataset, as the highest number of calories were burned on Friday

So this is how you can analyze smartwatch data using the Python programming language. There is a lot more you can do with this dataset. You can also use it for predicting the number of calories burned in a day