# **Data Exploration for Assignment 2**
**Objective**: Analyze the daily and minute step count data for three individuals to identify patterns and insights related to their physical activity.

## Initial Assumptions and Predictions 
1. The data for each individual will have variations in step counts. 
2. Weekends might have different step count patterns compared to weekdays. 
3. There will be missing data points or days with no recorded steps.



---
## Data Loading and Initial Exploration
We first import necessary libraries and load the data for each individual. 

In [1]:
import pandas as pd

# Loading data for each individual (assuming CSV files as an example)
daily_steps = pd.read_csv('dailySteps_merged.csv')
hourly_steps = pd.read_csv('hourlySteps_merged.csv')
minute_steps = pd.read_csv('minuteStepsWide_merged.csv')


---
## Daily Step Count Analysis

### Dataset introduction
By looking at the initial records, we can discern the structure of the dataset, the types of data it contains

In [2]:
daily_steps.head(100)

Unnamed: 0,Id,ActivityDay,StepTotal
0,1503960366,4/12/2016,13162
1,1503960366,4/13/2016,10735
2,1503960366,4/14/2016,10460
3,1503960366,4/15/2016,9762
4,1503960366,4/16/2016,12669
...,...,...,...
95,1844505072,4/15/2016,3844
96,1844505072,4/16/2016,3414
97,1844505072,4/17/2016,4525
98,1844505072,4/18/2016,4597


### Data Filtering
From the dataset, we extract data only for the first three unique IDs to simplify our analysis.

In [3]:
# Filter the data to include only three unique IDs
unique_ids = daily_steps["Id"].unique()
three_ids = unique_ids[17:20]

filtered_daily = daily_steps[daily_steps["Id"].isin(three_ids)]
filtered_daily

Unnamed: 0,Id,ActivityDay,StepTotal
474,4558609924,4/12/2016,5135
475,4558609924,4/13/2016,4978
476,4558609924,4/14/2016,6799
477,4558609924,4/15/2016,7795
478,4558609924,4/16/2016,7289
...,...,...,...
562,5553957443,5/8/2016,6083
563,5553957443,5/9/2016,11611
564,5553957443,5/10/2016,16358
565,5553957443,5/11/2016,4926


### Data Cleaning
Days with 0 steps are considered as outliers or perhaps days the tracker wasn't worn. Hence, such records are removed from the dataset.

In [4]:
filtered_daily_clean = filtered_daily[filtered_daily['StepTotal'] != 0]
filtered_daily_clean

Unnamed: 0,Id,ActivityDay,StepTotal
474,4558609924,4/12/2016,5135
475,4558609924,4/13/2016,4978
476,4558609924,4/14/2016,6799
477,4558609924,4/15/2016,7795
478,4558609924,4/16/2016,7289
...,...,...,...
562,5553957443,5/8/2016,6083
563,5553957443,5/9/2016,11611
564,5553957443,5/10/2016,16358
565,5553957443,5/11/2016,4926


### Data Engineering
The ActivityDay column is converted to a datetime format, and a new column, DayName, is created to store the name of the day (like Monday, Tuesday, etc.). 

In [5]:
# Create a copy of filtered_daily_clean to avoid changes to original dataset
filtered_daily_clean = filtered_daily_clean.copy()

# Convert "ActivityDay" to datetime and extract the day name
filtered_daily_clean.loc[:, 'DayName'] = pd.to_datetime(filtered_daily_clean['ActivityDay']).dt.day_name()
filtered_daily_clean


Unnamed: 0,Id,ActivityDay,StepTotal,DayName
474,4558609924,4/12/2016,5135,Tuesday
475,4558609924,4/13/2016,4978,Wednesday
476,4558609924,4/14/2016,6799,Thursday
477,4558609924,4/15/2016,7795,Friday
478,4558609924,4/16/2016,7289,Saturday
...,...,...,...,...
562,5553957443,5/8/2016,6083,Sunday
563,5553957443,5/9/2016,11611,Monday
564,5553957443,5/10/2016,16358,Tuesday
565,5553957443,5/11/2016,4926,Wednesday


In [6]:
# Count the number of days for each Id
days_count = filtered_daily.groupby('Id')['ActivityDay'].count()

print(days_count)

Id
4558609924    31
4702921684    31
5553957443    31
Name: ActivityDay, dtype: int64


### Summary Statistics 
Summary statistics provide a gauge to the general activity levels of the selected individuals by analyzing their average, maximum, and minimum daily step counts.

In [7]:
# Average step count per day
average_steps = round(filtered_daily_clean['StepTotal'].mean())
print("Average step count per day:", average_steps)

# Maximum step count
max_steps = filtered_daily_clean['StepTotal'].max()
print("Maximum step count:", max_steps)

# Minimum step count
min_steps = filtered_daily_clean['StepTotal'].min()
print("Minimum step count:", min_steps)

Average step count per day: 8380
Maximum step count: 17022
Minimum step count: 655


### Weekend Analysis
Weekend physical activity can be different from weekday patterns. For many, weekends can either be a time of relaxation and reduced activity or a chance to engage in recreational physical activities, sports, or outdoor events. By comparing the average step counts on weekends, we can get a snapshot of these behaviors.

In [8]:
# One other observation: Check which person are more active during the weekends

# Filter data for only Saturday and Sunday
weekend_data = filtered_daily_clean[filtered_daily_clean['DayName'].isin(['Saturday', 'Sunday'])]


# Group by 'Id' and take the mean of Saturday and Sunday steps, labeling it as 'Weekend'
average_steps_weekend_combined = weekend_data.groupby('Id').mean()['StepTotal'].round().astype(int)
average_steps_weekend_combined

Id
4558609924     7613
4702921684    13054
5553957443     3333
Name: StepTotal, dtype: int64

---
## Minute Step Count Analysis

### Dataset introduction
By looking at the initial records, we can discern the structure of the dataset, the types of data it contains

In [9]:
minute_steps.head()

Unnamed: 0,Id,ActivityHour,Steps00,Steps01,Steps02,Steps03,Steps04,Steps05,Steps06,Steps07,...,Steps50,Steps51,Steps52,Steps53,Steps54,Steps55,Steps56,Steps57,Steps58,Steps59
0,1503960366,4/13/2016 12:00:00 AM,4,16,0,0,0,9,0,17,...,0,9,8,0,20,1,0,0,0,0
1,1503960366,4/13/2016 1:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1503960366,4/13/2016 2:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1503960366,4/13/2016 3:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1503960366,4/13/2016 4:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Data Filtering
From the dataset, we extract data only for the first three unique IDs to simplify our analysis.

In [10]:
# Filter the data to include only three unique IDs
unique_ids = minute_steps["Id"].unique()
three_ids = unique_ids[17:20]

filtered_minute = minute_steps[minute_steps["Id"].isin(three_ids)]
filtered_minute

Unnamed: 0,Id,ActivityHour,Steps00,Steps01,Steps02,Steps03,Steps04,Steps05,Steps06,Steps07,...,Steps50,Steps51,Steps52,Steps53,Steps54,Steps55,Steps56,Steps57,Steps58,Steps59
10992,4558609924,4/13/2016 12:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10993,4558609924,4/13/2016 1:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10994,4558609924,4/13/2016 2:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10995,4558609924,4/13/2016 3:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10996,4558609924,4/13/2016 4:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13136,5553957443,5/12/2016 5:00:00 AM,0,0,0,0,0,0,0,0,...,0,11,0,0,0,0,0,0,0,0
13137,5553957443,5/12/2016 6:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,30,9,46,24,0,0,9,8
13138,5553957443,5/12/2016 7:00:00 AM,0,0,9,11,93,110,109,108,...,33,54,44,0,0,0,0,0,0,0
13139,5553957443,5/12/2016 8:00:00 AM,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Analyzing non-zero minute activity
By checking for non-zero values, we aim to find out which specific minutes consistently see activity. This can help in understanding patterns like consistent breaks or routines followed by individuals.

In [11]:
# Using a for loop to check for non-zero values for every minute, but first ensuring the column exists
non_zero_minute = []

for i in range(60):  # 60 minute steps from Steps00 to Steps59
    column_name = f"Steps{i:02d}"
    if column_name in filtered_minute.columns and (filtered_minute[column_name] != 0).all():
        non_zero_minute.append(column_name)

non_zero_minute
        
# Check how many non zero minutes are there
len(non_zero_minute)

0

**From the result, there are 0 non-zero minute. This indicates varied activity patterns for all individuals.**

### Data Cleaning
Missing values can influence the outcome of our analysis, leading to inaccurate results. This step ensures that subsequent analyses are based on complete and reliable data.

In [12]:
# Check for missing data in the DataFrame
missing_data = filtered_minute.isnull().sum()
missing_data_summary = missing_data[missing_data > 0]
missing_data_summary

# Check how many missing data are there
len(missing_data_summary)

0

### Summary Statistics (Average)
The filtered_minute dataset provides an in-depth view of activity levels, breaking down step counts for each minute within an hour. By calculating the average step count per minute, we aim to analyse how active individuals are within these small time frames.

In [13]:
# Calculate average step per minute
avg_steps_per_minute = filtered_minute.filter(like="Steps").mean().mean()
avg_steps_per_minute

5.868985574685898

### Summary Statistics (Max/Min)
In datasets that capture activity levels, understanding extremes can be particularly informative. These values can show peak activeness or completely inactivity, which would be beneficial for performance evaluation.

In [14]:
# Finding the maximum and minimum step values across the "StepsXX" columns
max_step_value = filtered_minute.filter(like="Steps").max().max()
min_step_value = filtered_minute.filter(like="Steps").min().min()

max_step_value, min_step_value

(207, 0)

### Summary Statistics (Activity Levels at 9AM)
Activity levels at 9am can be an indication for an individual's morning routine. Some people might be more active due to exercise, commuting to work, eaerly work tasks, while others might be less active, possible due to their later start of the day or a more sedentary morning routine.

In [15]:
# One other observation: Check which person are more active during 9am

# Filter rows for 9 AM and calculate the average for Steps00 to Steps59 for each unique ID
df_9am_filtered = filtered_minute[filtered_minute["ActivityHour"].str.contains("9:00:00 AM")]
avg_steps_9am_per_id = df_9am_filtered.groupby("Id").mean().filter(like="Steps").mean(axis=1).round().astype(int)
avg_steps_9am_per_id

Id
4558609924     6
4702921684    10
5553957443     2
dtype: int64

### Summary Statistics (Max hourly steps across three unique ids)
One hour of moderate walking is typically equivalent to 3500 steps for healthy adults. By computing the max hourly steps for each individuals, we can identify whether the three individuals had hit at least one hour of moderate walking a month based on the duration of collected data.

In [16]:
# Two other observation: Check whether an individual had achieve moderate hourly walking at least one across the month.

# Extract columns corresponding to the minutes Steps00 to Steps59
step_columns = [f"Steps{i:02d}" for i in range(60)]

# Check if all these columns exist in the dataframe
step_columns_present = [col for col in step_columns if col in filtered_minute.columns]

# Group by the 'Id' and 'ActivityHour' columns and calculate the sum specifically for these columns
total_steps_per_hour_per_id = filtered_minute.groupby(['Id', 'ActivityHour'])[step_columns_present].sum().sum(axis=1).reset_index(name="TotalStepsPerHour")

# Group by the 'Id' column and compute the max across all hours for each individual
max_steps_per_id = total_steps_per_hour_per_id.groupby('Id')['TotalStepsPerHour'].max()

# Rename the series
max_steps_per_id.name = "Max hourly steps across three unique ids"

max_steps_per_id

Id
4558609924    4688
4702921684    3962
5553957443    5808
Name: Max hourly steps across three unique ids, dtype: int64

---
## Conclusion
Through our exploratory data analysis, we aimed to understand the underlying patterns and characteristics of our dataset, and the following key insights were gleaned:

Data Quality: The dataset was largely complete with minimal missing values. However, only one outlier was identified in the daily steps but not minute steps.

Variability: The analysis highlighted variability among the three individuals, pointing to the importance of segmenting our data for any further predictive modeling.

Driving Problem: Our driving problem emphasizes the amount of moderate activity achieved by each of the three individuals. According to https://www.regainedwellness.com/how-many-steps-in-1-hour-walk/, one hour of moderate walking is typically equivalent to 3500 steps for healthy adults. And we found that all 3 individuals had at least have one hour of moderate activity based on walking in a month.