# Assignment 3: Fitness Tracking Analysis

## 📌 Overview
In this assignment, you will analyze a fitness tracking dataset using **Pandas** and **NumPy** only.
The dataset `fitness_tracking.csv` contains daily activity data such as steps, distance, active minutes, calories, and heart rate.

Your goal is to practice:
- Cleaning and preparing data with Pandas and NumPy
- Handling missing values
- Creating new features
- Detecting outliers with NumPy
- Aggregating and ranking data with Pandas

---

## 📊 Dataset Description
The dataset contains the following columns:

- **Date** → Day of the activity
- **DayOfWeek** → Day name (e.g., Monday, Tuesday)
- **Steps** → Number of steps taken
- **Distance_km** → Distance covered in kilometers
- **Active_Minutes** → Number of active minutes
- **Calories** → Calories burned
- **HeartRate_Avg** → Average heart rate

---

## 📝 Tasks

### Task 1: Load and Inspect Data
- Load the dataset into Pandas.
- Display the first 5 rows.
- Show the shape and data types of the dataset.
- Count missing values in each column.

---

### Task 2: Handle Missing Data
- Use **NumPy** to replace missing numeric values with the column mean.

---

### Task 3: Feature Engineering
- Create a new column `Cals_per_Min` = `Calories / Active_Minutes` (handle divide-by-zero safely using NumPy).
- Create a new column `Intense_Day` = `True` if `Steps >= 10,000`, else `False`.

---

### Task 4: Outlier Detection
- Using NumPy, calculate **z-scores** for `Steps` and `Calories`:
  \[z = \frac{x - \text{mean}}{\text{std}} \]

- Create boolean columns `Steps_Outlier` and `Calories_Outlier` where `|z| > 2`.

---

### Task 5: Aggregation and Ranking
- Group the data by `DayOfWeek` and calculate the average `Steps`, `Calories`, and `Cals_per_Min`.
- Find the **top 3 days** (dates) with the highest number of steps.


# Assignment 3 — Solution: Fitness Tracking ML Project


In [2]:
import numpy as np
import pandas as pd
df = pd.read_csv('fitness_tracking.csv')

## Task 1

In [3]:
df

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg
0,2025-07-03,Thursday,,6.657977,50.0,687.783888,74.0
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0
5,2025-07-25,Friday,12269.0,9.22687,51.0,680.38096,79.0
6,2025-07-28,Monday,12784.0,8.26817,123.0,1270.503099,78.0
7,2025-07-12,Saturday,14695.0,10.912283,143.0,1324.595936,90.0
8,2025-07-18,Friday,30000.0,2.702643,142.0,1165.691373,75.0
9,2025-07-23,Wednesday,4781.0,3.025357,16.0,253.402025,73.0


In [4]:
# First 5 row
df.head()

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg
0,2025-07-03,Thursday,,6.657977,50.0,687.783888,74.0
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0


In [7]:
#print shape and data type
print(df.shape)
print(df.dtypes)

(30, 7)
Date               object
DayOfWeek          object
Steps             float64
Distance_km       float64
Active_Minutes    float64
Calories          float64
HeartRate_Avg     float64
dtype: object


In [8]:
#finding the missing value
print(df.isnull().sum())

Date              0
DayOfWeek         0
Steps             2
Distance_km       2
Active_Minutes    2
Calories          2
HeartRate_Avg     2
dtype: int64


## Task 2

In [9]:
df

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg
0,2025-07-03,Thursday,,6.657977,50.0,687.783888,74.0
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0
5,2025-07-25,Friday,12269.0,9.22687,51.0,680.38096,79.0
6,2025-07-28,Monday,12784.0,8.26817,123.0,1270.503099,78.0
7,2025-07-12,Saturday,14695.0,10.912283,143.0,1324.595936,90.0
8,2025-07-18,Friday,30000.0,2.702643,142.0,1165.691373,75.0
9,2025-07-23,Wednesday,4781.0,3.025357,16.0,253.402025,73.0


In [10]:
for col in df.select_dtypes(include=np.number).columns:
    mean_val = np.nanmean(df[col])
    df[col] = np.where(df[col].isnull(), mean_val, df[col])

In [11]:
df

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg
0,2025-07-03,Thursday,10070.321429,6.657977,50.0,687.783888,74.0
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0
5,2025-07-25,Friday,12269.0,9.22687,51.0,680.38096,79.0
6,2025-07-28,Monday,12784.0,8.26817,123.0,1270.503099,78.0
7,2025-07-12,Saturday,14695.0,10.912283,143.0,1324.595936,90.0
8,2025-07-18,Friday,30000.0,2.702643,142.0,1165.691373,75.0
9,2025-07-23,Wednesday,4781.0,3.025357,16.0,253.402025,73.0


## Task 3


In [12]:
# Calories per Active Minute (handle divide by zero safely)
df["Cals_per_Min"] = np.where(df["Active_Minutes"] == 0, 0, df["Calories"] / df["Active_Minutes"])

In [15]:
#step >= 10000
df["Intense_Day"] = df['Steps'] >= 10000

In [16]:
df

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg,Cals_per_Min,Intense_Day
0,2025-07-03,Thursday,10070.321429,6.657977,50.0,687.783888,74.0,13.755678,True
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0,9.942672,False
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0,10.577689,True
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0,7.930503,False
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0,9.466824,False
5,2025-07-25,Friday,12269.0,9.22687,51.0,680.38096,79.0,13.340803,True
6,2025-07-28,Monday,12784.0,8.26817,123.0,1270.503099,78.0,10.329293,True
7,2025-07-12,Saturday,14695.0,10.912283,143.0,1324.595936,90.0,9.262909,True
8,2025-07-18,Friday,30000.0,2.702643,142.0,1165.691373,75.0,8.209094,True
9,2025-07-23,Wednesday,4781.0,3.025357,16.0,253.402025,73.0,15.837627,False


## Task 4


In [23]:
#function to calculate z scores
def z_score(series):
    return (series - np.mean(series)) / np.std(series)

In [24]:
df['Steps_z'] = z_score(df['Steps'])
df['Calories_z'] = z_score(df['Calories'])

In [25]:
df

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg,Cals_per_Min,Intense_Day,Steps_z,Calories_z
0,2025-07-03,Thursday,10070.321429,6.657977,50.0,687.783888,74.0,13.755678,True,3.789721e-16,-0.616258
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0,9.942672,False,-0.1569486,0.13385
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0,10.577689,True,0.4049502,0.2765
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0,7.930503,False,-0.2063257,0.056492
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0,9.466824,False,-0.529256,-0.45987
5,2025-07-25,Friday,12269.0,9.22687,51.0,680.38096,79.0,13.340803,True,0.4580774,-0.627035
6,2025-07-28,Monday,12784.0,8.26817,123.0,1270.503099,78.0,10.329293,True,0.5653736,0.232024
7,2025-07-12,Saturday,14695.0,10.912283,143.0,1324.595936,90.0,9.262909,True,0.9635154,0.310768
8,2025-07-18,Friday,30000.0,2.702643,142.0,1165.691373,75.0,8.209094,True,4.152192,0.079446
9,2025-07-23,Wednesday,4781.0,3.025357,16.0,253.402025,73.0,15.837627,False,-1.101989,-1.248601


In [26]:
# creating bool value
df["Steps_Outlier"] = np.abs(df["Steps_z"])>2
df["Calories_Outlier"] = np.abs(df["Calories_z"])>2

In [27]:
df

Unnamed: 0,Date,DayOfWeek,Steps,Distance_km,Active_Minutes,Calories,HeartRate_Avg,Cals_per_Min,Intense_Day,Steps_z,Calories_z,Steps_Outlier,Calories_Outlier
0,2025-07-03,Thursday,10070.321429,6.657977,50.0,687.783888,74.0,13.755678,True,3.789721e-16,-0.616258,False,False
1,2025-07-29,Tuesday,9317.0,6.694944,121.0,1203.063311,92.0,9.942672,False,-0.1569486,0.13385,False,False
2,2025-07-14,Monday,12014.0,8.806486,123.0,1301.055694,85.0,10.577689,True,0.4049502,0.2765,False,False
3,2025-07-11,Friday,9080.0,6.10021,145.0,1149.922935,75.0,7.930503,False,-0.2063257,0.056492,False,False
4,2025-07-27,Sunday,7530.0,5.598196,84.0,795.213183,80.0,9.466824,False,-0.529256,-0.45987,False,False
5,2025-07-25,Friday,12269.0,9.22687,51.0,680.38096,79.0,13.340803,True,0.4580774,-0.627035,False,False
6,2025-07-28,Monday,12784.0,8.26817,123.0,1270.503099,78.0,10.329293,True,0.5653736,0.232024,False,False
7,2025-07-12,Saturday,14695.0,10.912283,143.0,1324.595936,90.0,9.262909,True,0.9635154,0.310768,False,False
8,2025-07-18,Friday,30000.0,2.702643,142.0,1165.691373,75.0,8.209094,True,4.152192,0.079446,True,False
9,2025-07-23,Wednesday,4781.0,3.025357,16.0,253.402025,73.0,15.837627,False,-1.101989,-1.248601,False,False


## Task 5


In [33]:
#group by DataOfWeek
agg_results = df.groupby("DayOfWeek").agg({
    "Steps": "mean",
    "Calories": "mean",
    "Cals_per_Min": "mean"
}).reset_index()

In [34]:
print("\nAverage Stats by Day of Week:")
print(agg_results)


Average Stats by Day of Week:
   DayOfWeek         Steps     Calories  Cals_per_Min
0     Friday  14833.500000  1052.305720      9.777298
1     Monday   9281.500000  1176.065335      9.719679
2   Saturday  10806.330357  1055.568360     12.294104
3     Sunday  10146.750000   955.056521      9.448655
4   Thursday   9186.580357  1738.708080     65.659371
5    Tuesday   8550.200000   955.683327      9.997673
6  Wednesday   8468.000000   928.852407     11.779987


In [35]:
# top 3 data
df.nlargest(3, "Steps")[["Date", "Steps"]]


Unnamed: 0,Date,Steps
8,2025-07-18,30000.0
7,2025-07-12,14695.0
14,2025-07-24,14084.0
