# **Recovery Status Data - Exploratory Data Analysis**

*Recovery status is measured using several tests and metrics to inform recovery strategies throughout the season.
This dataset contains simulated data for 1 player.*

This notebook is organized in the following sections:

* [Part 0 - Preliminary Steps](#0)
    * [Part 0.1 - Importing the Necessary Libraries](#0.1)
    * [Part 0.2 - Reading the Recovery Status Data Dataset](#0.2)

* [Part 1 - Data Cleaning](#1)
    * [Part 1.1 - Preliminary Analysis of the Dataset](#1.1)
    * [Part 1.2 - Dealing with Duplicates](#1.2)
    * [Part 1.3 - Ensuring Correct Data Types](#1.3)
    * [Part 1.4 - Dealing with Null/Missing Values](#1.4)
    * [Part 1.5 - Final Checks](#1.5)

* [Part 2 - Exploratory Data Analysis](#2)

<a id='0'></a>
## Part 0 - Preliminary Steps

<a id='0.1'></a>
### Part 0.1 - Importing the Necessary Libraries

In [1]:
import pandas as pd

<a id='0.2'></a>
### Part 0.2 - Reading the Recovery Status Data Dataset

In [2]:
recovery_status_data = pd.read_csv('../data/CFC Recovery status Data (1).csv')

<a id='1'></a>
## Part 1 - Data Cleaning

<a id='1.1'></a>
### Part 1.1 - Preliminary Analysis of the Dataset

Each row represents a category with an associated metric, except for rows with the `emboss_baseline_score` (it is a pre-calculated aggregated “total” category that represents the overall recovery score). Therefore for each day there are 13 rows, as there are 6 categories, 2 metrics as well as the overall recovery score.

In [None]:
#Note for self --> in feature engineering can create a new row from the metric row that just has the values completeness and composite.

In [6]:
recovery_status_data.head(24)

Unnamed: 0,sessionDate,seasonName,metric,category,value
0,02/07/2023,2023/2024,bio_baseline_completeness,bio,0.0
1,02/07/2023,2023/2024,bio_baseline_composite,bio,
2,02/07/2023,2023/2024,emboss_baseline_score,total,
3,02/07/2023,2023/2024,msk_joint_range_baseline_completeness,msk_joint_range,0.0
4,02/07/2023,2023/2024,msk_joint_range_baseline_composite,msk_joint_range,
5,02/07/2023,2023/2024,msk_load_tolerance_baseline_completeness,msk_load_tolerance,0.0
6,02/07/2023,2023/2024,msk_load_tolerance_baseline_composite,msk_load_tolerance,
7,02/07/2023,2023/2024,sleep_baseline_completeness,sleep,0.0
8,02/07/2023,2023/2024,sleep_baseline_composite,sleep,
9,02/07/2023,2023/2024,soreness_baseline_completeness,soreness,0.0


In [14]:
recovery_status_data['sessionDate'].value_counts().min()

13

In [10]:
recovery_status_data['sessionDate'].value_counts().max()

13

In [15]:
recovery_status_data.tail(24)

Unnamed: 0,sessionDate,seasonName,metric,category,value
8049,12/03/2025,2024/2025,emboss_baseline_score,total,-0.012167
8050,12/03/2025,2024/2025,msk_joint_range_baseline_completeness,msk_joint_range,0.0
8051,12/03/2025,2024/2025,msk_joint_range_baseline_composite,msk_joint_range,
8052,12/03/2025,2024/2025,msk_load_tolerance_baseline_completeness,msk_load_tolerance,0.0
8053,12/03/2025,2024/2025,msk_load_tolerance_baseline_composite,msk_load_tolerance,
8054,12/03/2025,2024/2025,sleep_baseline_completeness,sleep,0.806452
8055,12/03/2025,2024/2025,sleep_baseline_composite,sleep,-0.0208
8056,12/03/2025,2024/2025,soreness_baseline_completeness,soreness,0.048387
8057,12/03/2025,2024/2025,soreness_baseline_composite,soreness,-0.1
8058,12/03/2025,2024/2025,subjective_baseline_completeness,subjective,0.806452


In [24]:
recovery_status_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8073 entries, 0 to 8072
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   sessionDate  8073 non-null   datetime64[ns]
 1   seasonName   8073 non-null   object        
 2   metric       8073 non-null   object        
 3   category     8073 non-null   object        
 4   value        5261 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 315.5+ KB


<a id='1.2'></a>
### Part 1.2 - Dealing with Duplicates

We checked if there were any duplicate rows. We found there were no duplicate rows.

In [16]:
recovery_status_data.duplicated().any()

False

In [17]:
# Another check for duplicates - just in case
recovery_status_data.duplicated().sum()

0

<a id='1.3'></a>
### Part 1.3 - Ensuring Correct Data Types

Next, we proceeded to ensure whether the data types of all columns were correct/adequate

In [21]:
recovery_status_data.dtypes

sessionDate    datetime64[ns]
seasonName             object
metric                 object
category               object
value                 float64
dtype: object

Given the only column with the incorrect data type was the 'sessionDate' (i.e., date) column, we proceeded to transform it to the correct format --> datetime format

In [20]:
# Transforming the columns into the correct data type

## Transforming the sessionDate column into datetime format
recovery_status_data['sessionDate'] = pd.to_datetime(recovery_status_data['sessionDate'], format = '%d/%m/%Y')

<a id='1.4'></a>
### Part 1.4 - Dealing with Null/Missing Values

There are approximately 2800 null values in the `value`column, which represents nearly 35% of all values. Null values means that the data was not collected for that metric on that day. 

In [23]:
recovery_status_data.isna().sum()

sessionDate       0
seasonName        0
metric            0
category          0
value          2812
dtype: int64

In [30]:
(recovery_status_data.isna().sum() / len(recovery_status_data)) * 100

sessionDate     0.000000
seasonName      0.000000
metric          0.000000
category        0.000000
value          34.832157
dtype: float64

In [None]:
#want just rows with the overall recovery score
emboss = recovery_status_data[recovery_status_data['metric'] == 'emboss_baseline_score']
emboss['value'].isna().sum()
#There are about 250 rows without the total recovery score.
#Not sure what to do with this? if just to drop it or to keep it and impute using the values from the day before

250

<a id='1.5'></a>
### Part 1.5 - Final Checks

<a id='2'></a>
## Part 2 - Exploratory Data Analysis