# **Physical Capability Data - Exploratory Data Analysis**

*Physical Capability is measured using a battery of tests which measure different qualities and expressions of force.
The dataset provided contains the longitudinal data for 1 player for the past 2+ seasons. The data has been aggregated to the MOVEMENT, QUALITY and EXPRESSION level,
so scores are not available for the specific tests and metrics that lie in the layers underneath.
Where sufficient data exists, the “BenchmarkPct” value will be available. This is a pre-calculated aggregate expressed as a percentage. There is 1 row per
movement/quality/expression per day. If no new data has been recorded on a given day, data from the previous day is carried forward.*

This notebook is organized in the following sections:

* [Part 0 - Preliminary Steps](#0)
    * [Part 0.1 - Importing the Necessary Libraries](#0.1)
    * [Part 0.2 - Reading the Physical Capability Data Dataset](#0.2)

* [Part 1 - Data Cleaning](#1)
    * [Part 1.1 - Preliminary Analysis of the Dataset](#1.1)
    * [Part 1.2 - Dealing with Duplicates](#1.2)
    * [Part 1.3 - Ensuring Correct Data Types](#1.3)
    * [Part 1.4 - Dealing with Null/Missing Values](#1.4)
    * [Part 1.5 - Creating New Columns to Enhance the Analysis](#1.5)
    * [Part 1.6 - Final Checks](#1.6)

* [Part 2 - Exploratory Data Analysis](#2)

<a id='0'></a>
## Part 0 - Preliminary Steps

<a id='0.1'></a>
### Part 0.1 - Importing the Necessary Libraries

In [1]:
import pandas as pd

<a id='0.2'></a>
### Part 0.2 - Reading the Physical Capability Data Dataset

In [3]:
physical_capability_data = pd.read_csv('../data/raw/CFC Physical Capability Data_ (1).csv')

<a id='1'></a>
## Part 1 - Data Cleaning

<a id='1.1'></a>
### Part 1.1 - Preliminary Analysis of the Dataset

In [4]:
physical_capability_data.head(20)

Unnamed: 0,testDate,expression,movement,quality,benchmarkPct
0,03/07/2023,isometric,upper body,pull,
1,04/07/2023,dynamic,agility,acceleration,0.32
2,10/07/2023,dynamic,agility,deceleration,0.867
3,18/07/2023,isometric,jump,take off,
4,20/07/2023,dynamic,upper body,pull,0.8525
5,30/07/2023,isometric,upper body,grapple,
6,31/07/2023,isometric,jump,land,
7,11/08/2023,isometric,sprint,acceleration,
8,18/08/2023,dynamic,agility,deceleration,0.868
9,25/08/2023,dynamic,upper body,push,0.4


In [None]:
#idea: it would be nice for every day to have the value of the different tests. especially as these can be carried forward

#would need to create a list of the possible combinations and perhaps use a rolling average (?)/ shift to replace null values here 

In [5]:
physical_capability_data.tail(50)

Unnamed: 0,testDate,expression,movement,quality,benchmarkPct
12350,16/04/2024,dynamic,jump,land,0.655
12351,27/04/2024,isometric,agility,acceleration,0.52
12352,15/05/2024,isometric,upper body,push,0.704
12353,16/05/2024,dynamic,sprint,max velocity,0.625
12354,17/05/2024,isometric,sprint,max velocity,0.1995
12355,18/05/2024,isometric,agility,deceleration,1.1365
12356,19/05/2024,isometric,upper body,grapple,
12357,28/05/2024,dynamic,sprint,acceleration,0.43
12358,02/06/2024,isometric,upper body,pull,
12359,05/06/2024,isometric,upper body,pull,


In [5]:
physical_capability_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12400 entries, 0 to 12399
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   testDate      12400 non-null  object 
 1   expression    12400 non-null  object 
 2   movement      12400 non-null  object 
 3   quality       12400 non-null  object 
 4   benchmarkPct  9839 non-null   float64
dtypes: float64(1), object(4)
memory usage: 484.5+ KB


<a id='1.2'></a>
### Part 1.2 - Dealing with Duplicates

We checked if there were any duplicate rows. We found there were no duplicate rows.

In [6]:
physical_capability_data.duplicated().any()

False

In [7]:
# Another check for duplicates - just in case
physical_capability_data.duplicated().sum()

0

<a id='1.3'></a>
### Part 1.3 - Ensuring Correct Data Types

Next, we proceeded to ensure whether the data types of all columns were correct/adequate

In [14]:
physical_capability_data.dtypes

testDate        datetime64[ns]
expression              object
movement                object
quality                 object
benchmarkPct           float64
dtype: object

Given the only column with the incorrect data type was the `testDate` (i.e., date) column, we proceeded to transform it to the correct format --> datetime format (%d/%m/%Y).

In [11]:
# Transforming the columns into the correct data type

## Transforming the sessionDate column into datetime format
physical_capability_data['testDate'] = pd.to_datetime(physical_capability_data['testDate'], format = '%d/%m/%Y')

<a id='1.4'></a>
### Part 1.4 - Dealing with Null/Missing Values

In [16]:
physical_capability_data.isna().sum()

testDate           0
expression         0
movement           0
quality            0
benchmarkPct    2561
dtype: int64

In [15]:
(physical_capability_data.isna().sum() / len(physical_capability_data)) * 100

testDate         0.000000
expression       0.000000
movement         0.000000
quality          0.000000
benchmarkPct    20.653226
dtype: float64

<a id='1.5'></a>
### Part 1.5 - Creating New Columns to Enhance the Analysis

<a id='1.6'></a>
### Part 1.6 - Final Checks

<a id='2'></a>
## Part 2 - Exploratory Data Analysis