# **GPS Data - Exploratory Data Analysis**

*GPS performance metrics track movement demands, including speed, distance, and acceleration, to assess workload and physical output.
This dataset contains simulated data for 1 player.*

This notebook is organized in the following sections:

* [Part 0 - Preliminary Steps](#0)
    * [Part 0.1 - Importing the Necessary Libraries](#0.1)
    * [Part 0.2 - Reading the GPS Data Dataset](#0.2)

* [Part 1 - Data Cleaning](#1)
    * [Part 1.1 - Preliminary Analysis of the Dataset](#1.1)
    * [Part 1.2 - Dealing with Duplicates](#1.2)
    * [Part 1.3 - Ensuring Correct Data Types](#1.3)
    * [Part 1.4 - Dealing with Null/Missing Values](#1.4)
    * [Part 1.5 - Creating New Columns to Enhance the Analysis](#1.5)
    * [Part 1.6 - Final Checks](#1.6)

* [Part 2 - Exploratory Data Analysis](#2)

<a id='0'></a>
## Part 0 - Preliminary Steps

<a id='0.1'></a>
### Part 0.1 - Importing the Necessary Libraries

In [1]:
import pandas as pd

<a id='0.2'></a>
### Part 0.2 - Reading the GPS Data Dataset

In [2]:
gps_data = pd.read_csv('data/CFC GPS Data (1).csv', encoding='ISO-8859-1')

<a id='1'></a>
## Part 1 - Data Cleaning

<a id='1.1'></a>
### Part 1.1 - Preliminary Analysis of the Dataset

In [3]:
gps_data.head()

Unnamed: 0,date,opposition_code,opposition_full,md_plus_code,md_minus_code,season,distance,distance_over_21,distance_over_24,distance_over_27,accel_decel_over_2_5,accel_decel_over_3_5,accel_decel_over_4_5,day_duration,peak_speed,hr_zone_1_hms,hr_zone_2_hms,hr_zone_3_hms,hr_zone_4_hms,hr_zone_5_hms
0,02/08/2022,,,10,-4,2022/2023,4524.085076,89.27853,85.690318,61.634335,119.108101,32.636928,8.557443,76.242369,30.7559,00:03:40,00:17:29,00:19:20,00:11:23,00:00:02
1,03/08/2022,,,10,-3,2022/2023,5264.645855,245.861691,91.348143,20.210588,45.974019,6.30973,3.09599,65.21783,28.67495,00:06:44,00:16:40,00:15:35,00:06:08,00:00:01
2,04/08/2022,,,10,-2,2022/2023,6886.542272,199.18026,84.634735,22.58547,97.488512,24.40018,3.825869,105.139759,29.2172,00:17:29,00:37:09,00:23:49,00:06:30,00:00:02
3,05/08/2022,,,10,-1,2022/2023,2622.552016,68.389321,11.795402,6.360193,43.750265,14.642925,2.189602,64.588434,28.703,00:07:34,00:15:51,00:07:31,00:01:51,00:00:00
4,06/08/2022,EVE,Everton,0,0,2022/2023,5654.028319,447.090545,164.576671,82.74643,122.568127,49.748446,22.201737,46.048353,30.29812,00:01:09,00:01:04,00:11:34,00:13:15,00:02:30


In [4]:
gps_data.tail()

Unnamed: 0,date,opposition_code,opposition_full,md_plus_code,md_minus_code,season,distance,distance_over_21,distance_over_24,distance_over_27,accel_decel_over_2_5,accel_decel_over_3_5,accel_decel_over_4_5,day_duration,peak_speed,hr_zone_1_hms,hr_zone_2_hms,hr_zone_3_hms,hr_zone_4_hms,hr_zone_5_hms
857,07/04/2025,,,1,-5,2024/2025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
858,08/04/2025,,,2,-4,2024/2025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
859,09/04/2025,,,3,-3,2024/2025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
860,10/04/2025,,,4,-2,2024/2025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
861,11/04/2025,,,5,-1,2024/2025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00


The GPS Data dataset has 826 rows, with only 2 columns which have null values.

In [10]:
gps_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 862 entries, 0 to 861
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  862 non-null    object 
 1   opposition_code       147 non-null    object 
 2   opposition_full       147 non-null    object 
 3   md_plus_code          862 non-null    int64  
 4   md_minus_code         862 non-null    int64  
 5   season                862 non-null    object 
 6   distance              862 non-null    float64
 7   distance_over_21      862 non-null    float64
 8   distance_over_24      862 non-null    float64
 9   distance_over_27      862 non-null    float64
 10  accel_decel_over_2_5  862 non-null    float64
 11  accel_decel_over_3_5  862 non-null    float64
 12  accel_decel_over_4_5  862 non-null    float64
 13  day_duration          862 non-null    float64
 14  peak_speed            862 non-null    float64
 15  hr_zone_1_hms         8

<a id='1.2'></a>
### Part 1.2 - Dealing with Duplicates

We checked if there were any duplicate rows. We found there were no duplicate rows.

In [5]:
gps_data.duplicated().any()

False

In [None]:
# Another check for duplicates - just in case
gps_data.duplicated().sum()

0

<a id='1.3'></a>
### Part 1.3 - Ensuring Correct Data Types

Next, we proceeded to ensure whether the data types of all columns were correct/adequate

In [12]:
gps_data.head()

Unnamed: 0,date,opposition_code,opposition_full,md_plus_code,md_minus_code,season,distance,distance_over_21,distance_over_24,distance_over_27,accel_decel_over_2_5,accel_decel_over_3_5,accel_decel_over_4_5,day_duration,peak_speed,hr_zone_1_hms,hr_zone_2_hms,hr_zone_3_hms,hr_zone_4_hms,hr_zone_5_hms
0,02/08/2022,,,10,-4,2022/2023,4524.085076,89.27853,85.690318,61.634335,119.108101,32.636928,8.557443,76.242369,30.7559,00:03:40,00:17:29,00:19:20,00:11:23,00:00:02
1,03/08/2022,,,10,-3,2022/2023,5264.645855,245.861691,91.348143,20.210588,45.974019,6.30973,3.09599,65.21783,28.67495,00:06:44,00:16:40,00:15:35,00:06:08,00:00:01
2,04/08/2022,,,10,-2,2022/2023,6886.542272,199.18026,84.634735,22.58547,97.488512,24.40018,3.825869,105.139759,29.2172,00:17:29,00:37:09,00:23:49,00:06:30,00:00:02
3,05/08/2022,,,10,-1,2022/2023,2622.552016,68.389321,11.795402,6.360193,43.750265,14.642925,2.189602,64.588434,28.703,00:07:34,00:15:51,00:07:31,00:01:51,00:00:00
4,06/08/2022,EVE,Everton,0,0,2022/2023,5654.028319,447.090545,164.576671,82.74643,122.568127,49.748446,22.201737,46.048353,30.29812,00:01:09,00:01:04,00:11:34,00:13:15,00:02:30


In [13]:
gps_data.dtypes

date                     object
opposition_code          object
opposition_full          object
md_plus_code              int64
md_minus_code             int64
season                   object
distance                float64
distance_over_21        float64
distance_over_24        float64
distance_over_27        float64
accel_decel_over_2_5    float64
accel_decel_over_3_5    float64
accel_decel_over_4_5    float64
day_duration            float64
peak_speed              float64
hr_zone_1_hms            object
hr_zone_2_hms            object
hr_zone_3_hms            object
hr_zone_4_hms            object
hr_zone_5_hms            object
dtype: object

The columns which had incorrect data types were the following:
* date --> should have been in datetime format (%d/%m/%Y)
* hr_zone_1_hms --> should have been in datetime format (%H:%M:%S)
* hr_zone_2_hms --> should have been in datetime format (%H:%M:%S)
* hr_zone_3_hms --> should have been in datetime format (%H:%M:%S)
* hr_zone_4_hms --> should have been in datetime format (%H:%M:%S)
* hr_zone_5_hms --> should have been in datetime format (%H:%M:%S)




we proceeded to transform it to the correct format --> datetime type

In [None]:
gps_data['date'] = pd.to_datetime(gps_data['date'], format = '%d/%m/%Y')

<a id='1.4'></a>
### Part 1.4 - Dealing with Null/Missing Values

<a id='1.5'></a>
### Part 1.5 - Creating New Columns to Enhance the Analysis

<a id='1.6'></a>
### Part 1.6 - Final Checks

<a id='2'></a>
## Part 2 - Exploratory Data Analysis