# Premier League All Players Stats 23/24: Cleaning


## üìÇ Dataset Info
* **Source:** [Premier League All Players Stats 23/24](https://www.kaggle.com/datasets/orkunaktas/premier-league-all-players-stats-2324)
* **File:** `premier_league_23_24_raw.csv`

This dataset contains detailed data on all footballers from the 2023/24 premier league season

- `Player`: The name of the player.
- `Nation`: The player's nationality.
- `Pos`: The player's position (e.g., forward, midfielder, defender).
- `Age`: The player's age.
- `MP` (Minutes Played): Total minutes played by the player.
- `Starts`: Number of matches the player started.
- `Min` (Minutes): Total minutes played by the player (this might be the same as MP).
- `90s` (90s Played): The equivalent of 90-minute matches played by the player (e.g., 1.5 = 135 minutes).
- `Gls` (Goals): Total number of goals scored by the player.
- `Ast` (Assists): Total number of assists made by the player.
- `G+A` (Goals + Assists): Total number of goals and assists combined.
- `G-PK` (Goals - Penalty Kicks): Total number of goals scored excluding penalty kicks.
- `PK` (Penalty Kicks): Number of penalty goals scored by the player.
- `PKatt` (Penalty Kicks Attempted): Number of penalty kicks attempted by the player.
- `CrdY` (Yellow Cards): Number of yellow cards received by the player.
- `CrdR` (Red Cards): Number of red cards received by the player.
- `xG` (Expected Goals): The expected number of goals from the player's shots.
- `npxG` (Non-Penalty Expected Goals): Expected goals excluding penalties.
- `xAG` (Expected Assists): The expected number of assists from the player's passes.
- `npxG+xAG` (Non-Penalty xG + xAG): Total of non-penalty expected goals and expected assists.
- `PrgC` (Progressive Carries): Number of times the player carried the ball forward.
- `PrgP` (Progressive Passes): Number of passes made by the player that moved the ball forward.
- `PrgR` (Progressive Runs): Number of times the player made runs forward with the ball.
- `Gls` (Goals): (Repeated, already defined) Total number of goals scored.
- `Ast` (Assists): (Repeated, already defined) Total number of assists made.
- `G+A` (Goals + Assists): (Repeated, already defined) Total number of goals and assists combined.
- `G-PK` (Goals - Penalty Kicks): (Repeated, already defined) Goals scored excluding penalty kicks.
- `G+A-PK` (Goals + Assists - Penalty Kicks): Total goals and assists minus penalty goals.
- `xG` (Expected Goals): (Repeated, already defined) Expected number of goals from the player's shots.
- `xAG` (Expected Assists): (Repeated, already defined) Expected number of assists from the player's passes.
- `xG+xAG` (Expected Goals + Expected Assists): Total expected goals and assists.
- `npxG` (Non-Penalty Expected Goals): (Repeated, already defined) Expected goals excluding penalties.
- `npxG+xAG` (Non-Penalty xG + Expected Assists): Total of non-penalty expected goals and expected assists.

## Loading data and libraries

In [9]:
# Libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('premier_league_23_24_raw.csv')

# Number of rows and columns
print(f"Loaded Dataset: {df.shape[0]} rows, {df.shape[1]} columns")

Loaded Dataset: 580 rows, 34 columns


## Basic data inspection

In [10]:
df.head()

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,Ast_90,G+A_90,G-PK_90,G+A-PK_90,xG_90,xAG_90,xG+xAG_90,npxG_90,npxG+xAG_90,Team
0,Rodri,es ESP,MF,27.0,34,34,2931.0,32.6,8.0,9.0,...,0.28,0.52,0.25,0.52,0.12,0.12,0.24,0.12,0.24,Manchester City
1,Phil Foden,eng ENG,"FW,MF",23.0,35,33,2857.0,31.7,19.0,8.0,...,0.25,0.85,0.6,0.85,0.33,0.26,0.59,0.33,0.59,Manchester City
2,Ederson,br BRA,GK,29.0,33,33,2785.0,30.9,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Manchester City
3,Juli√°n √Ålvarez,ar ARG,"MF,FW",23.0,36,31,2647.0,29.4,11.0,8.0,...,0.27,0.65,0.31,0.58,0.44,0.22,0.66,0.39,0.61,Manchester City
4,Kyle Walker,eng ENG,DF,33.0,32,30,2767.0,30.7,0.0,4.0,...,0.13,0.13,0.0,0.13,0.01,0.09,0.1,0.01,0.1,Manchester City


In [11]:
# Printing names of columns
print(df.columns.tolist())

# Renaming columns for consistency
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Checking the correctness of names
print(df.columns.tolist())

['Player', 'Nation', 'Pos', 'Age', 'MP', 'Starts', 'Min', '90s', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt', 'CrdY', 'CrdR', 'xG', 'npxG', 'xAG', 'npxG+xAG', 'PrgC', 'PrgP', 'PrgR', 'Gls_90', 'Ast_90', 'G+A_90', 'G-PK_90', 'G+A-PK_90', 'xG_90', 'xAG_90', 'xG+xAG_90', 'npxG_90', 'npxG+xAG_90', 'Team']
['player', 'nation', 'pos', 'age', 'mp', 'starts', 'min', '90s', 'gls', 'ast', 'g+a', 'g-pk', 'pk', 'pkatt', 'crdy', 'crdr', 'xg', 'npxg', 'xag', 'npxg+xag', 'prgc', 'prgp', 'prgr', 'gls_90', 'ast_90', 'g+a_90', 'g-pk_90', 'g+a-pk_90', 'xg_90', 'xag_90', 'xg+xag_90', 'npxg_90', 'npxg+xag_90', 'team']


In [14]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580 entries, 0 to 579
Data columns (total 34 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   player       580 non-null    object 
 1   nation       580 non-null    object 
 2   pos          580 non-null    object 
 3   age          580 non-null    float64
 4   mp           580 non-null    int64  
 5   starts       580 non-null    int64  
 6   min          580 non-null    float64
 7   90s          580 non-null    float64
 8   gls          580 non-null    float64
 9   ast          580 non-null    float64
 10  g+a          580 non-null    float64
 11  g-pk         580 non-null    float64
 12  pk           580 non-null    float64
 13  pkatt        580 non-null    float64
 14  crdy         580 non-null    float64
 15  crdr         580 non-null    float64
 16  xg           580 non-null    float64
 17  npxg         580 non-null    float64
 18  xag          580 non-null    float64
 19  npxg+xag

### Duplicates and missing values

#### Duplicates

In [15]:
print(f"Number of duplicated rows: {df.duplicated().sum()}")

Number of duplicated rows: 0


#### Missing values

In [16]:
print(f"Number of missing values: \n {df.isna().sum()}")

Number of missing values: 
 player         0
nation         0
pos            0
age            0
mp             0
starts         0
min            0
90s            0
gls            0
ast            0
g+a            0
g-pk           0
pk             0
pkatt          0
crdy           0
crdr           0
xg             0
npxg           0
xag            0
npxg+xag       0
prgc           0
prgp           0
prgr           0
gls_90         0
ast_90         0
g+a_90         0
g-pk_90        0
g+a-pk_90      0
xg_90          0
xag_90         0
xg+xag_90      0
npxg_90        0
npxg+xag_90    0
team           0
dtype: int64


Based on the output of `df.info()`, the dataset structure is clean and ready for analysis:

* **Completeness:** The dataset contains **580 rows** and **34 columns**. There are **no missing values** (nulls) detected, as every column shows 580 non-null entries.
* **Data Types:** The data types are correctly assigned for all features:
    * **Categorical variables** (e.g., `Player`, `Nation`, `Pos`, `Team`) are stored as `object` (strings).
    * **Numerical variables** (e.g., `Age`, `MP`, `Gls`, `xG`) are stored as `int64` or `float64`.

**Conclusion:** No immediate data type conversion is required. The dataset is technically consistent and ready for further cleaning (e.g., string parsing) or export.

## Data Cleaning 

### Player

In [18]:
print(df['player'])

0                  Rodri
1             Phil Foden
2                Ederson
3         Juli√°n √Ålvarez
4            Kyle Walker
             ...        
575           Sam Curtis
576      Daniel Jebbison
577    Antwoine Hackford
578           Sydie Peck
579             Ryan One
Name: player, Length: 580, dtype: object


### Nation

In [20]:
print(df['nation'].head())

0     es ESP
1    eng ENG
2     br BRA
3     ar ARG
4    eng ENG
Name: nation, dtype: object


In [22]:
# Cleaning the 'Nation' column to keep only the uppercase (e.g., "es ESP" -> "ESP")
df['nation'] = df['nation'].str.split(' ').str.get(1)

# Check the results
print(df['nation'].head())

0    ESP
1    ENG
2    BRA
3    ARG
4    ENG
Name: nation, dtype: object


### Position

In [23]:
print(df['pos'].head())

0       MF
1    FW,MF
2       GK
3    MF,FW
4       DF
Name: pos, dtype: object


In [25]:
# Cleaning 'Pos' column - keeping only the primary position (e.g., "FW,MF" -> "FW")
df['pos'] = df['pos'].str.split(',').str.get(0)

# Check the results
print(df['pos'].head())

0    MF
1    FW
2    GK
3    MF
4    DF
Name: pos, dtype: object


### Team

In [29]:
# Printing list of names of teams
print(df['team'].unique().tolist())

# Checking number of teams (# Unique Teams = 20)
print(len(df['team'].unique().tolist()))

['Manchester City', 'Liverpool', 'Arsenal', 'Chelsea', 'Newcastle United', 'Tottenham Hotspur', 'Manchester United', 'Aston Villa', 'West Ham United', 'Crystal Palace', 'Fulham', 'Everton', 'Brighton', 'Bournemouth', 'Wolverhampton', 'Brentford', 'Nottingham Forest', 'Luton Town', 'Burnley', 'Sheffield United']
20


### Numerical Variables

#### Integers

#### Floats