In [30]:
import sqlite3
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from joblib import dump
import sys

In [31]:
print("Python:", sys.version)
print("scikit-learn:", sklearn.__version__)
print("NumPy:", numpy.__version__)
print("joblib:", joblib.__version__)

Python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]


NameError: name 'sklearn' is not defined

# Preprocessing and Cleaning

This section of the project builds on and is heavily influenced by the work done by **Biswajit Basak** (in particular the Preprocessing Pipelines) who also used this dataset.

Link to the file used - https://github.com/thecuriousjuel/Machine-Learning-Project-iNeuron/blob/main/Jupyter%20Notebook%20File/ML%20Project%20Final.ipynb


## Data Quality Assessment 

In order to influence our preproccessing and cleaning we need to undetstand the form our data is in and what needs to be changed

## First we will deal with null values

In particular we will try and identify rows which are majority null (over 30/42 attributes missing) and remove them

In [14]:
connection = sqlite3.connect('playerdata.sqlite')
player_data = pd.read_sql_query("SELECT * FROM Player_Attributes", connection)

null_threshold = 21

num_nulls_per_row = player_data.isnull().sum(axis=1)
mostly_null_rows = num_nulls_per_row > null_threshold
print(mostly_null_rows.sum())

num_fully_null_rows = player_data.isnull().all(axis=1).sum()
print(num_fully_null_rows)

836
0


### Insight

After experimentiing with the null threshold realised that 836 rows were majority null all of them around 37 null entries and no rows had anything greater than that. Also there were 2319 rows with 8/42 null this i thought we could work with so i decided to set the threshold as greater than 9 null. This means, removing 836 records resulting in a better quality dataset. 

In [15]:
player_data_cleaned = player_data.copy()
player_data_cleaned = player_data_cleaned.dropna(thresh=9)
print(len(player_data))
print(len(player_data_cleaned))

183978
183142


With the remaining dataset still need to fix the remaining null vlaues so next i will show how many of the remaining attributes have null values and how many.

In [16]:
# Calculate the null values for each column
null = player_data_cleaned.isnull().sum()
filtered_null = null[null>0]
# Print the percentage of null values for each column
print(filtered_null.sort_values(ascending=False))

attacking_work_rate    2394
volleys                1877
curve                  1877
agility                1877
balance                1877
jumping                1877
vision                 1877
sliding_tackle         1877
dtype: int64


We can see here that 8 attributes contain null values and 7 of them are continuous (volleys, curve, agility, balance, jumping, vision, sliding_tackle) and one is categorical attacking_work_rate

**How we can imputate values for these null values?**

For <u>numerical/continuous</u> the options are using the median or mean of the column, or even more sophisticated approaches like K-Nearest Neighbors (KNN) imputation, which considers the similarity between rows. **I will use mean**

For <u>categorical</u> the most common approach is assigning mode value. **I will use mode**


In [17]:
for column in ['volleys', 'curve', 'agility', 'balance', 'jumping', 'vision', 'sliding_tackle']:
    player_data_cleaned[column].fillna(player_data_cleaned[column].mean(), inplace=True)
    
mode_value = player_data_cleaned['attacking_work_rate'].mode()[0]
player_data_cleaned['attacking_work_rate'].fillna(mode_value, inplace=True)


Now all the null values have been dealt with we can move onto the next part of the preprocessing stage 

## Feature Removal 

 1. In the next stage I am diverging greatly from Biswajit Basak's proposed method, I've opted to maintain multiple records for individual players, leveraging the variability in their performance across different matches rather than averaging attributes for every player. 
#### WHY? 
This approach captures the dynamic nature of football performance, influenced by factors such as match context, opponent strength, and player condition, providing a richer dataset for the model.

2. Next steps include removing player IDs and match dates from the dataset. This step aims to minimize the focus on individual player identities, shifting the analysis towards a broader understanding of how various attributes impact overall performance in football.

In [19]:
player_data_cleaned = player_data_cleaned.drop(['player_fifa_api_id', 'player_api_id', 'id','date'], axis=1)
print(player_data_cleaned.head())

   overall_rating  potential preferred_foot attacking_work_rate  \
0            67.0       71.0          right              medium   
1            67.0       71.0          right              medium   
2            62.0       66.0          right              medium   
3            61.0       65.0          right              medium   
4            61.0       65.0          right              medium   

  defensive_work_rate  crossing  finishing  heading_accuracy  short_passing  \
0              medium      49.0       44.0              71.0           61.0   
1              medium      49.0       44.0              71.0           61.0   
2              medium      49.0       44.0              71.0           61.0   
3              medium      48.0       43.0              70.0           60.0   
4              medium      48.0       43.0              70.0           60.0   

   volleys  ...  vision  penalties  marking  standing_tackle  sliding_tackle  \
0     44.0  ...    54.0       48.0     65.

## Further Cleaning

Also after the initial data analysis, I decided to **drop** the 'curve' and 'potential' attributes from the dataset.

1. **'Curve'**  as it is too specific and not universally applicable across all player roles
2. **'Potential'** due to the speculative nature and not reflectoing current ability or performance. 

This decision allows for a more focused analysis on attributes directly observable and quantifiable in terms of their effect on player performance, potentially leading to a model that is more practical and grounded in real-world applicability.

In [20]:
player_data_cleaned = player_data_cleaned.drop(['curve', 'potential'], axis=1)
print(player_data_cleaned.head())

   overall_rating preferred_foot attacking_work_rate defensive_work_rate  \
0            67.0          right              medium              medium   
1            67.0          right              medium              medium   
2            62.0          right              medium              medium   
3            61.0          right              medium              medium   
4            61.0          right              medium              medium   

   crossing  finishing  heading_accuracy  short_passing  volleys  dribbling  \
0      49.0       44.0              71.0           61.0     44.0       51.0   
1      49.0       44.0              71.0           61.0     44.0       51.0   
2      49.0       44.0              71.0           61.0     44.0       51.0   
3      48.0       43.0              70.0           60.0     43.0       50.0   
4      48.0       43.0              70.0           60.0     43.0       50.0   

   ...  vision  penalties  marking  standing_tackle  sliding_tackle 

## Categorical Attribute cleaning

Now that we have cleaned and preprocessed the Numerical Attributes we need to tackle the categorical ones

In [21]:
print("Before Cleaning")
print(player_data_cleaned['preferred_foot'].unique())
print(player_data_cleaned['attacking_work_rate'].unique())
print(player_data_cleaned['defensive_work_rate'].unique(),"\n")

# Clean attacking_work_rate
player_data_cleaned['attacking_work_rate'] = player_data_cleaned['attacking_work_rate'].replace({
    'None': 'medium', 'le': 'low', 'norm': 'medium', 'stoc': 'medium', 'y': 'high'
})

# Standardize defensive_work_rate
player_data_cleaned['defensive_work_rate'] = player_data_cleaned['defensive_work_rate'].apply(lambda x: 'low' if x in ['_0', '1', '2', '3', 'ean', 'o', 'tocky', '0', 'low'] else ('medium' if x in ['4', '5', '6', 'ormal', 'medium'] else ('high' if x in ['7', '8', '9', 'es','high'] else 'medium')))

print("After Cleaning")
print(player_data_cleaned['preferred_foot'].unique())
print(player_data_cleaned['attacking_work_rate'].unique())
print(player_data_cleaned['defensive_work_rate'].unique(),"\n")


Before Cleaning
['right' 'left']
['medium' 'high' 'low' 'None' 'le' 'norm' 'stoc' 'y']
['medium' 'high' 'low' '_0' '5' 'ean' 'o' '1' 'ormal' '7' '2' '8' '4'
 'tocky' '0' '3' '6' '9' 'es'] 

After Cleaning
['right' 'left']
['medium' 'high' 'low']
['medium' 'high' 'low'] 



Now that we have removed all the erraneous entries in our categorical features we can use OneHotEncoding to transform it into a form that can be fed into a ML model

In [22]:
categorical_columns = ['preferred_foot', 'defensive_work_rate', 'attacking_work_rate']

encoder = OneHotEncoder(sparse= False)

encoded_data = encoder.fit_transform(player_data_cleaned[categorical_columns])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_columns))

player_data_cleaned = player_data_cleaned.drop(columns=categorical_columns)
player_data_cleaned = pd.concat([player_data_cleaned.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)



In [23]:
print(player_data_cleaned.head())

   overall_rating  crossing  finishing  heading_accuracy  short_passing  \
0            67.0      49.0       44.0              71.0           61.0   
1            67.0      49.0       44.0              71.0           61.0   
2            62.0      49.0       44.0              71.0           61.0   
3            61.0      48.0       43.0              70.0           60.0   
4            61.0      48.0       43.0              70.0           60.0   

   volleys  dribbling  free_kick_accuracy  long_passing  ball_control  ...  \
0     44.0       51.0                39.0          64.0          49.0  ...   
1     44.0       51.0                39.0          64.0          49.0  ...   
2     44.0       51.0                39.0          64.0          49.0  ...   
3     43.0       50.0                38.0          63.0          48.0  ...   
4     43.0       50.0                38.0          63.0          48.0  ...   

   gk_positioning  gk_reflexes  preferred_foot_left  preferred_foot_right  \
0  

## Standard Scaling

Next step is Feature scaling. This can significantly impact the performance of models sensitive to input scales, such as gradient descent-based algorithms, k-nearest neighbors, and models using regularization. We only have to do this to the numerical data as the categorical is already OneHotEncoded

In [24]:
numerical_attributes = numerical_attributes = ['crossing', 'finishing', 'heading_accuracy', 'short_passing', 'volleys', 'dribbling', 'free_kick_accuracy', 'long_passing', 'ball_control', 'acceleration', 'sprint_speed', 'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina', 'strength', 'long_shots', 'aggression', 'interceptions', 'positioning', 'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes']

numerical_data = player_data_cleaned[numerical_attributes]
mean_values = numerical_data.mean()
print(mean_values)

scaler = StandardScaler()
scaled_numerical_data = scaler.fit_transform(numerical_data)
scaled_numerical_df = pd.DataFrame(scaled_numerical_data, columns=numerical_attributes, index=numerical_data.index)

player_data_cleaned = player_data_cleaned.drop(columns=numerical_attributes)

player_data_cleaned_scaled = pd.concat([player_data_cleaned, scaled_numerical_df], axis=1)

crossing              55.086883
finishing             49.921078
heading_accuracy      57.266023
short_passing         62.429672
volleys               49.468436
dribbling             59.175154
free_kick_accuracy    49.380950
long_passing          57.069880
ball_control          63.388879
acceleration          67.659357
sprint_speed          68.051244
agility               65.970910
reactions             66.103706
balance               65.189496
shot_power            61.808427
jumping               66.969045
stamina               67.038544
strength              67.424529
long_shots            53.339431
aggression            60.948046
interceptions         52.009271
positioning           55.786504
vision                57.873550
penalties             55.003986
marking               46.772242
standing_tackle       50.351257
sliding_tackle        48.001462
gk_diving             14.704393
gk_handling           16.063612
gk_kicking            20.998362
gk_positioning        16.132154
gk_refle

In [25]:
dump(encoder, 'encoder.joblib')
dump(scaler, 'scaler.joblib')

['scaler.joblib']

## Test/Train split

Now that our data has been cleaned and preprocessed we are nearly ready for model training. All we need to do is split the target attribute from the rest of the features and then split into test and train sets so we can accurately and fairly evaluate our model. 

We do this by first splitting the target and features then randomly picking 80% of the data to be in our training set and 20% to be in our test.

In [None]:
target = player_data_cleaned_scaled['overall_rating']
features = player_data_cleaned_scaled.drop('overall_rating', axis=1)

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


### Finally Convert to CSV for next step

In [None]:
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

In [26]:
from joblib import load

model = load('predictionmodel.joblib')
test_prediction = model.predict([[0, 1, 0, 0, 1, 0, 0, 1, -0.005, 0.004, -0.016, -0.030, -0.025, -0.009, -0.021, -0.004, -0.025, 0.026, -0.799, 0.002, -0.011, -0.014, 0.011, 0.002, -0.002, -0.035, -0.018, 0.003, -0.0004, 0.011, 0.008, -0.0002, 0.010, -0.016, -0.00006, -0.041, -0.004, 0.00007, -0.008, -0.025]])
print(test_prediction)

[59.59666667]


