# 3 - Feature Engineering and Data Preprocessing

## Project Goal
This notebook handles all necessary data preprocessing to prepare the dataset for machine learning models.

## Objectives
1.  Load the data from the EDA phase.
2.  Handle missing values through imputation.
3.  Encode categorical features (Position, Club, Nationality).
4.  Standardize numerical features.
5.  Save the final **model-ready dataset**.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the data saved from the 02_eda.ipynb notebook
FILE_PATH = '../data/02_eda_data.csv'
df_modeling = pd.read_csv(FILE_PATH)

print(f"Data successfully loaded. Shape: {df_modeling.shape}")

Data successfully loaded. Shape: (120, 34)


In [6]:
# Identify the numerical and categorical columns that will be processed
numerical_features = [
    'League Rating', 'UCL Rating', 'WC Rating', 'Other Rating', 
    'Time-Weighted Google Trend Score', 'Wiki Page Views',
    'League', 'UCL', 'WC', 'Continental Cup', 
    'League Top Scorer', 'League Top Assist Provider', 'League Best Goalkeeper', 'League Best Player', 
    'UCL Top Scorer', 'UCL Top Assist Provider', 'UCL Best Goalkeeper', 'UCL Best Player',
    'Continental Cup Rating', 'Continental Cup Top Scorer', 'Continental Cup Top Assist Provider', 
    'Continental Cup Best Goalkeeper', 'Continental Cup Best Player',
    'WC Top Scorer', 'WC Top Assist Provider', 'WC Best Goalkeeper', 'WC Best Player', 
]

categorical_features = ['Position']

# Exclude identifiers and target for preprocessing
X = df_modeling.drop(columns=['Player', 'Rank', 'Vote Share', 'Nationality', 'Club', 'Year'])

In [7]:
# Missing values are generally due to non-participation (e.g., WC Rating) or absence (e.g., Continental Cup).

# Fill missing values (Binary/Count features) with 0
trophy_cols = [col for col in numerical_features if 'WC' in col or 'Continental' in col]
X[trophy_cols] = X[trophy_cols].fillna(0)

# Check for remaining missing values (should be zero in the numerical columns of interest)
print("Remaining Missing Values after Imputation:")
print(X[numerical_features].isnull().sum().loc[lambda x: x>0])

Remaining Missing Values after Imputation:
Series([], dtype: int64)


In [8]:
# Initialize One-Hot Encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the categorical features
X_cat_encoded = ohe.fit_transform(X[categorical_features])

# Create a DataFrame for the encoded features with proper column names
onehot_cols = ohe.get_feature_names_out(categorical_features)
X_cat_encoded_df = pd.DataFrame(X_cat_encoded, columns=onehot_cols, index=X.index)

# Drop the original categorical columns and concatenate the encoded ones
X_processed_df = X.drop(columns=categorical_features).reset_index(drop=True)
X_processed_df = pd.concat([X_processed_df, X_cat_encoded_df], axis=1)

print(f"Categorical features successfully encoded. Total features: {X_processed_df.shape[1]}")
X_processed_df.head()

Categorical features successfully encoded. Total features: 31


Unnamed: 0,League,League Rating,League Top Scorer,League Top Assist Provider,League Best Goalkeeper,League Best Player,UCL,UCL Rating,UCL Top Scorer,UCL Top Assist Provider,...,WC Top Assist Provider,WC Best Goalkeeper,WC Best Player,Other Rating,Time-Weighted Google Trend Score,Wiki Page Views,Position_Defender,Position_Forward,Position_Goalkeeper,Position_Midfielder
0,1,7.73,1,0,0,1,1,8.08,1,0,...,0.0,0.0,0.0,7.46,25.183089,3635167,0.0,1.0,0.0,0.0
1,0,7.22,0,0,0,0,0,7.03,0,0,...,0.0,0.0,0.0,7.56,8.700908,2615768,0.0,1.0,0.0,0.0
2,1,7.78,0,0,0,1,0,7.27,0,0,...,0.0,0.0,0.0,7.85,12.795947,1645548,0.0,0.0,0.0,1.0
3,1,7.47,1,0,0,0,0,7.84,0,0,...,0.0,0.0,0.0,7.69,28.143955,3785756,0.0,1.0,0.0,0.0
4,0,7.14,1,1,0,0,0,7.0,0,0,...,0.0,0.0,0.0,7.11,18.823201,4237344,0.0,1.0,0.0,0.0


In [9]:
# Initialize StandardScaler
scaler = StandardScaler()

# Identify all numerical columns (which are now mixed with encoded columns)
cols_to_scale = [col for col in numerical_features]

# Fit and transform only the non-binary/non-encoded numerical columns
X_processed_df[cols_to_scale] = scaler.fit_transform(X_processed_df[cols_to_scale])

print("Numerical features successfully scaled (StandardScaler applied).")
X_processed_df[cols_to_scale].head()

Numerical features successfully scaled (StandardScaler applied).


Unnamed: 0,League Rating,UCL Rating,WC Rating,Other Rating,Time-Weighted Google Trend Score,Wiki Page Views,League,UCL,WC,Continental Cup,...,UCL Best Player,Continental Cup Rating,Continental Cup Top Scorer,Continental Cup Top Assist Provider,Continental Cup Best Goalkeeper,Continental Cup Best Player,WC Top Scorer,WC Top Assist Provider,WC Best Goalkeeper,WC Best Player
0,1.205121,0.68019,-0.485899,0.321297,1.598513,0.239192,0.889407,1.855921,-0.185695,-0.284747,...,5.385165,-0.589839,-0.160128,-0.09167,-0.09167,-0.130189,-0.09167,-0.160128,-0.09167,-0.09167
1,-0.561851,0.250704,-0.485899,0.621809,-0.159285,-0.0185,-1.124345,-0.538816,-0.185695,3.511885,...,-0.185695,1.743554,-0.160128,-0.09167,-0.09167,7.681146,-0.09167,-0.160128,-0.09167,-0.09167
2,1.378353,0.348872,-0.485899,1.493293,0.277445,-0.263759,0.889407,-0.538816,-0.185695,-0.284747,...,-0.185695,-0.589839,-0.160128,-0.09167,-0.09167,-0.130189,-0.09167,-0.160128,-0.09167,-0.09167
3,0.304312,0.582022,-0.485899,1.012474,1.914285,0.277259,0.889407,-0.538816,-0.185695,-0.284747,...,-0.185695,-0.589839,-0.160128,-0.09167,-0.09167,-0.130189,-0.09167,-0.160128,-0.09167,-0.09167
4,-0.839023,0.238433,-0.485899,-0.730494,0.920242,0.391415,-1.124345,-0.538816,-0.185695,-0.284747,...,-0.185695,1.649339,-0.160128,-0.09167,-0.09167,-0.130189,-0.09167,-0.160128,-0.09167,-0.09167


## 4. Final Checkpoint: Model-Ready Data

The final step is to re-combine the processed features with the identifiers (`Player`, `Rank`) and the target variable (`Vote Share`) and save the final dataset ready for modeling.

In [10]:
# Re-attach the identifiers and the target variable to the processed features
df_final_model = pd.concat([
    df_modeling[['Player', 'Rank', 'Vote Share', 'Year']], # Identifiers and Target
    X_processed_df
], axis=1)

print(f"Final Model-Ready Dataset Shape: {df_final_model.shape}")
df_final_model.head()

Final Model-Ready Dataset Shape: (120, 35)


Unnamed: 0,Player,Rank,Vote Share,Year,League,League Rating,League Top Scorer,League Top Assist Provider,League Best Goalkeeper,League Best Player,...,WC Top Assist Provider,WC Best Goalkeeper,WC Best Player,Other Rating,Time-Weighted Google Trend Score,Wiki Page Views,Position_Defender,Position_Forward,Position_Goalkeeper,Position_Midfielder
0,Karim Benzema,1,1.0,2022,0.889407,1.205121,2.171241,-0.301511,-0.09167,2.171241,...,-0.160128,-0.09167,-0.09167,0.321297,1.598513,0.239192,0.0,1.0,0.0,0.0
1,Sadio Mané,2,0.351548,2022,-1.124345,-0.561851,-0.460566,-0.301511,-0.09167,-0.460566,...,-0.160128,-0.09167,-0.09167,0.621809,-0.159285,-0.0185,0.0,1.0,0.0,0.0
2,Kevin De Bruyne,3,0.318761,2022,0.889407,1.378353,-0.460566,-0.301511,-0.09167,2.171241,...,-0.160128,-0.09167,-0.09167,1.493293,0.277445,-0.263759,0.0,0.0,0.0,1.0
3,Robert Lewandowski,4,0.309654,2022,0.889407,0.304312,2.171241,-0.301511,-0.09167,-0.460566,...,-0.160128,-0.09167,-0.09167,1.012474,1.914285,0.277259,0.0,1.0,0.0,0.0
4,Mohamed Salah,5,0.211293,2022,-1.124345,-0.839023,2.171241,3.316625,-0.09167,-0.460566,...,-0.160128,-0.09167,-0.09167,-0.730494,0.920242,0.391415,0.0,1.0,0.0,0.0


In [11]:
# Save the model-ready dataset
df_final_model.to_csv('../data/03_processed_data.csv', index=False)

print("\nFinal processed dataset saved to '../data/03_processed_data.csv'.")
print("The next notebook will be 04_model_and_analyze.ipynb!")


Final processed dataset saved to '../data/03_processed_data.csv'.
The next notebook will be 04_model_and_analyze.ipynb!
