In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("shared/data/fifa22.csv")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'shared/data/fifa22.csv'

the unit of analysis appears to be an individual soccer players. 

In [None]:
observations, features = df.shape
print(f"Number of observations: {observations}")
print(f"Number of features: {features}")

There are 19,630 observations and 20 features in the dataset. 

In [None]:
df['gender'].value_counts()

There are 19,239 male players and 391 female players in the dataset. 

In [None]:
# drop only rows where the column passing contains a NaN for practice
df_dropped = df.dropna(subset=['passing'])

# display shape
df_dropped.shape

In [None]:
import statsmodels.api as sm

In [None]:
X = df[['passing', 'attacking', 'defending', 'skill']]
y = df['rank']
X = sm.add_constant(X)
model = sm.OLS(y, X, missing='drop').fit()
model.summary()

In [None]:
fitted_values = model.fittedvalues
residuals = model.resid

plt.scatter(fitted_values, residuals)
plt.title('Residual vs. Fitted Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()

After plotting residual-versus-fitted, you can see that heteroskedasticity is not a problem because the plot does not have a clear trend. 
One concern could be the outlier values that have residuals between the 15-20 range(roughly x=65). Since residual is the difference/error between the predicted and observed values, outliers with unusually large residuals could mean that the model's accuracy is inconsistent.

The variation in rank which is explained by our features is the R-squared statistic, which is 0.705. We can say that about 70.5% of the variation in rank is explained by the features in our model. 

Holding passing, attacking, and defending constant, a 1-unit increase in “skill” is associated with a .0066 increase in rank. I got this value by observing the coefficient associated with skill in the model summary. Notice still that the p-value for skill .465, meaning that it might not have a big influence here. 

A 95% confidence interval for the effect of a 1-unit increase in skill(holding passing, attacking, and defending constant) on ranking is [-0.011, 0.024], as seen in the model summary.

Since the 95% confidence interval for the effect of a 1-unit increase in skill (holding other features constant) on ranking is [-0.011, 0.024], we can glean several things. Firstly, if we were to repeatedly collect random samples from the population and calculate the confidence interval for the coefficient associated with the skill feature, then in 95% of those samples, the true coefficient would fall within the calculated confidence interval. Secondly, the 95% confidence interval contains zero, meaning that a 1-unit increase in skill (holding other features constant) could correspond to a negative effect, a positive effect, or no effect at all. This inclusion of zero suggests that the null hypothesis—that the coefficient of skill is zero—cannot be rejected.

Based on the OLS regression, we can expect that the four features (passing, attacking, defending, and skill) will do fairly well for predicting rank for out-of-sample data based on a few things. Firstly, the R-squared value is 0.705, meaning that 70.5% of the variance in the rank is explain by the four features previously mentioned. Secondly, if we choose an alpha level of .05, we can say that passing, attacking, and defending are statistically significant predictors of rank. However, skill has a p-value of 0.465, indicating it is not statistically significant. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Create an X dataframe with just four features: passing, attacking, defending, and skill
X = df[['passing', 'attacking', 'defending', 'skill']]
# Create a Y dataframe (or series) with just the “rank” variable
Y = df['rank']

In [None]:
# X first five rows
X.head()

In [None]:
# Y first five rows
Y.head()

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.25, random_state=123)
X_train.head()

In [None]:
from sklearn.linear_model import LinearRegression
X_train_clean = X_train.dropna()
Y_train_clean = Y_train.loc[X_train_clean.index]

# Train a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train_clean, Y_train_clean)

# Get intercept and coefs
intercept = linear_model.intercept_
coefficients = linear_model.coef_

intercept, coefficients

In [None]:
# Get difference between Attacking coefficients from both models
statsmodels_coef = 0.6109
sklearn_coef = coefficients[1]
print(f"Difference between Attacking coefficient: {statsmodels_coef - sklearn_coef}")

The coefficients are pretty similar. In the OLS regression model, the coefficients were the following - Passing: -0.0247, Attacking: 0.6109, Defending: 0.1719, Skill: 0.0066. The linear regression (SKLearn) done in part has the following coefficients - Passing: -0.02738345, Attacking: 0.61570071, Defending: 0.17364093, Skill: 0.00464593. Notice that the "Attacking" coefficient decreased by roughly -0.004800 (as you can see above) from the statsmodels regression to the SKLearn one. 

In [None]:
# use linear_model to predict X validation set data
Y_val_predictions = linear_model.predict(X_val.dropna())

# Display the first three predicted values
Y_val_predictions[:3]

In [None]:
Y_val_clean = Y_val.loc[X_val.dropna().index]
# Scatterplot of actual vs predicted Y values
plt.figure(figsize=(8, 6))
plt.scatter(Y_val_clean, Y_val_predictions)
# Plot x = y to compare with observed vs. predicted
plt.plot([Y_val_clean.min(), Y_val_clean.max()], [Y_val_clean.min(), Y_val_clean.max()], color='red') 
plt.title("Actual vs Predicted Y Values")
plt.xlabel("Actual Y Values (Validation Data)")
plt.ylabel("Predicted Y Values")
plt.grid(True)
plt.show()

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np
rmse = np.sqrt(mean_squared_error(Y_val_clean, Y_val_predictions))
rmse

As you can see by the calculation above, the Root Mean Squared Error is around 3.7308. This means that the average error across the model between the predicted Y-values and the actual observed Y-values is roughly 3.7308 units. Based on this, the model seems to provide fairly accurate predictions although there is still room for improvement. 

As mentioned above, I think this model provides fairly accurate predictions for player rank. The RMSE is around 3.7308, meaning that the average error between predicted Y-values and observed ones is approximately 3.7308 rank units. The scatterplot of predicted vs. observed Y-values shows that predictions generally align well with the x=y line, suggesting that the model captures the underlying pattern nicely. While there are some outliers, the model performs well overall in predicting player rank.

In [None]:
df['preferred_foot'].value_counts()

In [None]:
right_foot_percentage = (df['preferred_foot'].value_counts()['Right'] / df.shape[0]) * 100
print(f"% of right-footed players: {round(right_foot_percentage, 2)}")

In [None]:
X_classifier = df[['shooting', 'passing', 'dribbling', 'defending', 'attacking', 
                          'skill', 'movement', 'power', 'mentality', 'goalkeeping']]
X_classifier.head()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X_classifier)
X_normalized_df = pd.DataFrame(X_normalized, columns=X_classifier.columns)

# Display first three rows of the normalized data
X_normalized_df.head()

In [None]:
Y_classifier = df['preferred_foot']
X_train_classifier, X_val_classifier, Y_train_classifier, Y_val_classifier = train_test_split(
    X_normalized_df, Y_classifier, test_size=0.30, random_state=456)

# Display first 5 rows of X training data
X_train_classifier.head()

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Although maybe not entirely necessary, lets impute missing values rather than dropping rows that contain them. 
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train_classifier_imputed = imputer.fit_transform(X_train_classifier)
X_val_classifier_imputed = imputer.transform(X_val_classifier)
# Initialize list for k_values and accuracies - will append accuracies list later
k_values = list(range(1, 31))
accuracies = []
# for loop through list and calculate accuracy
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_classifier_imputed, Y_train_classifier)
    # Predict on the validation set
    Y_val_pred = knn.predict(X_val_classifier_imputed)
    accuracy = accuracy_score(Y_val_classifier, Y_val_pred)
    accuracies.append(accuracy)

# adjust figsize, if not x-axis numbers are squashed
plt.figure(figsize=(10, 6))
plt.plot(k_values, accuracies, marker='o')
plt.title('KNN Classifier Accuracy vs. Number of Neighbors (k)')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.xticks(k_values)
plt.grid(True)
plt.show()

In [None]:
print(f"Best value for K: {k_values[accuracies.index(max(accuracies))]}")
print(f"Accuracy for K-value 29: {round(max(accuracies), 4)}")

As you can see above, the most reasonable k-value is 29, because it has the highest accuracy with roughly .7709 when rounded to the nearest 4th decimal place. 

In [None]:
knn_final = KNeighborsClassifier(n_neighbors=29)
knn_final.fit(X_train_classifier_imputed, Y_train_classifier)
Y_val_final_pred = knn_final.predict(X_val_classifier_imputed)
# Display (at least) the first 3 predictions for “preferred foot.”
Y_val_final_pred[:3]

In [None]:
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(Y_val_classifier, Y_val_final_pred)
# Extract num for True Lefts predicted as Right
true_left_pred_right = conf_matrix[0, 1]
true_left_pred_right

In [None]:
from sklearn.metrics import classification_report
class_report = classification_report(Y_val_classifier, Y_val_final_pred)
print(class_report)

The recall here (.05) suggests that the model only correctly identified 5% of the true left-footers in the dataset. One must always be careful when classes are very unbalanced. We previously saw that the soccer players in this study disproportionately skew to be right-footed, meaning our KNN model trained on this data set could be biased towards predicting right-footers. Notice how the recall for right-footed predictions is unbelievably high, around 99%. 

Overall, the model does a poor job of predicting a player's preferred foot. As we calculated early on, simply guessing that a player is right-footed each time would yield an accuracy of 76.34%, which is already quite high. After optimizing k for KNN, the model achieved only a marginal improvement, with an accuracy of 77.09%. While this may seem satisfactory at first, the recall for left-footers was shockingly low. The model predicts right-footers well, but this is largely because it guesses 'Right' most of the time due to the class imbalance. Accuracy alone is not a sufficient metric for evaluating models in the presence of unbalanced classes. The model's inability to predict left-footed players makes it inadequate for predicting a players preferred-foot.