The purpose of this notebook is to explore a dataset on Rocket League matches, and to train various machine learning models to predict match outcome.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import xgboost
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import classification_report

In [None]:
df = pd.read_excel('tracker_data_merged.xlsx', index_col=0)

In [None]:
df.head()

In [None]:
df = df.dropna()
df.reset_index(drop=True, inplace=True)

In [None]:
df.info()

---------------------

## Exploratory Data Analysis

In [None]:
df.describe().transpose()

In [None]:
df['player1_outcome'].value_counts()

In [None]:
df[df['p2_wins'] > 20000]

In [None]:
sns.pairplot(df)

In [None]:
sns.lmplot(x='p2_mmr',y='p2_gs_ratio',data=df)
plt.title('Goal/Shot Ratio vs. Rating')

Can see that as skill rating increases, goal/shot ratio decreases. This is most likely due to the fact that G/S ratio tends down over time because the more experienced a player is, the more total shots they have taken.

In [None]:
sns.histplot(data=df,x='p1_wins')

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x='p1_wins',y='p2_wins',hue='player1_outcome',data=df)
plt.title('Player 1 Total Wins Vs. Player 2 Total Wins, Split by Win or Loss')

In [None]:
f, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
sns.histplot(data=df,x='p2_wins',ax=axes[0]).set_title('Player 2 Win Distribution, Full')
sns.histplot(data=df,x='p2_wins',binrange=(0,10000),ax=axes[1]).set_title('Player 2 Win Distribution, Zoomed')

Why does p2_wins have such a skewed distribution? Is the data correct?

In [None]:
df['p2_wins'].value_counts()

Interesting side note - there are a few players with >20,000 lifetime wins. Holy crap!

---------------------------------------

## Feature Engineering

To make an ML model, I need to do some feature engineering to create predictors based on my given columns. Specifically, I don't want to use p1_wins and p2_wins to predict the likelihood of p1 winning. Instead, I want to use (p1_wins - p2_wins) = win_diff as my predictor.

In [None]:
df.head()

In [None]:
df['win_diff'] = df['p1_wins'] - df['p2_wins']

In [None]:
df['mmr_diff'] = df['p1_mmr'] - df['p2_mmr']

In [None]:
df['gs_ratio_diff'] = df['p1_gs_ratio'] - df['p2_gs_ratio']

Now create a new df, "data", to be used for the ML model.

In [None]:
data = df[['win_diff','mmr_diff','gs_ratio_diff','player1_outcome']]

In [None]:
data.head(10)

-----------------

## Machine Learning - Model Fitting

1) split the data into features and outputs:

In [None]:
X = data.drop(['player1_outcome','gs_ratio_diff'],axis=1)
y = data.drop(columns=['win_diff','mmr_diff','gs_ratio_diff'],axis = 1)
y = data['player1_outcome']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
model = XGBClassifier()
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test,y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
cr = classification_report(y_test, y_pred)
print(cr)

In [None]:
xgboost.plot_importance(model)

Now try using scikit-learn's Logistic Regression:

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logReg = LogisticRegression()

In [None]:
logReg.fit(X_train, y_train)

In [None]:
predictions = logReg.predict(X_test)

In [None]:
score = logReg.score(X_test, y_test)
print(score)

In [None]:
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);