#Machine Learning Prototype 

This notebook takes all the data collected, and compares different machine learning algorithms to determine which one is the best. This could not have been possible without using Payton Soicher's machine learning writeup as a reference. You can find it here: https://towardsdatascience.com/can-you-accurately-predict-mlb-games-based-on-home-and-away-records-8a9a919bad29

This model looked at the head to head matchup to see if a team at home would win.

In [1]:
#imports
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import xgboost as xgb
from constants import ID_TO_NAME

In [2]:
#getting the schedule data and team stats
df = pd.read_csv("data2019.csv")
stats = pd.read_csv("stats.csv")

#setting index to team name for easier refrencing
stats = stats.set_index("Team Name")

In [3]:
#filtering out the data
colums = ["stage", "away id","away","away score", "home id", "home score", "home", "winner", "winner id", "winner label",
         "Map 1 Name", "Map 1 Type", "Map 1 Away Points", "Map 1 Home Points", "Map 1 Winner",
        "Map 2 Name", "Map 2 Type", "Map 2 Away Points", "Map 2 Home Points", "Map 2 Winner",
        "Map 3 Name", "Map 3 Type", "Map 3 Away Points", "Map 3 Home Points", "Map 3 Winner", 
        "Map 4 Name", "Map 4 Type", "Map 4 Away Points", "Map 4 Home Points", "Map 4 Winner",
        "Map 5 Name", "Map 5 Type", "Map 5 Away Points", "Map 5 Home Points", "Map 5 Winner",
         ]

sdf = df[colums]

With the team stats, I had collected both map specific stats, as well as general map type stats. For now, I am sticking to map type stats, with the goal being to collect map specific stats based on what maps are being played.

In [4]:
#filter out to map type stats
import itertools
cols = ["Points Earned", "Points Lost", "Points Differential", "Points Differential Rank", "True Win %", "Map Potential %", "Map Potential % Rank"]
types = ["Average Assault", "Average Control", "Average Hybrid", "Average Escort"]
columns = []
for i in types:
    columns.append([i + " " + j for j in cols])

columns = list(itertools.chain.from_iterable(columns))
sta = stats[columns]
sta

Unnamed: 0_level_0,Average Assault Points Earned,Average Assault Points Lost,Average Assault Points Differential,Average Assault Points Differential Rank,Average Assault True Win %,Average Assault Map Potential %,Average Assault Map Potential % Rank,Average Control Points Earned,Average Control Points Lost,Average Control Points Differential,...,Average Hybrid True Win %,Average Hybrid Map Potential %,Average Hybrid Map Potential % Rank,Average Escort Points Earned,Average Escort Points Lost,Average Escort Points Differential,Average Escort Points Differential Rank,Average Escort True Win %,Average Escort Map Potential %,Average Escort Map Potential % Rank
Team Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Atlanta Reign,2.3416,2.15,0.1916,6.0,55.0,0.751,11.0,1.2904,1.0764,0.214,...,57.3334,0.7708,7.0,2.3945,2.385667,0.008833,8.0,51.865,0.799333,4.0
Boston Uprising,2.3466,2.6266,-0.28,14.0,33.5,0.7144,14.0,0.818,1.5638,-0.7458,...,40.3334,0.6908,15.0,1.8945,2.222167,-0.327667,13.0,33.333333,0.6235,20.0
Chengdu Hunters,2.2584,2.3084,-0.05,11.0,45.8334,0.789,8.0,1.25,1.21,0.04,...,49.6666,0.7758,6.0,1.938833,2.272167,-0.333333,14.0,35.0,0.663167,15.0
Dallas Fuel,2.2092,2.7736,-0.5644,20.0,28.2858,0.6474,19.0,1.128,1.1584,-0.0304,...,32.8572,0.6958,14.0,2.241667,2.791667,-0.55,19.0,31.666667,0.652167,16.0
Florida Mayhem,2.4058,2.6104,-0.2046,13.0,38.3332,0.7464,13.0,0.7534,1.5706,-0.8172,...,31.7142,0.7178,12.0,1.863833,2.325,-0.461167,17.0,41.111167,0.633333,19.0
Guangzhou Charge,2.2058,2.1962,0.0096,10.0,50.6192,0.712,15.0,1.3128,1.0682,0.2446,...,48.7858,0.7446,10.0,2.062667,2.338833,-0.276167,12.0,43.531667,0.686833,10.0
Hangzhou Spark,2.1294,2.0376,0.0918,8.0,56.8572,0.7934,6.0,1.3236,1.1042,0.2194,...,45.9166,0.7078,13.0,2.261333,2.1695,0.091833,7.0,54.0875,0.766667,7.0
Houston Outlaws,1.7492,2.0878,-0.3386,16.0,36.0,0.549,20.0,1.1562,1.2646,-0.1084,...,34.1666,0.5592,20.0,1.841667,2.211167,-0.3695,16.0,38.888833,0.684167,12.0
London Spitfire,2.262,2.0416,0.2204,5.0,57.6786,0.7848,9.0,1.2546,1.2016,0.053,...,50.2976,0.727,11.0,2.038833,2.211167,-0.172333,11.0,48.1945,0.680333,13.0
Los Angeles Gladiators,2.1894,1.9448,0.2446,3.0,57.1516,0.801,4.0,1.1678,1.1632,0.0046,...,69.3334,0.8632,3.0,2.1625,2.0625,0.1,6.0,49.166667,0.7285,8.0


In [5]:
#turning all important catagorical data into numeric data
def get_team_stats(team):
    teamrow = sta.loc[team, :]
    return teamrow

def home_team_winner(row):
    if row['home'] == row['winner']:
        return 1 
    else:
        return 0
    

finaldf = []
noplay = sdf[["stage", "away", "away id", "away score", "home score", "home id", "home", "winner", "winner id"]]

#combining stats and schedule dataframes
for index, row in noplay.iterrows():
    awayrow = get_team_stats(row["away"])
    homerow = get_team_stats(row["home"])
    awayrow = awayrow.rename(lambda x: "Away " + x)
    homerow = homerow.rename(lambda x: "Home " + x)
    test = pd.concat([row, awayrow, homerow], )
    finaldf.append(test)
finaldf = pd.DataFrame(finaldf)
finaldf.insert(finaldf.columns.get_loc("winner"), 'HomeTeamWin', finaldf.apply(home_team_winner, axis = 1))

#dropping all catagorical data
finaldf = finaldf.drop(["stage", "home", "away", 'away score', 'home score', "winner", "winner id"], axis = 1)
finaldf = finaldf.loc[:, ~finaldf.columns.str.contains("Rank")]
finaldf

Unnamed: 0,away id,home id,HomeTeamWin,Away Average Assault Points Earned,Away Average Assault Points Lost,Away Average Assault Points Differential,Away Average Assault True Win %,Away Average Assault Map Potential %,Away Average Control Points Earned,Away Average Control Points Lost,...,Home Average Hybrid Points Earned,Home Average Hybrid Points Lost,Home Average Hybrid Points Differential,Home Average Hybrid True Win %,Home Average Hybrid Map Potential %,Home Average Escort Points Earned,Home Average Escort Points Lost,Home Average Escort Points Differential,Home Average Escort True Win %,Home Average Escort Map Potential %
0,4524,4410,0,2.3196,2.6138,-0.2942,36.9446,0.7500,1.0026,1.4804,...,2.2656,2.5166,-0.2510,50.2976,0.7270,2.038833,2.211167,-0.172333,48.194500,0.680333
1,4403,4402,0,2.1624,1.9222,0.2402,59.9884,0.8234,1.3690,1.0052,...,2.2316,2.4650,-0.2334,40.3334,0.6908,1.894500,2.222167,-0.327667,33.333333,0.623500
2,4409,4406,0,2.3574,2.2094,0.1480,53.3056,0.7998,1.3700,1.1900,...,2.4400,2.0434,0.3966,69.3334,0.8632,2.162500,2.062500,0.100000,49.166667,0.728500
3,4408,7693,1,2.6066,2.5466,0.0600,55.0000,0.7742,1.3450,1.1692,...,2.4034,2.3216,0.0818,45.9166,0.7078,2.261333,2.169500,0.091833,54.087500,0.766667
4,7695,4525,0,1.9016,2.2484,-0.3468,38.8334,0.6884,1.0394,1.3994,...,1.6952,2.3442,-0.6490,34.1666,0.5592,1.841667,2.211167,-0.369500,38.888833,0.684167
5,7698,4407,0,2.3416,2.1500,0.1916,55.0000,0.7510,1.2904,1.0764,...,2.0524,2.5780,-0.5256,31.7142,0.7178,1.863833,2.325000,-0.461167,41.111167,0.633333
6,4523,4404,1,2.2092,2.7736,-0.5644,28.2858,0.6474,1.1280,1.1584,...,2.8054,1.8916,0.9138,75.8938,0.9046,2.845833,1.380333,1.465500,90.773833,0.931500
7,7692,7699,0,2.2584,2.3084,-0.0500,45.8334,0.7890,1.2500,1.2100,...,2.3514,2.3986,-0.0472,48.7858,0.7446,2.062667,2.338833,-0.276167,43.531667,0.686833
8,4410,7694,1,2.2620,2.0416,0.2204,57.6786,0.7848,1.2546,1.2016,...,2.5834,2.7400,-0.1566,45.2500,0.7556,2.511167,2.361167,0.150000,56.944500,0.804833
9,7697,4403,1,1.5908,1.9764,-0.3856,41.0714,0.6902,0.8500,1.6034,...,2.5336,2.0938,0.4398,63.1558,0.8286,2.105333,1.776167,0.329167,63.273833,0.792333


In [6]:
#splitting into training and testing
X_train, X_test, y_train, y_test = train_test_split(finaldf.loc[:, ~finaldf.columns.isin(['HomeTeamWin'])]
                                                   , finaldf.loc[:, 'HomeTeamWin']
                                                   , random_state = 42
                                                   , stratify = finaldf.loc[:, 'HomeTeamWin'])

Here, I decided to use a variety of different models to find which one worked best. It's worth noting that the Overwatch League is rapidly changing it's ruleset, allowing teams to be terrible at the beginning of the season and do really well towards the end of the season. The stats collected are for overall season, which means games played at the beginning of the season aren't properally predicted compared to games towards the end of the season. This is something I hope to fix in the future. 

In [7]:
rfc = RandomForestClassifier(500, random_state = 534)
rfc.fit(X_train, y_train)
print('-- Random Forest -- ')
print('Training Accuracy: ', accuracy_score(y_train, rfc.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, rfc.predict(X_test)))
print('Whole Dataset: ', accuracy_score(finaldf['HomeTeamWin'],rfc.predict(finaldf.loc[:, X_train.columns])))
print('\n')

lr = LogisticRegression(random_state = 534)
lr.fit(X_train, y_train)
print('-- Logistic Regression -- ')
print('Training Accuracy: ', accuracy_score(y_train, lr.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, lr.predict(X_test)))
print('Whole Dataset: ', accuracy_score(finaldf['HomeTeamWin'],lr.predict(finaldf.loc[:, X_train.columns])))
print('\n')

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('-- K Nearest Neighbors -- ')
print('Training Accuracy: ', accuracy_score(y_train, knn.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, knn.predict(X_test)))
print('Whole Dataset: ', accuracy_score(finaldf['HomeTeamWin'],knn.predict(finaldf.loc[:, X_train.columns])))
print('\n')

sv = SVC()
sv.fit(X_train, y_train)
print('-- SVC -- ')
print('Training Accuracy: ', accuracy_score(y_train, sv.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, sv.predict(X_test)))
print('Whole Dataset: ', accuracy_score(finaldf['HomeTeamWin'],sv.predict(finaldf.loc[:, X_train.columns])))
print('\n')

xgboost = xgb.XGBClassifier(seed = 82)
xgboost.fit(X_train, y_train)
print('-- XGBoost --')
print('Training Accuracy: ', accuracy_score(y_train, xgboost.predict(X_train)))
print('Testing Accuracy: ', accuracy_score(y_test, xgboost.predict(X_test)))
print('Whole Dataset: ', accuracy_score(finaldf['HomeTeamWin'],xgboost.predict(finaldf.loc[:, X_train.columns])))

-- Random Forest -- 
Training Accuracy:  0.9707112970711297
Testing Accuracy:  0.7375
Whole Dataset:  0.9122257053291536


-- Logistic Regression -- 
Training Accuracy:  0.7112970711297071
Testing Accuracy:  0.725
Whole Dataset:  0.7147335423197492


-- K Nearest Neighbors -- 
Training Accuracy:  0.7782426778242678
Testing Accuracy:  0.675
Whole Dataset:  0.7523510971786834


-- SVC -- 
Training Accuracy:  0.9707112970711297




Testing Accuracy:  0.5
Whole Dataset:  0.8526645768025078


-- XGBoost --
Training Accuracy:  0.8744769874476988
Testing Accuracy:  0.6875
Whole Dataset:  0.8275862068965517


As scene, each of the models vary in it's predictions. The testing accuracy does not go about 80%, but when compared to the whole dataset, it does much better. This is overall very good accuracy, as the Overwatch League is in such a constant state of flux that anything above 50% is considered good. Ideally, with more relavent features, the accuracy can go up.

In [8]:
#created a function that would choose any two teams from the overwatch league and determine the winner. 
def test_predict(away, home):
    
    #turn id back into name
    def convert_prediction(prediction):
        if prediction[0] == 1:
            #Home Won
            return ID_TO_NAME.get(testrow[1])
        if prediction[0] == 0:
            #Away Won
            return ID_TO_NAME.get(testrow[0])


    #create test data series
    newcol = ["away", "home"]
    
    #enter names of teams to get
    testrow = pd.Series([away, home], index=newcol)

    #get all stats for each team
    awayrow = get_team_stats(testrow[0])
    homerow = get_team_stats(testrow[1])

    #convert columns to proper team placement
    awayrow = awayrow.rename(lambda x: "Away " + x)
    homerow = homerow.rename(lambda x: "Home " + x)

    #turn name into id
    for name, team in ID_TO_NAME.items():
        if team == testrow[0]:
            testrow[0] = name
        if team == testrow[1]:
            testrow[1] = name
    testrow = pd.concat([testrow, awayrow, homerow])
    testrow = testrow[~testrow.index.str.contains("Rank")]

    #predictions
    rfcprediction = rfc.predict([testrow])
    lrprediction = lr.predict([testrow])
    knnprediction = knn.predict([testrow])
    svprediction = sv.predict([testrow])
    
    print("Random Forest Prediction: ", convert_prediction(rfcprediction))
    print(" ")
    print("Logistic Regression Prediction: ", convert_prediction(lrprediction))
    print(" ")
    print("K Nearest Neighbors Prediction: ", convert_prediction(knnprediction))
    print(" ")
    print("SVC Prediction: ", convert_prediction(svprediction))
    print(" ")


test_predict("Seoul Dynasty", "Dallas Fuel")

Random Forest Prediction:  Dallas Fuel
 
Logistic Regression Prediction:  Seoul Dynasty
 
K Nearest Neighbors Prediction:  Dallas Fuel
 
SVC Prediction:  Dallas Fuel
 


In [19]:
#listing features for Random Forest
pd.DataFrame(list(zip(rfc.feature_importances_, X_train.columns)), columns = ['Feature Importance','Feature']
            ).sort_values('Feature Importance',ascending = False)

Unnamed: 0,Feature Importance,Feature
9,0.041263,Away Average Control Points Differential
10,0.038966,Away Average Control True Win %
11,0.035386,Away Average Control Map Potential %
7,0.03365,Away Average Control Points Earned
8,0.030483,Away Average Control Points Lost
14,0.029669,Away Average Hybrid Points Differential
21,0.028427,Away Average Escort Map Potential %
30,0.028357,Home Average Control True Win %
28,0.027924,Home Average Control Points Lost
24,0.02785,Home Average Assault Points Differential
