**PURPOSE**

The goal of this workbook is to document Initial process by which I generated predictions for the WNCAA 2018 championship. My final predicitons were created by combining this the Glicko score and variance of each team, and other details available from  WNCAATourneyCompactResults.csv

**THEORY OF OPERATION**

In a given basketball game, a team1 scores a given number of points, though offense, and prevents a number of points, though defense. The same goes for team2. 

Let O1 be the offence of team1, O2 be the offense of team2, D1 be the defense of team1, and D2 be the defense of team2. All of these are random variables in units of points. 

As a result if O1 - D2 > O2 - D1 then team1 should be expected to win. As such let us define another variable *L*

let *L* = O1 + D1 - O2 - D2 

Thus P(*L* > 0) = P(team1 wins) and P(*L* < 0) = P(team2 wins)

Now to find a suitable measure of offense and defense.
I made the assumtions that over the course of a season a given teams opponents would be representative of that of the entire league, and that transitivity holds i.e. if team1 is likely to beat team2 and team2 is likely to beat team3 then team1 is likely to beat team3, likely referring to being expected to beat 50% of the time.

As such I chose O1 to be the average number of points scored in a game by team1, and D1 the negative average score of the opponent to team1 in a game. As it turned out the both of these random variables appeared to be roughly normally distributed, thus with some simple statistics a description for *L* becomes easy. However, in principle any distribution should work with the appropriate math.

E(*L*) = E(O1) + E(D1) - E(O2) - E(D2) 

and 

var(*L*) = var(O1) + var(D1) + var(O2) + var(D2) 

**NOTE:**Tthe code below is to make the implementation later easier to read


In [1]:
import pandas as pd 
import numpy as np
import math
import scipy.stats as st

class Season():
    def __init__(self, year):
        self.year = year
        seasonResults = pd.read_csv("../input/WRegularSeasonCompactResults.csv")
        yearMask = seasonResults.loc[:,'Season'] == self.year
        self.games = seasonResults.loc[yearMask,:].copy()
        self.__setTeams()
        self.expectedScore = (self.games.loc[:,"WScore"].mean() + self.games.loc[:,"LScore"].mean()) / 2
        self.__calcErrors()
        
    def __setTeams(self, filename = "../input/WNCAATourneySeeds.csv"):
        """
        Pulls all of the teams due to compete in the tournement in the coming year
        """
        temp = pd.read_csv(filename)
        temp = temp.query('Season == {}'.format(self.year)).copy()
        teamset = set()
        teamset.update(temp["TeamID"].unique())
        self.tournTeams = list(teamset)
        self.tournTeams.sort()
        
        allteamset = set(self.games.loc[:,"WTeamID"].unique())
        allteamset.update(self.games.loc[:,"LTeamID"].unique())
        self.allTeams = list(allteamset)
        self.allTeams.sort()
        
    def __calcErrors(self):
        scores = self.games.loc[:,"WScore"].copy().values
        scores = np.append(self.games.loc[:,"LScore"],scores)
        scores = np.std(scores)
        self.stdScore = scores
        
    def getTeamRecords(self,teamId):
        return self.games.query('(WTeamID == {} | LTeamID == {})'.format(teamId,teamId))
class Team():
    def __init__(self, teamId, season):
        self.teamId = teamId
        self.record = season.getTeamRecords(self.teamId)
        self.__setOffense()
        
    def play(self, opponent, gameData ):
        return 0
        
    def __str__(self):
        print(self.teamId)
        print("Team Scoring: ", self.teamStats)
        print("Opponent Scoring: ", self.oppStats)
        return ""
        
    def __setOffense(self):
        Wmask = self.record['WTeamID'] == self.teamId
        Lmask = self.record['LTeamID'] == self.teamId        
        
        teamScoreSeries = pd.Series([])
        teamScoreSeries = teamScoreSeries.append([self.record[Wmask]['WScore'].copy(),
                                self.record[Lmask]['LScore'].copy()], ignore_index=True)
        
        oppScoreSeries = pd.Series()
        oppScoreSeries = oppScoreSeries.append([self.record.loc[Wmask,'LScore'].copy(),
                                                self.record.loc[Lmask,'WScore'].copy()], ignore_index=True)
        
        self.teamStats = {"PPG":teamScoreSeries.mean(), "STD":teamScoreSeries.std()}
        self.oppStats = {"PPG":oppScoreSeries.mean(), "STD":oppScoreSeries.std()}

class Matchup(object):
    def __init__(self,team1,team2,season):
        self.season = season.year
        self.team1 = team1.teamId
        self.team2 = team2.teamId
        self.PTeam1Win = EstimateProb(team1,team2)
        
    def tostr(self):
        return "{}_{}_{},{}\n".format(
                self.season,self.team1,self.team2,self.PTeam1Win)

With the backbone in place we now go to the computational heavy lifting, for each year specified in years we will analyze each teams performance and output a dataframe with all of the features.

In [10]:
years = [1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017] # an array of years from 1998 to 2018

teamData = pd.DataFrame({'Year':[], 'TeamId':[],'AverageScore':[], 'ScoreSTD':[], 'OpponentAverageScore':[], 
                         'OpponentScoreSTD':[]})
for year in years:
    season = Season(year)
    teamList = [Team(ID,season) for ID in season.allTeams]

    for team in teamList: 
        temp = pd.DataFrame({'Year': [year], 'TeamId': [team.teamId], 
                         'AverageScore': [team.teamStats['PPG']], 'ScoreSTD': [team.teamStats['STD']],
                         'OpponentAverageScore': [team.oppStats['PPG']], 
                         'OpponentScoreSTD': [team.oppStats['STD']]})
        teamData = pd.concat([teamData,temp])
    print(year, 'Done')
teamData['Year'] = teamData['Year'].astype('int', copy=False)
teamData['TeamId'] = teamData['TeamId'].astype('int', copy=False)
teamData['UID'] = teamData['Year'].astype(str) + '_' + teamData['TeamId'].astype(str)
teamData = teamData.set_index('UID')
teamData.head()


Now the DataFrame teamData contains the requisite stats of a given team during the calendar year. From this conversion into predictions based on statistical performance is easy. However, it should be noted that if a team only has one recorded game during the season, a standard deviation cannot be created, as such there may be missing data. Matches where a team has an unknown variances will be taken to be 50% for convienience.

Given that the model was generated from regular season performance, then the entirity of the historical tournement data can be used for benchmarking purposes, without the need for subdividing the data into a training and testing set.

In [3]:
tourney = pd.read_csv("../input/WNCAATourneyCompactResults.csv")
tourney.head()

In [4]:
tourney['WUID'] = tourney['Season'].astype(str) + '_' + tourney['WTeamID'].astype(str)
tourney['LUID'] = tourney['Season'].astype(str) + '_' + tourney['LTeamID'].astype(str)
tourney.head()

As we have yet to code up a function with which to estimate the probability of a given team winning, below we shall.

In [26]:
def EstimateProb(team1PPG, team1OPPG, team1STD, team1OSTD, team2PPG, team2OPPG, team2STD, team2OSTD):
    expVal = team1PPG - team2PPG
    expVal -= team1OPPG - team2OPPG
    
    variance = team1STD**2 + team1OSTD**2 
    variance += team2STD**2 + team2OSTD**2
    
    std =np.sqrt(variance)
    zScore = expVal / std
    np.nan_to_num(zScore)
    return 1 - st.norm.cdf(-1 * zScore)

In [53]:
# The Requisite team1 statistics
team1PPG = tourney['WUID'].map(teamData['AverageScore']).values
team1OPPG = tourney['WUID'].map(teamData['OpponentAverageScore']).values
team1STD = tourney['WUID'].map(teamData['ScoreSTD']).values
team1OSTD = tourney['WUID'].map(teamData['OpponentScoreSTD']).values

# The Requisite team2 statisitics
team2PPG = tourney['LUID'].map(teamData['AverageScore']).values
team2OPPG = tourney['LUID'].map(teamData['OpponentAverageScore']).values
team2STD = tourney['LUID'].map(teamData['ScoreSTD']).values
team2OSTD = tourney['LUID'].map(teamData['OpponentScoreSTD']).values

tourney['Prediction'] = EstimateProb(team1PPG, team1OPPG, team1STD, team1OSTD, team2PPG, team2OPPG, team2STD, team2OSTD)
tourney['Result'] = 1
tourney.head()

In [54]:
correct = tourney.loc[tourney['Prediction'] > .5, :]['Prediction'].size
incorrect = tourney.loc[tourney['Prediction'] <= .5, :]['Prediction'].size
print("{:.4f}".format(correct / (correct + incorrect)))

So such a method has historically predicted WNCAA matches with 71% accuracy. What about log loss?

In [57]:
from sklearn.metrics import log_loss
print("{:.4f}".format(log_loss(tourney['Result'].values, tourney['Prediction'].values, labels = [0,1])))

In [None]:
A fairly high log-loss on its own, however a valueable piece of information none the less.