# Markov Chain Win Prediction

## By Will Walters

   This tutorial will provide a brief introduction to a method for predicting the outcome of events which rely on a large number of factors by constructing a simple model of them with many simplifying instructions, and then running this model many times to see if a general outcome is trended towards.
   
   An example of this which was given in class is the methodology which the website 538 uses to forecast the presidential election. While their model used to simulate a single iteration of the election is not what I would personally call simple - it uses polls to predict each candidate's polling baseline in that state, and then past results to correlate error in this polling between states - it is still a much simplified model of the actual election and makes use of several simplifying instructions. 
   
   The power of their model, and other models using its same conceit, comes from the fact that they run this partially random simulation several thousands of times, which points at some underlying nonobvious truth about the data - in this case, which candidate is more likely to win the election.

![Image](https://33x5bs39d3jhnwvvr3uyk6zm-wpengine.netdna-ssl.com/wp-content/uploads/pix/2016/08/Nate-Silver-Election-Forecast-873x898.png)

In this tutorial, we will be applying this same idea towards predicting the outcome of a basketball game. (Note that this tutorial does presuppose some knowledge of the fundamentals of basketball.) In specific, we will be giving a probability for each team to win a game given: the identities and statistics about each time, the current score, and the time left in the game. (Adding in specifics about the game will allow us to compute the changing win percentage over the course of a given game, as we'll see later.)

## Getting Statistics

First, we'll need to find statistics from each team. These I found from http://www.basketball-reference.com/, which puts tables of statistics up for download in the csv format. These stats are from the 2015-2016 regular season.

In [1]:
import pandas as pd

In [2]:
selfStats = pd.read_csv("team_stats.csv")
otherStats = pd.read_csv("opponent_stats.csv")

print selfStats.head()
print otherStats.head()

   Rk                Team   G     MP    FG   FGA    FG%   3P   3PA    3P%  \
0   1      Atlanta Hawks*  82  19830  3168  6923  0.458  815  2326  0.350   
1   2     Boston Celtics*  82  19780  3216  7318  0.439  717  2142  0.335   
2   3       Brooklyn Nets  82  19755  3136  6920  0.453  531  1508  0.352   
3   4  Charlotte Hornets*  82  19855  3036  6922  0.439  873  2410  0.362   
4   5       Chicago Bulls  82  19905  3165  7170  0.441  651  1753  0.371   

   ...    ORB   DRB   TRB   AST  STL  BLK   TOV    PF   PTS   PS/G  
0  ...    679  2772  3451  2100  747  486  1226  1570  8433  102.8  
1  ...    950  2733  3683  1981  752  348  1127  1796  8669  105.7  
2  ...    863  2614  3477  1829  627  332  1212  1476  8089   98.6  
3  ...    734  2869  3603  1778  595  438  1030  1487  8479  103.4  
4  ...    907  2889  3796  1870  495  470  1141  1545  8335  101.6  

[5 rows x 26 columns]
   Rk                Team   G     MP    FG   FGA    FG%   3P   3PA    3P%  \
0   1      Atlanta Hawk

Some explanation for these tables: the first, selfStats, is stats pertaining to the team's performance. The second, otherStats, is stats pertaining to the performance of other teams which the team in question was playing against. The table is ordered alphabetically by team name.

We will only be using a few of the above stats to build our model. The goal of these stats is to help with assigning probabilities to our markov chain-like model of a basketball game. From selfStats, we will use 3P% (three point percentage), 2P% (two point percentage), ORB (offensive rebounds), DRB (defensive rebounds), 2PA (two point attempts), and 3PA (three point attempts). From otherStats, we will use 3P% (three point percentage [of the opponent]) and 2P% (two point percentage [of the opponent]). We will also keep track of the name.

The function below will take in an int corresponding to the lexicographical rank of the team's name (0-indexed), and return a 6-member list of the relevant information as given above.

In [3]:
def getStats(i):
    
    info = []
    
    name = selfStats.iloc[i]['Team']
    if name[-1] == '*':
        name = name[:-1]
    info.append(name)
    
    info.append(selfStats.iloc[i]['3P%'])
    info.append(selfStats.iloc[i]['2P%'])
    info.append(selfStats.iloc[i]['ORB'])
    info.append(selfStats.iloc[i]['DRB'])
    info.append(selfStats.iloc[i]['3PA'])
    info.append(selfStats.iloc[i]['2PA'])
    info.append(otherStats.iloc[i]['3P%'])
    info.append(otherStats.iloc[i]['2P%'])
    
    return info

print getStats(0)

['Atlanta Hawks', 0.34999999999999998, 0.51200000000000001, 679, 2772, 2326, 4597, 0.33799999999999997, 0.46899999999999997]


Now, we will write a function which will combine these two team stats into a class that we'll call a 'matchup'. A matchup will consist of the probabilities for various events we will use to simulate a posession of basketball. These are:
- Three Point Attempt
- Two Point Attempt
- Three Point Made
- Two Point Made
- Offensive Rebound
- Defensive Rebound

These stats will have different values for each team, so that a matchup consists of, in total, 12 different stats. Here's how we calculate these:

### Three and Two Point Attempts

We know from the stats the total number of times a team attempted a three point shot (3PA) and a two point shot (2PA). If we assume that a given game will follow this ratio, it stands to reason that we can calculate the probabilities of each type of shot being attempted as:

$$\text{Three Point Attempt }= \frac{3PA}{3PA + 2PA}\\$$
$$\text{Two Point Attempt }= \frac{2PA}{3PA + 2PA}$$

### Three and Two Point Attempt Made

We are given a value for a team's success in making a given two or three point shot from the csv, as 3P% and 2P%. However, this does not take into account the defense of the opposing team. To estimate a team's shooting percentage against a certain opponent, we will use a technique known as "regressing towards the mean". This means that we will take the default value of a team's shooting percentage and move it by a fixed ratio towards the defense's average allowed percentage (the 3P% or 2P% from otherStats). Using $3P\%_{off}$ for the offensive rate and $3P\%_{def}$ for the defensive rate, we get the updated percentages as:

$$\text{Three Point Percentage }= 3P\%_{off} - \alpha\cdot(3P\%_{off} - 3P\%_{def})\\$$
$$\text{Two Point Percentage }= 2P\%_{off} - \alpha\cdot(2P\%_{off} - 2P\%_{def})$$

where $\alpha$ is a constant less than one.

Note that this is a very oversimplified way of calculating the change in a team's shooting percentage. This is deliberate, as a goal of this tutorial is to demonstrate the utility of a highly simple model. If this model was to be improved, we could plot for each game played versus a given team their opponent's overall average against their opponent's shooting percentage in that game. With this, we could form a regression to get a more accurate idea of the effect a defense can have on shooting percentage.

### Rebounds

To estimate each team's chance of getting rebounds, we will simply compare one team's gross number of offensive rebounds (ORB) to the other team's gross number of defensive rebounds (DRB), like so (with $ORB_1$ and $DRB_1$ being the first team's gross numbers, and $ORB_2$ and $DRB_2$ belonging to the second team):

$$\text{Offensive Rebound Chance for Team 1 }= \frac{ORB_1}{ORB_1 + DRB_2}\\$$
$$\text{Defensive Rebound Chance for Team 1 }= \frac{DRB_1}{DRB_1 + ORB_2}$$

With similar formulas for Team 2.

Now that those formulas are established, we will build the class to contain all of these. Note that I'm making a class mostly for readability, as we could just as easily back all of these values into a list as I did with the raw statistics earlier.

Note as well that I do not explicitly store either team's two point attempt percentage or defensive rebound percentage, as these have counterparts (three point attempt and  the other team's offensive rebound) which are their complement (e.g. three point attempt + two point attempt = 1) and so are easily recovered.

In [17]:
class Matchup:
    def __init__(self, team1, team2, a=.35):
        self.name1 = team1[0]
        self.name2 = team2[0]
        
        self.threePA1 = float(team1[5]) / (team1[5] + team1[6])
        self.threePA2 = float(team2[5]) / (team2[5] + team2[6])
        
        self.threePP1 = team1[1] - (a * (team1[1] - team2[7]))
        self.threePP2 = team2[1] - (a * (team2[1] - team1[7]))
        
        self.twoPP1 = team1[2] - (a * (team1[2] - team2[8]))
        self.twoPP2 = team2[2] - (a * (team2[2] - team1[8]))
        
        self.orb1 = float(team1[3]) / (team1[3] + team2[4])
        self.orb2 = float(team2[3]) / (team2[3] + team1[4])
        
    def name(self, i):
        if i == 1:
            return self.name1
        else:
            return self.name2
        
    def threePointAttempt(self, i):
        if i == 1:
            return self.threePA1
        else:
            return self.threePA2
        
    def threePointPercent(self, i):
        if i == 1:
            return self.threePP1
        else:
            return self.threePP2
        
    def twoPointPercent(self, i):
        if i == 1:
            return self.twoPP1
        else:
            return self.twoPP2
        
    def offRebound(self, i):
        if i == 1:
            return self.orb1
        else:
            return self.orb2

In [28]:
# make a matchup between the Golden State Warriors and the Cleveland Caveliers
# with a = .35
gstateAndCavs = Matchup(getStats(9), getStats(5))
print gstateAndCavs.name(1), gstateAndCavs.name(2)
print gstateAndCavs.threePointPercent(1), gstateAndCavs.threePointPercent(2)

Golden State Warriors Cleveland Cavaliers
0.39185 0.3515


## Simulating a Game

We're now ready to simulate a full game. A game is made up of a series of plays, and each play is essentially a Markov chain. A picture of the chain we will be using is given below, along with the transition percentages.

![Image](chain.png)

Note that a 3 point or 2 point attempt being made results in a gain of 3 or 2 points respectively for the team which makes it. The second major component of this simulation is the time: we will simulate only 2880 seconds (48 minutes) of play, the same length of an NBA game. Each step of the Markov chain has a time associated with it. The times used here are arbitrary in magnitude but not in relation to each other: for example, a three point attempt will take less time than a two point attempt to set up, but it will take longer to rebound a three point attempt.

Each time will also have a normal random variable added to it, to create variation in the time cost.

When all 2880 seconds have been exhausted, the score will be checked. The winner is the team with the larger score. If the score is tied, 300 seconds (five minutes) of additional time will be played, in the same way that real games can go to overtime.

In [38]:
import numpy.random as nran
from collections import Counter

def play(match, time=2880, score1=0, score2=0):

    pos = nran.choice([1,2])
    
    while (time > 0):
        
        time -= (5 + nran.normal())
        
        # taking a three point shot
        if (nran.uniform() <= match.threePointAttempt(pos)):
            
            time -= (3 + 3 * nran.normal())
            
            # we made it
            if (nran.uniform() <= match.threePointPercent(pos)):
                
                # posession changes and score
                if pos == 1:
                    score1 += 3
                    pos = 2
                    
                else:
                    score2 += 3
                    pos = 1
                    
            # we miss
            else:
                
                time -= (4 + 2 * nran.normal())
                
                # we get the rebound
                if (nran.uniform() <= match.offRebound(pos)):
                    
                    # we still have posession
                    pass
                
                # other team gets rebound
                else:
                    
                    # posession changes
                    if pos == 1:
                        pos = 2
                    
                    else:
                        pos = 1
                        
        # taking a two point shot
        else:
            
            time -= (5 + 2 * nran.normal())
            
            # we made it
            if (nran.uniform() <= match.twoPointPercent(pos)):
                
                # posession changes and score
                if pos == 1:
                    score1 += 2
                    pos = 2
                    
                else:
                    score2 += 2
                    pos = 1
                    
            # we miss
            else:
                
                time -= (2 + 2 * nran.normal())
                
                # we get the rebound
                if (nran.uniform() <= match.offRebound(pos)):
                    
                    # we still have posession
                    pass
                
                # other team gets rebound
                else:
                    
                    # posession changes
                    if pos == 1:
                        pos = 2
                    
                    else:
                        pos = 1
                        
        # check if we need to play more
        if (time <= 0) and (score1 == score2): 
            time += 300
            
    # the game is over
    if (score1 > score2):
        return 1
    
    else:
        return 2
    
def winPerc(match, iters=1000, time=2880, score1=0, score2=0):
    
    results = [play(match, time, score1, score2) for i in range(0, iters)]
    tally = Counter(results)
    
    return [tally[1] / float(iters), tally[2] / float(iters)]

In [39]:
winPerc(gstateAndCavs)

[0.658, 0.342]

From this, we can see that the model predicts the Warriors to have about a two-thirds chance to win over the Cavaliers.

## Percent Over Time

From the way the model is set up, we can alter the initial parameters and see what effect they have on the game. For instance, we can simulate the game with 5 minutes left and one team leading 88 to 82. In effect, this means we can simulate a single game from multiple points, in effect giving a continuously updated win percentage.

The data below is taken from Game 1 of the 2016 NBA finals, which was between the Golden State Warriors and the Cleveland Cavaliers. It contains the time remaining and both scores at two-minute intervals.

In [45]:
gameData = [
            [2880,0,0],[2760,5,5],[2640,11,9],[2520,16,13],[2400,21,17],[2280,24,19],[2160,28,24],
            [2040,32,26],[1920,39,28],[1800,43,30],[1680,47,36],[1560,49,41],[1440,52,43],
            [1320,54,45],[1200,56,52],[1080,61,57],[960,63,62],[840,67,68],[720,74,68],
            [600,82,70],[480,88,74],[360,94,76],[240,96,82],[120,104,87],[0,104,89]
            ]
percents = map(lambda x: winPerc(gstateAndCavs, 10000, time=x[0], score1=x[1], score2=x[2]), gameData)

In [46]:
percents

[[0.6655, 0.3345],
 [0.6605, 0.3395],
 [0.6972, 0.3028],
 [0.7121, 0.2879],
 [0.7346, 0.2654],
 [0.7533, 0.2467],
 [0.7272, 0.2728],
 [0.7764, 0.2236],
 [0.8618, 0.1382],
 [0.8907, 0.1093],
 [0.8686, 0.1314],
 [0.8127, 0.1873],
 [0.8331, 0.1669],
 [0.8441, 0.1559],
 [0.727, 0.273],
 [0.7343, 0.2657],
 [0.6356, 0.3644],
 [0.5574, 0.4426],
 [0.8074, 0.1926],
 [0.9458, 0.0542],
 [0.9815, 0.0185],
 [0.9989, 0.0011],
 [0.9968, 0.0032],
 [1.0, 0.0],
 [1.0, 0.0]]

Let's plot these and see how the Golden State Warriors' chance of winning changed over time. (Note that the code below does not run in-screen, and will open a popup window.)

In [49]:
import matplotlib.pyplot as plt

gsPercs = [i[0] for i in percents]
cavPercs = [i[1] for i in percents]
times = [2880 - i[0] for i in gameData]

plt.plot(times, gsPercs, 'r-')
plt.plot(times, cavPercs, 'b-')
plt.axis([0, 2880, 0, 1])
plt.ylabel('Win Percentage')
plt.xlabel('Time Elapsed')
plt.title('GState versus Cavs')
plt.show()

![title](plot.png)

The chart above shows the Warriors in red and the Cavaliers in blue. Comparing the probabilities to the box score shows that they respond exactly as you'd expect to changes in the score: the lines draw close when the Cavaliers gain a 1 point lead near the end of the third quarter, and fly apart when the Warriors score 7 unanswered points to end the quarter.

This demonstrates the level of predective capability that can be reached with a very simple model of a complicated event. While some decisions and numerical values here included are slightly arbitrary and could be fine-tuned by comparing real-life results to the model's predictions, there is something to say for the fact that even with such caveats included, we were easily able to obtain a model that gives results in line with what we would intuitively expect.