# Summary
2019-03-29

The analysis will attempt to choose the winner of the 2019 NCAA Men's Basketball tournament by ranking teams using the [PageRank](https://en.wikipedia.org/wiki/PageRank) algorithm based on their records before the tournament started.

Many thanks go out to:
- https://www.youtube.com/watch?v=F5fcEtqysGs&list=WL&index=50&t=0s for PageRank Tutorial
- https://www.printyourbrackets.com/fillable-ncaa-tournament-bracket.html for customizable bracket
- https://www.sports-reference.com/cbb/ for team season records

In [1]:
#load need libraries
import pandas as pd
import numpy as np

### Load Data

In [2]:
data = pd.read_csv('NCAA_dataset.csv', header=0)
data.head()

Unnamed: 0,tm1,tm2,gm_type,s1,s2,w1,w2
0,abilene-christian,arlington-baptist,REG,107,54,1,0
1,abilene-christian,arkansas-state,REG,94,73,1,0
2,abilene-christian,denver,REG,67,61,1,0
3,abilene-christian,elon,REG,72,56,1,0
4,abilene-christian,pacific,REG,73,71,1,0


In [3]:
#get a complete list of unique teams in dataset
all_teams = set(data['tm1'].unique()) | set(data['tm2'].unique())
all_teams = list(all_teams)
len(all_teams)

391

### Start Analysis

Outline Steps:
1. Create a numpy matrix of 391 X 391 because there are 391 teams.
2. We will count every loss as an outgoing link for that team. We will ignore wins to avoid double counting. However, because our dataset does not contain all D1 basketball teams, but just those that made it to the NCAA tournament, this approach means we will not have a full record of the teams not in the tournament. The columns of the matrix will represent the loses (or outgoing links) a team had, and the rows will represent the wins a team. 
3. The columns will be normalized by the total sum of those columns. 
4. Using the matrix compute the rank using the Pagerank algorithm with dampening. Use an iterative approach to compute the PageRank stopping when the values become stable and only changing slightly between iterations. 

##### Create Matrix

In [4]:
l = len(all_teams)
m = np.zeros((l, l))
key = dict(zip(all_teams, range(l)))

for i in range(len(data)):
    if data['w2'][i] == 1:
        m[ key[data['tm2'][i]], key[data['tm1'][i]] ] += 1
#normalize the matrix
t = m.sum(axis=0)
for i in range(len(t)):
    if t[i] == 0:
        continue
    else:
        m[:, i] = m[:, i] / t[i]

Our dataset contains the records for only those teams who are in the tournement. To avoid double counting we can track either the total number of wins or total number of losses for those teams. This is being done to avoid creating some sort of hashing function that will recongize that a particular game is the same, regardless if it shows up twice in the dataset for each of the different teams. By tracking only losses, for some teams the total number of wins is not correct.

##### Apply PageRank Algo.

In [5]:
#initialize rank vector
r = np.ones((l, 1)) * 1/l
d = .9
#iterate adjust rank until stable
for i in range(50):
    r_init = r
    r = (1 - d)/l + d*(np.dot(m, r))
    ssd = np.square(r_init - r).sum() #squared sum diff. from prev.
    z = zip(all_teams, r.ravel())
    print(ssd)
    if ssd <= 1e-09:
        break   

0.00158332945869
3.08594808598e-05
5.70494925626e-06
2.208947065e-06
9.4994807279e-07
4.50326316718e-07
2.17147139724e-07
1.05316962055e-07
5.17674182282e-08
2.55042613318e-08
1.25644375992e-08
6.19727863408e-09
3.05682699302e-09
1.50730982062e-09
7.43192275285e-10


##### Put Sorted Results in DataFrame

In [6]:
z = zip(all_teams, r.ravel())
df = pd.DataFrame(list(z), columns=['team', 'pr'])
df.sort_values(by=['pr'], ascending=False).head(30)

Unnamed: 0,team,pr
207,duke,0.002235
298,north-carolina,0.001802
17,kansas,0.001357
98,kentucky,0.001356
326,virginia,0.001222
52,florida-state,0.001217
370,michigan-state,0.001212
160,tennessee,0.001196
371,michigan,0.001022
69,texas,0.000983


##### Run Tournament

Based on ranking run tournament to see who wins! Highest rank moves to next round. 

In [7]:
east = ['duke', 'north-dakota-state', 'virginia-commonwealth', 'central-florida', 
        'mississippi-state', 'liberty', 'virginia-tech', 'saint-louis', 
        'maryland', 'belmont', 'louisiana-state', 'yale', 
        'louisville', 'minnesota', 'michigan-state', 'bradley']
west = ['gonzaga', 'fairleigh-dickinson', 'syracuse', 'baylor', 
        'marquette', 'murray-state', 'florida-state', 'vermont',
        'buffalo', 'arizona-state', 'texas-tech', 'northern-kentucky',
        'nevada', 'florida', 'michigan', 'montana']
south = ['virginia', 'gardner-webb', 'mississippi', 'oklahoma', 
         'wisconsin', 'oregon', 'kansas-state', 'california-irvine',
         'villanova', 'saint-marys-ca', 'purdue', 'old-dominion',
         'cincinnati', 'iowa', 'tennessee', 'colgate']
midwest = ['north-carolina', 'iona', 'utah-state', 'washington',
          'auburn', 'new-mexico-state', 'kansas', 'northeastern', 
           'iowa-state', 'ohio-state', 'houston', 'georgia-state', 
           'wofford', 'seton-hall', 'kentucky', 'abilene-christian'
          ]

In [8]:
def run_tournament_round(list_of_teams=[]):
    next_round = []
    for i in range(0, len(list_of_teams), 2):
        r1 = float(df.loc[df.team == list_of_teams[i], 'pr'])
        r2 = float(df.loc[df.team == list_of_teams[i+1], 'pr'])
        if r1 > r2:
            next_round.append(list_of_teams[i])
        else:
            next_round.append(list_of_teams[i+1])
    return next_round

def run_tournament(left_bracket, right_bracket):
    r_l = [left_bracket]
    r_r = [right_bracket]
    for i in range(5):
        r_l.append(run_tournament_round(r_l[i]))
        r_r.append(run_tournament_round(r_r[i]))
    return r_l, r_r, run_tournament_round(r_l[5] + r_r[5])

The output of the tournament returns and obeject with 3 items. The first 2 are a list of lists which represent the predicted winners in each round on the left and right side of the tournament bracket. The final item is the name of the predicted overall winner.

In [9]:
o = run_tournament(east + west, south + midwest)
#Display predicted winners of the sweet 16, elite 8, final 4, and champion
o[0][2:], o[1][2:], o[2]

([['duke',
   'virginia-tech',
   'louisiana-state',
   'michigan-state',
   'gonzaga',
   'florida-state',
   'texas-tech',
   'michigan'],
  ['duke', 'michigan-state', 'florida-state', 'michigan'],
  ['duke', 'florida-state'],
  ['duke']],
 [['virginia',
   'kansas-state',
   'purdue',
   'tennessee',
   'north-carolina',
   'kansas',
   'iowa-state',
   'kentucky'],
  ['virginia', 'tennessee', 'north-carolina', 'kentucky'],
  ['virginia', 'north-carolina'],
  ['north-carolina']],
 ['duke'])

### Result

The overall predicted result is Duke. This is no big surprise as they were the number 1 rated seed coming into the tournament. I started this analysis when the teams were playing to make it into the Elite 8 but finished just before the final 4 games start. Of the 60 games that have been completed so far this version of the algorithm predicted 37 correctly. (To see the full predicted bracket, [click here](https://github.com/stubberf/NCAATour2019/blob/master/fillable-march-madness-V1.pdf)!) Not perfect, but better than 50%. Additionally, if you use the common scoring this bracket would beat the [average bracket](https://www.ncaa.com/news/basketball-men/2019-02-27/march-madness-how-do-your-past-brackets-stack-competition) over the last 3 years. I consider that performance pretty good. Based on the current algorithm and teams in the Final 4, Michigan-State and Virginia are predicted (see below) to move to the final round with Virginia slightly (by the smallest margins!) ranked higher than Michigan-State. Let’s see if these predictions are right.

Possible improvements:
 1. Run the analysis again with the tournament games results through to the Final 4
 2. Add simulation into analysis by taking standard deviation of the total number of points a team scores and randomly add and subtract that from the game score and add those games to the analysis. This is an attempt to determine if the same two teams played multiple times who is likely to win.
 

In [10]:
df.loc[(df.team=='texas-tech')|(df.team=='michigan-state'),:]

Unnamed: 0,team,pr
193,texas-tech,0.000708
370,michigan-state,0.001212


In [11]:
df.loc[(df.team=='virginia')|(df.team=='auburn'),:]

Unnamed: 0,team,pr
211,auburn,0.000924
326,virginia,0.001222


In [12]:
df.loc[(df.team=='virginia')|(df.team=='michigan-state'),:]

Unnamed: 0,team,pr
326,virginia,0.001222
370,michigan-state,0.001212


### Improvement Idea # 1.

Trying our improvement idea #1 to re-run the analysis again with the tournament games results through to the Final 4.

In [13]:
l = len(all_teams)
m = np.zeros((l, l))
key = dict(zip(all_teams, range(l)))

for i in range(len(data)):
    if data['w2'][i] == 1:
        m[ key[data['tm2'][i]], key[data['tm1'][i]] ] += 1


In [14]:
#Add in the results of the tournament through Final Four
#first round
m[ key['north-dakota-state'], key['duke']] += 1
m[ key['virginia-commonwealth'], key['central-florida']] += 1
m[ key['mississippi-state'], key['liberty']] += 1
m[ key['saint-louis'], key['virginia-tech']] += 1
m[ key['belmont'], key['maryland']] += 1
m[ key['yale'], key['louisiana-state']] += 1
m[ key['louisville'], key['minnesota']] += 1
m[ key['bradley'], key['michigan-state']] += 1
m[ key['fairleigh-dickinson'], key['gonzaga']] += 1
m[ key['syracuse'], key['baylor']] += 1
m[ key['marquette'], key['murray-state']] += 1
m[ key['vermont'], key['florida-state']] += 1
m[ key['northern-kentucky'], key['texas-tech']] += 1
m[ key['arizona-state'], key['buffalo']] += 1
m[ key['nevada'], key['florida']] += 1
m[ key['montana'], key['michigan']] += 1
m[ key['gardner-webb'], key['virginia']] += 1
m[ key['mississippi'], key['oklahoma']] += 1
m[ key['wisconsin'], key['oregon']] += 1
m[ key['kansas-state'], key['california-irvine']] += 1
m[ key['saint-marys-ca'], key['villanova']] += 1
m[ key['old-dominion'], key['purdue']] += 1
m[ key['cincinnati'], key['iowa']] += 1
m[ key['colgate'], key['tennessee']] += 1
m[ key['iona'], key['north-carolina']] += 1
m[ key['utah-state'], key['washington']] += 1
m[ key['new-mexico-state'], key['auburn']] += 1
m[ key['northeastern'], key['kansas']] += 1
m[ key['iowa-state'], key['ohio-state']] += 1
m[ key['georgia-state'], key['houston']] += 1
m[ key['seton-hall'], key['wofford']] += 1
m[ key['abilene-christian'], key['kentucky']] += 1

#second-round
m[ key['central-florida'], key['duke']] += 1
m[ key['liberty'], key['virginia-tech']] += 1
m[ key['maryland'], key['louisiana-state']] += 1
m[ key['minnesota'], key['michigan-state']] += 1
m[ key['baylor'], key['gonzaga']] += 1
m[ key['murray-state'], key['florida-state']] += 1
m[ key['buffalo'], key['texas-tech']] += 1
m[ key['florida'], key['michigan']] += 1
m[ key['oklahoma'], key['virginia']] += 1
m[ key['california-irvine'], key['oregon']] += 1
m[ key['villanova'], key['purdue']] += 1
m[ key['iowa'], key['tennessee']] += 1
m[ key['washington'], key['north-carolina']] += 1
m[ key['kansas'], key['auburn']] += 1
m[ key['ohio-state'], key['houston']] += 1
m[ key['wofford'], key['kentucky']] += 1

#sweet-sixteen
m[ key['virginia-tech'], key['duke']] += 1
m[ key['louisiana-state'], key['michigan-state']] += 1
m[ key['florida-state'], key['gonzaga']] += 1
m[ key['michigan'], key['texas-tech']] += 1
m[ key['oregon'], key['virginia']] += 1
m[ key['tennessee'], key['purdue']] += 1
m[ key['north-carolina'], key['auburn']] += 1
m[ key['houston'], key['kentucky']] += 1

#final-four
m[ key['duke'], key['michigan-state']] += 1
m[ key['gonzaga'], key['texas-tech']] += 1
m[ key['purdue'], key['virginia']] += 1
m[ key['kentucky'], key['auburn']] += 1

In [15]:
#normalize the matrix
t = m.sum(axis=0)
for i in range(len(t)):
    if t[i] == 0:
        continue
    else:
        m[:, i] = m[:, i] / t[i]

In [16]:
#initialize rank vector
r = np.ones((l, 1)) * 1/l
d = .9
#iterate adjust rank until stable
for i in range(50):
    r_init = r
    r = (1 - d)/l + d*(np.dot(m, r))
    ssd = np.square(r_init - r).sum() #squared sum diff. from prev.
    z = zip(all_teams, r.ravel())
    print(ssd)
    if ssd <= 1e-09:
        break   

0.00155734155089
2.74839458129e-05
5.85176562218e-06
2.12569530933e-06
8.29510750609e-07
3.34128753113e-07
1.34729484843e-07
5.43144276233e-08
2.19511634579e-08
8.86928935211e-09
3.58404953622e-09
1.44864429496e-09
5.85549539323e-10


In [17]:
z = zip(all_teams, r.ravel())
df2 = pd.DataFrame(list(z), columns=['team', 'pr'])
df2.sort_values(by=['pr'], ascending=False).head(30)

Unnamed: 0,team,pr
207,duke,0.001336
17,kansas,0.001216
370,michigan-state,0.00109
298,north-carolina,0.001066
98,kentucky,0.001016
371,michigan,0.000939
54,iowa-state,0.000932
205,houston,0.000923
326,virginia,0.000921
52,florida-state,0.00091


Using the game results up until the Final Four in the analysis we see that Duke is still ranked number 1. However, noteabley North Carolina has dropped and Michigan-State has increased. 


Let's look at the Final Four matchup with this new data.

In [18]:
df2.loc[(df2.team=='texas-tech')|(df2.team=='michigan-state'),:]

Unnamed: 0,team,pr
193,texas-tech,0.000703
370,michigan-state,0.00109


In [19]:
df2.loc[(df2.team=='virginia')|(df2.team=='auburn'),:]

Unnamed: 0,team,pr
211,auburn,0.000726
326,virginia,0.000921


In [20]:
df2.loc[(df2.team=='virginia')|(df2.team=='michigan-state'),:]

Unnamed: 0,team,pr
326,virginia,0.000921
370,michigan-state,0.00109


__Michigan-State and Virginia are still predicted to reach the finals. However, now Michigan-State is the favorite to win it all!__