# Summary
2019-03-29

Perform PageRank on NCAA data

Usefull website:
- https://www.youtube.com/watch?v=F5fcEtqysGs&list=WL&index=50&t=0s
- https://en.wikipedia.org/wiki/PageRank
- https://www.printyourbrackets.com/fillable-ncaa-tournament-bracket.html
- https://www.sports-reference.com/cbb/

In [1]:
#load need libraries
import pandas as pd
import numpy as np

### Load Data

In [2]:
data = pd.read_csv('NCAA_dataset.csv', header=0)
data.head()

Unnamed: 0,tm1,tm2,gm_type,s1,s2,w1,w2
0,abilene-christian,arlington-baptist,REG,107,54,1,0
1,abilene-christian,arkansas-state,REG,94,73,1,0
2,abilene-christian,denver,REG,67,61,1,0
3,abilene-christian,elon,REG,72,56,1,0
4,abilene-christian,pacific,REG,73,71,1,0


In [3]:
#get a complete list of unique teams in dataset
all_teams = set(data['tm1'].unique()) | set(data['tm2'].unique())
all_teams = list(all_teams)
len(all_teams)

391

### Start Analysis

Outline Steps:
1. Create a numpy matrix of 391 X 391 because there are 391 teams.
2. We will count every loss as an outgoing link for that team. We will ignore wins to avoid double counting. However, because our dataset does not contain all D1 basketball teams, but just those that made it to the NCAA tournament, this approach means we will not have a full record of the teams not in the tournament. The columns of the matrix will represent the loses (or outgoing links) a team had, and the rows will represent the wins a team. 
3. The columns will be normalized by the total sum of those columns. 
4. Using the matrix compute the rank using the Pagerank algorithm with dampening. Use an iterative approach to compute the PageRank stopping when the values become stable and only changing slightly between iterations. 

##### Create Matrix

In [68]:
l = len(all_teams)
m = np.zeros((l, l))
key = dict(zip(all_teams, range(l)))

for i in range(len(data)):
    if data['w2'][i] == 1:
        m[ key[data['tm2'][i]], key[data['tm1'][i]] ] += 1
#normalize the matrix
t = m.sum(axis=0)
for i in range(len(t)):
    if t[i] == 0:
        continue
    else:
        m[:, i] = m[:, i] / t[i]

Our dataset contains the records for only those teams who are in the tournement. To avoid double counting we can track either the total number of wins or total number of losses for those teams. This is being done to avoid creating some sort of hashing function that will recongize that a particular game is the same, regardless if it shows up twice in the dataset for each of the different teams. By tracking only losses, for some teams the total number of wins is not correct.

##### Apply PageRank Algo.

In [70]:
#initialize rank vector
r = np.ones((l, 1)) * 1/l
d = .9
#iterate adjust rank until stable
for i in range(50):
    r_init = r
    r = (1 - d)/l + d*(np.dot(m, r))
    ssd = np.square(r_init - r).sum() #squared sum diff. from prev.
    z = zip(all_teams, r.ravel())
    print(ssd)
    if ssd <= 1e-09:
        break   

0.00158332945869
3.08594808598e-05
5.70494925626e-06
2.208947065e-06
9.4994807279e-07
4.50326316718e-07
2.17147139724e-07
1.05316962055e-07
5.17674182282e-08
2.55042613318e-08
1.25644375992e-08
6.19727863408e-09
3.05682699302e-09
1.50730982062e-09
7.43192275285e-10


##### Put Sorted Results in DataFrame

In [6]:
z = zip(all_teams, r.ravel())
df = pd.DataFrame(list(z), columns=['team', 'pr'])
df.sort_values(by=['pr'], ascending=False).head(30)

Unnamed: 0,team,pr
177,duke,0.049204
64,kansas,0.035346
383,north-carolina,0.034495
50,houston,0.029902
222,virginia,0.025156
328,kentucky,0.024683
18,iowa-state,0.024141
59,florida-state,0.023306
234,cincinnati,0.021247
171,marquette,0.020885


##### Run Tournament

Based on ranking run tournament to see who wins! Highest rank moves to next round. 

In [7]:
east = ['duke', 'north-dakota-state', 'virginia-commonwealth', 'central-florida', 
        'mississippi-state', 'liberty', 'virginia-tech', 'saint-louis', 
        'maryland', 'belmont', 'louisiana-state', 'yale', 
        'louisville', 'minnesota', 'michigan-state', 'bradley']
west = ['gonzaga', 'fairleigh-dickinson', 'syracuse', 'baylor', 
        'marquette', 'murray-state', 'florida-state', 'vermont',
        'buffalo', 'arizona-state', 'texas-tech', 'northern-kentucky',
        'nevada', 'florida', 'michigan', 'montana']
south = ['virginia', 'gardner-webb', 'mississippi', 'oklahoma', 
         'wisconsin', 'oregon', 'kansas-state', 'california-irvine',
         'villanova', 'saint-marys-ca', 'purdue', 'old-dominion',
         'cincinnati', 'iowa', 'tennessee', 'colgate']
midwest = ['north-carolina', 'iona', 'utah-state', 'washington',
          'auburn', 'new-mexico-state', 'kansas', 'northeastern', 
           'iowa-state', 'ohio-state', 'houston', 'georgia-state', 
           'wofford', 'seton-hall', 'kentucky', 'abilene-christian'
          ]

In [8]:
def run_tournament_round(list_of_teams=[]):
    next_round = []
    for i in range(0, len(list_of_teams), 2):
        r1 = float(df.loc[df.team == list_of_teams[i], 'pr'])
        r2 = float(df.loc[df.team == list_of_teams[i+1], 'pr'])
        if r1 > r2:
            next_round.append(list_of_teams[i])
        else:
            next_round.append(list_of_teams[i+1])
    return next_round

def run_tournament(left_bracket, right_bracket):
    r_l = [left_bracket]
    r_r = [right_bracket]
    for i in range(5):
        r_l.append(run_tournament_round(r_l[i]))
        r_r.append(run_tournament_round(r_r[i]))
    return r_l, r_r, run_tournament_round(r_l[5] + r_r[5])

The output of the tournament returns and obeject with 3 items. The first 2 are a list of lists which represent the predicted winners in each round on the left and right side of the tournament bracket. The final item is the name of the predicted overall winner.

In [9]:
o = run_tournament(east + west, south + midwest)
#Display predicted winners of the sweet 16, elite 8, final 4, and champion
o[0][2:], o[1][2:], o[2]

([['duke',
   'virginia-tech',
   'louisiana-state',
   'michigan-state',
   'baylor',
   'florida-state',
   'texas-tech',
   'michigan'],
  ['duke', 'michigan-state', 'florida-state', 'michigan'],
  ['duke', 'florida-state'],
  ['duke']],
 [['virginia',
   'kansas-state',
   'villanova',
   'cincinnati',
   'north-carolina',
   'kansas',
   'houston',
   'kentucky'],
  ['virginia', 'cincinnati', 'kansas', 'houston'],
  ['virginia', 'kansas'],
  ['kansas']],
 ['duke'])

### Result

The overall predicted result is Duke. This is no big surprise as they were the number 1 rated seed coming into the tournament. I started this analysis when the teams were playing to make it into the Elite 8 but finished just before the final 4 games start. Of the 60 games that have been completed so far this version of the algorithm predicted 37 correctly. (To see the full predicted bracket, click here!) Not perfect, but better than 50%. I consider that performance ok.

Possible improvements:
 - Add simulation into analysis by taking standard deviation of the total number of points a team scores and randomly add and subtract that from the game score and add those games to the analysis. This is an attempt to determine if the same two teams played multiple times who is likely to win.