# Comparing Models to Predict March Madness Rankings


March Madness, also known as the NCAA Division I Men's Basketball Tournament happens annually in the month of March. Depending on which teams performed the best in the season, the top 32 are selected to compete in the tournament and play each other in a bracket for the winners trophy. Although March Madness 2018 is already over, our team wanted to see which model would do a better job at predicting team rankings/winners. We decided to look at the Elo Model as well as use logistic regression with features we extracted in order to find a trend. 

### Scraping the Data

In order to scrape data, we used www.sports-reference.com/cbb (cbb = college basketball). We utilized the BeautifulSoup library to extract the features that we thought we would need for both models. In the following code the features extracted are in the _featuresWanted_ set. A typical page that we would scrape from looks like the following: 


<img src="files/cbbstatsex.png">

This data displays Villanova's game history for the year 2018 [found here](https://www.sports-reference.com/cbb/schools/villanova/2018-schedule.html). We used Beautiful Soup to gather all the table data and format it in a data frame. Because the scraping usually takes ~10 minutes, the code was run once and put into a csv file, which we later used to do our data analysis. 


In [37]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd


def getSchools():
    url = "https://www.sports-reference.com/cbb/seasons/2018-school-stats.html"
    page = urlopen(url).read()
    soup = BeautifulSoup(page)
    count  = 0
    table = soup.find("tbody")
    school_dict = dict()
    for row in table.findAll('td', {"data-stat": "school_name"}):
        school_name = row.getText()
        for a in row.find_all('a', href=True):
            link = a['href'].strip()
            name = link[13:].split("/")[0]
            school_dict[name] = school_name
            
    return school_dict

def getDfs():
    school_set = getSchools()
    dfs = []
    final_df=pd.DataFrame()
    for school in school_set: 
        url = "https://www.sports-reference.com/cbb/schools/" + school + "/2018-schedule.html"
        page = urlopen(url).read()
        soup = BeautifulSoup(page)
        count = 0 
        pre_df = dict()
        school_set = getSchools()
        table = soup.find("tbody")
        featuresWanted =  {'opp_name', 'pts', 'opp_pts', 
                           'game_location','game_result','overtimes','wins','losses', 'date_game'} #add more features here!!

        rows = table.findChildren(['tr'])
        for row in rows:
            if (row.find('th', {"scope":"row"}) != None):

                for f in featuresWanted:
                    cell = row.find("td",{"data-stat": f})

                    a = cell.text.strip().encode()
                    text=a.decode("utf-8")
                    if f in pre_df:
                        pre_df[f].append(text)
                    else:
                        pre_df[f]=[text]
            
        df = pd.DataFrame.from_dict(pre_df)
        df["opp_name"]= df["opp_name"].apply(lambda row: (row.split("(")[0]).rstrip())
        df["school_name"]=school_set[school]
        df["school_name"] = df["school_name"].apply(lambda row: (row.split("(")[0]).rstrip())
        final_df=pd.concat([final_df,df])
    return final_df


def csvDump():
    df=getDfs()
    df.to_csv("scraped_data.csv")
    
    
csvDump()





 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


351
{'bryant', 'wake-forest', 'arkansas-state', 'georgia', 'st-bonaventure', 'high-point', 'southern-mississippi', 'seton-hall', 'norfolk-state', 'saint-peters', 'stetson', 'alabama-birmingham', 'cal-state-bakersfield', 'new-hampshire', 'south-alabama', 'albany-ny', 'stephen-f-austin', 'jacksonville', 'pittsburgh', 'iowa-state', 'southern-utah', 'kent-state', 'texas', 'old-dominion', 'cincinnati', 'fairleigh-dickinson', 'tennessee-martin', 'oklahoma-state', 'tennessee-tech', 'colorado', 'green-bay', 'college-of-charleston', 'florida', 'furman', 'longwood', 'central-michigan', 'delaware', 'california-riverside', 'george-mason', 'william-mary', 'bradley', 'texas-southern', 'massachusetts', 'california', 'prairie-view', 'tennessee-state', 'charlotte', 'gardner-webb', 'louisiana-lafayette', 'central-arkansas', 'james-madison', 'texas-tech', 'drake', 'monmouth', 'northern-illinois', 'south-carolina-state', 'clemson', 'rutgers', 'kentucky', 'minnesota', 'south-carolina-upstate', 'seattle', '

After creating the csv, our csv (in this same folder, called scraped_data.csv contained data about all games that were played in the 2017-2018 season. 

## The Elo Model 

The Elo Model is a way of creating a rating system for zero-sum games - games that only have one winner and one loser (e.g. basketball, hockey, football, tennis, etc.) The system uses the following method:

The algorithm works in the following way: 

Each team begins with the same ranking. The standard across most sports is ~1000-1500. We started out with *1200*, which was a common trend amongst others across the internet who had also used Elo Rankings for other sports. We then calculate the probability of each team winning with the following equation:

**Team1 Probability = (1.0 / (1.0 + 10^((Team1_Rating – Team2_Rating) / 400)))**

**Team2 Probability  = (1.0 / (1.0 + 10^((Team2_Rating – Team1_Rating) / 400)))**


We can see that Team1 Probabilty + Team2 Probabiilty = 1.0. The '400' is a standardized constant in Elo Rankings[(1)](https://en.wikipedia.org/wiki/Elo_rating_system)

When a game is played, we can update the rankings of both teams using the following equation: 

**Team1_Rating = Team1_Rating + K*(Team1_Score – Team1_Probability)**

**Team2_Rating = Team2_Rating + K*(Team2_Score – Team2_Probability)**

Here, the scores are determined by the outcome of the game:

win = 1.0
draw = 0.5
loss = 0.0

The K factor is a numerical value that "determines how much the Elo rating should change following a match result"[(2)](www.betfair.com.au). Across literature and the internet, a common k-factor for basketball has been 20 (Used by FiveThirtyEight and others). We can actually create a K=factor that depeonds on the nubmer of matches played. (More on this later). 


The following Elo class creates an Elo ranking for each team and updates it everytime a game is played. It will be used for data analysis on the data we scraped earlier.  

In [3]:
'''
WIN = 1.
DRAW = 0.5
LOSS = 0.

https://www.geeksforgeeks.org/elo-rating-algorithm/

'''
#: Default K-factor.
K_FACTOR = 25
#: Default rating class.
RATING_CLASS = float
#: Default initial rating.
INITIAL_RATING = 1200
#: Default Beta value.
BETA = 200


class Elo(object):
    #initialize object
    def __init__(self, teamName, kFactor = K_FACTOR, rating = INITIAL_RATING, beta = BETA):
        self.teamName = teamName
        self.kFactor = kFactor
        self.rating = rating 
        self.pWin = None
        self.beta = 2*BETA
        self.matches = 0 

    def calcPWin(self, oppRating): #expected
        pwin = 1/(1+1000.00**((self.rating - oppRating)/self.beta))
        self.pWin = pwin
        return pwin

    def game(self, outcome, oppRating): #1 for win, 0 for loss, 
        pwin =self.calcPWin(oppRating)
        self.rating = self.rating - self.kFactor*(outcome - pwin)
        self.matches+=1
        return True

    def getPWin(self):
        return self.pWin

    def getRating(self):
        return self.rating

    def setKFactor(self, k):
        self.kFactor = k 


# TEST

# elo = Elo("villanova")
# print(elo.kFactor)
# x = elo.game(0, 1200)
# print(elo.getRating())


## Ranking the teams:

here we do blah blah blah 

In [7]:
import pandas as pd
from datetime import datetime


def removeNCAA(x):
    if("NCAA" in x):
        return x[:-5]
    else:
        return x


def getKey(item):
    return item[1]

def ranking(schoolDictionary): 
    outputs = []
    for key in schoolDictionary: 
        eloObject = schoolDictionary[key]
        item = (eloObject.teamName, eloObject.rating)
        outputs.append(item)
    return sorted(outputs,  key=getKey)
     
    
def main(): 
    df = pd.read_csv("scraped_data.csv")
    df.drop(['Unnamed: 0'], axis = 1, inplace=True)
    df['date_game'] =pd.to_datetime(df.date_game)
    df["school_name"].apply(removeNCAA)
    schoolDict = {} 
    schools = set(df['school_name'])


    holla = set()
    counter = 0 
    for school in schools: 
        if school not in schoolDict: 
            schoolDict[school] = Elo(school)
    for index, row in df.iterrows(): 
        homeSchool = row["school_name"]
        oppSchool = row["opp_name"]
        if oppSchool not in schoolDict:
            #even if opponent is not in school, compute the games using a baseline rating
            holla.add(oppSchool)
            counter += 1 
            #was messing around with this value ????? 
            oppRating = 100
            oppObj = None 
        else: 
            oppObj = schoolDict[oppSchool]
            oppRating = oppObj.getRating()
        #getting the rating before we update with the outcome, so that we can use the before game ELO to correctly 
        #reflect the outcome for the opponent after changing home team win
        
        schoolObj = schoolDict[homeSchool]
        schoolRating = schoolObj.getRating() 
    
        result = row["game_result"]
        if result == 'W': 
            schoolObj.game(1, oppRating)
            if oppObj != None: 
                oppObj.game(0, schoolRating)
        else: 
            schoolObj.game(0, oppRating)
            if oppObj != None: 
                oppObj.game(1, schoolRating)
        schoolDict[homeSchool] = schoolObj
    ranks = ranking(schoolDict)
    d = dict()
    for i in range(len(ranks)):
        d[ranks[i]]= i+1
        
    return d
        
    #return ranks
#     return sup[:50]
    
d1 = main() 



{0: ('Villanova\xa0NCAA', 584.041914378557),
 1: ('Kansas\xa0NCAA', 697.3853257361252),
 2: ('Xavier\xa0NCAA', 715.1488862508481),
 3: ('Virginia\xa0NCAA', 747.9153516817996),
 4: ('Texas Tech\xa0NCAA', 793.0203615462041),
 5: ('North Carolina\xa0NCAA', 797.1815630666526),
 6: ('West Virginia\xa0NCAA', 810.9384753571296),
 7: ('Duke\xa0NCAA', 819.6052912348774),
 8: ('Kentucky\xa0NCAA', 821.0381684746703),
 9: ('Michigan\xa0NCAA', 825.6474137225725),
 10: ('Kansas State\xa0NCAA', 844.578625365501),
 11: ('Tennessee\xa0NCAA', 848.3024946249897),
 12: ('Texas A&M\xa0NCAA', 867.2832446849899),
 13: ('Purdue\xa0NCAA', 876.1527360740388),
 14: ('Texas Christian\xa0NCAA', 895.4909056678247),
 15: ('Seton Hall\xa0NCAA', 897.1550521151629),
 16: ('Providence\xa0NCAA', 900.1765377822684),
 17: ('Clemson\xa0NCAA', 900.3035540087043),
 18: ('Syracuse\xa0NCAA', 906.8661248853806),
 19: ('Texas\xa0NCAA', 930.3869780442485),
 20: ('Cincinnati\xa0NCAA', 936.0725461762386),
 21: ('Gonzaga\xa0NCAA', 94

In [None]:
def get_original_MM_rankings():
    pass

In [None]:
#rank1, rank2: school --> rank

def mean_squares(rank1, rank2):
    ranks = []
    for school in rank1:
        d1_rank = rank1[school]
        if (school in rank2):
            d2_rank = rank2[school]
            rank_diff = abs(d1_rank-d2_rank)
            ranks.append(rank_diff)
    mse = reduce(lambda x, y: x + y, ranks) / len(ranks)
    return mse
        
    