# Comparing Models to Predict March Madness Rankings


March Madness, also known as the NCAA Division I Men's Basketball Tournament happens annually in the month of March. Depending on which teams performed the best in the season, the top 32 are selected to compete in the tournament and play each other in a bracket for the winners trophy. Although March Madness 2018 is already over, our team wanted to see which model would do a better job at predicting team rankings/winners. We decided to look at the Elo Model as well as use simple logistic regression with features we extracted in order to find a trend. 

### Scraping the Data

In order to scrape data, we used www.sports-reference.com/cbb (cbb = college basketball). We utilized the BeautifulSoup library to extract the features that we thought we would need for both models. In the following code the features extracted are in the _featuresWanted_ set. A typical page that we would scrape from looks like the following: 


<img src="files/cbbstatsex.png">

This data displays Villanova's game history for the year 2018 [found here](https://www.sports-reference.com/cbb/schools/villanova/2018-schedule.html). We used Beautiful Soup to gather all the table data and format it in a data frame. Because the scraping usually takes ~10 minutes, the code was run once and put into a csv file, which we later used to do our data analysis. 


In [None]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd


def getSchools():
    url = "https://www.sports-reference.com/cbb/seasons/2018-school-stats.html"
    page = urlopen(url).read()
    soup = BeautifulSoup(page)
    count  = 0
    table = soup.find("tbody")
    school_set = set()
    for row in table.findAll('td', {"data-stat": "school_name"}):
        for a in row.find_all('a', href=True):
            link = a['href'].strip()
            name = link[13:].split("/")[0]
            school_set.add(name)
            
    return school_set

def getDfs():
    school_set = getSchools()
    dfs = []
    final_df=pd.DataFrame()
    for school in school_set: 
        url = "https://www.sports-reference.com/cbb/schools/" + school + "/2018-schedule.html"
        page = urlopen(url).read()
        soup = BeautifulSoup(page)
        count = 0 
        pre_df = dict()
        school_set = getSchools()
        table = soup.find("tbody")
        featuresWanted =  {'opp_name', 'pts', 'opp_pts', 
                           'game_location','game_result','overtimes','wins','losses', 'date_game'} #add more features here!!

        rows = table.findChildren(['tr'])
        for row in rows:
            if (row.find('th', {"scope":"row"}) != None):

                for f in featuresWanted:
                    cell = row.find("td",{"data-stat": f})

                    a = cell.text.strip().encode()
                    text=a.decode("utf-8")
                    if f in pre_df:
                        pre_df[f].append(text)
                    else:
                        pre_df[f]=[text]
            
        df = pd.DataFrame.from_dict(pre_df)
        df["opp_name"]= df["opp_name"].apply(lambda row: (row.split("(")[0]).rstrip())
        df["school_name"]=school
        final_df=pd.concat([final_df,df])
    return final_df


def csvDump():
    df=getDfs()
    df.to_csv("scraped_data.csv")
    
    
csvDump() #dump df into csv 


After creating the csv, our csv (in this same folder, called scraped_data.csv contained data about all games that were played in the 2017-2018 season. 

## The Elo Model 

The Elo Model is a way of creating a rating system for zero-sum games - games that only have one winner and one loser (e.g. basketball, hockey, football, tennis, etc.) 
