## **Final Project:** Forecasting the 2023 NCAA Basketball Tournaments
**Course:** Spring 2023, MSDS 565 Predictive Modeling & Analytics <br>
**Author:** Aleesa Mann and Cyruss Tsurgeon | **Date:** 20 Mar 2023

### **Instructions** (Kaggle Competition: March Machine Learning Mania 2023)

The evaluation methodology for 2023 has changed from prior editions of this competition. Submissions are now evaluated on the **Brier score** between the predicted probabilities and the actual game outcomes (this is equivalent to mean squared error in this context). This change was made to reduce the competitive "distractions" caused by the 0 and 1 boundaries of the previous log-loss metric (e.g. submitting rounded predictions to gamble on a given upset, or caring deeply about the 0.99 vs 0.999 distinction that log loss would reward/punish).

**Submission File**

The submission file format also has a revised format for 2023:

1. **Kaggle has combined the Men's and Women's tournaments into one single competition.** Our submission file should contain predictions for both.

2. **We will be predicting the hypothetical results for every possible team matchup, not just teams that are selected for the NCAA tournament.** This change was enacted to provide a longer time window to submit predictions for the 2023 tournament. Previously, the short time between Selection Sunday and the tournament tipoffs would require participants to quickly turn around updated predictions. By forecasting every possible outcome between every team, we can now submit a valid prediction at any point leading up to the tournaments.

3. We are allowed as many predictions as we wish before the tournaments start, but **must select no more than two submissions we want to count towards scoring in the Kaggle competition**. Do not rely on automatic selection to pick your submissions, as there is no public leaderboard score and the system will select your earliest two submissions.


### **Problem Understanding and Definition**

#### **Problem Understanding**

**Overview:**   <Enter Text Here>

**Assumptions:**  <Enter Text Here>

**Assignment:** <Enter Text Here>

#### **Problem Definition**

**Problem:** <Enter Text Here>

**Goals (High-level Solution):**  <Enter Text Here>

**Methodology:** <Enter Text Here>

**Metrics:**  <Enter Text Here>

#### **Libraries**

In [None]:
# For working with notebook
import os
import warnings

# For manipulating data
import pandas as pd
import numpy as np

#### **Functions**

### **Settings and Setup**

##### **Import Libraries**

In [None]:
# For working with files
import os
import re
import csv
import glob

# For manipulating data
import pandas as pd

# For getting geographic locations and distances
from geopy.distance import geodesic
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim

##### **Set File Path**

In [None]:
DATA_DIR = 'march-machine-learning-mania-2023/'

### **Data Collection and Preparation**

#### **Custom Functions**

##### **`MTeamSpellings.csv` and `WTeamSpellings.csv` File**

In [None]:
def store_spellings(csvfile):
    """
    Converts a CSV file with two columns (TeamNameSpelling, TeamID) to a dictionary with TeamNameSpelling as the key and TeamID as the value.

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns a dictionary with TeamNameSpelling as the key and TeamID as the value.

    Use:
    Use to get the TeamID for a team using an alternate spelling of the name (eg. mnames['mt-st-marys'])
    """

    # Create an empty dictionary to store the data
    spellings = {}

    # Open the csv file and read the contents into a list of dictionaries
    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        csv_reader = csv.DictReader(csv_file)

        # Use a dictionary comprehension to create a dictionary from the csv data
        spellings = {row['TeamNameSpelling']: row['TeamID'] for row in csv_reader}

    # Return dictionary
    return spellings

##### **`MSeasons.csv` and `WSeasons.csv` File**

In [None]:
def store_seasons(csvfile):
    """
    Converts a CSV file with six columns (Season, DayZero, RegionW, RegionX, RegionY, RegionZ) to TWO dictionaries: 
    (1) with Season as the key and DayZero as the value; and 
    (2) with Season as the key and the four regions names as the value (in tuple)

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns two dictionaries for DayZero and (RegionW, RegionX, RegionY, RegionZ) using Season as the key.

    Use:
    (1) Use to get the date for any game using Season as key eg. dayz(2022) outputs 2021-11-01
    (2) Use to get the four regions for any season using Season as key eg. regions(2022) outputs (East, West, Midwest, South)
    """

    # Create two empty dictionaries: one for storing day zero and another for storing regions
    dict_day_zero = {}
    dict_regions = {}

    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        reader = csv.reader(csv_file)
        for row in reader:
            # Extract the values from the row
            season, day_zero, region_w, region_x, region_y, region_z = row

            # Populate the dictionaries
            dict_day_zero[season] = day_zero
            dict_regions[season] = (region_w, region_x, region_y, region_z)

    # Return the two dictionaries
    return dict_day_zero, dict_regions

##### **`MTeamConferences.csv` and `WTeamConferences.csv` File**

In [None]:
def store_team_conference(csvfile):
    """
    Converts a CSV file with three columns (Season, TeamID, ConfAbbrev) to a nested dictionary with Season as the outer key, TeamID as the
    inner key, and ConfAbbrev as the value.

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns a nested dictionary with Season as the outer key, TeamID as the inner key, and ConfAbbrev as the value.

    Use:
    Use to get the conference abbreviation for a team in a specific season (eg. mens_conferences['1985']['1449'])
    """

    # Create an empty nested dictionary
    conferences = {}

    # Open the CSV file and read the contents into a list of dictionaries
    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        csv_reader = csv.DictReader(csv_file)

        # Iterate over the rows in the CSV file
        for row in csv_reader:
            # Extract the year, id, and name from the current row
            year = row['Season']
            team_id = row['TeamID']
            abbrv = row['ConfAbbrev']

            # Create the outer dictionary if it doesn't exist
            if year not in conferences:
                conferences[year] = {}

            # Add the name to the inner dictionary with the id as the key
            conferences[year][team_id] = abbrv

    return conferences

##### **`MNCAATourneySeeds.` and `WNCAATourneySeeds.csv`**

In [None]:
def store_tourney_seeds(csvfile):
    """
    Converts a CSV file with three columns (Season, Seed, TeamID) to a nested dictionary with Season as the outer key, TeamID as the
    inner key, and Seed as the value.

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns a nested dictionary with Season as the outer key, TeamID as the inner key, and Seed as the value.

    Use:
    Use to get the conference abbreviation for a team in a specific season (eg. mens_seeds['1985']['1449'])
    """

    # Create an empty nested dictionary
    seeds = {}

    # Open the CSV file and read the contents into a list of dictionaries
    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        csv_reader = csv.DictReader(csv_file)

        # Iterate over the rows in the CSV file
        for row in csv_reader:
            # Extract the year, id, and name from the current row
            year = row['Season']
            seed = row['Seed']
            team_id = row['TeamID']

            # Create the outer dictionary if it doesn't exist
            if year not in seeds:
                seeds[year] = {}

            # Add the name to the inner dictionary with the id as the key
            seeds[year][team_id] = seed

    return seeds

##### **`Cities.csv` File**

In [None]:
def store_cities(csvfile):
    """
    Converts a CSV file with three columns (CityID, City, State) to a dictionary with CityID as the key and both City, State as the value.

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns a dictionary for City and State using CityID as the key.

    Use:
    Use to get the City and State for a game using the CityID (eg. city['4030'])
    """

    # Create an empty dictionary to store the data
    cities = {}

    # Open the csv file and read the contents into a list of dictionaries
    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        csv_reader = csv.DictReader(csv_file)

        # Iterate over each row in the csv data
        for row in csv_reader:
            # Extract the relevant fields from the row
            city_id = row['CityID']
            city = row['City']
            state = row['State']

            # Add the data to the city_data dictionary
            cities[city_id] = f"{city}, {state}"

    # Return the dictionary
    return cities

In [None]:
def get_distance(city1, city2):
    # Function to get the distance between two locations (city, state)
    
    # Designate the two locations from function parameters
    city_1 = city1
    city_2 = city2
    
    # Use custom function to get coordinates for each city
    coord_1 = get_coordinates(city_1)  # Latitude and longitude of city1
    coord_2 = get_coordinates(city_2)  # Latitude and longitude of city2
    
    # Calculate distance between city1 and city2 using geopy
    distance_mi = geodesic(coord_1, coord_2).miles
    
    # print(f"The distance between {city_1} and {city_2} is {dist:.2f} miles")
    # Return the distance as numer (float)
    return distance_mi

In [None]:
def get_coordinates(city):
    # Function to get coordinates for a single location (city, state)
    
    # Create an instance of Nominatim class
    geolocator = Nominatim(user_agent='fake_useragent')

    # Use geocode method to get the location
    try:
        location_data = geolocator.geocode(city, timeout=10)
        latitude = location_data.latitude
        longitude = location_data.longitude

    # Add an except to return 'None' when location can not be found (to prevent error)
    except (AttributeError, GeocoderTimedOut):
        latitude = None
        longitude = None

    # Return (latitude, longitude) as tuple
    return (latitude, longitude)

##### **`Conferences.csv`**

In [None]:
def store_conferences(csvfile):
    """
    Converts a CSV file with two columns (ConfAbbrev, Description) to a dictionary with ConfAbbrev as the key and Description as the value.

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns a dictionary with ConfAbbrev as the key and Description as the value.

    Use:
    Use to get the Description (full name) for a conference using its abbreviation (eg. conf_abrv['a_sun'])
    """

    # Create an empty dictionary to store the data
    conferences = {}

    # Open the csv file and read the contents into a list of dictionaries
    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        csv_reader = csv.DictReader(csv_file)

        # Use a dictionary comprehension to create a dictionary from the csv data
        conferences = {row['ConfAbbrev']: row['Description'] for row in csv_reader}

    # Return dictionary
    return conferences

##### **`MSecondaryTourneyTeams.csv`**

In [None]:
def store_sec_tourney_teams(csvfile):
    """
    Converts a CSV file with three columns (Season, SecondaryTourney, TeamID) to a nested dictionary with Season as the outer key, TeamID as the
    inner key, and SecondaryTourney as the value.

    Parameters:
    csvfile (str): The path and name of the CSV file to convert.

    Returns:
    Returns a nested dictionary with Season as the outer key, TeamID as the inner key, and SecondaryTourney as the value.

    Use:
    Use to get the conference abbreviation for a team in a specific season (eg. sec_tourney['1985']['1449'])
    """

    # Create an empty nested dictionary
    secondary = {}

    # Open the CSV file and read the contents into a list of dictionaries
    with open(csvfile, 'r', encoding='latin-1') as csv_file:
        csv_reader = csv.DictReader(csv_file)

        # Iterate over the rows in the CSV file
        for row in csv_reader:
            # Extract the year, id, and name from the current row
            year = row['Season']
            second = row['SecondaryTourney']
            team_id = row['TeamID']

            # Create the outer dictionary if it doesn't exist
            if year not in secondary:
                secondary[year] = {}

            # Add the name to the inner dictionary with the id as the key
            secondary[year][team_id] = second

    return secondary

#### **Background on Dataset**

The data for this project are a collection of **30 csv files** distributed to the public as part of the Kaggle 2023 March Machine Learning Mania (Forecast the 2023 NCAA Basketball Tournaments) competition. By convention, when we identify a particular season, we will reference the year that the season ends (e.g. current season is 2023, not 2022 or 22-23).

`NOTE:` In many cases there are more data (or additional files) provided for men's basketball than women's basketball.

#### **Data Section 1 - Teams**

##### **Team files:** `MTeams.csv` and `WTeams.csv`

These files identify the different college teams present in the dataset (MTeams is for the men's teams and WTeams is for the women's teams). There are 363 teams currently in Men's Division-I and 361 teams currently in Women's Division-I. There will be some teams listed in the data only for historical seasons and not for the current season, and thus there are more than 363 men's teams and more than 361 women's teams listed.

In [None]:
# Creates a Teams tables
Mens = pd.read_csv(DATA_DIR + 'MTeams.csv')
Womens = pd.read_csv(DATA_DIR + 'WTeams.csv')
print("Done. Loaded  team tables as 'Mens' and 'Womens'.")

##### **Team Spelling files:** `MTeamSpellings.csv` and `WTeamSpellings.csv`

These files indicate alternative spellings of many team names. They are intended for use in associating external spellings against our own TeamID numbers, thereby helping to relate the external data properly with our datasets.

In [None]:
# Creates a dictionary to store alternate team name spellings
mnames = store_spellings(DATA_DIR + 'MTeamSpellings.csv') # use with Season as e.g. mnames['mt-st-marys']
wnames = store_spellings(DATA_DIR + 'WTeamSpellings.csv') # use with Season as e.g. wnames['mt-st-marys']
print('e.g. output - mens:', mnames['mt-st-marys'], 'womens:', wnames['mt-st-marys'])

##### **Coaches file:** `MTeamCoaches.csv`

This file indicates the head coach for each team in each season, including a start/finish range of DayNum's to indicate a mid-season coaching change.

In [None]:
# Creates Compact Secondary Tournament Results table:
MCoaches = pd.read_csv(DATA_DIR + 'MTeamCoaches.csv')
print("Done. Loaded team coaches table as 'MCoaches', mens only.")

#### **Data Section 1 - Season Summaries**

##### **Season files:** `MSeasons.csv` and `WSeasons.csv`

These files identify the different seasons included in the historical data, along with certain season-level properties. There are separate files for men's data (MSeasons) and women's data (WSeasons).

In [None]:
# Creates two dictionaries to store day zero for each season and the region names
mens_day0, mens_regions = store_seasons(DATA_DIR + 'MSeasons.csv'); # use with Season as e.g. mens_day0['2022'] or mens_regions['2022']
womens_day0, womens_regions = store_seasons(DATA_DIR + 'WSeasons.csv'); # use with Season as e.g. womens_day0['2022'] or womens_regions['2022']
print('e.g. output - mens:', mens_day0['2022'], mens_regions['2022'], 'womens:', womens_day0['2022'], womens_regions['2022'])

##### **Season files:** `MTeamConferences.csv` and `WTeamConferences.csv`

These files indicate the conference affiliations for each team during each season.

In [None]:
# Creates a dictionary to store team conference full names
mens_conferences = store_team_conference(DATA_DIR + 'MTeamConferences.csv') # use as mens_conferences[year][team_id]
womens_conferences = store_team_conference(DATA_DIR + 'WTeamConferences.csv') # use as womens_conferences[year][team_id]
print('e.g. output - mens:', mens_conferences['2022']['1449'], 'womens:', womens_conferences['2022']['3449'])

#### **Data Section 1 - Box Scores**

##### **Season files:** `MRegularSeasonDetailedResults.csv` and `WRegularSeasonDetailedResults.csv` | `MRegularSeasonCompactResults.csv` and `WRegularSeasonCompactResults.csv`

These files identify the game-by-game NCAA® tournament results for all seasons of historical data. These files provide team-level box scores for many regular seasons of historical data, starting with the 2003 season (men) or starting with the 2010 season (women). 

Team Box Scores are provided in "Detailed Results" files rather than "Compact Results" files. However, the two files are strongly related. In a Detailed Results file, the first eight columns (Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, and NumOT) are exactly the same as a Compact Results file. However, in a Detailed Results file, there are many additional columns.

In [None]:
# Creates Detailed Regular Season tables:
MSeason = pd.read_csv(DATA_DIR + 'MRegularSeasonDetailedResults.csv')
WSeason = pd.read_csv(DATA_DIR + 'WRegularSeasonDetailedResults.csv')
print("Done. Loaded detailed regular season tables as 'MSeason' and 'WSeason'.")

In [None]:
# Creates Compact Regular Season tables (Optional):
MSeason_compact = pd.read_csv(DATA_DIR + 'MRegularSeasonCompactResults.csv')
WSeason_compact = pd.read_csv(DATA_DIR + 'WRegularSeasonCompactResults.csv')
print("Done. Loaded compact regular season tables as 'MSeason_compact' and 'WSeason_compact'.")

##### **Game Cities files:** `MGameCities.csv` and `WGameCities.csv`

This file identifies all games, starting with the 2010 season, along with the city that the game was played in. Games from the regular season, the NCAA® tourney, and other post-season tournaments (men's data only), are all listed together.

In [None]:
# Creates Game Cities tables:
MGCities = pd.read_csv(DATA_DIR + 'MGameCities.csv')
WGCities = pd.read_csv(DATA_DIR + 'WGameCities.csv')
print("Done. Loaded game cities tables as 'MGCities' and 'WGCities'.")

#### **Data Section 1 - Tournament Games**

##### **Tourney files:** `MNCAATourneySeeds.csv` and `WNCAATourneySeeds.csv`

These files identify the seeds for all teams in each NCAA® tournament, for all seasons of historical data.

In [None]:
# Creates a dictionary to store team tournament seeding
mens_seeds = store_tourney_seeds(DATA_DIR + 'MNCAATourneySeeds.csv') # use as mens_seeds[year][team_id]
womens_seeds = store_tourney_seeds(DATA_DIR + 'WNCAATourneySeeds.csv') # use as womens_seeds[year][team_id]
print('e.g. output - mens:', mens_seeds['1985']['1449'], 'womens:', womens_seeds['2001']['3449'])

##### **Tourney files:** `MNCAATourneySlots` and `WNCAATourneySlots`

These files identify the mechanism by which teams are paired against each other, depending upon their seeds, as the tournament proceeds through its rounds.

In [None]:
# Creates Tournament Seed Matchup tables:
MTourney_seeds = pd.read_csv(DATA_DIR + 'MNCAATourneySlots.csv')
WTourney_seeds = pd.read_csv(DATA_DIR + 'WNCAATourneySlots.csv')
print("Done. Loaded tournament seed matchup tables as 'MTourney_seeds' and 'WTourney_seeds'.")

##### **Tourney file:** `MNCAATourneySeedRoundSlots.csv`

This file identifies the teams that participated in post-season men's tournaments other than the NCAA® Tournament (such events would run in parallel with the NCAA® Tournament). These are teams that were not invited to the NCAA® Tournament and instead were invited to some other tournament.

In [None]:
# Creates a Tourney Seeding by Round table 
Tourney_rounds = pd.read_csv(DATA_DIR + 'MNCAATourneySeedRoundSlots.csv')
print("Done. Loaded tournament seeding for each round in table as 'Tourney_rounds'.")

##### **Tourney files:** `MNCAATourneyDetailedResults.csv` and `WNCAATourneyDetailedResults.csv` | `MNCAATourneyCompactResults.csv` and `WNCAATourneyCompactResults.csv`

These files identify the game-by-game NCAA® tournament results for all seasons of historical data. These files provide team-level box scores for many NCAA® tournaments, starting with the 2003 season (men) or starting with the 2010 season (women).

The data is formatted exactly like the corresponding Regular Season Results data.

In [None]:
# Creates Detailed Tournament Results tables:
MTourney = pd.read_csv(DATA_DIR + 'MNCAATourneyDetailedResults.csv')
WTourney = pd.read_csv(DATA_DIR + 'WNCAATourneyDetailedResults.csv')
print("Done. Loaded detailed tournament tables as 'MTourney' and 'WTourney'.")

In [None]:
# Creates Compact Tournament Results tables (Optional):
MTourney_compact = pd.read_csv(DATA_DIR + 'MNCAATourneyCompactResults.csv')
WTourney_compact = pd.read_csv(DATA_DIR + 'WNCAATourneyCompactResults.csv')
print("Done. Loaded compact tournament tables an 'MTourney_compact' and 'WTourney_compact'.")

##### **Secondary Tourney file:** `MSecondaryTourneyCompactResults.csv`

This file identifies the teams that participated in post-season men's tournaments other than the NCAA® Tournament (such events would run in parallel with the NCAA® Tournament). These are teams that were not invited to the NCAA® Tournament and instead were invited to some other tournament.

In [None]:
# Creates a Compact Secondary Tournament Results table:
MSecondary = pd.read_csv(DATA_DIR + 'MSecondaryTourneyCompactResults.csv')
print("Done. Loaded secondary tournament table as 'MSecondary'.")

#### **Data Section 1 - Supplementary Data**

##### **Cities file:** `Cities.csv`

This file provides a master list of cities that have been locations for games played.

In [None]:
# Creates a dictionary to store cities as city, state
cities = store_cities(DATA_DIR + 'Cities.csv'); # use with CityID as e.g. city['4030']
print('e.g. output -', cities['4030'])

##### **Ratings file:** `MMasseyOrdinals.csv`

This file lists out rankings (e.g. #1, #2, #3, ..., #N) of men's teams going back to the 2002-2003 season, under a large number of different ranking system methodologies.

##### **Conferences file:** `Conferences.csv`

This file indicates the Division I conferences that have existed over the years since 1985. Each conference is listed with an abbreviation and a longer name.

In [None]:
# Creates a dictionary to store conference full names
conf_abrv = store_conferences(DATA_DIR + 'Conferences.csv'); # use with conference abbrv as e.g. conf_abrv['a_sun']
print('e.g. output - ', conf_abrv['a_sun'])

##### **Conference Tourney file:** `MConferenceTourneyGames.csv`

This file indicates which games were part of each year's post-season men's conference tournaments (all of which finished on Selection Sunday or earlier), starting from the 2001 season. Many of these conference tournament games are held on neutral sites, and many of the games are played by tournament-caliber teams just a few days before the NCAA® tournament. Thus these games could be considered as very similar to NCAA® tournament games, and (depending on your methodology) may be of use in optimizing your predictions.

In [None]:
# Creates a Conference Tourney Games table 
CTourney = pd.read_csv(DATA_DIR + 'MConferenceTourneyGames.csv')
print("Done. Loaded conference tournament games table as 'CTourney'.")

##### **Secondary Teams file:** `MSecondaryTourneyTeams.csv`

This file identifies the teams that participated in post-season men's tournaments other than the NCAA® Tournament (such events would run in parallel with the NCAA® Tournament). These are teams that were not invited to the NCAA® Tournament and instead were invited to some other tournament.

In [None]:
# Creates a dictionary to store secondary tournament that certain teams played 
sec_tourney = store_sec_tourney_teams(DATA_DIR + 'MSecondaryTourneyTeams.csv') # use as sec_tourney[year][team_id]
print('e.g. output -', sec_tourney['2019']['1400'])

#### **Load Data**

#### **Attribute Summary**

##### **`Comments:`** attribute summary

#### **Missing Values**

##### **`Comments:`** missing values

#### **Prepared Data Summary**

##### **`Comments:`** data preparation summary

### **Data Understanding Using Exploratory Data Analysis (EDA)**

#### **Import Additional Libraries**

#### **Plots of Features**

##### **`Comments:`** feature plots

#### **Feature Correlation**

##### **`Comments:`** feature correlation

#### **Correlation Maps**

##### **`Comments:`** correlation map

### **Model Building**

### **Model Evaluation**

### **Communication** (and/or Deployment)

### **References**

[1] 

[2] 
