# 2020 NCAA Men's and Women's Exploratory Data Analysis, All files explained

<a href="https://en.wikipedia.org/wiki/March_Madness_(disambiguation)">March Madness</a>  is the collegiate men's and women's basketball tournaments in US, held by NCAA (National Collegiate Athletic Association).

We will build a prediction model to **predict which team wins for all combination of possible matchup**. Large amount of historical data about college basketball games and teams are provided.

This is 2 stage competition:
 - **Stage 1** - You should submit predicted probabilities for every possible matchup in the past 5 NCAA® tournaments (seasons 2015-2019).
 - **Stage 2** - You should submit predicted probabilities for every possible matchup before the 2020 tournament begins.
 
I will also demonstrate how to get **logloss=0 score for stage1**.

[Update] I wrote following kernels too, please check it as well!
 - [
2020 NCAAM: Fast data loading with feather](https://www.kaggle.com/corochann/2020-ncaam-fast-data-loading-with-feather)
 - [
2020 NCAAW: Fast data loading with feather](https://www.kaggle.com/corochann/2020-ncaaw-fast-data-loading-with-feather)


![](https://www.ncaa.com/sites/default/files/public/styles/original/public-s3/images/2019/06/27/2020-NCAA-bracket-March-Madness.jpg?itok=ZFsTQ3uO)

In [None]:
import gc
import os
from pathlib import Path
import random
import sys

from tqdm.notebook import tqdm
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import display, HTML

# --- plotly ---
from plotly import tools, subplots
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

# --- models ---
from sklearn import preprocessing
from sklearn.model_selection import KFold
import lightgbm as lgb
import xgboost as xgb
import catboost as cb

# --- setup ---
pd.set_option('max_columns', 50)

I will explain all the files provided in NCAAM (men's) competition, I guess the explanation can be applied to NCAAW (women's) competition as well.

At first, let's see what kind of files are provided as data, we can see 26 files are provided for this competition, so many,,,!!<br/>
Yeah you can see [Data description page](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/data) that it is so long!! I will look into all files in this notebook.

In [None]:
# Input data files are available in the "../input/" directory.
import os

file_count = 0
for dirname, _, filenames in os.walk('/kaggle/input'):
    filenames.sort()
    for filename in filenames:
        print(os.path.join(dirname, filename))
        file_count += 1
print(f'Total {file_count} files!')

In [None]:
datadir = Path('/kaggle/input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament')
stage1dir = datadir/'MDataFiles_Stage1'

Let's see data one by one carefully. I will follow the order written in the [Data Description](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/data) page.

Before begin, **Special note about "Season" numbers**: The college basketball season starts from November until March, which spans two calendar years. And the season year is referenced as the year that **the season ends in**, not the year that it starts in.<br/>
For example, **"season 2020" means 2019-2020**.

# Data Section 1 - The Basics

This section provides everything you need to build a simple prediction model and submit predictions.

## Teams data

Teams name and their 4-digit id are provided. In total 367 teams participated since 1985, although 2020 game is held with 68 teams.

In [None]:
teams_df = pd.read_csv(stage1dir/'MTeams.csv')

print('teams_df', teams_df.shape)
teams_df.head()

Teams first participation year `FirstD1Season` and lasat participation year `FirstD1Season` are provided. Because March Madness games are for Division-I game, each team might appear/disappear in each year.

Below gantt chart visualizes each team's first/last participation status, note that it won't handle "missing participation" in intermediate years.

In [None]:
tmp_df = teams_df[['TeamName', 'FirstD1Season', 'LastD1Season']].copy()
tmp_df.columns = ['Task', 'Start', 'Finish']

# Only plot first 20 teams
fig = ff.create_gantt(tmp_df.iloc[:20])
py.plot(fig, filename='gannt.html')
# fig.show()  # It causes kaggle kernel error when committed somehow...

## Seasons data

**DayZero**: All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on **`DayNum=154`**.

**RegionW, RegionX, Region Y, Region Z**: Tournament proceeds with deviding US in 4 regions, but this division changes in year. These RegionW, X, Y, Z column represents 4 regions of each year. Note that RegionW & X, Y & Z have matchup in semifinals. Winner of WX and YZ regions have the final championship matchup.

In [None]:
seasons_df = pd.read_csv(stage1dir/'MSeasons.csv')

In [None]:
print(seasons_df.shape)
seasons_df.head()

## Tournament seeds data

`Seed` column string consists of **Region (W, X, Y, Z) + seed number (01 ~ 16)**. Seed number 01 is considered as most "strong". For play-in games, which occurs before main tournament, fourth character (**a or b**) is added.

The tournament slot can be understood by the figure provided in [official NCAA website](https://www.ncaa.com/news/basketball-men/ncaa-bracket-march-madness), which is introduced in [this kernel by @headsortails](https://www.kaggle.com/headsortails/jump-shot-to-conclusions-march-madness-eda).

![](https://www.ncaa.com/sites/default/files/public/styles/original/public-s3/images/2019/04/09/ncaa-tournament-bracket-2019-scores-games-virginia-texas-tech.png?itok=0E3VNWmI)

Before tournament, 68 teams will be reduced to 64 teams, so 4 teams have pre-tournament match up. "a" and "b" is the match up of those teams. 

In [None]:
tourney_seeds_df = pd.read_csv(stage1dir/'MNCAATourneySeeds.csv')
tourney_seeds_df

## Regular Season Results data & Tournament Results data

For each year, we have **regular season** at first, and only 68 teams survive for **NCAA tournament, also called March Madness** (In parallel, some teams who cannot proceed March Madness participates **Secondary tournament**). The game result of these 3 are stored in separated files.

"Compact results" data stores important information for win/lose, while "Detailed results" data (explained later) stores much precise information. Let's see Compact results data here.

In [None]:
regular_season_results_df = pd.read_csv(stage1dir/'MRegularSeasonCompactResults.csv')
tournament_results_df = pd.read_csv(stage1dir/'MNCAATourneyCompactResults.csv')

Both files are same format, stores matchup date, winner/loser's team id & score.<br/>

We can see winner team (starts with "W") and lost team (starts with "L") from this data.

`WScore` and `LScore` is the winner-loser score at the end of game.

`WLoc` stores the "location" of winning team, either `["H", "A", "N"]`.<br/>
"H" is home, "A" is Away (visiting to opponent's site), "N" is neutral court.

`NumOT` is number of overtime periods (when the score is same after 4-period of game, we continue the game with overtime period in basketball.)

The 4 columns, **Season, DayNum, WTeamID abd LTeamID** uniquely identifies the game. This fact is very important when you want to merge information of other files.

In [None]:
regular_season_results_df.head()

For tournament, All games will show up as neutral site (so WLoc is always Neutral).

NCAA tournament schedule is consistent and the day of Round 1, 2, 3, 4, 5 (semifinals), 6 (national championship) are decided. Please refer [data description](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/data) for details.

In [None]:
tournament_results_df.head()

Number of regular season's game is much more than tournament, since many teams matchups, as expected.

In [None]:
print('regular season', regular_season_results_df.shape, 'tournament', tournament_results_df.shape)

## Sample submission

It stores the matchup to predict for submission. In Stage 1, it seems we need to predict matchup from 2015 until 2019.

> During Stage 1, you are asked to make predictions for all possible matchups from the past five NCAA® tournaments (seasons 2015, 2016, 2017, 2018, 2019).
> When there are 68 teams in the tournament, there are 68*67/2=2,278 predictions to make for that year, so a Stage 1 submission file will have 2,278*5=11,390 data rows.

> ID - this is a 14-character string of the format SSSS_XXXX_YYYY, where SSSS is the four digit season number, XXXX is the four-digit TeamID of the lower-ID team, and YYYY is the four-digit TeamID of the higher-ID team.
> Pred - this contains the predicted winning percentage for the first team identified in the ID field, the one represented above by XXXX.


We already saw that 2015-2019 matchup result is provided, which means **we have a ground truth answer of stage1**. I think it is easy to get `logloss=0` score on the leaderboard during stage 1. See bottom "Create stage 1 submission file with ground truth data" for details.

In [None]:
sample_submission = pd.read_csv(datadir/'MSampleSubmissionStage1_2020.csv')
sample_submission

Note: so what should we do for this competition? stage1 is only for testing your model, stage2 is the main term of this competition where you need to predict future 2020 matchup results.<br/>
We can virtually test this by trainining the prediction model using data until season XXXX and test with season XXXX + 1, where XXXX = 2015, 2016, 2017, 2018 etc.

# Data Section 2 - Team Box Scores

## Detailed results

This section provides game-by-game stats at a team level (free throws attempted, defensive rebounds, turnovers, etc.) for all regular season, conference tournament, and NCAA® tournament games since the 2002-03 season.

Team Box Scores are provided in "Detailed Results" files rather than "Compact Results" files.

Compact results are collected from 1985, but details results are collected since 2003, so the number of rows in the data are different.

But 8 columns (Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, and NumOT) are exactly same value with CompactResults file.
Additional columns stores more precise information:

 - WFGM - field goals made (by the winning team)
 - WFGA - field goals attempted (by the winning team)
 - WFGM3 - three pointers made (by the winning team)
 - WFGA3 - three pointers attempted (by the winning team)
 - WFTM - free throws made (by the winning team)
 - WFTA - free throws attempted (by the winning team)
 - WOR - offensive rebounds (pulled by the winning team)
 - WDR - defensive rebounds (pulled by the winning team)
 - WAst - assists (by the winning team)
 - WTO - turnovers committed (by the winning team)
 - WStl - steals (accomplished by the winning team)
 - WBlk - blocks (accomplished by the winning team)
 - WPF - personal fouls committed (by the winning team)
 
 same set of stats from the perspective of the losing team starting with "L".

In [None]:
regular_season_detailed_results_df = pd.read_csv(stage1dir/'MRegularSeasonDetailedResults.csv')
tournament_detailed_results_df = pd.read_csv(stage1dir/'MNCAATourneyDetailedResults.csv')

For example in top row WFGM=27, WFGM3=3, WFTM=11 and WScore=68.

This means (27-3)=24 goals with 2 points shot, 3 goals with 3 points shot, and 11 goals with 1 point free throw shot.<br/>
In total `24 * 2 + 3 * 3 + 11 * 1 = 68`.

In [None]:
regular_season_detailed_results_df.head()

In [None]:
tournament_detailed_results_df.head()

Number of rows is smaller than CompactResults data, since detailed information is provided since 2003. 

In [None]:
print('regular', regular_season_detailed_results_df.shape, 'tournament', tournament_detailed_results_df.shape)

# Data Section 3 - Geography

This section provides city locations of all regular season, conference tournament, and NCAA® tournament games since the 2010 season

In [None]:
cities_df = pd.read_csv(stage1dir/'Cities.csv')
mgame_cities_df = pd.read_csv(stage1dir/'MGameCities.csv')

## City information

City information and Conference information can be used for both men's and women's contest.

City data stores **city id**, **city name** and **US state** location.

In [None]:
cities_df

## Game city

The first 4 columns (**Season, DayNum, WTeamID and LTeamID**) are enough information to uniquely identify the game.

**CRType** - `["Regular", "NCAA", "Secondary"]`, the game is found in the specified file. "Regular" is Regular season, "NCAA" is tournament, "Secondary" is secondary tournament respectively<br/>
**CityID** - See above `Cities.csv` to know the location.

In [None]:
mgame_cities_df

# Data Section 4 - Public Rankings
This section provides weekly team rankings for dozens of top rating systems - Pomeroy, Sagarin, RPI, ESPN, etc., since the 2002-2003 season.

The information was gathered by Kenneth Massey and provided on his [College Basketball Ranking Composite page](https://www.masseyratings.com/cb/compare.htm).

In [None]:
massey_df = pd.read_csv(stage1dir/'MMasseyOrdinals.csv')
massey_df

SystemName represents the rating calculation method, it seems there are many types of ranking system. Basically we should only compare the rating in the same calculation method.

In [None]:
massey_df['SystemName'].unique()


Ranking is meaningful only when we compare **same day (Season & RankingDayNum)** and **same system (SystemName)**.

Below example shows "SEL" ranking of DayNum=35 in 2003.

In [None]:
tmp_df = massey_df[(massey_df['Season'] == 2003) & (massey_df['RankingDayNum'] == 35) & (massey_df['SystemName'] == 'SEL')][['TeamID', 'OrdinalRank']]
# Only shows first 20 teams.
tmp_df.sort_values('OrdinalRank').iloc[:20].plot(kind='barh', x='TeamID', y='OrdinalRank')

# Data Section 5 - Play-by-play
This section provides play-by-play event logs for more than 99.5% of each year's regular season, NCAA® tournament, and secondary tournament games since the 2014-15 season - including plays by individual players.

## Events data

This file stores the most precise information for each game, player's each action during the game.

In [None]:
event2015_df = pd.read_csv(datadir/'MEvents2015.csv')
# event2016_df = pd.read_csv(datadir/'MEvents2016.csv')
# event2017_df = pd.read_csv(datadir/'MEvents2017.csv')
# event2018_df = pd.read_csv(datadir/'MEvents2018.csv')
# event2019_df = pd.read_csv(datadir/'MEvents2019.csv')

4 columns (Season, DayNum, WTeamID, LTeamID) specify game id. Same game event should have same value for these columns.

Below 5 rows can be read as follows.

EventID 1: Team 1103 Player 100 attemped to try 3 points shoot, which was missed.<br/>
EventID 2: Team 1420 Player 11784 get re-bound, defence got the boal, offence-defence changed.<br/>

EventID 3: Team 1420 Player 11789 attemped to try 2 points DUNK shoot! which was made.<br/>
EventID 4: Team 1420 Player 11803 assited that dunk.<br/>

EventID 5: Team 1103 Player 87 attempted 2 points shoot, which was made.<br/>

In [None]:
event2015_df.head(10)

Event Types and Subtypes:

 - **assist** - an assist was credited on a made shot
 - **block** - a blocked shot was recorded
 - **steal** - a steal was recorded
 - **sub** - a substitution occurred, with one of the following subtypes: in=player entered the game; out=player exited the game; start=player started the game
 - **timeout** - a timeout was called, with one of the following subtypes: unk=unknown type of timeout; comm=commercial timeout; full=full timeout; short= short timeout
 - **turnover** - a turnover was recorded, with one of the following subtypes: unk=unknown type of turnover; 10sec=10 second violation; 3sec=3 second violation; 5sec=5 second violation; bpass=bad pass turnover; dribb=dribbling turnover; lanev=lane violation; lostb=lost ball; offen=offensive turnover (?); offgt=offensive goaltending; other=other type of turnover; shotc=shot clock violation; trav=travelling
 - **foul** - a foul was committed, with one of the following subtypes: unk=unknown type of foul; admT=administrative technical; benT=bench technical; coaT=coach technical; off=offensive foul; pers=personal foul; tech=technical foul
 - **fouled** - a player was fouled
 - **reb** - a rebound was recorded, with one of the following subtypes: deadb=a deadball rebound; def=a defensive rebound; defdb=a defensive deadball rebound; off=an offensive rebound; offdb=an offensive deadball rebound
 - **made1, miss1** - a one-point free throw was made or missed, with one of the following subtypes: 1of1=the only free throw of the trip to the line; 1of2=the first of two free throw attempts; 2of2=the second of two free throw attempts; 1of3=the first of three free throw attempts; 2of3=the second of three free throw attempts; 3of3=the third of three free throw attempts; unk=unknown what the free throw sequence is
 - **made2, miss2** - a two-point field goal was made or missed, with one of the following subtypes: unk=unknown type of two-point shot; dunk=dunk; lay=layup; tip=tip-in; jump=jump shot; alley=alley-oop; drive=driving layup; hook=hook shot; stepb=step-back jump shot; pullu=pull-up jump shot; turna=turn-around jump shot; wrong=wrong basket
 - **made3, miss3** - a three-point field goal was made or missed, with one of the following subtypes: unk=unknown type of three-point shot; jump=jump shot; stepb=step-back jump shot; pullu=pull-up jump shot; turna=turn-around jump shot; wrong=wrong basket
 - **jumpb** - a jumpball was called or resolved, with one of the following subtypes: start=start period; block=block tie-up; heldb=held ball; lodge=lodged ball; lost=jump ball lost; outof=out of bounds; outrb=out of bounds rebound; won=jump ball won
 - **X, Y** - for games where it is available, this describes an X/Y position on the court where the lower-left corner of the court is (0,0), the upper-right corner of the court is (100,100), and so on. The X/Y position is provided for fouls, turnovers, and field-goal attempts (either 2-point or 3-point).
 - **Area** - for events where an X/Y position is provided, this position is more generally categorized into one of 13 "areas" of the court, as follows: 1=under basket; 2=in the paint; 3=inside right wing; 4=inside right; 5=inside center; 6=inside left; 7=inside left wing; 8=outside right wing; 9=outside right; 10=outside center; 11=outside left; 12=outside left wing; 13=backcourt

## Players data

Player's **id**, **name**, and their **team** information is stored.

In [None]:
players_df = pd.read_csv(datadir/'MPlayers.csv')
players_df

# Data Section 6 - Supplements

This section contains additional supporting information, including coaches, conference affiliations, alternative team name spellings, bracket structure, and game results for NIT and other postseason tournaments.

## Coach data

coach information stores the name of the head coach of each team, and his/her duration of the team. Often 1 coach plays the role and thus `FirstDayNum=0` and `LastDayNum=154` but sometimes multiple coach exists within 1 year for 1 team.

For example, `tom_pugliese` and `mark_slonaker` was the head coach of Team ID 1209 in Season 1985.

In [None]:
team_coaches_df = pd.read_csv(stage1dir/'MTeamCoaches.csv')

print('team_coaches_df', team_coaches_df.shape)
team_coaches_df.iloc[80:85]

## Conference data

`Conferences.csv` is common between Men's and Women's competition, which stores conference **abbreviation name** and its **full name**.<br/>

In [None]:
conferences_df = pd.read_csv(stage1dir/'Conferences.csv')
team_conferences_df = pd.read_csv(stage1dir/'MTeamConferences.csv')

In [None]:
conferences_df

`MTeamConferences.csv` stores each team's conference information. Conference place for the team may change in time, so this data contains "Season" column to indicate the year.

In [None]:
team_conferences_df.head()

For example, TeamID 1102 (=Air Force) changed its conference from `wac` (Western Athletic Conference) to `mwc` (Mountain West Conference) in 2000.

In [None]:
team_conferences_df[team_conferences_df['TeamID'] == 1102]

`MConferenceTourneyGames.csv` connects each game and its conference place.<br/>
4 columns (Season, DayNum, WTeamID, LTeamID) uniquely specify the game, `ConfAbbrev` stores the conference information for the game.

In [None]:
conference_tourney_games_df = pd.read_csv(stage1dir/'MConferenceTourneyGames.csv')
conference_tourney_games_df

## Secondary Tournament

In parallel to NCAA tournament, secondary tournament is organized. `SecondaryTourney` has some types, either `["NIT", "CBI", "CIT", "V16"]`.<br/>
"NIT" is the most prominent tournament amont these 4 tournaments.

First, `MSecondaryTourneyTeams.csv` contains information about each team participated which tournament in each season.

In [None]:
secondary_tourney_teams_df = pd.read_csv(stage1dir/'MSecondaryTourneyTeams.csv')
secondary_tourney_teams_df

`MSecondaryTourneyCompactResults.csv` contains information about each game results of secondary tournament.

Note that **detailed results or event data is not provided**. We can only use compact results for secondary tournament.

In [None]:
secondary_tourney_results_df = pd.read_csv(stage1dir/'MSecondaryTourneyCompactResults.csv')
secondary_tourney_results_df

## Team spelling data

[Update] Based on the discussion [Team spelling Unicode](https://www.kaggle.com/c/march-madness-analytics-2020/discussion/130705) by @wjholst, this file can be open with specifying `encoding="cp1252`.

It shows the `TeamID` and variety of the team's full name spelling.

In [None]:
# it seems there's encoding problem... which encoding can be used to open this file?
# team_spellings_df = pd.read_csv(stage1dir/'MTeamSpellings.csv')

# encoding="cp1252" works!
team_spellings_df = pd.read_csv(stage1dir/'MTeamSpellings.csv',encoding="cp1252")

In [None]:
team_spellings_df.head()

We can see that several names exist for 1 team.
For example TeamID=1394 team name is "TAM C. Christi" in `teams_df`, but it was spelled as "a&m-corpus chris" or "a&m-corpus christi" as well according to the `teams_spellings_df`.

In [None]:
teams_df.query("TeamID == 1394").TeamName

## Tournament slots data

NCAA tournament is organized that it has consistent tournament structure in each year.<br/>
Below "slots" information indicates which part of the tournament bracket structure each seed team in.

"GameSlot" shows "most strongest seed slot" so when W11 team wins to W6 in the previous round, this team's game slot is recorded as W6 in the later round.

Details are also written in the [data description](https://www.kaggle.com/c/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/data) page.

In [None]:
tourney_slots_df = pd.read_csv(stage1dir/'MNCAATourneySlots.csv')
tourney_seed_round_slots_df = pd.read_csv(stage1dir/'MNCAATourneySeedRoundSlots.csv')

In [None]:
tourney_slots_df

For example, Round 1 of Region W in Season 1985 consists of following 8 matchups.

In [None]:
tourney_slots_df[(tourney_slots_df['Season'] == 1985) & (tourney_slots_df['Slot'].str.startswith('R1W'))]

`MNCAATourneySeedRoundSlots.csv` stores further information about each `GameRound` and `DayNum`.

In [None]:
tourney_seed_round_slots_df

# Create stage 1 submission file with ground truth data

As we already understood, stage 1 prediction target is 2015-2019 tournament result, which is provided by data. I will demonstrate and check our understanding the submission file for stage1.

At first, we only need tournament result from 2015 until 2019.

In [None]:
tournament_results2015_df = tournament_results_df.query("Season >= 2015")
tournament_results2015_df

Check again, sample submission format is as follows, consisting of ID and Pred.

In [None]:
sample_submission.head()

Then we would like to create **ID = Season_WTeamID_LTeamID** with **Pred = 1.0**, or **ID = Season_LTeamID_WTeamID** with **Pred = 0.0**.
Below code only substitutes the `Pred` value to sample_submission data.

In [None]:

for key, row in tournament_results2015_df.iterrows():
    if row['WTeamID'] < row['LTeamID']:
        # Check season_win_lost type
        id_name = str(row['Season']) + '_' + str(row['WTeamID']) + '_' + str(row['LTeamID'])
        sample_submission.loc[sample_submission['ID'] == id_name, 'Pred'] = 1.0
    else:
        # Check season_lost_win type
        id_name = str(row['Season']) + '_' + str(row['LTeamID']) + '_' + str(row['WTeamID'])
        sample_submission.loc[sample_submission['ID'] == id_name, 'Pred'] = 0.0

Save created submission file for submission.

In [None]:
sample_submission.to_csv('submission.csv', index=False)

Check histgram. Here, only few rows are changed to pred 1.0 or 0.0, but many of the rows are remain 0.5. Because most of the **matchup is not actually executed**. The score is evaluated only by the actually executed games.

In [None]:
sample_submission['Pred'].hist()

Of course, this logic **completely does not work for stage 2**!!!, where we need to predict 2020 tournament result while 2020 tournament data is not provided.<br/>
This is just a demonstration of our data understanding.

When you refer other people's kernel, please be careful that if they use tournament data during 2015-2019 to train the model, it is **leaked for stage1**. In that case basically we can get any score, as I could get `logloss=0` score here.