Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you"ll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we"ll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [1]:
from typing import List, Tuple
# Familiar imports
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)

def load_raw_data(filename: str, stage: int = 1) -> pd.DataFrame:
    path = f"/kaggle/input/mens-march-mania-2022/MDataFiles_Stage{stage}/"
    return pd.read_csv(path+filename)

stage = 1

# Step 2: Load the data

Next, we"ll load the raw data.

## Regular Season Detailed Results

This file provides team-level box scores for many regular seasons of historical data, starting with the 2003 season. All games listed in the MRegularSeasonCompactResults file since the 2003 season should exactly be present in the MRegularSeasonDetailedResults file.

### Regular Season Compact Results

This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

* Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs). For example, during the 2016 season, there were regular season games played between November 2015 and March 2016, and all of those games will show up with a Season of 2016.
* DayNum - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "DayZero" date in the "MSeasons.csv" file. For example, the first game in the file was DayNum=20. Combined with the fact from the "MSeasons.csv" file that day zero was 10/29/1984 that year, this means the first game was played 20 days later, or 11/18/1984. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key (combining Season and DayNum and WTeamID). In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day because of a tornado, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original scheduled date.
* WTeamID - this identifies the id number of the team that won the game, as listed in the "MTeams.csv" file. No matter whether the game was won by the home team or visiting team, or if it was a neutral-site game, the "WTeamID" always identifies the winning team.
* WScore - this identifies the number of points scored by the winning team.
* LTeamID - this identifies the id number of the team that lost the game.
* LScore - this identifies the number of points scored by the losing team. Thus you can be confident that WScore will be greater than LScore for all games listed.
* WLoc - this identifies the "location" of the winning team. If the winning team was the home team, this value will be "H". If the winning team was the visiting team, this value will be "A". If it was played on a neutral court, then this value will be "N". Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team. If you would like to investigate this factor more closely, we invite you to explore Data Section 3, which provides the city that each game was played in, irrespective of whether it was considered to be a neutral site.
* NumOT - this indicates the number of overtime periods in the game, an integer 0 or higher.


In [2]:
MRegularSeasonDetailedResults = load_raw_data("MRegularSeasonDetailedResults.csv", stage)
MRegularSeasonDetailedResults.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,WFGM3,WFGA3,WFTM,WFTA,WOR,WDR,WAst,WTO,WStl,WBlk,WPF,LFGM,LFGA,LFGM3,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
100418,2022,98,1400,79,1242,76,H,0,28,67,3,20,20,23,14,18,8,6,7,2,21,28,48,5,13,15,23,5,24,10,15,3,5,21
100419,2022,98,1411,66,1126,63,A,0,24,59,2,20,16,28,12,27,9,19,10,5,19,20,49,8,21,15,24,5,23,10,19,13,2,23
100420,2022,98,1422,68,1441,49,A,0,23,56,13,32,9,13,11,22,11,15,11,1,13,18,53,5,24,8,11,10,18,5,16,8,2,12
100421,2022,98,1438,69,1181,68,A,0,31,65,2,12,5,9,10,20,16,5,10,2,17,22,52,6,17,18,22,11,25,14,14,3,9,11
100422,2022,98,1439,74,1338,47,H,0,29,55,13,27,3,6,9,26,20,8,6,1,12,15,40,10,20,7,9,0,18,11,11,3,6,14


## Tourney Detailed Results
This file provides team-level box scores for many NCAA® tournaments, starting with the 2003 season. All games listed in the MNCAATourneyCompactResults file since the 2003 season should exactly be present in the MNCAATourneyDetailedResults file.

### Tourney Compact Results

This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the MRegularSeasonCompactResults data. All games will show up as neutral site (so WLoc is always N). Note that this tournament game data also includes the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.

Because of the consistent structure of the NCAA® tournament schedule, you can actually tell what round a game was, depending on the exact DayNum. Thus:

* DayNum=134 or 135 (Tue/Wed) - play-in games to get the tournament field down to the final 64 teams
* DayNum=136 or 137 (Thu/Fri) - Round 1, to bring the tournament field from 64 teams to 32 teams
* DayNum=138 or 139 (Sat/Sun) - Round 2, to bring the tournament field from 32 teams to 16 teams
* DayNum=143 or 144 (Thu/Fri) - Round 3, otherwise known as "Sweet Sixteen", to bring the tournament field from 16 teams to 8 teams
* DayNum=145 or 146 (Sat/Sun) - Round 4, otherwise known as "Elite Eight" or "regional finals", to bring the tournament field from 8 teams to 4 teams
* DayNum=152 (Sat) - Round 5, otherwise known as "Final Four" or "national semifinals", to bring the tournament field from 4 teams to 2 teams
* DayNum=154 (Mon) - Round 6, otherwise known as "national final" or "national championship", to bring the tournament field from 2 teams to 1 champion team

Special note: Each year, there are also going to be other games that happened after Selection Sunday, which are not part of the NCAA® Tournament. This includes tournaments like the postseason NIT, the CBI, the CIT, and the Vegas 16. Such games are not listed in the Regular Season or the NCAA® Tourney files; they can be found in the "Secondary Tourney" data files within Data Section 6. Although they would not be games you would ever be predicting directly for the NCAA® tournament, and they would not be games you would have data from at the time of predicting NCAA® tournament outcomes, you may nevertheless wish to make use of these games for model optimization, depending on your methodology. The more games that you can test your predictions against, the better your optimized model might eventually become, depending on how applicable all those games are. A similar argument might be advanced in favor of optimizing your predictions against conference tournament games, which might be viewed as reasonable proxies for NCAA® tournament games.

In [3]:
MNCAATourneyDetailedResults = load_raw_data("MNCAATourneyDetailedResults.csv", stage)
MNCAATourneyDetailedResults.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,WFGM,WFGA,WFGM3,WFGA3,WFTM,WFTA,WOR,WDR,WAst,WTO,WStl,WBlk,WPF,LFGM,LFGA,LFGM3,LFGA3,LFTM,LFTA,LOR,LDR,LAst,LTO,LStl,LBlk,LPF
1176,2021,148,1211,85,1425,66,N,0,33,66,7,21,12,17,11,27,21,9,6,3,16,24,62,4,15,14,19,7,20,9,9,7,0,13
1177,2021,148,1417,51,1276,49,N,0,21,54,3,13,6,7,6,21,12,8,5,2,14,20,51,3,11,6,11,8,24,12,14,5,3,11
1178,2021,152,1124,78,1222,59,N,0,29,55,11,24,9,13,11,17,23,8,6,0,18,21,55,6,19,11,16,13,12,10,10,4,5,10
1179,2021,152,1211,93,1417,90,N,1,37,63,7,21,12,20,4,19,25,10,8,3,16,34,59,8,17,14,21,7,24,21,9,4,1,16
1180,2021,154,1124,86,1211,70,N,0,30,67,10,23,16,18,14,20,18,7,8,5,19,25,49,5,17,15,21,1,16,16,14,4,3,17


## Rankings
This file lists out rankings (e.g. #1, #2, #3, ..., #N) of teams going back to the 2002-2003 season, under a large number of different ranking system methodologies. The information was gathered by Kenneth Massey and provided on his College Basketball Ranking Composite page.

Note that a rating system is more precise than a ranking system, because a rating system can provide insight about the strength gap between two adjacently-ranked teams. A ranking system will just tell you who is #1 or who is #2, but a rating system might tell you whether the gap between #1 and #2 is large or small. Nevertheless, it can be hard to compare two different rating systems that are expressed in different scales, so it can be very useful to express all the systems in terms of their ordinal ranking (1, 2, 3, ..., N) of teams.

* Season - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs)
* RankingDayNum - this integer always ranges from 0 to 133, and is expressed in the same terms as a game's DayNum (where DayZero is found in the MSeasons.csv file). The RankingDayNum is intended to tell you the first day that it is appropriate to use the rankings for predicting games. For example, if RankingDayNum is 110, then the rankings ought to be based upon game outcomes up through DayNum=109, and so you can use the rankings to make predictions of games on DayNum=110 or later. The final pre-tournament rankings each year have a RankingDayNum of 133, and can thus be used to make predictions of the games from the NCAA® tournament, which start on DayNum=134 (the Tuesday after Selection Sunday).
* SystemName - this is the (usually) 3-letter abbreviation for each distinct ranking system. These systems may evolve from year to year, but as a general rule they retain their meaning across the years. Near the top of the Massey composite page, you can find slightly longer labels describing each system, along with links to the underlying pages where the latest rankings are provided (and sometimes the calculation is described).
* TeamID - this is the ID of the team being ranked, as described in MTeams.csv.
* OrdinalRank - this is the overall ranking of the team in the underlying system. Most systems from recent seasons provide a complete ranking from #1 through #351, but more recently they go higher because additional teams were added to Division I in recent years.

Disclaimer: you ought to be careful about your methodology when using or evaluating these ranking systems. They are presented on a weekly basis, and given a consistent date on the Massey Composite page that typically is a Sunday; that is how the ranking systems can be compared against each other on this page. However, these systems each follow their own timeline and some systems may be released on a Sunday and others on a Saturday or Monday or even Tuesday. You should remember that if a ranking is released on a Tuesday, and was calculated based on games played through Monday, it will make the system look unusually good at predicting if you use that system to forecast the very games played on Monday that already inform the rankings. To avoid this methodological trap, we have typically used a conservative RankingDayNum of Wednesday to represent the rankings that were released at approximately the end of the weekend, a few days before, even though those rankings are represented on the composite page as being on a Sunday. For some of the older years, a more precise timestamp was known for each ranking system that allowed a more precise assignment of a RankingDayNum. By convention, the final pre-tournament rankings are always expressed as RankingDayNum=133, even though sometimes the rankings for individual systems are not released until Tuesday (DayNum=134) or even Wednesday or Thursday. If you decide to use some rankings from these Massey Ordinals to inform your predictions, be forewarned that we have no control over when they are released, and not all systems may turn out to be available in time to make pre-tournament predictions by our submission deadline. In such a situation, you may wish to use the rankings from DayNum=128 or you may need to dig into the details of the actual source of the rankings, by following the respective links on the Massey Composite Page. We may also be able to provide partial releases of the final pre-tournament Massey Ordinals on the forums, so that as systems come in on Monday or Tuesday you can use them right away.


In [4]:
MMasseyOrdinals = load_raw_data("MMasseyOrdinals.csv", stage)
MMasseyOrdinals.tail()

Unnamed: 0,Season,RankingDayNum,SystemName,TeamID,OrdinalRank
4521715,2022,100,WOL,1468,183
4521716,2022,100,WOL,1469,259
4521717,2022,100,WOL,1470,209
4521718,2022,100,WOL,1471,270
4521719,2022,100,WOL,1472,296


## Tourney Seeds

This file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 13, 2022 (DayNum=132).

* Season - the year that the tournament was played in
* Seed - this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tell you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have seeds with the same first three characters. The "a" and "b" are assigned based on which Team ID is lower numerically. As an example of the format of the seed, the first record in the file is seed W01 from 1985, which means we are looking at the #1 seed in the W region (which we can see from the "MSeasons.csv" file was the East region).
* TeamID - this identifies the id number of the team, as specified in the MTeams.csv file


In [5]:
MNCAATourneySeeds = load_raw_data("MNCAATourneySeeds.csv", stage)
MNCAATourneySeeds.tail()

Unnamed: 0,Season,Seed,TeamID
2349,2021,Z12,1457
2350,2021,Z13,1317
2351,2021,Z14,1159
2352,2021,Z15,1331
2353,2021,Z16,1216


# Step 3: Prepare the data



## Feature Engineering

### Detailed Results

In [6]:
def process_detailed_results(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    df = clean_detailed_results(df)
    df = aggregate_detailed_results(df)
    df = compute_percentages(df)
    return df

def clean_detailed_results(df: pd.DataFrame) -> pd.DataFrame:
    return df.drop(["WLoc", "DayNum"], axis=1)

def reshape_detailed_results(df: pd.DataFrame) -> pd.DataFrame:
    winner_columns, looser_columns = split_winner_and_looser_columns(df)
    df_winner = df.copy()
    df_winner = df_winner[winner_columns]
    df_winner.columns = clean_column_names(df_winner)
    df_winner["Win"] = 1
    df_looser = df.copy()
    df_looser = df_looser[looser_columns]
    df_looser.columns = clean_column_names(df_looser)
    df_looser["Win"] = 0
    return pd.concat([df_winner, df_looser], ignore_index=True)

def aggregate_detailed_results(df: pd.DataFrame) -> pd.DataFrame:
    df = reshape_detailed_results(df)
    df_agg = df.groupby(["Season", "TeamID"]).agg("mean")
    return df_agg.reset_index()

def compute_percentages(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    df["FGP"] =  df["FGM"] / df["FGA"]
    df["FGP3"] =  df["FGM3"] / df["FGA3"]
    df["FTP"] =  df["FTM"] / df["FTA"]
    return df

def split_winner_and_looser_columns(df: pd.DataFrame) -> Tuple[List[str], List[str]]:
    winner_columns = [name for name in df.columns if not name.startswith("L")]
    looser_columns = [name for name in df.columns if not name.startswith("W")]
    return winner_columns, looser_columns

def clean_column_names(df: pd.DataFrame) -> List[str]:
    column_names = [
        name[1:] if 
        name.startswith("L") or name.startswith("W")
        else name 
        for name in df.columns
    ]
    return column_names

# Test data
test_df = pd.DataFrame([
    {"Season": 1, "WTeamID": "A", "LTeamID": "B", "stat1": 1, "Wstat2": 2, "Lstat2": 3 },
    {"Season": 1, "WTeamID": "A", "LTeamID": "B", "stat1": 4, "Wstat2": 5, "Lstat2": 6 },
])
expected_column_names = [
    "Season", "TeamID", "TeamID", "stat1", "stat2", "stat2"
]
expected_column_split = (
    ["Season", "WTeamID", "stat1", "Wstat2"], 
    ["Season", "LTeamID", "stat1", "Lstat2"]
)
expected_reshaped_df = pd.DataFrame([
    { "Season": 1, "TeamID": "A", "stat1": 1, "stat2": 2, "Win": 1 },
    { "Season": 1, "TeamID": "A", "stat1": 4, "stat2": 5, "Win": 1 },
    { "Season": 1, "TeamID": "B", "stat1": 1, "stat2": 3, "Win": 0 },
    { "Season": 1, "TeamID": "B", "stat1": 4, "stat2": 6, "Win": 0 },
    
])
expected_aggregated_df = pd.DataFrame([
    {"Season": 1, "TeamID": "A", "stat1": 2.5, "stat2": 3.5, "Win": 1.0 },
    {"Season": 1, "TeamID": "B","stat1": 2.5, "stat2": 4.5, "Win": 0.0 },
])
test_df_copy = test_df.copy()

# Tests
assert clean_column_names(test_df) == expected_column_names, "Function clean_column_names failed."
assert split_winner_and_looser_columns(test_df) == expected_column_split, "Function split_winner_and_looser_columns failed."
assert expected_reshaped_df.equals(reshape_detailed_results(test_df)), "Function reshape_detailed_results failed."
assert expected_aggregated_df.equals(aggregate_detailed_results(test_df)), "Function aggregate_detailed_results failed."

In [7]:
ProcessedMRegularSeasonDetailedResults = process_detailed_results(
    MRegularSeasonDetailedResults
)
ProcessedMRegularSeasonDetailedResults.tail()

Unnamed: 0,Season,TeamID,Score,NumOT,FGM,FGA,FGM3,FGA3,FTM,FTA,OR,DR,Ast,TO,Stl,Blk,PF,Win,FGP,FGP3,FTP
6887,2022,1468,66.6,0.05,25.05,54.85,7.15,22.1,9.35,12.55,5.9,20.65,13.3,9.75,4.85,1.6,16.05,0.45,0.4567,0.323529,0.74502
6888,2022,1469,69.526316,0.0,24.105263,58.473684,6.473684,21.631579,14.842105,21.473684,8.736842,23.631579,14.631579,15.052632,6.210526,2.315789,19.842105,0.368421,0.412241,0.29927,0.691176
6889,2022,1470,63.428571,0.047619,22.619048,54.904762,5.142857,16.666667,13.047619,17.571429,7.380952,19.095238,10.142857,10.190476,8.190476,2.142857,18.380952,0.380952,0.411969,0.308571,0.742547
6890,2022,1471,67.1,0.05,22.7,52.4,8.1,23.7,13.6,18.6,4.8,21.95,11.95,13.25,5.5,1.6,15.05,0.4,0.433206,0.341772,0.731183
6891,2022,1472,72.736842,0.0,26.105263,60.0,10.736842,30.052632,9.789474,12.631579,5.736842,19.368421,11.842105,7.947368,4.842105,1.157895,17.0,0.263158,0.435088,0.357268,0.775


In [8]:
ProcessedMNCAATourneyDetailedResults = process_detailed_results(
    MNCAATourneyDetailedResults
)
ProcessedMNCAATourneyDetailedResults.tail()

Unnamed: 0,Season,TeamID,Score,NumOT,FGM,FGA,FGM3,FGA3,FTM,FTA,OR,DR,Ast,TO,Stl,Blk,PF,Win,FGP,FGP3,FTP
1194,2021,1439,70.0,1.0,24.0,57.0,7.0,23.0,15.0,21.0,6.0,16.0,11.0,11.0,6.0,2.0,24.0,0.0,0.421053,0.304348,0.714286
1195,2021,1452,78.0,0.0,29.0,66.5,10.0,22.0,10.0,13.0,13.0,17.5,17.0,10.0,9.0,2.5,15.5,0.5,0.43609,0.454545,0.769231
1196,2021,1455,52.0,0.0,19.0,56.0,3.0,18.0,11.0,22.0,10.0,23.0,7.0,8.0,7.0,7.0,13.0,0.0,0.339286,0.166667,0.5
1197,2021,1457,63.0,0.0,21.0,58.0,7.0,22.0,14.0,19.0,9.0,25.0,13.0,10.0,0.0,0.0,23.0,0.0,0.362069,0.318182,0.736842
1198,2021,1458,74.0,0.0,28.0,58.0,10.5,24.0,7.5,9.5,7.0,24.5,14.0,9.5,2.5,6.5,14.5,0.5,0.482759,0.4375,0.789474


### Rankings

In [9]:
def process_rankings(df_in:pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    mask = df["RankingDayNum"] == df["RankingDayNum"].max()
    df = df[mask]
    df.drop(["SystemName", "RankingDayNum"], axis=1, inplace=True)
    df = df.groupby(["Season", "TeamID"]).agg("mean")
    return df.reset_index()

In [10]:
ProcessedMMasseyOrdinals = process_rankings(MMasseyOrdinals)
ProcessedMMasseyOrdinals.tail()

Unnamed: 0,Season,TeamID,OrdinalRank
6175,2021,1467,239.823529
6176,2021,1468,180.62
6177,2021,1469,315.142857
6178,2021,1470,254.367347
6179,2021,1471,255.62


### Seeds

In [11]:
def process_seeds(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()
    mask = df["Season"] > 2002
    df = df[mask]
    df["Seed"] = df["Seed"].str.replace(r"\D+","")
    df["Seed"] = df["Seed"].astype(int)
    return df

In [12]:
ProcessedMNCAATourneySeeds = process_seeds(MNCAATourneySeeds)
ProcessedMNCAATourneySeeds.tail()

Unnamed: 0,Season,Seed,TeamID
2349,2021,12,1457
2350,2021,13,1317
2351,2021,14,1159
2352,2021,15,1331
2353,2021,16,1216


## Merge features

In [13]:
features = pd.merge(
    ProcessedMRegularSeasonDetailedResults,
    ProcessedMNCAATourneyDetailedResults,
    how="inner",
    on=["Season", "TeamID"],
    suffixes=("Reg", "Tou")
)

features = features.merge(
    ProcessedMMasseyOrdinals,
    how="inner",
    on=["Season", "TeamID"]
)

features = features.merge(
    ProcessedMNCAATourneySeeds,
    how="inner",
    on=["Season", "TeamID"]
)

features.tail()

Unnamed: 0,Season,TeamID,ScoreReg,NumOTReg,FGMReg,FGAReg,FGM3Reg,FGA3Reg,FTMReg,FTAReg,ORReg,DRReg,AstReg,TOReg,StlReg,BlkReg,PFReg,WinReg,FGPReg,FGP3Reg,FTPReg,ScoreTou,NumOTTou,FGMTou,FGATou,FGM3Tou,FGA3Tou,FTMTou,FTATou,ORTou,DRTou,AstTou,TOTou,StlTou,BlkTou,PFTou,WinTou,FGPTou,FGP3Tou,FTPTou,OrdinalRank,Seed
1194,2021,1439,72.142857,0.095238,25.428571,56.333333,8.238095,23.095238,13.047619,18.714286,7.857143,24.52381,14.380952,11.047619,4.761905,3.952381,2.285714,0.714286,0.451395,0.356701,0.697201,70.0,1.0,24.0,57.0,7.0,23.0,15.0,21.0,6.0,16.0,11.0,11.0,6.0,2.0,24.0,0.0,0.421053,0.304348,0.714286,47.388889,10
1195,2021,1452,77.296296,0.111111,26.333333,61.481481,7.259259,20.333333,17.37037,24.222222,11.888889,23.0,13.777778,11.62963,7.481481,2.851852,4.222222,0.666667,0.428313,0.357013,0.717125,78.0,0.0,29.0,66.5,10.0,22.0,10.0,13.0,13.0,17.5,17.0,10.0,9.0,2.5,15.5,0.5,0.43609,0.454545,0.769231,20.222222,3
1196,2021,1455,71.894737,0.105263,23.894737,59.0,8.368421,24.263158,15.736842,22.526316,9.0,23.473684,12.842105,10.578947,5.736842,3.421053,3.578947,0.736842,0.404996,0.344902,0.698598,52.0,0.0,19.0,56.0,3.0,18.0,11.0,22.0,10.0,23.0,7.0,8.0,7.0,7.0,13.0,0.0,0.339286,0.166667,0.5,57.307692,11
1197,2021,1457,79.541667,0.0,28.083333,60.958333,8.458333,23.958333,14.916667,21.75,11.208333,26.458333,15.041667,13.666667,7.75,2.375,3.5,0.958333,0.460697,0.353043,0.685824,63.0,0.0,21.0,58.0,7.0,22.0,14.0,19.0,9.0,25.0,13.0,10.0,0.0,0.0,23.0,0.0,0.362069,0.318182,0.736842,66.711538,12
1198,2021,1458,69.62069,0.068966,24.517241,58.275862,8.551724,23.758621,12.034483,15.689655,6.689655,24.034483,13.310345,8.586207,5.862069,3.793103,4.172414,0.586207,0.42071,0.359942,0.767033,74.0,0.0,28.0,58.0,10.5,24.0,7.5,9.5,7.0,24.5,14.0,9.5,2.5,6.5,14.5,0.5,0.482759,0.4375,0.789474,27.961538,9


## Build Dataset

In [14]:
def get_outcomes(df):
    input_rows = df.to_records()
    output_rows = [parse_row(input_row) for input_row in input_rows]
    out_df = pd.DataFrame(output_rows)
    return out_df

def parse_row(row):
    season = row['Season']
    winning_team_id = row['WTeamID']
    losing_team_id = row['LTeamID']
    if winning_team_id < losing_team_id:
        small_id = winning_team_id
        big_id = losing_team_id
        outcome = 1
    elif losing_team_id < winning_team_id:
        small_id = losing_team_id
        big_id = winning_team_id
        outcome = 0
    record = {
        "ID": f"{season}_{small_id}_{big_id}",
        'Season': season,
        'LowID': small_id,
        'HighID': big_id,
        'Win': outcome
    }
    return record

In [15]:
outcomes = get_outcomes(MNCAATourneyDetailedResults)
outcomes.tail()

Unnamed: 0,ID,Season,LowID,HighID,Win
1176,2021_1211_1425,2021,1211,1425,1
1177,2021_1276_1417,2021,1276,1417,0
1178,2021_1124_1222,2021,1124,1222,1
1179,2021_1211_1417,2021,1211,1417,1
1180,2021_1124_1211,2021,1124,1211,1


In [16]:
data = pd.merge(
    outcomes, 
    features, 
    how="inner", 
    left_on=["Season", "HighID"], 
    right_on=["Season", "TeamID"]
)
data = pd.merge(
    data, 
    features, 
    how="inner", 
    left_on=["Season", "LowID"], 
    right_on=["Season", "TeamID"],
    suffixes=("High", "Low")
)
data.drop(
    ["Season", "HighID", "LowID","TeamIDHigh","TeamIDLow"], 
    axis=1, 
    inplace=True
)
data.set_index("ID", inplace=True)
data.tail()

Unnamed: 0_level_0,Win,ScoreRegHigh,NumOTRegHigh,FGMRegHigh,FGARegHigh,FGM3RegHigh,FGA3RegHigh,FTMRegHigh,FTARegHigh,ORRegHigh,DRRegHigh,AstRegHigh,TORegHigh,StlRegHigh,BlkRegHigh,PFRegHigh,WinRegHigh,FGPRegHigh,FGP3RegHigh,FTPRegHigh,ScoreTouHigh,NumOTTouHigh,FGMTouHigh,FGATouHigh,FGM3TouHigh,FGA3TouHigh,FTMTouHigh,FTATouHigh,ORTouHigh,DRTouHigh,AstTouHigh,TOTouHigh,StlTouHigh,BlkTouHigh,PFTouHigh,WinTouHigh,FGPTouHigh,FGP3TouHigh,FTPTouHigh,OrdinalRankHigh,SeedHigh,ScoreRegLow,NumOTRegLow,FGMRegLow,FGARegLow,FGM3RegLow,FGA3RegLow,FTMRegLow,FTARegLow,ORRegLow,DRRegLow,AstRegLow,TORegLow,StlRegLow,BlkRegLow,PFRegLow,WinRegLow,FGPRegLow,FGP3RegLow,FTPRegLow,ScoreTouLow,NumOTTouLow,FGMTouLow,FGATouLow,FGM3TouLow,FGA3TouLow,FTMTouLow,FTATouLow,ORTouLow,DRTouLow,AstTouLow,TOTouLow,StlTouLow,BlkTouLow,PFTouLow,WinTouLow,FGPTouLow,FGP3TouLow,FTPTouLow,OrdinalRankLow,SeedLow
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
2021_1242_1425,0,74.758621,0.137931,27.172414,58.172414,6.310345,18.137931,14.103448,21.793103,10.482759,25.827586,13.655172,12.103448,4.793103,5.241379,4.482759,0.758621,0.467101,0.347909,0.647152,76.25,0.0,29.0,57.5,7.75,17.0,10.5,15.75,7.5,24.5,15.5,10.75,4.5,3.75,12.0,0.75,0.504348,0.455882,0.666667,16.796296,6,72.518519,0.037037,26.111111,59.740741,7.259259,21.481481,13.037037,18.222222,10.074074,24.925926,13.62963,11.962963,6.703704,4.037037,3.518519,0.703704,0.437074,0.337931,0.715447,72.0,0.0,26.0,66.5,9.0,27.5,11.0,14.0,8.5,20.5,16.0,6.0,7.5,2.5,16.5,0.5,0.390977,0.327273,0.785714,16.092593,3
2021_1332_1425,0,74.758621,0.137931,27.172414,58.172414,6.310345,18.137931,14.103448,21.793103,10.482759,25.827586,13.655172,12.103448,4.793103,5.241379,4.482759,0.758621,0.467101,0.347909,0.647152,76.25,0.0,29.0,57.5,7.75,17.0,10.5,15.75,7.5,24.5,15.5,10.75,4.5,3.75,12.0,0.75,0.504348,0.455882,0.666667,16.796296,6,74.384615,0.0,27.5,58.230769,8.384615,22.115385,11.0,15.615385,8.192308,22.5,13.307692,11.0,7.461538,3.615385,5.192308,0.769231,0.472259,0.37913,0.704433,81.5,0.0,32.0,68.5,8.0,23.0,9.5,11.5,11.5,18.0,20.0,10.5,7.0,4.5,13.5,0.5,0.467153,0.347826,0.826087,28.735849,7
2021_1329_1333,0,69.107143,0.071429,23.857143,56.071429,6.892857,19.892857,14.5,19.071429,9.214286,21.535714,14.178571,11.0,5.821429,3.178571,5.535714,0.571429,0.425478,0.346499,0.7603,69.0,0.0,22.75,51.25,6.75,17.5,16.75,21.25,8.25,28.25,13.5,11.75,3.75,5.0,18.25,0.75,0.443902,0.385714,0.788235,76.788462,12,77.071429,0.178571,27.678571,59.142857,6.392857,18.892857,15.321429,21.5,9.285714,25.928571,13.178571,15.642857,7.285714,4.357143,5.142857,0.714286,0.467995,0.338374,0.712625,69.5,0.0,20.5,59.5,5.5,20.5,23.0,33.5,10.5,20.5,6.5,10.0,9.5,4.5,23.0,0.5,0.344538,0.268293,0.686567,21.759259,4
2021_1260_1333,0,69.107143,0.071429,23.857143,56.071429,6.892857,19.892857,14.5,19.071429,9.214286,21.535714,14.178571,11.0,5.821429,3.178571,5.535714,0.571429,0.425478,0.346499,0.7603,69.0,0.0,22.75,51.25,6.75,17.5,16.75,21.25,8.25,28.25,13.5,11.75,3.75,5.0,18.25,0.75,0.443902,0.385714,0.788235,76.788462,12,70.538462,0.076923,25.961538,52.192308,6.884615,19.307692,11.730769,16.0,6.423077,23.692308,15.538462,11.384615,7.115385,2.307692,4.538462,0.846154,0.497421,0.356574,0.733173,66.666667,0.0,23.0,52.666667,6.666667,20.0,14.0,20.333333,9.666667,19.666667,16.0,9.0,7.0,2.666667,15.0,0.666667,0.436709,0.333333,0.688525,23.851852,8
2021_1234_1332,0,74.384615,0.0,27.5,58.230769,8.384615,22.115385,11.0,15.615385,8.192308,22.5,13.307692,11.0,7.461538,3.615385,5.192308,0.769231,0.472259,0.37913,0.704433,81.5,0.0,32.0,68.5,8.0,23.0,9.5,11.5,11.5,18.0,20.0,10.5,7.0,4.5,13.5,0.5,0.467153,0.347826,0.826087,28.735849,7,83.758621,0.034483,29.758621,63.448276,9.758621,25.275862,14.482759,20.310345,10.172414,27.827586,19.0,9.137931,5.586207,4.517241,3.724138,0.724138,0.469022,0.386085,0.713073,83.0,0.0,30.0,61.0,9.0,23.5,14.0,19.0,9.0,21.5,19.5,8.5,5.5,5.0,14.0,0.5,0.491803,0.382979,0.736842,8.055556,2


## Train Test Split 

In [17]:
# For splitting data
from sklearn.model_selection import train_test_split

# Create train, validate, and test sets.
X = data.copy().drop("Win", axis=1)
y = data["Win"]
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

X_train.shape

(885, 80)

### Partition Features 
Here we partition our features into three groups:
 - Numerical
 - Categorical High Carnality
 - Categorical Low Cardnality

In [18]:
def partition_features(df: pd.DataFrame, cardnality_threshold: int) -> Tuple[pd.DataFrame]:
    cat_cols = [name for name, data_type in df.dtypes.items() if data_type == object]
    num_cols = list(set(df.columns) - set(cat_cols))
    col_cardnality = {col_name: X_train[col_name].nunique() for col_name in cat_cols}
    cat_cols_high = []
    cat_cols_low = []
    for name, cardnality in col_cardnality.items():
        if cardnality > cardnality_threshold:
            cat_cols_high.append(name)
        else:
            cat_cols_low.append(name)
    return num_cols, cat_cols_high, cat_cols_low

In [19]:
num_cols, cat_cols_high, cat_cols_low = partition_features(X_train, 10)
print(num_cols)
print(cat_cols_high)
print(cat_cols_low)

['FGM3RegLow', 'WinTouLow', 'NumOTRegLow', 'BlkTouLow', 'BlkRegHigh', 'ORRegLow', 'FTMRegHigh', 'PFTouHigh', 'FGMRegHigh', 'FTMTouLow', 'NumOTTouHigh', 'FTPTouLow', 'FGP3TouLow', 'BlkTouHigh', 'SeedHigh', 'ORRegHigh', 'DRTouLow', 'AstRegLow', 'PFRegHigh', 'FGPRegHigh', 'FGP3TouHigh', 'FGATouLow', 'NumOTRegHigh', 'FGM3RegHigh', 'TOTouLow', 'AstTouHigh', 'FGPTouHigh', 'AstRegHigh', 'TORegHigh', 'FGP3RegHigh', 'BlkRegLow', 'DRRegLow', 'FGMRegLow', 'FTMTouHigh', 'NumOTTouLow', 'StlTouHigh', 'FGA3RegLow', 'StlTouLow', 'SeedLow', 'FGARegHigh', 'FGM3TouHigh', 'ScoreRegLow', 'FTPRegHigh', 'FTMRegLow', 'FTATouHigh', 'FGM3TouLow', 'ScoreRegHigh', 'WinRegLow', 'ScoreTouHigh', 'FGA3TouLow', 'OrdinalRankHigh', 'ScoreTouLow', 'OrdinalRankLow', 'FGMTouLow', 'StlRegLow', 'FTARegHigh', 'StlRegHigh', 'FTPRegLow', 'FTARegLow', 'TORegLow', 'FGARegLow', 'WinTouHigh', 'PFRegLow', 'DRRegHigh', 'FGA3TouHigh', 'ORTouLow', 'FGPTouLow', 'FGA3RegHigh', 'FGATouHigh', 'FGMTouHigh', 'AstTouLow', 'DRTouHigh', 'PFTouL

### Data preperation pipelines

In [20]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy="median")

# Preprocessing for high cardnality categorical data
cat_high_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OrdinalEncoder(handle_unknown="ignore"))
])

# Preprocessing for low cardnality categorical data
cat_low_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, num_cols),
        ("cat_high", cat_high_transformer, cat_cols_high),
        ("cat_low", cat_low_transformer, cat_cols_low)
    ])

### Preprocess Data

In [21]:
X_train_processed = preprocessor.fit_transform(X_train)
X_valid_processed = preprocessor.transform(X_valid)

# Step 4: Train a model

Now that the data is prepared, the next step is to train a model.  

If you took the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** courses, then you learned about **[Random Forests](https://www.kaggle.com/dansbecker/random-forests)**.  In the code cell below, we fit a random forest model to the data.

### Setup Hyperparameter Tuning
See https://www.kaggle.com/prashant111/a-guide-on-xgboost-hyperparameters-tuning

In [22]:
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from sklearn.metrics import mean_squared_error, log_loss
from xgboost import XGBClassifier
import numpy as np
import warnings
warnings.filterwarnings("ignore",category=Warning)

In [23]:
space={
    "n_estimators": hp.quniform("n_estimators", 1000, 2000, 250),
    "learning_rate": hp.uniform("learning_rate", 0.1, 0.2),
    "max_depth": hp.quniform("max_depth", 2, 8, 1),
    "reg_lambda": hp.uniform("reg_lambda", 50, 150),
    "colsample_bytree": hp.uniform("colsample_bytree", 0.5, 1),
    "tree_method": "gpu_hist",
    "random_state": 42
}

def objective(space):
    regressor=XGBClassifier(
        n_estimators = int(space["n_estimators"]),
        learning_rate = space["learning_rate"],
        max_depth = int(space["max_depth"]),
        reg_lambda = space["reg_lambda"],
        colsample_bytree = space["colsample_bytree"],
        tree_method = "gpu_hist",
        random_state = 42,
    )
    evaluation = [(X_valid_processed, y_valid)]
    regressor.fit(
        X_train_processed, y_train,
        eval_set=evaluation,
        eval_metric="logloss",
        early_stopping_rounds=10,
        verbose=False
    )
    preds = regressor.predict(X_valid_processed)
    score = log_loss(y_valid, preds)
    return {"loss": score, "status": STATUS_OK }

In [24]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 100,
                        trials = trials)

100%|██████████| 100/100 [02:31<00:00,  1.51s/trial, best loss: 0.11668505538821274]


In [25]:
# Define model
model = XGBClassifier(
    n_estimators = int(best_hyperparams["n_estimators"]),
    max_depth = int(best_hyperparams["max_depth"]),
    learning_rate = best_hyperparams["learning_rate"],
    colsample_bytree = best_hyperparams["colsample_bytree"],
    reg_lambda = best_hyperparams["reg_lambda"], 
    tree_method = "gpu_hist",
    random_state = 42
)
model.fit(X_train_processed, y_train,
          early_stopping_rounds=10, 
          eval_set=[(X_valid_processed, y_valid)],
          verbose=False)
preds = model.predict(X_valid_processed)
print("RMSE:", mean_squared_error(y_valid, preds, squared=False))

RMSE: 0.05812381937190964


In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We"ll begin by using the trained model to generate predictions, which we"ll save to a CSV file.

In [26]:
# Use the model to generate predictions

predictions = model.predict(X)

# Save the predictions to a CSV file
output = pd.DataFrame({"ID": X.index,
                       "Pred": predictions})
output.to_csv("submission.csv", index=False)

Once you have run the code cell above, follow the instructions below to submit to the competition:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There"s a lot of room to improve, and you will climb up the leaderboard as you work.

# Step 6: Keep Learning!

If you"re not sure what to do next, you can begin by trying out more model types!
1. If you took the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course, then you learned about **[XGBoost](https://www.kaggle.com/alexisbcook/xgboost)**.  Try training a model with XGBoost, to improve over the performance you got here.

2. Take the time to learn about **Light GBM (LGBM)**, which is similar to XGBoost, since they both use gradient boosting to iteratively add decision trees to an ensemble.  In case you"re not sure how to get started, **[here"s a notebook](https://www.kaggle.com/svyatoslavsokolov/tps-feb-2021-lgbm-simple-version)** that trains a model on a similar dataset.