**What we need to predict**

**Stage 1** - We submit predicted probabilities for every possible matchup in the past 5 NCAA® tournaments (2015-2022).

**Stage 2** - We submit predicted probabilities for every possible matchup before the 2022 tournament begins.



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **BASICS**

### **MTEAMS.csv**

This file identifies the different college teams present in the dataset. Each school is uniquely identified by a 4 digit id number. You will not see games present for all teams in all seasons, because the games listing is only for matchups where both teams are Division-I teams. There are 357 teams currently in Division-I, and an overall total of 371 teams in our team listing (each year, some teams might start being Division-I programs, and others might stop being Division-I programs). This year there are four teams that are new to Division I: Bellarmine (TeamID=1469), North Alabama (TeamID=1469), Tarleton State (TeamID=1470), and UC_San Diego (TeamID=1471) and so you will not see any historical data for these teams prior to the current season. In addition, some teams opted not to play during the 2021 season due to the impact of COVID-19 and will not have any games listed.

* **TeamID** - a 4 digit id number, from 1000-1999, uniquely identifying each NCAA® men's team. A school's TeamID does not change from one year to the next, so for instance the Duke men's TeamID is 1181 for all seasons. To avoid possible confusion between the men's data and the women's data, all of the men's team ID's range from 1000-1999, whereas all of the women's team ID's range from 3000-3999.

* **TeamName** - a compact spelling of the team's college name, 16 characters or fewer. There are no commas or double-quotes in the team names, but you will see some characters that are not letters or spaces, e.g., Texas A&M, St Mary's CA, TAM C. Christi, and Bethune-Cookman.

* **FirstD1Season** - the first season in our dataset that the school was a Division-I school. For instance, FL Gulf Coast (famously) was not a Division-I school until the 2008 season, despite their two wins just five years later in the 2013 NCAA® tourney. Of course, many schools were Division-I far earlier than 1985, but since we don't have any data included prior to 1985, all such teams are listed with a FirstD1Season of 1985.

* **LastD1Season** - the last season in our dataset that the school was a Division-I school. For any teams that are currently Division-I, they will be listed with LastD1Season=2021, and you can confirm there are 357 such teams.

In [None]:
teams = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MTeams.csv")
teams.head(6)

In [None]:
teams.shape

In [None]:
print('Unique Team IDs: {}'.format(teams.TeamID.nunique()))
print('Unique TeamName: {}'.format(teams.TeamName.nunique()))

In [None]:
fig = plt.figure(figsize=(5, 10))
sns.countplot(data=teams, y='FirstD1Season');
plt.show()

In [None]:
fig = plt.figure(figsize=(5, 10))
sns.countplot(data=teams, y='LastD1Season');
plt.show()

## MSeasons.csv

This file identifies the different seasons included in the historical data, along with certain season-level properties.

* **Season** - indicates the year in which the tournament was played. Remember that the current season counts as 2021.

* **DayZero** - tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the national semifinals are always on DayNum=152, the "play-in" games are on days 135, Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. All game data includes the day number in order to make it easier to perform date calculations. If you need to know the exact date a game was played on, you can combine the game's "DayNum" with the season's "DayZero". For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on DayNum=7, they were therefore played on 11/07/2011.

* **RegionW, RegionX, Region Y, Region Z** - by our contests' convention, each of the four regions in the final tournament is assigned a letter of W, X, Y, or Z. Whichever region's name comes first alphabetically, that region will be Region W. And whichever Region plays against Region W in the national semifinals, that will be Region X. For the other two regions, whichever region's name comes first alphabetically, that region will be Region Y, and the other will be Region Z. This allows us to identify the regions and brackets in a standardized way in other files, even if the region names change from year to year. For instance, during the 2012 tournament, the four regions were East, Midwest, South, and West. Being the first alphabetically, East becomes W. Since the East regional champion (Ohio State) played against the Midwest regional champion (Kansas) in the national semifinals, that makes Midwest be region X. For the other two (South and West), since South comes first alphabetically, that makes South Y and therefore West is Z. So for that season, the W/X/Y/Z are East,Midwest,South,West. And so for instance, Ohio State, the #2 seed in the East, is listed in the MNCAATourneySeeds file that year with a seed of W02, meaning they were the #2 seed in the W region (the East region). We will not know the final W/X/Y/Z designations until Selection Sunday, because the national semifinal pairings in the Final Four will depend upon the overall ranks of the four #1 seeds.

In [None]:
seasons = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MSeasons.csv")
seasons.head(6)

In [None]:
seasons.shape

In [None]:
print('Total Seasons: {}'.format(seasons.Season.nunique()))

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.countplot(data=seasons, x='RegionW');
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.countplot(data=seasons, x='RegionX');
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.countplot(data=seasons, x='RegionY');
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.countplot(data=seasons, x='RegionZ');
plt.show()

In [None]:
seasons["Day_of_year"] = pd.to_datetime(seasons["DayZero"]).apply(lambda x: x.day_of_year)
fig = plt.figure(figsize=(10, 5))
sns.lineplot(x=seasons.Season, y=seasons.Day_of_year);
plt.show()

## **MNCAATourneySeeds.csv**

this file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 14, 2021 (DayNum=132).

* **Season** - the year that the tournament was played in

* **Seed**- this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tell you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have seeds with the same first three characters. The "a" and "b" are assigned based on which Team ID is lower numerically. As an example of the format of the seed, the first record in the file is seed W01 from 1985, which means we are looking at the #1 seed in the W region (which we can see from the "MSeasons.csv" file was the East region).

* **TeamID** - this identifies the id number of the team, as specified in the MTeams.csv file

In [None]:
MNCAATourneySeeds = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MNCAATourneySeeds.csv")
MNCAATourneySeeds.head(6)

In [None]:
MNCAATourneySeeds.shape

In [None]:
print("Total Seasons: {}".format(MNCAATourneySeeds.Season.nunique()))
print("Seeds Count: {}".format(MNCAATourneySeeds.Seed.nunique()))
print("Unique TeamID: {}".format(MNCAATourneySeeds.TeamID.nunique()))

In [None]:
fig = plt.figure(figsize=(5, 10))
sns.countplot(data=MNCAATourneySeeds, y='Season');
plt.show()

In [None]:
fig = plt.figure(figsize=(5, 20))
sns.countplot(data=MNCAATourneySeeds, y='Seed');
plt.show()

## **MRegularSeasonCompactResults.csv**

This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

* **Season** - this is the year of the associated entry in MSeasons.csv (the year in which the final tournament occurs). For example, during the 2016 season, there were regular season games played between November 2015 and March 2016, and all of those games will show up with a Season of 2016.

* **DayNum** - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "DayZero" date in the "MSeasons.csv" file. For example, the first game in the file was DayNum=20. Combined with the fact from the "MSeasons.csv" file that day zero was 10/29/1984 that year, this means the first game was played 20 days later, or 11/18/1984. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key (combining Season and DayNum and WTeamID). In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day because of a tornado, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original scheduled date.

* **WTeamID**- this identifies the id number of the team that won the game, as listed in the "MTeams.csv" file. No matter whether the game was won by the home team or visiting team, or if it was a neutral-site game, the "WTeamID" always identifies the winning team.

* **WScore** - this identifies the number of points scored by the winning team.

* **LTeamID** - this identifies the id number of the team that lost the game.

* **LScore** - this identifies the number of points scored by the losing team. Thus you can be confident that WScore will be greater than LScore for all games listed.

* **WLoc** - this identifies the "location" of the winning team. If the winning team was the home team, this value will be "H". If the winning team was the visiting team, this value will be "A". If it was played on a neutral court, then this value will be "N". Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team. If you would like to investigate this factor more closely, we invite you to explore Data Section 3, which provides the city that each game was played in, irrespective of whether it was considered to be a neutral site.

* **NumOT** - this indicates the number of overtime periods in the game, an integer 0 or higher.

In [None]:
MRegularSeasonCompactResults = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MRegularSeasonCompactResults.csv")
MRegularSeasonCompactResults.head(6)

In [None]:
MRegularSeasonCompactResults.shape

In [None]:
print("Toal Seasons:{}".format(MRegularSeasonCompactResults.Season.nunique()))
MRegularSeasonCompactResults.Season.value_counts().sort_index()

In [None]:
fig = plt.figure(figsize=(5, 5))
sns.countplot(data=MRegularSeasonCompactResults, x='WLoc');
plt.show()

In [None]:
fig = plt.figure(figsize=(5, 5))
sns.countplot(data=MRegularSeasonCompactResults, x='NumOT');
plt.show()

In [None]:
fig = plt.figure(figsize=(5, 10))
sns.countplot(data=MRegularSeasonCompactResults, y='Season');
plt.show()

In [None]:
fig = plt.figure(figsize=(5, 60))
sns.countplot(data=MRegularSeasonCompactResults, y='WTeamID');
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(MRegularSeasonCompactResults.groupby("WTeamID").apply(lambda x: len(x)).values, bins=25);
plt.title("Distribution of total number of wins by each teams")
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(MRegularSeasonCompactResults.groupby("LTeamID").apply(lambda x: len(x)).values, bins=25);
plt.title("Distribution of total number of Losses by each teams")
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MRegularSeasonCompactResults, x='WScore');
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MRegularSeasonCompactResults, x='LScore');
plt.show();

In [None]:
MRegularSeasonCompactResults["score_difference"] = \
    MRegularSeasonCompactResults["WScore"] -\
    MRegularSeasonCompactResults["LScore"]

fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MRegularSeasonCompactResults, x='score_difference');
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MRegularSeasonCompactResults, x='DayNum');
plt.show();

## **MNCAATourneyCompactResults.csv**

This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the MRegularSeasonCompactResults data. All games will show up as neutral site (so WLoc is always N). Note that this tournament game data also includes the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.

Because of the consistent structure of the NCAA® tournament schedule, you can actually tell what round a game was, depending on the exact DayNum. Thus:

* **DayNum=134 or 135 (Tue/Wed)** - play-in games to get the tournament field down to the final 64 teams
* **DayNum=136 or 137 (Thu/Fri)** - Round 1, to bring the tournament field from 64 teams to 32 teams
* **DayNum=138 or 139 (Sat/Sun)** - Round 2, to bring the tournament field from 32 teams to 16 teams
* **DayNum=143 or 144 (Thu/Fri)** - Round 3, otherwise known as "Sweet Sixteen", to bring the tournament field from 16 teams to 8 teams
* **DayNum=145 or 146 (Sat/Sun)** - Round 4, otherwise known as "Elite Eight" or "regional finals", to bring the tournament field from 8 teams to 4 teams
* **DayNum=152 (Sat)** - Round 5, otherwise known as "Final Four" or "national semifinals", to bring the tournament field from 4 teams to 2 teams
* **DayNum=154 (Mon)** - Round 6, otherwise known as "national final" or "national championship", to bring the tournament field from 2 teams to 1 champion team

In [None]:
MNCAATourneyCompactResults = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MNCAATourneyCompactResults.csv")
MNCAATourneyCompactResults.head(6)

In [None]:
MNCAATourneyCompactResults.shape

In [None]:
print("Seasons Count: {}".format(MNCAATourneyCompactResults.Season.nunique()))
MNCAATourneyCompactResults.Season.value_counts().sort_index()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.countplot(data=MNCAATourneyCompactResults, x='DayNum');
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(MNCAATourneyCompactResults.groupby("WTeamID").apply(lambda x: len(x)).values, bins=25);
plt.title("Distribution of total number of wins by each teams")
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(MNCAATourneyCompactResults.groupby("LTeamID").apply(lambda x: len(x)).values, bins=25);
plt.title("Distribution of total number of Losses by each teams")
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MNCAATourneyCompactResults, x='WScore');
plt.show();

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MNCAATourneyCompactResults, x='LScore');
plt.show();

In [None]:
MNCAATourneyCompactResults["score_difference"] = \
    MNCAATourneyCompactResults["WScore"] -\
    MNCAATourneyCompactResults["LScore"]

fig = plt.figure(figsize=(10, 5))
sns.histplot(data=MNCAATourneyCompactResults, x='score_difference');
plt.show();

## **MSampleSubmissionStage1.csv**

This file illustrates the submission file format for Stage 1. It is the simplest possible submission: a 50% winning percentage is predicted for each possible matchup.

A submission file lists every possible matchup between tournament teams for one or more years. During Stage 1, you are asked to make predictions for all possible matchups from the past five NCAA® tournaments (seasons 2015, 2016, 2017, 2018, 2019). In Stage 2, you will be asked to make predictions for all possible matchups from the current NCAA® tournament (season 2021).

When there are 68 teams in the tournament, there are 6867/2=2,278 predictions to make for that year, so a Stage 1 submission file will have 2,2785=11,390 data rows.

* **ID** - this is a 14-character string of the format SSSS_XXXX_YYYY, where SSSS is the four digit season number, XXXX is the four-digit TeamID of the lower-ID team, and YYYY is the four-digit TeamID of the higher-ID team.

* **Pred** - this contains the predicted winning percentage for the first team identified in the ID field, the one represented above by XXXX.

* **Example #1:** You want to make a prediction for Duke (TeamID=1181) against Arizona (TeamID=1112) in the 2017 tournament, with Duke given a 53% chance to win and Arizona given a 47% chance to win. In this case, Arizona has the lower numerical ID so they would be listed first, and the winning percentage would be expressed from Arizona's perspective (47%):

2017_1112_1181,0.47

* **Example #2:** You want to make a prediction for Duke (TeamID=1181) against North Carolina (TeamID=1314) in the 2018 tournament, with Duke given a 51.6% chance to win and North Carolina given a 48.4% chance to win. In this case, Duke has the lower numerical ID so they would be listed first, and the winning percentage would be expressed from Duke's perspective (51.6%):

2018_1181_1314,0.516

Also note that a single prediction row serves as a prediction for each of the two teams' winning chances. So for instance, in Example #1, the submission row of "2017_1112_1181,0.47" specifically gives a 47% chance for Arizona to win, and doesn't explicitly mention Duke's 53% chance to win. However, our evaluation utility will automatically infer the winning percentage in the other direction, so a 47% prediction for Arizona to win also means a 53% prediction for Duke to win. And similarly, because the submission row in Example #2 gives Duke a 51.6% chance to beat North Carolina, we will automatically figure out that this also means North Carolina has a 48.4% chance to beat Duke.

In [None]:
MSampleSubmissionStage1 = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MSampleSubmissionStage1.csv")
MSampleSubmissionStage1.head(6)

In [None]:
MSampleSubmissionStage1.tail(6)

This section provides game-by-game stats at a team level (free throws attempted, defensive rebounds, turnovers, etc.) for all regular season, conference tournament, and NCAA® tournament games since the 2002-03 season.

## **MRegularSeasonDetailedResults.csv**

In a Detailed Results file, the first eight columns (Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, and NumOT) are exactly the same as a Compact Results file. However, in a Detailed Results file, there are many additional columns. The column names should be self-explanatory to basketball fans (as above, "W" or "L" refers to the winning or losing team):

* **WFGM** - field goals made (by the winning team)
* **WFGA** - field goals attempted (by the winning team)
* **WFGM3** - three pointers made (by the winning team)
* **WFGA3** - three pointers attempted (by the winning team)
* **WFTM**- free throws made (by the winning team)
* **WFTA** - free throws attempted (by the winning team)
* **WOR** - offensive rebounds (pulled by the winning team)
* **WDR** - defensive rebounds (pulled by the winning team)
* **WAst** - assists (by the winning team)
* **WTO** - turnovers committed (by the winning team)
* **WStl** - steals (accomplished by the winning team)
* **WBlk** - blocks (accomplished by the winning team)
* **WPF**- personal fouls committed (by the winning team)
(and then the same set of stats from the perspective of the losing team: LFGM is the number of field goals made by the losing team, and so on up to LPF).

Note: by convention, "field goals made" (either WFGM or LFGM) refers to the total number of fields goals made by a team, a combination of both two-point field goals and three-point field goals. And "three point field goals made" (either WFGM3 or LFGM3) is just the three-point fields goals made, of course. So if you want to know specifically about two-point field goals, you have to subtract one from the other (e.g., WFGM - WFGM3). And the total number of points scored is most simply expressed as 2*FGM + FGM3 + FTM.

In [None]:
MRegularSeasonDetailedResults = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MRegularSeasonDetailedResults.csv")
MRegularSeasonDetailedResults.head(6)

In [None]:
MRegularSeasonDetailedResults.shape

In [None]:
MRegularSeasonDetailedResults.Season.value_counts().sort_index()

In [None]:
wnew_cols, lnew_cols = ['WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR',
       'WAst', 'WTO', 'WStl', 'WBlk', 'WPF'], [ 'LFGM', 'LFGA', 'LFGM3', 'LFGA3',
       'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']

for i, j in zip(wnew_cols, lnew_cols):
    fig, axes = plt.subplots(1, 2, figsize=(20, 5))
    sns.histplot(ax=axes[0], data=MRegularSeasonDetailedResults, x=i);
    sns.histplot(ax=axes[1], data=MRegularSeasonDetailedResults, x=j);
    plt.show();

## **MNCAATourneyDetailedResults.csv**

In [None]:
MNCAATourneyDetailedResults = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MNCAATourneyDetailedResults.csv")
MNCAATourneyDetailedResults.head(6)

In [None]:
MNCAATourneyDetailedResults.shape

In [None]:
MNCAATourneyDetailedResults.Season.value_counts().sort_index()

In [None]:
for i, j in zip(wnew_cols, lnew_cols):
    fig, axes = plt.subplots(1, 2, figsize=(20, 5))
    sns.histplot(ax=axes[0], data=MNCAATourneyDetailedResults, x=i);
    sns.histplot(ax=axes[1], data=MNCAATourneyDetailedResults, x=j);
    plt.show();

## **Cities.csv**

This file provides a master list of cities that have been locations for games played. Please notice that the Cities and Conferences files are the only two that don't start with an M; this is because the data files are identical between men's and women's data, so you don't need to maintain separate listings of cities or conferences across the two contests. Also note that if you created any supplemental data last year on cities (latitude/longitude, altitude, etc.), the CityID's match between last year and this year, so you should be able to re-use that information.

* **CityID**- a four-digit ID number uniquely identifying a city.
* **City** - the text name of the city.
* **State** - the state abbreviation of the state that the city is in. In a few rare cases, the game location is not inside one of the 50 U.S. states and so other abbreviations are used. For instance Cancun, Mexico has a state abbreviation of MX.

In [None]:
Cities = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/Cities.csv")
Cities.head(6)

In [None]:
Cities.shape

In [None]:
print("Total Cities {}".format(Cities.City.nunique()))
print("Divided in {} States".format(Cities.State.nunique()))

In [None]:
fig = plt.figure(figsize=(5, 15))
sns.countplot(data=Cities, y='State');
plt.show()

## **MGameCities.csv**

This file identifies all games, starting with the 2010 season, along with the city that the game was played in. Games from the regular season, the NCAA® tourney, and other post-season tournaments, are all listed together. There should be no games since the 2010 season where the CityID is not known. Games from the 2009 season and before are not listed in this file.

* **Season, DayNum, WTeamID, LTeamID**- these four columns are sufficient to uniquely identify each game. Additional data, such as the score of the game and other stats, can be found in the corresponding Compact Results and/or Detailed Results file.
* **CRType** - this can be either Regular or NCAA or Secondary. If it is Regular, you can find more about the game in the MRegularSeasonCompactResults.csv and MRegularSeasonDetailedResults.csv files. If it is NCAA, you can find more about the game in the MNCAATourneyCompactResults.csv and MNCAATourneyDetailedResults.csv files. If it is Secondary, you can find more about the game in the MSecondaryTourneyCompactResults file.
* **CityID** - the ID of the city where the game was played, as specified by the CityID column in the Cities.csv file.

In [None]:
MGameCities = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MGameCities.csv")
MGameCities.head(6)

In [None]:
MGameCities.shape

In [None]:
MGameCities.Season.value_counts().sort_index()

In [None]:
fig = plt.figure(figsize=(10, 5))
sns.countplot(data=MGameCities, x='CRType');
plt.show()

In [None]:
fig = plt.figure(figsize=(5, 60))
sns.countplot(data=MGameCities, y='CityID');
plt.show()

## **MMasseyOrdinals.csv**

In [None]:
MMasseyOrdinals = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MMasseyOrdinals.csv")
MMasseyOrdinals.head(6)

In [None]:
print("Unique TeamIDs: {}".format(MMasseyOrdinals.TeamID.nunique()))
print("Unique SystemName: {}".format(MMasseyOrdinals.SystemName.nunique()))

In [None]:
fig = plt.figure(figsize=(5, 40))
sns.countplot(data=MMasseyOrdinals, y='SystemName');
plt.show()

In [None]:
co_mat = pd.crosstab(MMasseyOrdinals.Season, MMasseyOrdinals.SystemName)
pd.set_option('display.max_columns', 200)
co_mat

## **MTeamCoaches.csv**

This file indicates the head coach for each team in each season, including a start/finish range of DayNum's to indicate a mid-season coaching change. For scenarios where a team had the same head coach the entire season, they will be listed with a DayNum range of 0 to 154 for that season. For head coaches whose term lasted many seasons, there will be many rows listed, most of which have a DayNum range of 0 to 154 for the corresponding season.

* **Season**- this is the year of the associated entry in MSeasons.csv (the calendar year in which the final tournament occurs)
* **TeamID** - this is the TeamID of the team that was coached, as described in MTeams.csv.
* **FirstDayNum, LastDayNum** - this defines a continuous range of days within the season, during which the indicated coach was the head coach of the team. In most cases, a data row will either have FirstDayNum=0 (meaning they started the year as head coach) and/or LastDayNum=154 (meaning they ended the year as head coach), but in some cases there were multiple new coaches during a team's season, or a head coach who went on leave and then returned (in which case there would be multiple records in that season for that coach, indicating the continuous ranges of days when they were the head coach).
* **CoachName** - this is a text representation of the coach's full name, in the format first_last, with underscores substituted in for spaces.

In [None]:
MTeamCoaches = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MTeamCoaches.csv")
MTeamCoaches.head(6)

In [None]:
MTeamCoaches.shape

In [None]:
MTeamCoaches['TotalDays'] = MTeamCoaches['LastDayNum'] - MTeamCoaches['FirstDayNum'] 
fig = plt.figure(figsize=(10, 5))
sns.histplot(MTeamCoaches.groupby('CoachName')['TotalDays'].sum().values);
plt.title("Distribution of number of total number days per coach")
plt.show();

I will share any insight,that might be of relevance.

This is still a WIP.