# Introduction
he below analysis will be conducted on both the regular season and tournament statistics. It will begin by taking a high level view at the NCAA championships - who has won them and how. It will analyse pre-tournament seedings and will also look at some of the data contained in the Massey Ordinal data, specifically Pomeroy and Sagarin rankings, and will try to analyse what a different ranking means to a team’s performance.

Then it will move on to analyse the averages per regular season of the teams and then take those season averages of each team and plot their correlation with wins.

The analysis will then included a detailed look at Advanced Metrics and analyse their relationship with team performance, both at a per game level, and by taking a look at the overall season performance. It will highlight the performance of the champion team in each season with regards to these metrics. This section will show the importance of some of these advanced metrics, and how they can help explain basketball games better than the traditional box score statistics.

It will continue highlighting the champion team’s performance with regards to tradition box score stats for the season.

The analysis will then move on to comparing the differences in team stats between regular season play and tournament play to see whether the tournament games yield a different style of play.

Finally, per game statistics will be analysed for 2014 - 2018 and compared between winning and losing teams, and also looks at if there is a difference between winning and losing in both tournament and regular season games.

I will build upon this kernel throughout the competition. Feel free to add in any suggestions. If you like the analysis, an upvote would also be greatly appreciated!

Update:

My model was very strong on Wofford - it even has them progressing to the final four! I couldn’t understand why, so I have added in some analysis on 2019 data and visualised Wofford’s advanced stats for 2019 and compared theirs to those of past winners. Hope you enjoy!

Update 2:

I have included an analysis of the upcoming Final Four Games. I hope it’s helpful to you!

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# for visualizations
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Mens Basket Ball Datasets

## Data Section 1
### Data Section 1.1

 ### Men Teams

*Data Info*

**TeamID** - a 4 digit id number, from 1000-1999 (for men) or 3000-3999 (for women), uniquely identifying each NCAA® team. A school's TeamID does not change from one year to the next, so for instance the Duke men's TeamID is 1181 and women's TeamID is 3181 for all seasons. To avoid possible confusion between the men's data and the women's data, all of the men's team ID's range from 1000-1999, whereas all of the women's team ID's range from 3000-3999.

**TeamName** - a compact spelling of the team's college name, 16 characters or fewer. There are no commas or double-quotes in the team names, but you will see some characters that are not letters or spaces, e.g., Texas A&M, St Mary's CA, TAM C. Christi, and Bethune-Cookman. Also note that several teams have had their team names changed slightly since last year: "Albany NY" is now "SUNY Albany", "Santa Barbara" is now "UC Santa Barbara", "VA Commonwealth" is now "VCU", "Edwardsville" is now "SIUE", "Cal Poly SLO" is now "Cal Poly", "IPFW" is now "PFW", "Long Island" is now "LIU Brooklyn", and "ULL" is now "Louisiana".

**FirstD1Season** (men's file only) - the first season in our dataset that the school was a Division-I school. For instance, FL Gulf Coast (famously) was not a Division-I school until the 2008 season, despite their two wins just five years later in the 2013 NCAA® tourney. Of course, many schools were Division-I far earlier than 1985, but since we don't have any data included prior to 1985, all such teams are listed with a FirstD1Season of 1985.

**LastD1Season** (men's file only) - the last season in our dataset that the school was a Division-I school. For any teams that are currently Division-I, they will be listed with LastD1Season=2020, and you can confirm there are 353 such teams. Since Savannah St is no longer a Division-I school, you can see that their last Division 1 season was 2019 rather than 2020.


In [None]:
MTeams = pd.read_csv("/kaggle/input/march-madness-analytics-2020/2020DataFiles/2020-Mens-Data/MDataFiles_Stage1/MTeams.csv")

In [None]:
MTeams.columns

In [None]:
MTeams.describe()

In [None]:
MTeams.head()

 ### Men Seasons

*Data Info*


**Season** - indicates the year in which the tournament was played. Remember that the current season counts as 2020.

**DayZero** - tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the national semifinals are always on DayNum=152, the "play-in" games are on days 134/135, Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. All game data includes the day number in order to make it easier to perform date calculations. If you need to know the exact date a game was played on, you can combine the game's "DayNum" with the season's "DayZero". For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on DayNum=7, they were therefore played on 11/07/2011

**RegionW, RegionX, Region Y, Region Z** - by our contests' convention, each of the four regions in the final tournament is assigned a letter of W, X, Y, or Z. We will not know the final W/X/Y/Z designations until Selection Sunday, because the national semifinal pairings in the Final Four will depend upon the overall ranks of the four #1 seeds.

In [None]:
MSeasons = pd.read_csv("/kaggle/input/march-madness-analytics-2020/2020DataFiles/2020-Mens-Data/MDataFiles_Stage1/MSeasons.csv")

In [None]:
MSeasons.columns

In [None]:
MSeasons.head()

### Men's NCAA Tourney Seeds

*Data Info*

**Season** - the year that the tournament was played in

**Seed** - this is a 3/4-character identifier of the seed, where the first character is either W, X, Y, or Z (identifying the region the team was in) and the next two digits (either 01, 02, ..., 15, or 16) tell you the seed within the region. For play-in teams (men's contest only), there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have seeds with the same first three characters. The "a" and "b" are assigned based on which Team ID is lower numerically. As an example of the format of the seed, the first record in the file is seed W01 from 1985, which means we are looking at the #1 seed in the W region (which we can see from the "MSeasons.csv" file was the East region).

**TeamID** - this identifies the id number of the team, as specified in the MTeams.csv file


In [None]:
MTourneySeed = pd.read_csv("/kaggle/input/march-madness-analytics-2020/2020DataFiles/2020-Mens-Data/MDataFiles_Stage1/MNCAATourneySeeds.csv")

In [None]:
MTourneySeed.columns

In [None]:
MTourneySeed.shape

In [None]:
MTourneySeed.head()

In [None]:
MTourneySeed['Seed'] = MTourneySeed['Seed'].astype(str)
MTourneySeed['Seed'].unique()

In [None]:
sns.set(font_scale=1.4)

In [None]:
plt.figure(figsize=(25,10))
plt.xticks( rotation=85)

sns.countplot(x="Seed",data=MTourneySeed,order=MTourneySeed['Seed'].value_counts().sort_values(ascending=False).index)
sns.set_style("whitegrid")


plt.title("Seed Counts") 
plt.xlabel("Seeds") 
plt.ylabel("Counts") 
plt.show()

In [None]:

plt.figure(figsize=(20,30))
plt.xticks( rotation=85)

sns.scatterplot(x=MTourneySeed['Season'],y=MTourneySeed['Seed'],hue=MTourneySeed['Season'],s=150)
sns.set_style("whitegrid")


plt.title("Teams by Seed") 
plt.ylabel("Teams") 
plt.xlabel("Seed") 
plt.show()

### Merging Men Tourney Seeds to Men Teams

In [None]:
MTS_MTeams = pd.merge(MTourneySeed,
                 MTeams[['TeamID','TeamName']],
                 on='TeamID')
MTS_MTeams

In [None]:

plt.figure(figsize=(20,100))
plt.xticks( rotation=85)

sns.scatterplot(x=MTS_MTeams['Season'],y=MTS_MTeams['TeamName'],hue=MTS_MTeams['Season'],s=150)
sns.set_style("whitegrid")


plt.title("Teams by Each Season") 
plt.ylabel("Teams") 
plt.xlabel("Seasons") 
plt.show()

In [None]:
plt.figure(figsize=(20,80))
plt.xticks( rotation=0)

sns.countplot(y="TeamName",data=MTS_MTeams,order=MTS_MTeams['TeamName'].value_counts().sort_values(ascending=False).index)
sns.set_style("whitegrid")


plt.title("Seed Counts") 
plt.xlabel("Seeds") 
plt.ylabel("Counts") 
plt.show()

* ## **To be Continued**