# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Import</p></div>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
from datetime import timedelta
import matplotlib.pyplot as plt
import seaborn as sns
DATA_PATH = '../input/mens-march-mania-2022/MDataFiles_Stage1/'

# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Data Section 1 - The Basics</p></div>

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MTeams.csv</p></div>
****
This file identifies the different college teams present in the dataset. Each school is uniquely identified by a 4 digit id number. You will not see games present for all teams in all seasons, because the games listing is only for matchups where both teams are Division-I teams. There are 358 teams currently in Division-I, and an overall total of 372 teams in our team listing (each year, some teams might start being Division-I programs, and others might stop being Division-I programs).

In [None]:
MTeams = pd.read_csv('../input/mens-march-mania-2022/MDataFiles_Stage1/MTeams.csv')
MTeams.head()

**TeamID** 

a 4 digit id number, from 1000-1999, uniquely identifying each NCAA® men's team. A school's TeamID does not change from one year to the next, so for instance the Duke men's TeamID is 1181 for all seasons. To avoid possible confusion between the men's data and the women's data, all of the men's team ID's range from 1000-1999, whereas all of the women's team ID's range from 3000-3999.

In [None]:
MTeams.TeamID.describe()

**TeamName**  

a compact spelling of the team's college name, 16 characters or fewer. There are no commas or double-quotes in the team names, but you will see some characters that are not letters or spaces, e.g., Texas A&M, St Mary's CA, TAM C. Christi, and Bethune-Cookman.

In [None]:
MTeams.TeamName.describe()

In [None]:
MTeams[MTeams.TeamName.str.contains('&')]

In [None]:
MTeams[MTeams.TeamName.str.contains("'")]

In [None]:
MTeams[MTeams.TeamName.str.contains("-")]

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MRegularSeasons.csv</p></div>
****
This file identifies the different seasons included in the historical data, along with certain season-level properties.

In [None]:
MSeasons = pd.read_csv("../input/mens-march-mania-2022/MDataFiles_Stage1/MSeasons.csv")
MSeasons.head()

**Season**

indicates the year in which the tournament was played. Remember that the current season counts as 2022.

In [None]:
MSeasons.Season.describe()

**DayZero** 

- tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) the Monday championship game of the men's tournament is on DayNum=154. Working backward, the national semifinals are always on DayNum=152, the "play-in" games are on days 135, Selection Sunday is on day 132, the final day of the regular season is also day 132, and so on. All game data includes the day number in order to make it easier to perform date calculations. If you need to know the exact date a game was played on, you can combine the game's "DayNum" with the season's "DayZero". For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on DayNum=7, they were therefore played on 11/07/2011.

In [None]:
dayzero_1985 = MSeasons.DayZero[MSeasons.Season == 1985]
dayzero_1985

**RegionW, RegionX, Region Y, Region Z** 

- by our contests' convention, each of the four regions in the final tournament is assigned a letter of W, X, Y, or Z. Whichever region's name comes first alphabetically, that region will be Region W. And whichever Region plays against Region W in the national semifinals, that will be Region X. For the other two regions, whichever region's name comes first alphabetically, that region will be Region Y, and the other will be Region Z. This allows us to identify the regions and brackets in a standardized way in other files, even if the region names change from year to year. For instance, during the 2012 tournament, the four regions were East, Midwest, South, and West. Being the first alphabetically, East becomes W. Since the East regional champion (Ohio State) played against the Midwest regional champion (Kansas) in the national semifinals, that makes Midwest be region X. For the other two (South and West), since South comes first alphabetically, that makes South Y and therefore West is Z. So for that season, the W/X/Y/Z are East,Midwest,South,West. And so for instance, Ohio State, the #2 seed in the East, is listed in the MNCAATourneySeeds file that year with a seed of W02, meaning they were the #2 seed in the W region (the East region). We will not know the final W/X/Y/Z designations until Selection Sunday, because the national semifinal pairings in the Final Four will depend upon the overall ranks of the four #1 seeds.


In [None]:
regionW_1985 = MSeasons.RegionW[MSeasons.Season == 1985]
regionW_1985

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MRegularSeasonDetailedResults.csv</p></div>
****
This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

In [None]:
MRegularSeasonCompactResults =pd.read_csv('../input/mens-march-mania-2022/MDataFiles_Stage1/MRegularSeasonCompactResults.csv')
MRegularSeasonCompactResults.head()

**DayNum**

- this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "DayZero" date in the "MSeasons.csv" file. For example, the first game in the file was DayNum=20. Combined with the fact from the "MSeasons.csv" file that day zero was 10/29/1984 that year, this means the first game was played 20 days later, or 11/18/1984. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key (combining Season and DayNum and WTeamID). In order to accomplish this uniqueness, we had to adjust one game's date. In March 2008, the SEC postseason tournament had to reschedule one game (Georgia-Kentucky) to a subsequent day because of a tornado, so Georgia had to actually play two games on the same day. In order to enforce this uniqueness, we moved the game date for the Georgia-Kentucky game back to its original scheduled date.

In [None]:
MRegularSeasonCompactResults.DayNum.describe()

In [None]:
df_1985 = pd.DataFrame(MRegularSeasonCompactResults[MRegularSeasonCompactResults.Season==1985], columns=['DayNum'])

In [None]:
dateZero_1985 = datetime.datetime.strptime(MSeasons.DayZero[0],"%Y-%m-%d %H:%M:%S")
df_1985['The1stGameDate'] = df_1985.DayNum.map(lambda x: timedelta(days=x) + dateZero_1985)
df_1985.head()

### Winner and Loser Team

In [None]:
fig = plt.figure(figsize=(30,10))
fig.add_subplot(1,2,1)
MRegularSeasonCompactResults.groupby('WTeamID').size().hist(bins=100, color='blue')
plt.xlabel("Distribution of match-number by the winner")
plt.ylabel("Frequency")
fig.add_subplot(1,2,2)
MRegularSeasonCompactResults.groupby('LTeamID').size().hist(bins=100, color='red')
plt.xlabel("Distribution of match-number by the loser")
plt.ylabel("Frequency")

plt.show()

### Winner and Loser Score

In [None]:
fig = plt.figure(figsize=(30,10))
fig.add_subplot(1,2,1)
MRegularSeasonCompactResults.WScore.hist(bins=186, color='blue')
plt.xlabel("Distribution of Score by the winner")
plt.ylabel("Frequency")
fig.add_subplot(1,2,2)
MRegularSeasonCompactResults.LScore.hist(bins=150, color='red')
plt.xlabel("Distribution of Score by the loser")
plt.ylabel("Frequency")

plt.show()

In [None]:
print(f'Mean and std of Winner Score:\n Mean = {np.round(np.mean(MRegularSeasonCompactResults.WScore), 2)} \n Std = {np.round(np.std(MRegularSeasonCompactResults.WScore), 2)}')

In [None]:
print(f'Mean and std of Loser Score:\n Mean = {np.round(np.mean(MRegularSeasonCompactResults.LScore), 2)} \n Std = {np.round(np.std(MRegularSeasonCompactResults.LScore), 2)}')

In [None]:
# Plot the difference Score between Winner and Loser 
MRegularSeasonCompactResults['DScore'] = MRegularSeasonCompactResults.WScore - MRegularSeasonCompactResults.LScore
plt.figure(figsize=(10,7))
MRegularSeasonCompactResults.DScore.hist(bins=100, color='orange')
plt.xlabel("Score Difference")
plt.ylabel("Frequency")
plt.show()

In [None]:
print(f'Mean and std of Difference Score:\n Mean = {np.round(np.mean(MRegularSeasonCompactResults.DScore), 2)} \n Std = {np.round(np.std(MRegularSeasonCompactResults.DScore), 2)}')

In [None]:
data = MRegularSeasonCompactResults.groupby('WTeamID').agg(score=('DScore','mean'))
sns.relplot(x=data.index, y='score', data=data, kind='line',ci='sd')
plt.gcf().set_size_inches(20, 10)
plt.show()

**NumOT** 
- this indicates the number of overtime periods in the game, an integer 0 or higher.

In [None]:
print(f'There are {MRegularSeasonCompactResults.NumOT.nunique()} unique numbers overtime:')
print(f'{np.sort(MRegularSeasonCompactResults.NumOT.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=(10,6))
MRegularSeasonCompactResults.groupby('NumOT').size().plot(kind='pie',
                                                         ax = ax, 
                                                         title='Overtime Distribution')
plt.show()

In [None]:
noOvertime = np.round(MRegularSeasonCompactResults.groupby('NumOT').size()[0] / MRegularSeasonCompactResults.groupby('NumOT').size().sum(), 3) *100
print(f'The percentage of overtime periods in game equals 0 is = {noOvertime} %')

**WLoc : Winner Location** 
- this identifies the "location" of the winning team. 
- If the winning team was the home team, this value will be **"H"**. 
- If the winning team was the visiting team, this value will be **"A"**. 
- If it was played on a neutral court, then this value will be **"N"**. 

Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team. If you would like to investigate this factor more closely, we invite you to explore Data Section 3, which provides the city that each game was played in, irrespective of whether it was considered to be a neutral site.

In [None]:
pie, ax = plt.subplots(figsize = [10,6])
MRegularSeasonCompactResults.groupby('WLoc').size().plot(kind='pie',
                                                        ax=ax,
                                                         title='Winner Location Distribution',
                                                         rotatelabels=True,)
plt.show()

Playing at home make higher win opportunity

**MNCAATourneySeeds.csv**

This file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 13, 2022 (DayNum=132).

In [None]:
seeds = pd.read_csv(DATA_PATH + "MNCAATourneySeeds.csv")
seeds.head()

The Tournament Selection Committee seeds every team in the NCAA Tournament ranging from 1 (the best teams) to 16 (the worst ones). The tournament started with 8 teams in 1939 and has since expanded to 68 teams today (4 play-in games). On Selection Sunday (March 13, 2022), the committee releases the bracket with the seeds.

* First character : Region (W, X, Y, or Z)
* Next two digits : Seed within the region (01 to 16)
* Last character (optional): Distinguishes teams between play-ins ( a or b)

In [None]:
seeds[seeds.Season==1985]

## <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">DATA SECTION 2 - TEAM BOX SCORES</p></div>

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MRegularSeasonDetailedResults.csv</p></div>
****

This file provides team-level box scores for many regular seasons of historical data, starting with the *2003* season. All games listed in the MRegularSeasonCompactResults file since the 2003 season should exactly be present in the MRegularSeasonDetailedResults file.

In [None]:
MRegularSeasonDetailedResults = pd.read_csv(DATA_PATH + "MRegularSeasonDetailedResults.csv")
MRegularSeasonDetailedResults.head()

**WFGM** - field goals made (by the winning team)

**WFGA** - field goals attempted (by the winning team)

**WFGM3** - three pointers made (by the winning team)

**WFGA3** - three pointers attempted (by the winning team)

**WFTM** - free throws made (by the winning team)

**WFTA** - free throws attempted (by the winning team)

**WDR** - defensive rebounds (pulled by the winning team)

**WAst** - assists (by the winning team)

**WTO** - turnovers committed (by the winning team)

**WStl** - steals (accomplished by the winning team)

**WBlk** - blocks (accomplished by the winning team)

**WPF** - personal fouls committed (by the winning team)

In [None]:
fig = plt.figure(figsize=(30,80))
x = 1
columns_ = MRegularSeasonDetailedResults.columns.drop(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc','NumOT'])
for i in columns_:
    fig.add_subplot(15,5,x)
    plt.title(i, fontsize=18)
    plt.plot(MRegularSeasonDetailedResults.groupby('WTeamID').mean().index, 
             MRegularSeasonDetailedResults.groupby('WTeamID').mean()[i])
    x+=1

In [None]:
MRegularSeasonDetailedResults['DScore'] = MRegularSeasonDetailedResults.WScore - MRegularSeasonDetailedResults.LScore
qx = sns.jointplot(x = MRegularSeasonDetailedResults.WTeamID.unique(), 
                   y = MRegularSeasonDetailedResults.groupby('WTeamID')['DScore'].mean().values, 
                   kind="reg", 
                   height=12, 
                   joint_kws={'line_kws':{'color':'red'}})
qx.ax_joint.set_xlabel('WTeamID')
qx.ax_joint.set_ylabel('DScore')
plt.show()

In [None]:
Win_time = MRegularSeasonDetailedResults.WTeamID.value_counts()
team = MRegularSeasonDetailedResults.groupby(['WTeamID']).mean().reset_index()
teams = pd.DataFrame()
for i in Win_time.index:
    data = team[team.WTeamID ==i]
    data['Win_time'] = Win_time[i]
    teams = pd.concat([teams,data])
    
sns.relplot(x='DScore',
            y='Win_time',
            data=teams)
plt.gcf().set_size_inches(8, 8)

In [None]:
fig = plt.figure(figsize=(30,80))
r=1 
for x in columns_:    
        fig.add_subplot(10,5, r)
        plt.title(x,fontsize=18)
        plt.scatter(x='Win_time',
                    y=x,
                    data=teams)
        r+=1 

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MNCAATourneyDetailedResults.csv + MNCAATourneyCompactResults.csv</p></div>
****

This file provides team-level box scores for many NCAA® tournaments, starting with the 2003 season. All games listed in the MNCAATourneyCompactResults file since the 2003 season should exactly be present in the MNCAATourneyDetailedResults file.

In [None]:
MNCAATourneyDetailedResults = pd.read_csv(DATA_PATH + "MNCAATourneyDetailedResults.csv")
MNCAATourneyDetailedResults.head()

In [None]:
MNCAATourneyCompactResults = pd.read_csv(DATA_PATH + 'MNCAATourneyCompactResults.csv')
MNCAATourneyCompactResults.head()

In [None]:
MNCAATourneyCompactResults.Season.unique()

## <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">DATA SECTION 3- GEOGRAPHY</p></div>

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">Cities.csv</p></div>
****

This file provides a master list of cities that have been locations for games played. Please notice that the Cities and Conferences files are the only two that don't start with an M; this is because the data files are identical between men's and women's data, so you don't need to maintain separate listings of cities or conferences across the two contests. Also note that if you created any supplemental data last year on cities (latitude/longitude, altitude, etc.), the CityID's match between last year and this year, so you should be able to re-use that information.

In [None]:
Cities = pd.read_csv(DATA_PATH + 'Cities.csv')
Cities.head()

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MGameCities.csv</p></div>
****
This file identifies all games, starting with the 2010 season, along with the city that the game was played in. Games from the regular season, the NCAA® tourney, and other post-season tournaments, are all listed together. There should be no games since the 2010 season where the CityID is not known. Games from the 2009 season and before are not listed in this file.

In [None]:
MGameCities = pd.read_csv(DATA_PATH + 'MGameCities.csv')
MGameCities.head()

## <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">DATA SECTION 4- PUBLIC RANKINGS</p></div>

### <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:left">MMasseyOrdinals.csv</p></div>
****
This file lists out rankings (e.g. #1, #2, #3, ..., #N) of teams going back to the 2002-2003 season, under a large number of different ranking system methodologies. The information was gathered by Kenneth Massey and provided on his [College Basketball Ranking Composite page](https://www.masseyratings.com/cb/compare.htm).

Note that a rating system is more precise than a ranking system, because a rating system can provide insight about the strength gap between two adjacently-ranked teams. A ranking system will just tell you who is #1 or who is #2, but a rating system might tell you whether the gap between #1 and #2 is large or small. Nevertheless, it can be hard to compare two different rating systems that are expressed in different scales, so it can be very useful to express all the systems in terms of their ordinal ranking (1, 2, 3, ..., N) of teams.

In [None]:
MMasseyOrdinals =pd.read_csv(DATA_PATH + 'MMasseyOrdinals.csv')
MMasseyOrdinals.head()

# To be continued