# Introduction

The purpose of this section is to quickly ensure there is no missing values and there are no strange outliers in the datasets indicated in "Data Section 1.' The following section will be used to perform a more thorough data analysis and visualization of the dataset. This kernel will also serve as a way of practicing data analysis and visualization. 

# Table Of Contents
* [Import and load data](#import)
* [Initial Data Analysis](#init-data) 
    * [Analyzing Teams Table](#data-analysis-team)
    * [Analyzing Seasons Table](#data-analysis-season)
    * [Analyzing NCAA Tourney Seeds Table](#data-analysis-ncaats)
    * [Analyzing Regular Season Compact Results Table](#data-analysis-reg-season)
    * [Analyzing NCAA Tourney Compact Results Table](#data-analysis-ncaa-tour)
* [Data Visualization and Analysis](#data-viz)
    * [Analyzing Teams Table](#data-viz-team)
    * [Analyzing Seasons Table](#data-viz-season)
    * [Analyzing NCAA Tourney Seeds Table](#data-viz-ncaats)
    * [Analyzing Regular Season Compact Results Table](#data-viz-reg-season)
    * [Analyzing NCAA Tourney Compact Results Table](#data-viz-ncaa-tour)
* [Conclusion](#conclusion)

## <a id='import'></a> Import and load data

In [116]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

In [117]:
temp = os.listdir("../input")
all_csv = {}
for i in range(0,len(temp)):
#for i in range(0,33):
    if(temp[i].split(".")[1] == "csv"):
        all_csv[temp[i].split(".")[0]] = pd.read_csv("../input/"+temp[i],encoding = 'ISO-8859-1')

print(all_csv.keys())

# <a id='init-data'></a> Initial Data Analysis

## <a id='data-analysis-team'></a> Analyzing Teams Table

In [118]:
all_csv['Teams'].head()

In [119]:
print(all_csv['Teams'].info())
print(100*"*")
print(all_csv['Teams'].describe(include = ['O']))
print(100*"*")
print(all_csv['Teams'].describe(exclude =['O']))
print(100*"*")
print(all_csv['Teams'].isnull().sum())

No missing values. All team names are unique

## <a id='data-analysis-season'></a> Analyzing Seasons Table

In [120]:
all_csv['Seasons'].head()

In [121]:
print(all_csv['Seasons'].info())
print(100*"*")
print(all_csv['Seasons'].describe(include = ['O']))
print(100*"*")
print(all_csv['Seasons'].describe(exclude =['O']))
print(100*"*")
print(all_csv['Seasons'].isnull().sum())

No missing values in any column. Look at distribution of region values - which regions are predominant in final four bracket?

In [122]:
print(all_csv['Seasons'].loc[:,'RegionW'].value_counts())
print(100*'*')
print(all_csv['Seasons'].loc[:,'RegionX'].value_counts())
print(100*'*')
print(all_csv['Seasons'].loc[:,'RegionY'].value_counts())
print(100*'*')
print(all_csv['Seasons'].loc[:,'RegionZ'].value_counts())

## <a id='data-analysis-ncaats'></a> Analyzing NCAA Tourney Seeds Table

In [123]:
all_csv['NCAATourneySeeds'].head()

In [124]:
print(all_csv['NCAATourneySeeds'].info())
print(100*"*")
print(all_csv['NCAATourneySeeds'].describe(include = ['O']))
print(100*"*")
print(all_csv['NCAATourneySeeds'].describe(exclude =['O']))
print(100*"*")
print(all_csv['NCAATourneySeeds'].isnull().sum())

No missing values. 

## <a id='data-analysis-reg-season'></a> Analyzing Regular Season Compact Results Table

In [125]:
all_csv['RegularSeasonCompactResults'].head()

In [126]:
print(all_csv['RegularSeasonCompactResults'].info())
print(100*"*")
print(all_csv['RegularSeasonCompactResults'].describe(include = ['O']))
print(100*"*")
print(all_csv['RegularSeasonCompactResults'].describe(exclude =['O']))
print(100*"*")
print(all_csv['RegularSeasonCompactResults'].isnull().sum())

In [127]:
all_csv['RegularSeasonCompactResults'].loc[:,'WLoc'].value_counts()

No missing values. Distribution of WLoc makes sense since most teams tend to play better at home

## <a id='data-analysis-ncaa-tour'></a> Analyzing NCAA Touney Compact Results Table

In [128]:
all_csv['NCAATourneyCompactResults'].head()

In [153]:
print(all_csv['NCAATourneyCompactResults'].info())
print(100*"*")
print(all_csv['NCAATourneyCompactResults'].describe(include = ['O']))
print(100*"*")
print(all_csv['NCAATourneyCompactResults'].describe(exclude =['O']))
print(100*"*")
print(all_csv['NCAATourneyCompactResults'].isnull().sum())

In [154]:
print(all_csv['NCAATourneyCompactResults'].loc[:,'WLoc'].value_counts())

No missing values. Note that all of the games are played in neutral sites. 

# <a id='data-viz'></a> Data Visualization and Analysis

In [155]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## <a id='data-viz-team'></a> Analyzing Teams Table

In [156]:
all_csv['Teams'].head()

In [186]:
fig1, ax1 = plt.subplots(2,2)
fig1.set_size_inches(10,10)
sns.countplot(y = 'FirstD1Season', data = all_csv['Teams'], ax = ax1[0][0])
sns.countplot(y = 'FirstD1Season', data = all_csv["Teams"].loc[all_csv['Teams']['FirstD1Season'] != 1985, ['FirstD1Season']], ax = ax1[0][1])
sns.countplot(y = 'LastD1Season', data = all_csv['Teams'], ax = ax1[1][0])
sns.countplot(y = 'LastD1Season', data = all_csv["Teams"].loc[all_csv['Teams']['LastD1Season'] !=2018, ['LastD1Season']], ax = ax1[1][1])
plt.tight_layout()

As can be seen in the countplot for 'FirstD1Season,' very few new teams are 'invited' to participalte in the NCAA tourney and 2011 was the last time a team participated in the NCA tourney and did not participate again. 

In [188]:
all_csv['Teams'][all_csv['Teams']['LastD1Season']==2011]

## <a id='data-viz-season'></a> Analyzing Seasons Table

In [158]:
all_csv['Seasons'].head()

In [159]:
fig2, ax2 = plt.subplots(4,1)
fig2.set_size_inches(8,12)
#sns.countplot(x = 'RegionW', data = all_csv['Seasons'])
sns.countplot(x = 'RegionW', data = all_csv['Seasons'], ax = ax2[0])
sns.countplot(x = 'RegionX', data = all_csv['Seasons'], ax = ax2[1])
sns.countplot(x = 'RegionY', data = all_csv['Seasons'], ax = ax2[2])
sns.countplot(x = 'RegionZ', data = all_csv['Seasons'], ax = ax2[3])
plt.tight_layout()

Countplot tells us that 'RegionW' and 'RegionZ' are predominantly with schools from the East and West regions of the United States while 'RegionX' and 'RegionY' deal with the remaining regions.  

## <a id='data-viz-ncaats'></a> Analyzing NCAA Tourney Seeds Table

In [160]:
all_csv['NCAATourneySeeds'].head()

Look into team seeds and see number of times a team has been #1 seed,  #2, etc.

In [161]:
all_csv['NCAATourneySeeds']['Seed_NoReg'] = [i[1:] for i in all_csv['NCAATourneySeeds'].Seed]
all_csv['NCAATourneySeeds'].head()

In [162]:
temp = pd.merge(all_csv['NCAATourneySeeds'],all_csv['Teams'].loc[:,['TeamID','TeamName']],on = 'TeamID')

In [163]:
temp1 = temp.groupby(['TeamName','Seed_NoReg'])['Seed_NoReg'].count()

In [164]:
grid_kws = {"height_ratios": (.95, .01), "hspace": .03}
fig3, (ax3,cbar3) = plt.subplots(2, gridspec_kw=grid_kws)
fig3.set_size_inches((50,200))
ax3.yaxis.label.set_size(60)
ax3.xaxis.label.set_size(60)
ax3.tick_params(labelsize=35)
ax = sns.heatmap(data = temp1.unstack().fillna(0), ax = ax3, cbar_ax=cbar3, linewidth=2,vmax = 14,\
                  cbar_kws={"orientation": "horizontal","ticks":[0,14]})
cbar3.tick_params(labelsize=40)
cbar3.set_title(label = 'Seed Rating',fontsize=50,loc = 'left')
plt.show()

What stands out in the heatmap above is that since inception of NCAA tourney, 1985:
1. Arizona has been the number 1 and number 2 seed, 6 and 7 times respectively
2. Conneticut has been the number 1 and number 2 seed, 5 and 6 times respectively
3. Duke has been the number 1 and number 2 seed, 13 and 10 times respectively
4. Kansas has been the number 1 and number 2 seed, 13 and 7 times respectively
5. Kentucky has been the number 1 and number 2 seed, 10 and 6 times
6. North Carolina has been the number 1 and 2 seed, 13 and 6 times

## <a id='data-viz-reg-season'></a> Analyzing Regular Season Compact Results Table

In [165]:
all_csv['RegularSeasonCompactResults']['WMargin'] = all_csv['RegularSeasonCompactResults']["WScore"] - \
all_csv['RegularSeasonCompactResults']['LScore']

In [166]:
all_csv['RegularSeasonCompactResults'].head()

In [167]:
fig5, ax5 = plt.subplots(1,3)
fig5.set_size_inches((12,4))
sns.distplot(all_csv['RegularSeasonCompactResults'].WScore,label = "Winning Team",\
            ax = ax5[0],hist = True, kde = True)
sns.distplot(all_csv['RegularSeasonCompactResults'].LScore,label = "Losing Team",\
            ax = ax5[0], hist = True, kde = True)
sns.distplot(all_csv['RegularSeasonCompactResults'].WMargin,label = "Score Margin",\
            ax = ax5[0], hist = True, kde = True)
sns.regplot(x = 'Season',y = 'WScore', ax = ax5[1],fit_reg = False,label = 'Winning Team',\
          data = all_csv['RegularSeasonCompactResults'].groupby(["Season"])['WScore'].mean().reset_index())
sns.regplot(x = 'Season',y = 'LScore', ax = ax5[1],fit_reg = False,label = 'Losing Team',\
          data = all_csv['RegularSeasonCompactResults'].groupby(["Season"])['LScore'].mean().reset_index())
sns.regplot(x = 'Season',y = 'WMargin', ax = ax5[2],fit_reg = False,label = 'Score Margin',\
          data = all_csv['RegularSeasonCompactResults'].groupby(["Season"])['WMargin'].mean().reset_index())
#sns.boxplot(x="Season", y="Wscore", ax = ax5[2], \
#            data = all_csv['RegularSeasonCompactResults'].groupby(["Season"])['WScore','LScore'].mean().reset_index())
plt.tight_layout()
ax5[0].set_xlabel('Score')
ax5[1].set_ylabel('Yearly Average Score')
ax5[2].set_ylabel('Winning Team Margin - Yearly Average')
ax5[0].legend()
ax5[1].legend()
ax5[2].legend()
ax5[2].set_ylim([9,15])

As can be seen in the figures above (histograms), there were some blowouts but the yearly average score margin was between 10 and 13 points. 

In [168]:
fig6, ax6 = plt.subplots(1,1)
fig6.set_size_inches((6,4))
sns.boxplot(x = 'WLoc', y = 'WMargin',data = all_csv['RegularSeasonCompactResults'],ax = ax6)
sns.lmplot(x = 'WScore', y = 'LScore',hue = 'WLoc', fit_reg = False,\
              data = all_csv['RegularSeasonCompactResults'])
plt.tight_layout()
ax6.set_ylabel('Winning Team Margin')

What the figures above tell us is that the location of the games does play a difference since the median score margin for the winning team does vary based on where the games are played - Home, Away or Neutral. 
Also, looking at the scatter plot, It makes sense that the slope refering to the games at home is not as step as the games played away since the winning team at home tend to blowout their opponents, while at an away site, the games would be more competitive.  

In [169]:
all_csv['Teams'][['TeamID','TeamName']].head()

In [170]:
temp4 = all_csv['RegularSeasonCompactResults'].merge(all_csv['Teams'][['TeamID','TeamName']],left_on = 'WTeamID',right_on = 'TeamID')
temp4.drop(['TeamID'],axis = 1,inplace = True)
temp4.head()

In [172]:
grid_kws = {"height_ratios": (.95, .01), "hspace": .03}
fig7, (ax7,cbar7) = plt.subplots(2, gridspec_kw=grid_kws)
fig7.set_size_inches((25,150))
ax7.yaxis.label.set_size(25)
ax7.xaxis.label.set_size(25)
ax7.tick_params(labelsize=15)
ax = sns.heatmap(data = temp4.groupby(['Season','TeamName'])['WMargin'].mean().unstack().fillna(0).transpose(), ax = ax7, cbar_ax=cbar7, linewidth=2,\
                  cbar_kws={"orientation": "horizontal"})
cbar7.tick_params(labelsize=15)
cbar7.set_title(label = 'Winning Team Margin',fontsize=30,loc = 'left')
plt.show()

Although there are a few years where teams score on average 25+ points, most winning teams score between 10 and 15 points. Not much variance in the winning team score margin so this feature, along with the winning team score and the losing team score might not be useful for prediction

## <a id='data-viz-ncaa-tour'></a> Analyzing NCAA Touney Compact Results Table

In [173]:
all_csv['NCAATourneyCompactResults']['WMargin'] = all_csv['NCAATourneyCompactResults']["WScore"] - \
all_csv['NCAATourneyCompactResults']['LScore']

In [174]:
all_csv['NCAATourneyCompactResults'].head()

In [175]:
fig8, ax8 = plt.subplots(1,3)
fig8.set_size_inches((12,4))
sns.distplot(all_csv['NCAATourneyCompactResults'].WScore,label = "Winning Team",\
            ax = ax8[0],hist = True, kde = True)
sns.distplot(all_csv['NCAATourneyCompactResults'].LScore,label = "Losing Team",\
            ax = ax8[0], hist = True, kde = True)
sns.distplot(all_csv['NCAATourneyCompactResults'].WMargin,label = "Score Margin",\
            ax = ax8[0], hist = True, kde = True)
sns.regplot(x = 'Season',y = 'WScore', ax = ax8[1],fit_reg = False,label = 'Winning Team',\
          data = all_csv['NCAATourneyCompactResults'].groupby(["Season"])['WScore'].mean().reset_index())
sns.regplot(x = 'Season',y = 'LScore', ax = ax8[1],fit_reg = False,label = 'Losing Team',\
          data = all_csv['NCAATourneyCompactResults'].groupby(["Season"])['LScore'].mean().reset_index())
sns.regplot(x = 'Season',y = 'WMargin', ax = ax8[2],fit_reg = False,label = 'Score Margin',\
          data = all_csv['NCAATourneyCompactResults'].groupby(["Season"])['WMargin'].mean().reset_index())
#sns.boxplot(x="Season", y="Wscore", ax = ax5[2], \
#            data = all_csv['RegularSeasonCompactResults'].groupby(["Season"])['WScore','LScore'].mean().reset_index())
plt.tight_layout()
ax8[0].set_xlabel('Score')
ax8[1].set_ylabel('Yearly Average Score')
ax8[2].set_ylabel('Winning Team Margin - Yearly Average')
ax8[0].legend()
ax8[1].legend()
ax8[2].legend()
ax8[2].set_ylim([7,18])

Since this is the NCAA tourney playoffs, the 'Score Margin" is not as rightly skewed as in the regular season case considering that the best 64 teams are playing each other. Notice that there is an increase in variance for the yearly average score margin probably because there are less games to average/smoothen out the value. It could also be that although there are not a large amount of blowouts (as indicated by the histogram), the top tiered teams could be having their way against teams ranked lower than them in the earlier rounds in the playoffs. 

In [176]:
fig9, ax9 = plt.subplots(1,1)
fig9.set_size_inches((6,4))
sns.boxplot(x = 'WLoc', y = 'WMargin',data = all_csv['NCAATourneyCompactResults'],ax = ax9)
sns.lmplot(x = 'WScore', y = 'LScore',hue = 'WLoc', fit_reg = False,\
              data = all_csv['NCAATourneyCompactResults'])
plt.tight_layout()
ax9.set_ylabel('Winning Team Margin')

Games in NCAA Tourney are played at a neutral site and score margin is similar to that of the regular season for games that are played at a neutral site. Notice that when a scatter plot is created between the winning team score and losing team score, it resembles the scatter plot for the regular season when the winning team plays at an away site. This is probably because the best 64 teams are pitted against each other so we would expect to get tighter games. 

In [177]:
temp5 = all_csv['NCAATourneyCompactResults'].merge(all_csv['Teams'][['TeamID','TeamName']],left_on = 'WTeamID',right_on = 'TeamID')
temp5.drop(['TeamID'],axis = 1,inplace = True)
temp5.head()

In [179]:
grid_kws = {"height_ratios": (.95, .01), "hspace": .03}
fig10, (ax10,cbar10) = plt.subplots(2, gridspec_kw=grid_kws)
fig10.set_size_inches((25,150))
ax10.yaxis.label.set_size(25)
ax10.xaxis.label.set_size(25)
ax10.tick_params(labelsize=15)
ax = sns.heatmap(data = temp5.groupby(['Season','TeamName'])['WMargin'].mean().unstack().fillna(0).transpose(), ax = ax10, cbar_ax=cbar10, linewidth=2,\
                  cbar_kws={"orientation": "horizontal"})
cbar10.tick_params(labelsize=15)
cbar10.set_title(label = 'Winning Team Margin',fontsize=30,loc = 'left')
plt.show()

As in the regular season, we should expect to see score margins for the winning team that are quite low (apart from a few spots). Note that a few teams have score margins in >50 for specific seasons. It will be interesting to see if those teams won the title that year. 

# <a id='conclusion'></a> Conclusion

Although the data analysis and visualization performed here are only for 'data section 1,' I intend to complete and publish another kernel for the remaining data sections in the following week before performing some machine learning. Feel free to comment and let me know where I can improve in terms of data analysis and visualization!  