# In this Notebook we will doing the Exploratory Data Analysis of College Basketball Dataset

### Description of the columns

* TEAM: The Division I college basketball school

* CONF: The Athletic Conference in which the school participates in 

* G: Number of games played

* W: Number of games won

* ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)

* ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)

* BARTHAG: Power Rating (Chance of beating an average Division I team)

* EFG_O: Effective Field Goal Percentage Shot

* EFG_D: Effective Field Goal Percentage Allowed

* TOR: Turnover Percentage Allowed (Turnover Rate)

* TORD: Turnover Percentage Committed (Steal Rate)

* ORB: Offensive Rebound Percentage

* DRB: Defensive Rebound Percentage

* FTR : Free Throw Rate (How often the given team shoots Free Throws)

* FTRD: Free Throw Rate Allowed

* 2P_O: Two-Point Shooting Percentage

* 2P_D: Two-Point Shooting Percentage Allowed

* 3P_O: Three-Point Shooting Percentage

* 3P_D: Three-Point Shooting Percentage Allowed

* ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)

* WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)

* POSTSEASON: Round where the given team was eliminated or where their season ended 

* SEED: Seed in the NCAA March Madness Tournament

* YEAR: Season

## Importing Dependencies

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("../input/college-basketball-dataset/cbb.csv")

In [None]:
df.describe()

In [None]:
print(df.shape)

In [None]:
print(df.info())

### we can see dataset doesnot contain any null values

In [None]:
df.columns

#### Let us calculate Win / Gameplay ratio for each tuple

In [None]:
df['W_ratio'] = df['W'] / df['G']

In [None]:
df.head()

#### Suppose we have a task of pridicting Wining ratio of the team
#### Now we will try to find out the relation of W_ratio with other parameters 

#### let's delete unnecessery columns like G,W

In [None]:
del df['G']

In [None]:
del df['W']

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.head(20)

In [None]:
df['POSTSEASON'].unique()

In [None]:
df['POSTSEASON'].value_counts()

### Since The 'POSTSEASON' colums is catagorical variable, we can map the values to numereric digits

In [None]:
d = {'Champions' : 1, '2ND' : 2, 'F4' : 3, 'E8' : 8, 'R68' : 5, 'S16' : 5, 'R32' : 6, 'R64' : 7}
df['POSTSEASON'] = df['POSTSEASON'].map(d)

In [None]:
df.head(10)

In [None]:
df['CONF'].value_counts()

In [None]:
df['Win_prob_.5'] = df['W_ratio'] >= 0.5

In [None]:
df.head(10)

In [None]:
df['Win_prob_.5'].value_counts()

In [None]:
df.corr()

In [None]:
pd.crosstab(df['Win_prob_.5'], df['POSTSEASON'])

* Champions : 1
* 2ND : 2
* F4 : 3
* E8 : 4
* R68 : 8
* S16 : 5
* R32 : 6
* R64 : 7

#### we can see that if team belong to R64 then its win probalility is less then .5

In [None]:
corr_mat = df.corr()

In [None]:
corr_mat['Win_prob_.5']

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(corr_mat)

In [None]:
corr_mat['W_ratio']

we can see that wining ration is highly dependent on ADJOE , BARTHAG , EFG_O , 2P_O , 3P_O

In [None]:
sns.distplot(df['ADJOE'])

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "ADJOE", data = df)

#### we can see that Median of Adjusted Offensive Efficiency for wining teams are higher then that of Lossing teams

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "ADJDE", data = df)

#### we can see that Median of Adjusted Defensive Efficiency for wining teams are lower then that of Lossing teams

* ## From this we can conclude that if ADJOE of team is grater then ADJDE then its chaces of winning are higher 

In [None]:
df_melt = pd.melt(frame = df , id_vars = ['Win_prob_.5'], value_vars = ['ADJOE'])

In [None]:
df_melt.head()

In [None]:
ax = sns.violinplot(x = "variable", y = "value", hue = "Win_prob_.5", data = df_melt , palette="muted", split=True)

 Now Let's see the boxplot relation ship between Win_prob_.5 and  Power Rating 

In [None]:
sns.distplot(df['BARTHAG'])

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "BARTHAG", data = df)

* ## we can also conclude from above box plot that more the power rating is more the chances to win match

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "EFG_O", data = df)

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "EFG_D", data = df)

In [None]:
df['shot_diff'] = df['EFG_O'] - df['EFG_D']

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "shot_diff", data = df)

* ## Now we can also conclude that if team's  Effective Field Goal Percentage Shot is higher then Effective Field Goal Percentage Allowed then the probability of wining the match are higher

Now let's see the relationship between win_prob_.5 and Turnover Percentage

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "TOR", data = df)

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "TORD", data = df)

* ## There is no such big relation ship bewteen two

Relationship between winprob.5 and Rebound Percentage

* Offensive Rebound Percentage

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "ORB", data = df)

* Defensive Rebound Percentage

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "DRB", data = df)

* lets see the relationship between these two

In [None]:
df['rebound_diff'] = df['ORB'] - df['DRB']

In [None]:
sns.boxplot(x = "Win_prob_.5", y = "rebound_diff", data = df)

* ## we can not directly predict the relation ship

Let's see the relatioship between Free Throw Rate

In [None]:
sns.boxplot(x = 'Win_prob_.5' , y = 'FTR', data = df)

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = 'FTRD', data = df)

In [None]:
df['throw_diff'] = df['FTR'] - df['FTRD']

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = 'throw_diff', data = df)

* ## We can not pridict the relaionship between the two variable 

now let's see the relationship between Wining and 2, 3 pointer

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = '2P_O', data = df)

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = '2P_D', data = df)

In [None]:
df['2p_diff'] = df['2P_O'] - df['2P_D']

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = '2p_diff', data = df)

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = '3P_O', data = df)

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = '3P_D', data = df)

In [None]:
df.columns

In [None]:
df['3p_diff'] = df['3P_O'] - df['3P_D']

In [None]:
sns.boxplot(x = 'Win_prob_.5', y = '3p_diff', data = df)

In [None]:
df['Win_prob_.5'] = df['Win_prob_.5'].astype(int)

In [None]:
sns.distplot(df[df['Win_prob_.5'] ==1]['3p_diff'])

In [None]:
df_win = df[df['Win_prob_.5'] == 1]

In [None]:
df_win['3p_diff'].hist()

In [None]:
df_loss = df[df['Win_prob_.5'] == 0]

In [None]:
df_loss['3p_diff'].hist()

In [None]:
df_win['2p_diff'].hist()

In [None]:
df_loss['2p_diff'].hist()

In [None]:
df['2p_diff'].describe()

* ## Now we can also conclude that if team's Two-Point Shooting Percentage and Three-Point Shooting Percentage are  higher then Two-Point Shooting Percentage Allowed and Three-Point Shooting Percentage Allowed then there are chances to win the match

Let's see the how wining probalility and Wins Above Bubble are related

In [None]:
sns.boxplot(x = 'Win_prob_.5' , y = 'WAB' , data = df)

* ## we can clearly see that higher the WAB is higher the chances to win the match

# CONCLUSION OF EDA

* ### higher the WAB is higher the chances to win the match

* ### if team's Two-Point Shooting Percentage and Three-Point Shooting Percentage are higher then Two-Point Shooting Percentage Allowed and Three-Point Shooting Percentage Allowed then there are chances to win the match

* ###  if team's Effective Field Goal Percentage Shot is higher then Effective Field Goal Percentage Allowed then the probability of wining the match are higher

* ### More the power rating is more the chances to win match

* ### If ADJOE of team is grater then ADJDE then its chaces of winning are higher