**In this notebook, the goal is to see if we can find any trends about why teams win and if those trends continue to mean success in the NCAA Tournament.**

Going in, I have one hypothesis: the greater the difference between offense and defense efficency, the better success in the NCAA Tournament (playoffs) and the more wins overall. The dataset provides a variety of stats, but I believe that ADJOE(Offensive Efficency) and ADJDE(Defensive Efficency) sum the majority of the columns up. Therefore, I will focus on ADJOE, ADJOE, wins, WAB(Wins Against the Bubble), and POSTSEASON. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import plotly.express as px

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**When I translate the data into a Dataframe, I will set the index to 0 so we can see the data based on the Team instead of the index number.**

In [None]:
df = pd.read_csv('/kaggle/input/college-basketball-dataset/cbb.csv', index_col = 0)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

**Using the describe function, we can see some interesting things. First, the average team has 16 wins per 31 games or a little above a 50% win ratio. This is proven by the remaining columns in the data frame. It looks like the average offensive efficeny and defensive efficeny is right around 103 meaning that teams give up and score roughly the same amount per 100 posessions. The other interesting thing to me is WAB (Wins above the Bubble), that is wins against teams that make the NCAA tournament. It looks like on average, teams are losing half their games to teams that make the playoffs. **

In [None]:
df.describe()

**Lets add a column for win ratios and difference between ADJOE and ADJDE. We would assume that if the efficeny difference is positive, the team has won more games and vice versa. **

In [None]:
df['WIN_RATIO'] = df['W'] / df['G']

In [None]:
df['Eff_DIFF'] = df['ADJOE'] - df['ADJDE']

**Now lets re-examine the data to see if we can see anything new yet. **

In [None]:
df.head(10)

**After using the head function, we see that these teams are all Champions or Runner-ups in the NCAA Tourney. I also see that they have an efficency difference of atleast 24 and a win ratio of at least .8. This was to be expected. Lets now look at the mean of our new columns.**

In [None]:
df.describe()

**Nothing new, just confirmed our prior assumptions. What about a sample of random teams? **

In [None]:
df.sample(20)

**To me, the sample data is more interesting. One interesting thing I see, even more interesting the Win Ratio vs Efficency difference, is Efficency vs WAB. Let's see if we can visually see a relationship. **

In [None]:
px.scatter(df, x='WIN_RATIO', y='Eff_DIFF',trendline='ols',color='W')

In [None]:
px.scatter(df, x='WAB', y='Eff_DIFF', trendline='ols',color='W')

**It looks like the Wins Against the Bubble to Efficency Difference is more linear than the Win Ratio to Efficency Difference.**

**I wonder if we can predict if a team will make the tournament based on eff difference and WAB? **

**Now let's visualize what total wins are in relation for Wins against the bubble.**

In [None]:
px.scatter(df, x='WAB', y='W',trendline='ols')

**Again, WAB seems to be a pretty good predictor of success in a season.** 

**Maybe we can find out if offensive or defensive efficency is a better predictor of WAB.**

In [None]:
px.scatter(df, x='WAB', y='ADJOE', trendline='ols', color='WAB')

In [None]:
px.scatter(df, x='WAB', y='ADJDE', trendline='ols', color='WAB')

**R-Squared seems to be a bit higher with the offensive efficency. For fun, let's see if we can see any trends in efficeny based on wins.** 

In [None]:
#px.scatter(df, x='ADJOE', y='W',trendline='ols' ,color='ADJDE')

In [None]:
px.scatter(df, x='ADJDE', y='W',trendline='ols',color='ADJOE')

In [None]:
px.scatter(df, x='Eff_DIFF', y= 'W',color='ADJDE', trendline='ols')

**As we can see pretty clearly after using the scatter plots, Efficency difference and WAB seems to be the best predictor of wins.** 

**Instead of looking at all the teams, let's add another dimension to the data: did the team make the tournament? First we will need to fill in the null items in the POSTSEASON column which we saw earlier using the info function.** 

In [None]:
df['POSTSEASON'] = df['POSTSEASON'].fillna('Did not make tourney')

In [None]:
df['POSTSEASON'].sample(20)

In [None]:
px.histogram(df, x='POSTSEASON', color ='W')

**We can see that we can manipulate the data since we filled the null values. Can we find anything interesting?**

In [None]:
px.scatter(df,x='ADJDE', y='ADJOE',color='POSTSEASON')

**Using this scatterplot, we can definitely see some clustering. On the right, you can choose which data points to show based on the postseason placement.** 

**One thing that jumps out to me is that ADJOE seems to be a significant if you win the Championship. If you look at the teams that came in 2nd vs the teams that won the Championship, we see that most of the time, the teams with a lower ADJDE (Defensive Efficency) don't win the Championship**. 

In [None]:
px.scatter(df,x='Eff_DIFF', y='W',color='POSTSEASON')

**Finally, let's see what the top ADJOE and ADJDE are including postseason placement.** 

In [None]:
place_ADJOE = df[['ADJOE', 'POSTSEASON']]
place_ADJOE.sort_values(by='ADJOE', ascending=False).head(20)

In [None]:
place_ADJDE = df[['ADJDE', 'POSTSEASON']]
place_ADJDE.sort_values(by='ADJDE', ascending=True).head(20)

From the scatter plots and the last two charts, it looks like ****Defense gets you to the tournament, but offense wins you Championships.**
**In conclusion, it looks like Defense gets you to the tournament, but offense wins you Championships.****

Out of curiousity, I wonder if there is a relationship between offensive efficency (ADJOE) and tempo (ADJ_T). Intuitively, I would assume the higher tempo would lead to more points scored per 100 possessions.

In [None]:
px.scatter(df, x='ADJOE', y='ADJ_T', trendline="ols", color='W')

No real trend between tempo and efficency on offense. This was surprising to me, personally. What about on defense? Maybe the slower the tempo the lower the points given up since time of possession should be in higher than the opposing team.

In [None]:
px.scatter(df, x='ADJ_T', y='ADJDE', trendline="ols", color ='W')

**Tempo doesn't seem to make a difference with how efficent a team is.**