# Let's work together

## Exploratory Data Analysis on USA College Basketball (2015-2019) dataset.

![NCAA_Champions](http://richiewong.co.uk/wp-content/uploads/2019/12/usa_today_10761836.0.jpg)

Read more: https://www.ncaa.com/history/basketball-men/d1

Welcome to this notebook, intended to be easy to read and understand with intuitive visualisations.

My passion is playing and watching basketball from a young age and I am keen to explore this dataset on College Basketball in the USA.

The end-goal is to learn the history/trends of college basketball and what is the formula for success in a team. I hope this is interesting for you as it is for me as I am from the UK.

Feel free to fork for your own learning and edit the code or use in your own submissions. If you found this enriched your learning in the slightest please **upvote** this notebook as an encouragement for me to continue writing notebooks! :)

Thanks to Andrew Sundberg for uploading this dataset which is very clean and easy to work with.

Table of Contents
* [Objectives](#intro)
    - [Types of Data in Dataset](#typesofdata)


* [Exploratory Data Analysis](#exploratorydataanalysis)
    - [Import Libaries and file](#import)
    - [First look at 2019 Dataset](#DF2019)
    - [Number of games played](#gamesplayed)
    - [Number of Teams in each conferences](#conferences)
    - [Number of Teams qualify to March Madness](#qualify)
    - [Heatmap Matrix of the correlations](#correlation)
    - [Winners of the NCAA Basketball Tournament](#winners)
    - [Scatterplot for how to qualify to NCAA Tournament](#relationshipqualify)
    
        
        
* [What's Next](#conclusion)

<a id="intro"></a>
### Objectives

**EDA - Curious Questions - interested to know what you would suggest in the discussion below**
* How many teams are there in each conference for 2019?
* How many teams make it to March Madness 2019?
* Who are the winners in March Madness for 2019 and for the historical years? (PostSeason Field)

**What are the success factors for a contending team to qualify / Final Four and could a underdog be succesful in the March Madness 2015-2019?**
* Matrix for the relationships of the fields (non Categorical)

**What is more valuable, offensive or defensive rating?**
* Scatter plot of the teams in terms of Offensive rating and Defensive Rating - Highlight the qualifiers in colour (Effective Field Goal Percentage Shot/Allowed)

<a id="typesofdata"></a>
### Table of feature types


Description of the columns
TEAM: The Division I college basketball school

CONF: The Athletic Conference in which the school participates in

G: Number of games played

W: Number of games won

ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)

ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)

BARTHAG: Power Rating (Chance of beating an average Division I team)

EFG_O: Effective Field Goal Percentage Shot

EFG_D: Effective Field Goal Percentage Allowed

TOR: Turnover Percentage Allowed (Turnover Rate)

TORD: Turnover Percentage Committed (Steal Rate)

ORB: Offensive Rebound Percentage

DRB: Defensive Rebound Percentage

FTR : Free Throw Rate (How often the given team shoots Free Throws)

FTRD: Free Throw Rate Allowed

2P_O: Two-Point Shooting Percentage

2P_D: Two-Point Shooting Percentage Allowed

3P_O: Three-Point Shooting Percentage

3P_D: Three-Point Shooting Percentage Allowed

ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)

WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)

POSTSEASON: Round where the given team was eliminated or where their season ended

SEED: Seed in the NCAA March Madness Tournament

YEAR: Season

<a id="import"></a>
# Import Libaries and file

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("../input/college-basketball-dataset/cbb.csv")

#only required to load this dataset as this csv contains all the ears.

<a id="df2019"></a>
# College basketball in 2019

Let's first look at 2019 data to understand the format and number of teams

In [None]:
# filter dataframe for 19
#df=df.loc[df['YEAR'] == '2019']
array = ['2019']
df_19=df.loc[df['YEAR'].isin(array)]

In [None]:
df_19.info()

In [None]:
df_19.sample(15)

In [None]:
df_19.describe()

In [None]:
print(df_19.shape)

In [None]:
print(df_19.info())

In [None]:
df_19.columns

<a id="gamesplayed"></a>
### Number of games played

Let's see if all teams have played the same amount of games

In [None]:
df_19['Games_Played'] = df_19['W'] + df_19['G']

In [None]:
sns.distplot(df_19['Games_Played'])

It's good to bare in mind that not all teams have played the same amount of games

In [None]:
df_19['W_ratio'] = df_19['W'] / df_19['G']

In [None]:
df_19.sort_values(by='W_ratio', ascending=False).head(20)

<a id="conferences"></a>
### Number of Teams in each conferences

How many teams are there in each conference?

In [None]:
df_19['CONF'].value_counts()

In [None]:
df_19['CONF'].count()

<a id="qualify"></a>
### Number of Teams qualify to March Madness

How many teams make it to the March Madness?

In [None]:
df_19['SEED'].notna().sum()
# df['SEED'].count()

In [None]:
d=df_19['SEED'].notna().sum()/df_19['TEAM'].count()
print ("Percentage of college teams that make it to the March Madness: "+"{:.2%}".format(d))

In [None]:
df_19['POSTSEASON'].unique()

In [None]:
df_19['POSTSEASON'].value_counts()

In [None]:
df_19['SEED'].value_counts()

### Encoding data

As it is a categorical data - it's better to encode to integor value to undertake other analysis, this can also be done to the SEED Column.

In [None]:
d = {'Champions' : 1, '2ND' : 2, 'F4' : 3, 'E8' : 8, 'R68' : 5, 'S16' : 5, 'R32' : 6, 'R64' : 7}
df_19['POSTSEASON_Value'] = df_19['POSTSEASON'].map(d)

In [None]:
df_19.head(10)

<a id="correlation"></a>
# Heatmap Matrix of the correlations
What is more valuable, offensive or defensive rating?

In [None]:
plt.figure(figsize=(20,20))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(df_19.corr(), annot=True,cmap='RdYlGn',square=True)  # seaborn has very simple solution for heatmap

In [None]:
corr_mat = df_19.corr()

In [None]:
corr_mat['W_ratio']

so = corr_mat['W_ratio'].sort_values(kind="quicksort", ascending=False)

print(so)

we can see that wining ration is highly dependent on ADJOE and BARTHAG

<a id="winners"></a>
# Winners of the NCAA Basketball Tournament

In [None]:
array1 = ['Champions']
df1=df.loc[df['POSTSEASON'].isin(array1)]

df1.sort_values(by='YEAR', ascending=False) #Filter for the Champions by each year

We can see that the team Villanova were Champions in 2016 and 2018 - which they are featured at the top of the notebook celebrating :-)

In [None]:
array2 = ['Virginia', 'Villanova', 'North Carolina','Duke']
df2=df.loc[df['TEAM'].isin(array2)]

df2.sort_values(['TEAM', 'YEAR'], ascending=[False, False]) #Filter by champion team history

From this querying of the table, we can see that AJDE and the Barthag scores are very high in the fields (above 80 and 84%.

ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)

BARTHAG: Power Rating (Chance of beating an average Division I team)

<a id="relationshipqualify"></a>
# Scatterplot for how to qualify to NCAA Tournament

Let's first clean the data to binary if the team qualify or not qualify for 2015-2019 data.

In [None]:
df['Not_Qualified'] = pd.isna(df['SEED'])

In [None]:
sns.scatterplot(y=df['EFG_O'], x=df['EFG_D'], hue=df['Not_Qualified'])

The scatter plot shows it is important for a team to have high Effective Field Goal Percentage Shot (EFG_O) and low Effective Field Goal Percentage Allowed (EFG_D) in order to have higher chance to qualify from (2015-2019).

<a id="conclusion"></a>

# Conclusion
There are a lot of interesting column variables to explore and find relationships to success.

We have explored fields to determine the high level metrics to enable success of a team, i.e. the scatterplot of how to qualify for the NCAA team, which can be drilled down even further and how it changed over the years.

It's worth exploring how the NCAA tournament works and the rules and how the teams match up against each other to understand more the underlying reason if we want to call finishing well at a NCAA tournament a factor for success or is it have the best win/game ratio?

From being a basketball enthusiast, there a waaay more factors that determine the success for a team, you can see the champion teams, they have lost some games! It can be determine, by the players (note they play college for 1-3 years generally), it can be due to the coaching facility, the head coach, even how adaptable they are towards the game. 

The game of basketball has changed and have a lot more focus and emphasis with shooting the 3-point shot, which is actually driven by data analytics. This is why I find data analytics and data science so interesting because of it's many sources of applications.

Thank you all for reading - Richie

I can be connected on LinkedIn.
https://www.linkedin.com/in/richieone/

### What's Next
I intend to work on finding out these questions along the dataset.

* Finding the underdog teams, and how they progressed throughout the years.
* How college basketball how progressed throughout the years. i.e. number of 3 point shots taken, how the average offensive and defensive rating has changed.