The objectives are:
* Data Processing
* Exploratory Data Analysis
* Outlier Treatment
* Visualisation
* Categorical Data Transformation


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # data visualisation
import seaborn as sns  # data visualisation
%matplotlib inline

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#importing combined data: 
# cbb.csv

CBB = pd.read_csv("../input/college-basketball-dataset/cbb.csv")

In [None]:
CBB.head()

# Variable Insights

## Description of the columns TEAM: The Division I college basketball school

CONF: The Athletic Conference in which the school participates in

G: Number of games played

W: Number of games won

ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)

ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)

BARTHAG: Power Rating (Chance of beating an average Division I team)

EFG_O: Effective Field Goal Percentage Shot

EFG_D: Effective Field Goal Percentage Allowed

TOR: Turnover Percentage Allowed (Turnover Rate)

TORD: Turnover Percentage Committed (Steal Rate)

ORB: Offensive Rebound Percentage

DRB: Defensive Rebound Percentage

FTR : Free Throw Rate (How often the given team shoots Free Throws)

FTRD: Free Throw Rate Allowed

2P_O: Two-Point Shooting Percentage

2P_D: Two-Point Shooting Percentage Allowed

3P_O: Three-Point Shooting Percentage

3P_D: Three-Point Shooting Percentage Allowed

ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)

WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)

POSTSEASON: Round where the given team was eliminated or where their season ended

SEED: Seed in the NCAA March Madness Tournament

YEAR: Season

# Explorartory Data Analysis

Checking the type of data types 

In [None]:
CBB.info()

We can infer that there are 3 categorical data types and rest are numerical.

In [None]:
CBB.shape

In [None]:
#Checking missing value 

def missing_check(CBB):
    total = CBB.isnull().sum().sort_values(ascending=False)   # total number of null values
    percent = (CBB.isnull().sum()/CBB.isnull().count()).sort_values(ascending=False)  # percentage of values that are null
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])  # putting the above two together
    return missing_data # return the dataframe
missing_check(CBB)

There are :
* POSTSEASON with 1417 null values accounting 80.6488% as it is column for round where the given team was eliminated or where their season ended
* SEED with 1417 null values accounting 80.6488% as it is column for Seed in the NCAA March Madness Tournament

To get an insights year wise we have to convert Year wise data type into categorical ones

In [None]:
CBB["YEAR"] = pd.Categorical(CBB["YEAR"])

In [None]:
CBB.describe()

In [None]:
#checking for skewness in a data
CBB.skew()

In [None]:
CBB.columns

In [None]:
#cheking for outliers in a data through boxplot
plt.figure(figsize= (25,10))
plt.subplot(19,1,1)
sns.boxplot(x=CBB.G , color='blue')

plt.subplot(19,1,2)
sns.boxplot(x= CBB.W, color='red')

plt.subplot(19,1,3)
sns.boxplot(x= CBB.ADJOE, color='green')

plt.subplot(19,1,4)
sns.boxplot(x=CBB.EFG_O , color='blue')

plt.subplot(19,1,5)
sns.boxplot(x= CBB.ADJDE, color='red')

plt.subplot(19,1,6)
sns.boxplot(x= CBB.BARTHAG, color='green')

plt.subplot(19,1,7)
sns.boxplot(x=CBB.EFG_D , color='blue')

plt.subplot(19,1,8)
sns.boxplot(x= CBB.TOR, color='red')

plt.subplot(19,1,9)
sns.boxplot(x= CBB.TORD, color='green')

plt.subplot(19,1,10)
sns.boxplot(x= CBB.ORB, color='red')

plt.subplot(19,1,11)
sns.boxplot(x= CBB.DRB, color='green')

plt.subplot(19,1,12)
sns.boxplot(x=CBB.FTR , color='blue')

plt.subplot(19,1,13)
sns.boxplot(x= CBB.FTRD, color='red')

plt.subplot(19,1,14)
sns.boxplot(x= CBB['2P_O'], color='green')

plt.subplot(19,1,15)
sns.boxplot(x=CBB['2P_D'], color='blue')

plt.subplot(19,1,16)
sns.boxplot(x= CBB['3P_O'], color='red')

plt.subplot(19,1,17)
sns.boxplot(x= CBB['3P_D'], color='green')

plt.subplot(19,1,18)
sns.boxplot(x=CBB.ADJ_T , color='blue')

plt.subplot(19,1,19)
sns.boxplot(x= CBB.WAB, color='red')

plt.show()

* As we can infer through above boxplots that outliers are present in all the variable.
* Henceforth, outliers treatment have to be done.

In [None]:
CBB_Outlier_Treatment = CBB.drop(columns = ["TEAM", "CONF", "POSTSEASON","SEED","YEAR"])
CBB_Outlier_Treatment

In [None]:
from scipy import stats
z = np.abs(stats.zscore(CBB_Outlier_Treatment))   # get the z-score of every value with respect to their columns
print(z)

Looking the code and the output above, it is difficult to say which data point is an outlier. Let’s try and define a threshold to identify an outlier.

In [None]:
threshold = 3 # In a Normal distribution standard deviation is within or equal to 3 times
print ("Rows and columns location showing outlier value:")
np.where(z > threshold)

In [None]:
print(z[0][0]) # for example

In [None]:
CBB_copy = CBB_Outlier_Treatment.copy()   #make a deep copy of the dataframe

#Replace all the outliers with median values. This will create new some outliers but, we will ignore them

for i, j in zip(np.where(z > threshold)[0], np.where(z > threshold)[1]):# iterate using 2 variables.i for rows and j for columns
    CBB_copy.iloc[i,j] = CBB_Outlier_Treatment.iloc[:,j].median()  # replace i,jth element with the median of j i.e, corresponding column

In [None]:
z = np.abs(stats.zscore(CBB_copy))
np.where(z > threshold)  # New outliers detected after imputing the original outliers


# An Univariate Visualisation

In [None]:
sns.distplot(CBB_Outlier_Treatment.G);

In [None]:
sns.distplot(CBB_Outlier_Treatment.W);

In [None]:
sns.distplot(CBB_Outlier_Treatment.ADJOE);

* We can visualize each variable like this.
* plots a frequency polygon superimposed on a histogram using the seaborn package.
* seaborn automatically creates class intervals. The number of bins can also be manually set.

### For bivariate and univariate visualisation (diagnols will be showing univariate visualisation).

In [None]:
sns.pairplot(CBB_Outlier_Treatment, kind= "reg"); 

Analysing Correlation between two variables

The bivariate correlation, is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1.

* A closer the value towards 1 strong is a relationship and vice versa.
* A negative value stands for negative relationship.
* A positive value stands for positive relationship.
* A zero value means no relationship.

In [None]:
CBB_Outlier_Treatment.corr() # Method = Pearson

A visualisation relationship between variables through heatmap

In [None]:
plt.figure(figsize= (30,20))
sns.heatmap(CBB_Outlier_Treatment.corr(), annot = True);

# Plotting pivot table for categorical columns

In [None]:
pd.crosstab([CBB.TEAM,CBB.CONF,CBB.YEAR], CBB['W']).head(10)

In [None]:
pd.crosstab([CBB.TEAM,CBB.CONF,CBB.YEAR], CBB['W']).tail(10)


* Handling non-numeric(Categorical) data through One Hot Encoding
* One-Hot-Encoding is used to create dummry variables to replace the categories in a categorical variable into features of each category and represent it using 1 or 0 based on the presence or absence of the categorical value in the record

In [None]:
CBB_dummies= pd.get_dummies(CBB, prefix='year', columns=['YEAR']) #This function does One-Hot-Encoding on categorical text

In [None]:
CBB_dummies.head()

An another way of doing it through Label Encoding that has to be imported from sklearn library

In [None]:
CBB_dummies.corr() # now we can analyze the relationship between variable year wise

In [None]:
plt.figure(figsize= (30,20))
sns.heatmap(CBB_dummies.corr(), annot = True);

Although one major disadvantage of One hot encoding or dummies variable is that it will leads to form extra columns

Happy Learning