# **Basic Data Analytics**

## Objectives

* In this notebook, we will do some preliminary statistical analysis, such as a correlation study, as well as various visualizations.

## Inputs

* The input for this is the cleaned data from the last notebook, namely `'game_data_clean.csv'`.

## Outputs

* At the end, we will have various plots displaying the statistical relationship between different features of our dataset.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
home_dir = '/workspace/pp5-ml-dashboard'
csv_dir ='/workspace/pp5-ml-dashboard/outputs/datasets/clean/csv' 
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

We now load our cleaned dataset as well as some of the packages that we will be using.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from src.utils import get_df

game_data = get_df('game_data_clean', 'datasets/clean/csv')


## Section 1: Profile report
We first examine a profile report based on the data.

In [None]:
game_data.head()

We are going to modify the data frame before doing any exploratory data analysis. We are going to drop metadata columns like `'game_id'`, `'team_id'`, as well as `'Day'` and `'Month'`. We will leave `'Year'` just in case something interesting shows up. We will also need to change the column `'wl_home'`.

In [None]:
game_eda = game_data.drop(labels=['game_id','min','season_id', 'team_id_home', 'team_id_away', 'Day', 'Month'], axis=1)
game_eda['home_wins'] = game_eda.apply(lambda x: 1 if x['wl_home'] == 'W' else 0, axis=1)
game_eda.drop(labels=['wl_home'], axis=1, inplace=True)


Let's look at a profile report produced by `ydata_profiling`.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=game_eda, minimal=True)
pandas_report.to_notebook_iframe()

It is interesting that many of the distributions have the shape of a normal distribution. Some distributions are also skewed. You will notice that there are alerts in the report about zero values. These will not bother us as it took time for 3 point shots to become common and blocks are in general infrequent. The fact that the `'home_wins'` column is 0 approximately 40% of the time is actually a good sign. 

Let us investigate further if any of these features are normally distributed. A standard proceedure for determining of a distribution is normal is to use the `normality` function from the library Pingouin. If the p-value is larger than 0.05 then we conclude that the distribution is close enough to being normal.

In [None]:
import pingouin as pg

normality_eda_nt = pg.normality(game_eda, method='normaltest', alpha=0.05)
print(normality_eda_nt.query('normal == True'))
print("Max p-value Normal Test: ", normality_eda_nt['pval'].max())

From this test, it is clear that these statistics are not normally distributed. However, the distributions sure looked normal. Let's look at some qq plots. We will focus on the statistics

Appear normally distributed:
* attempted field goals
* defensive rebounds
* assists

Do not appear normally distributed:
* made and attempted 3-pointers
* blocks

The first group appear to be close to normal distributions and the second grouop does not.

In [None]:
from scipy.stats import probplot

good_dists = ['fga','dreb','ast']
bad_dists = ['fg3a','fg3m','blk']

def prob_plot_home_away(df,vars):
    for var in vars:
        var_1 = var + '_home'
        var_2 = var + '_away'
        fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(8, 4))
        plt.subplot(121)
        probplot(df[var_1], plot=sns.mpl.pyplot)
        plt.xlabel(f"{var_1}")
        plt.subplot(122)
        probplot(df[var_2], plot=sns.mpl.pyplot)
        plt.xlabel(f"{var_2}")
        plt.show()

prob_plot_home_away(game_eda,good_dists)
prob_plot_home_away(game_eda,bad_dists)

    

At least the curves look similar for home and away teams. Hopefully, we will be able to engineer the features a bit so that they more closely resemble normal distributions.

## Section 2: Correlation study
Now that we have a basic idea of what the distributions look like, we will focus on correlation coefficients. We will then focus on correlation with respect to the `'home_wins'`.

In [None]:
pearson_corr = game_eda.corr(method='pearson')
spearman_corr = game_eda.corr(method='spearman')
spearman_corr.head()

# Used in Notebook 05_Model_Selection 
# print(spearman_corr.filter([ 'reb_home']).loc[['dreb_home','oreb_home']])
# spearman_corr.filter([ 'reb_away']).loc[['dreb_away','oreb_away']]


We will look at the thresholds at which the correlation changes in order to select a reasonable threshold. We will work with the Spearman correlation.

In [None]:
from src.utils import count_threshold_changes

print(spearman_corr.shape)
thresholds = [i/100 for i in range(20,85)]
changes = count_threshold_changes(spearman_corr, thresholds)
sns.lineplot(x=[change[0] for change in changes], y=[change[1]/2 for change in changes])
plt.show()


There are approximately 20 unique pairs of features that have correlation coefficient at least 0.7. Let us see what some of these pairs are and look at the associated scatter plots. Some of these variables should be strongly correlated, like shots attempted and shots made, for various different shots.

In [None]:
from src.utils import get_pairs

pairs = get_pairs(spearman_corr,0.7)
print(len(pairs))

parts = pairs[:5]+pairs[-5:]
for index in range(len(parts)//2):
    fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(11, 5))
    sns.scatterplot(x=parts[2*index][0], y=parts[2*index][1], data=spearman_corr, ax=axes[0])
    axes[0].set_title(f"Corr: {parts[2*index][2]}")
    axes[0].set_ylabel(parts[2*index+1][1])
    axes[0].set_xlabel(parts[2*index][0])
    sns.scatterplot(x=parts[2*index+1][0], y=parts[2*index+1][1], data=spearman_corr, ax=axes[1])
    axes[1].set_title(f"Corr: {parts[2*index+1][2]}")
    axes[1].set_ylabel(parts[2*index+1][1])
    axes[1].set_xlabel(parts[2*index+1][0])
    plt.show()





We find it quite interesting that `'Year'` has such a strong correlation with 3 point shots, made and attempted. This isn't so surprising and we would wager that this is related to the impact Steph Curry has had on the game.

We will now focus on what correlates with the wins. Remember that this is recorded as when the home team wins, so statistics for the away team are functionally statistics for the opponent.

In [None]:
pearson_corr_wins = pearson_corr['home_wins'].sort_values(key=abs, ascending=False)[1:]
print(pearson_corr_wins[:11])
spearman_corr_wins = spearman_corr['home_wins'].sort_values(key=abs, ascending=False)[1:]
print(spearman_corr_wins[:11])


The `'plus_minus_home'` scores is the difference in points between the home team and the away team. Naturally this will correlate most strongly with winning. Similarly, `'pts'` will correlate quite strongly with winning since that is how the winner of the game is actually determined.

We will focus on the 6 next features with that have the strongest correlation with winning. Note that both Pearson and Spearman produce the same list of features.


In [None]:
vars_to_study = list(pearson_corr_wins[3:9].index)
vars_to_study.sort()
print(vars_to_study)


The following are the statistics with the least correlation with wins. It is slightly reassuring that the correlation is so weak, but it is a bit odd that it is more strongly correlated with winning than offensive rebounds.

In [None]:
pearson_corr_wins[-6:]

This tells an interesting story. Aside from points, the statistic with the highest correlation to wins is the defensive rebounds of the opposing team.

Lets look at the distribution of these statistics. We will color the histograms according two which team won. We will also look at box plots and qq plots.

In [None]:

for var in vars_to_study:
    fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(8, 4))
    sns.histplot(data=game_eda, x=var, hue='home_wins',kde=True, element="step", ax=axes[0])
    plt.subplot(122)
    probplot(game_eda[var], plot=sns.mpl.pyplot)
    plt.show()

# Attention
Explain the above graphs a bit more, perhaps before they appear.

Note the symmetry in these distributions. Swapping `'home'` to `'away'` is statistically equivalent to reflection across the y-axis. This is good. We would expect such symmetry. It is not necessarily present though as each game only appears once in our data. While these distributions appear normal, we know from the analysis above that they are not. We will attempt to transform them to bring the more in line in the next notebook.

## Section 2: Predictive Power Score
There is also Predictive Power Score. It is able to measure asymetric relationships.

In [None]:
import ppscore as pps

pps_raw = pps.matrix(df=game_eda)



In [None]:
pps_results = pps_raw.query('case != "predict_itself"').sort_values('ppscore', ascending=False)
pps_results = pps_results.filter(['x','y','ppscore'])
pps_results.head()

The score measures the ability of feature `x` to predict the value of feature `y`. It is unsurprising then that `'plus_minus_home'` is completely capable of predicting who wins.

In [None]:
pps_rounded = pps_results.query('ppscore!=0')
pps_rounded['ppscore'] = pps_rounded.apply(lambda x: round(x['ppscore'],4), axis=1)

print("mean: ",pps_rounded['ppscore'].mean())
print("mode: ",pps_rounded['ppscore'].mode())
print("median: ",pps_rounded['ppscore'].median())
print("standard deviation: ",pps_rounded['ppscore'].std())



We are interested in predicting when the home team wins. So let's focus on that.

In [None]:
pps_wins = pps_results.query('y=="home_wins"')
pps_wins.sort_values('ppscore')
print(pps_wins.head())

It seems that many of that values are able to predict one another. However, predicting wins seems much more difficult if we don't use the `'plus_minus_home'` value, which is 100% effective. Let's try and determine a reasonable threshold for the pp score.

In [None]:
sns.histplot(data=pps_rounded, x='ppscore', kde=True)
plt.show()

If we look at the change in the number of rows as the threshold increases, it may help us determine an appropriate cut off.

In [None]:
drop_off = count_threshold_changes(pps_rounded['ppscore'],[i/100 for i in range(100)],corr=False)
jumps = [(drop_off[index+1][0],-drop_off[index+1][1]+drop_off[index][1]) for index in range(len(drop_off)-1)]
print(f"The number of rows changes {len(jumps)} times.")
plt.plot([jump[0] for jump in jumps],[jump[1] for jump in jumps])
plt.show()

We will look at pps values larger than 0.3 and see what statistics show up.

In [None]:
pps_truncated = pps_rounded.query('ppscore > 0.3')
print(pps_truncated.shape)
print(pps_truncated.head())
print(pps_truncated.iloc[-5:])
pps_table = pps_truncated.pivot(columns='x', index='y', values='ppscore')
sns.heatmap(pps_table, annot=True, cmap="YlGnBu", linewidth=0.05,linecolor='lightgrey', xticklabels=True, yticklabels=True)
plt.show()

Above, we see a lot of obvious relationships. Relationships between plus/minus score vs wins, attempted shots vs made shots, and personal fouls vs free throws. One interesting relationship is that between year and 3-point attempts, we also saw this in the correlation study. It is also interesting how little predictive power some statistics have on each other, such as field goals made and field goals attempted. There is also a surprising amount of symmetry in the statistics.

Let's consider the predictive power of the features we highlighted in the correlation study.

In [None]:
print(vars_to_study)
pps_home_wins = pps_results[pps_results['x'].isin(vars_to_study)].query('y=="home_wins"')
pps_home_wins

None of these statistics that have a strong correlation seem to have predictive power. Thus, we don't feel there is a lot of insight to be gained from the `'ppscore'`. Perhaps this is because it is easy to imagine how these statistics influence each other.

---

## Next Steps

In the next notebook, we will do some feature engineering. In the previous notebook we truncated the data so that we have no missing values. This means we won't have much cleaning to do. We will spend a lot of our time seeing if we can transform our features so that they are closer to normal distributions.

## Conclusions

Both correlation and pps values confirm common sense related to Basketball, attempts correlates with successes. One standout is the relationship between year and 3 point shots (both made and attempted). We have highlighted some statistics:
* Assists (home and away teams)
* Defensive Rebounds (home and away teams)
* Field Goals Made (home and away teams)

In particular, the correlation of the opposing teams number of defensive rebounds had a comparatively strong correlation with winning.