# **Basic Data Analytics**

## Objectives

* In this notebook, we will do some preliminary statistical analysis, such as a correlation study, as well as various visualizations.

## Inputs

* The input for this is the cleaned data from the last notebook, namely `'game_data_clean.csv'`.

## Outputs

* At the end, we will have various plots displaying the statistical relationship between different features of our dataset.

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
home_dir = '/workspace/pp5-ml-dashboard'
csv_dir ='/workspace/pp5-ml-dashboard/outputs/datasets/clean/csv' 
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our cleaned dataset as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from ydata_profiling import ProfileReport
from src.utils import get_df

game_data = get_df('game_data_clean', csv_dir)
game_data.head()

Unnamed: 0,season_id,team_id_home,game_id,wl_home,min,fgm_home,fga_home,fg3m_home,fg3a_home,ftm_home,...,ast_away,stl_away,blk_away,tov_away,pf_away,pts_away,plus_minus_away,Day,Month,Year
0,21985,1610612737,28500005,L,240,41.0,92.0,0.0,3.0,9.0,...,21.0,11.0,7.0,17.0,19.0,100.0,9,25,10,1985
1,21985,1610612758,28500006,L,240,39.0,88.0,0.0,2.0,26.0,...,19.0,7.0,7.0,18.0,32.0,108.0,4,25,10,1985
2,21985,1610612765,28500010,W,240,39.0,88.0,0.0,1.0,40.0,...,27.0,10.0,7.0,20.0,32.0,116.0,-2,25,10,1985
3,21985,1610612762,28500011,L,240,42.0,82.0,0.0,2.0,24.0,...,23.0,10.0,7.0,19.0,28.0,112.0,4,25,10,1985
4,21985,1610612744,28500008,L,240,36.0,91.0,0.0,4.0,33.0,...,26.0,11.0,3.0,22.0,40.0,119.0,14,25,10,1985


# Section 1: EDA and visualization
We hypothesize that many of the statistics will be correlated with each other. After dropping certain categorical features, lets look the correlation dataframe.

Note: add in playoff and regular season column to see if anything correlates there, and maybe leave year in, or add season in as an ordinal thing. maybe make team ordinal as well. drop day, maybe drop month as well.

In [None]:
game_data.head()

We are going to modify the data frame before doing any exploratory data analysis. We are going to drop metadata columns like `'game_id'`, `'team_id'`, as well as `'Day'` and `'Month'`. We will leave `'Year'` just in case something interesting shows up. We will also need to change the column `'wl_home'`.

In [3]:
game_eda = game_data.drop(labels=['game_id','min','season_id', 'team_id_home', 'team_id_away', 'Day', 'Month'], axis=1)
game_eda['home_wins'] = game_eda.apply(lambda x: 1 if x['wl_home'] == 'W' else 0, axis=1)
game_eda.drop(labels=['wl_home','wl_away'], axis=1, inplace=True)

Let's look at a profile report produced by `ydata_profiling`.

In [4]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=game_eda, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

It is interesting that many of the distributions have the shape of a normal distribution. Some distributions are also skewed. You will notice that there are alerts in the report about zero values. Are not alarming. Some things don't happen in a game. We know from our previous inspection of the data that there are no missing values. The fact that the `'home_wins'` column is 0 approximately 40% of the time is actually a good sign. Now that we have a basic idea of what the distributions look like, we will focus on correlation coefficients with respect to the statistic `'home_wins'`.

In [5]:

pearson_corr = game_eda.corr(method='pearson')['home_wins'].sort_values(key=abs, ascending=False)[1:]
print(pearson_corr[:11])
spearman_corr = game_eda.corr(method='spearman')['home_wins'].sort_values(key=abs, ascending=False)[1:]
print(spearman_corr[:11])

plus_minus_away   -0.798495
plus_minus_home    0.798495
pts_home           0.404435
pts_away          -0.386229
dreb_away         -0.343750
fgm_home           0.331596
ast_home           0.315810
dreb_home          0.307699
fgm_away          -0.299504
ast_away          -0.266257
reb_away          -0.241296
Name: home_wins, dtype: float64
plus_minus_away   -0.846898
plus_minus_home    0.846898
pts_home           0.407112
pts_away          -0.385031
dreb_away         -0.342127
fgm_home           0.331037
ast_home           0.314473
dreb_home          0.305353
fgm_away          -0.295041
ast_away          -0.260685
reb_away          -0.240049
Name: home_wins, dtype: float64


The `'plus_minus'` scores are the point differentials between the teams. Naturally these will correlate most strongly with winning. Similarly, `'pts'` will correlate quite strongly with winning since that is how the winner of the game is actually determined. If we trained a model on the strongest features, we would undoubtedly get a model which simply looked at the point differentials or the points scored by each time.

The following are the statistics with the least correlation with wins. It is slightly reassuring that the correlation is so weak, but it is a bit odd that it is more strongly correlated with winning than offensive rebounds.

In [6]:
print(pearson_corr[-5:])
print(spearman_corr[-5:])

fg3a_home   -0.066227
Year        -0.051915
fga_home    -0.049170
fg3a_away   -0.036388
oreb_home   -0.019643
Name: home_wins, dtype: float64
fg3m_home    0.063330
Year        -0.051613
fga_home    -0.047646
fg3a_away   -0.035033
oreb_home   -0.016766
Name: home_wins, dtype: float64


This tells an interesting story. Aside from points, the statistic with the highest correlation to wins is the defensive rebounds of the opposing team. What about predictive power? Remember, this statistic is not symmetric.



This is a lot of data. So let's look at a heat map where we only consider correlation coefficients the top 9 correlation coefficients. Also, the `'plus_minus'` statistic is just the difference in points. So clearly, this will have the highest correlation with who wins (why is it not a correlation coefficient of 1?). The Spearman and Pearson correlation coefficients indicate the same features for further study. We may also remove the `'pts'` statistics.

In [None]:
vars_to_study = ['ast_home','ast_away', 'dreb_away', 'dreb_home', 'fgm_away', 'fgm_home', 'pts_away', 'pts_home']
top_corr = game_data.filter(list(vars_to_study)+['home_wins'])
top_corr.head()

---

These are all numerical fields. So we will plot the distribution of the values and use hue to distinguish between wins and losses. Notice that there is a certain symmetry in the distributions. If the role of home and away are swapped we could also swap win and loss. The symmetry can also be thought of as reflecting across the y-axis. This shows a genuine symmetry in the data since each game is only listed once. The presence of the symmetry is almost reassuring and is very intuitive.

In [None]:
for var in vars_to_study:
    plt.figure(figsize=(8, 5))
    sns.histplot(data=game_data_for_corr, x=var, hue='home_wins', kde=True, element="step")
    plt.title(f"{var}", fontsize=20, y=1.05)
    plt.show()

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
