# NBA Data Analysis

The data used in this notebook was downloaded from [Kaggle](https://www.kaggle.com/drgilermo/nba-players-stats#Seasons_Stats.csv).  The original source of the data is [Basketball-reference](http://www.basketball-reference.com/).


## General Intro EDA

In [None]:
%load_ext nb_black

In [None]:
import pandas as pd
import numpy as np

from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://docs.google.com/spreadsheets/d/1m0jaYL1KGjxW1cKJUQxVTcPOnm7v7NZEBKRZADCmc68/export?format=csv"
nba = pd.read_csv(data_url, index_col=0)
nba.head()

Preprocessing from before:

In [None]:
nba = nba.drop(columns=["blank2", "blanl"])
nba = nba.dropna(subset=["Year", "Player", "Pos", "Tm"])

## Feature Engineering

We have a lot of useful data here, but most predictive models that we'll be looking at only like numeric data.  To still use our information we have to do a little bit of reformatting.

### One Hot Encoding / Dummy Encoding

For example, for team, we might "one-hot encode" (aka create dummy variables).  This is a way of creating a series of variables indicating True/False.

Create a dataframe that is a subset of the `nba` dataframe.  Only include in this subset:

* Columns: `PTS`, `Player`, & `Tm`
* Rows: a random selection of 15 rows (use 42 as the `random_state`)

In [None]:
# subset columns
nba_sub = ____

# subset rows
nba_sub = ____

nba_sub

Use `pd.get_dummies()` on the subset.

* What happened?
* What might we change about this and why?
* What does the `drop_first` argument of `pd.get_dummies()` do and why?

There are some issues that come up with using `pd.get_dummies` in a machine learning workflow.  For today, we'll stick with it due to its ease of use compared to more powerful options.

Using [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) overcomes the issues that `pd.get_dummies` can run into, but it has a little more complex usage.

### Binary Encoding

Create a binary column named `is_old` that shows whether or not the `Year` variable is before 1980.

Create a binary column named `is_california` that shows whether or not a team is located in california.

In [None]:
ca_teams = ["LAL", "LAC", "GSW", "SAC"]



### Ordinal Encoding

Let's make up some data to be ordinal encoded.

* Using the `grades` list create a sample of 20 random letters
* Create a 1 column DataFrame from this sample

In [None]:
np.random.seed(42)

grades = ["A", "B", "C", 'D', 'F']
rand_grades = ____

grade_df = ____
grade_df.head()

Create a variable that is an ordinal encoding of grade.  Have `A` be 1 and `F` be 5.

### Scaling

Some methods we'll see are sensitive to our variables being on different scales.  For example, if you have variables for a person's height and their annual income, the height feature will have a much much smaller value than the income feature.  In some methods, this will lead to the income variable being a louder signal than the height variable.  Larger magnitude variables can end up drowning out smaller magnitude ones, and this can be an issue if we think height will be an important predictor.

To address this issue, we can scale the variables to have equal footing.  This won't change the shape of their distribution.  Not changing shape means that the patterns within and between the variable aren't lost by scaling, the patterns are preserved, the values have just been standardized.

* Create a subset of the nba dataset that has the columns `PTS` and `Age`.
* Drop all NAs
* Use the pandas boxplot method on this resulting data.
* Plot these variables on a scatter plot.

We're going to split into groups to evaluate 2 different scalers.  The below code will decide the groups.

In [None]:
# fmt: off
data_scientists = ["Seiedeshiva", "Christopher", "Jason", "Francis", 
                   "Tizeta", "Matthew", "Jason", "Scott", "Dæyva", 
                   "Michael", "Cristina", "Alex", "Mike", "Taylor"]
# fmt: on

# Randomize order
np.random.shuffle(data_scientists)

n = len(data_scientists) // 2
print(f"Use StandardScaler: {data_scientists[:n]}")
print(f"Use MinMaxScaler: {data_scientists[n:]}")

In [None]:
# Pick your poison (comment out the one that your group isn't doing)
scaler = StandardScaler()
scaler = MinMaxScaler()

* Use a scaler to scale the `PTS` and `Age` data.
* The output of the scaler is a numpy array, convert this back to a dataframe
* Recreate the same box plots from before.
  * What's the same?
  * What's different?
  * What's the minimum value of the numeric axis? the max value?

In [None]:
# .fit() methods 'learn' something from your data
# They don't apply any of these learnings
# In the case of a scaler we have to call .transform
# Alternatively, we could use .fit_transform() to do 
# both of these things in one step
scaler.fit(____)

scaled = scaler.transform(____)

scaled_df = ____
scaled_df.head()

* Bonus: what attributes does your scaler have? What is the significance of these?
* Bonus Bonus: can you recreate this same scaling from scratch?