# Title: "Data Preprocessing: Cleaning, Transforming, and Preparing the Dataset"

**Description:** 

This notebook focuses on the crucial step of data preprocessing in a machine learning project. It covers various techniques and tasks involved in preparing the dataset for analysis and model training. The notebook provides insights into handling missing data, outlier detection and treatment, feature scaling, encoding categorical variables, and handling imbalanced datasets. 

By following this guide, I want to internalize how to effectively preprocess data to improve the quality and reliability of machine learning models.

In [15]:
#importing all the relevant modules libaries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [16]:
#Importing the data
df = pd.read_csv('/Users/robertkurtz/Desktop/NBA 2023 Dataplayground/data/2022-2023 NBA Player Stats - Regular.csv', delimiter=';', encoding='ISO-8859-1' )

In [17]:
df

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,C,23,TOR,33,9,23.0,4.0,8.2,...,0.697,2.1,4.3,6.4,1.1,0.7,0.7,1.2,2.2,10.4
1,2,Steven Adams,C,29,MEM,42,42,27.0,3.7,6.3,...,0.364,5.1,6.5,11.5,2.3,0.9,1.1,1.9,2.3,8.6
2,3,Bam Adebayo,C,25,MIA,52,52,35.3,8.6,15.7,...,0.806,2.7,7.3,10.1,3.3,1.2,0.8,2.6,2.8,21.6
3,4,Ochai Agbaji,SG,22,UTA,35,1,14.0,1.5,3.2,...,0.625,0.6,1.0,1.6,0.5,0.1,0.1,0.3,1.4,4.1
4,5,Santi Aldama,PF,22,MEM,52,18,22.0,3.4,7.0,...,0.730,1.0,3.7,4.7,1.2,0.7,0.7,0.7,1.9,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
548,501,Delon Wright,PG,30,WAS,26,2,22.1,2.1,5.1,...,0.903,0.9,2.2,3.1,3.7,1.9,0.3,0.8,1.4,6.0
549,502,McKinley Wright IV,PG,24,DAL,19,1,10.2,1.0,2.4,...,0.600,0.3,1.1,1.4,1.7,0.4,0.2,0.6,0.9,2.4
550,503,Thaddeus Young,PF,34,TOR,45,9,16.1,2.2,4.0,...,0.692,1.4,1.9,3.4,1.5,1.1,0.1,0.8,1.8,5.0
551,504,Trae Young,PG,24,ATL,50,50,35.5,8.5,19.8,...,0.887,0.7,2.3,3.0,10.2,1.0,0.2,4.2,1.5,26.9


## Step 4: Cleaning the Data

### Dropping duplicates

In [18]:
# drop duplicate rows based on a subset of columns
# 48 duplicates will be dropped
# I want to have every player only once

df = df.drop_duplicates(subset=['Player'])
df

# --> Create an average for each player instead of dropping the duplicates

#waren die duplikate gleich? oder waren die werte unterschiedlich --> in cosideration nehmen

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,C,23,TOR,33,9,23.0,4.0,8.2,...,0.697,2.1,4.3,6.4,1.1,0.7,0.7,1.2,2.2,10.4
1,2,Steven Adams,C,29,MEM,42,42,27.0,3.7,6.3,...,0.364,5.1,6.5,11.5,2.3,0.9,1.1,1.9,2.3,8.6
2,3,Bam Adebayo,C,25,MIA,52,52,35.3,8.6,15.7,...,0.806,2.7,7.3,10.1,3.3,1.2,0.8,2.6,2.8,21.6
3,4,Ochai Agbaji,SG,22,UTA,35,1,14.0,1.5,3.2,...,0.625,0.6,1.0,1.6,0.5,0.1,0.1,0.3,1.4,4.1
4,5,Santi Aldama,PF,22,MEM,52,18,22.0,3.4,7.0,...,0.730,1.0,3.7,4.7,1.2,0.7,0.7,0.7,1.9,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
548,501,Delon Wright,PG,30,WAS,26,2,22.1,2.1,5.1,...,0.903,0.9,2.2,3.1,3.7,1.9,0.3,0.8,1.4,6.0
549,502,McKinley Wright IV,PG,24,DAL,19,1,10.2,1.0,2.4,...,0.600,0.3,1.1,1.4,1.7,0.4,0.2,0.6,0.9,2.4
550,503,Thaddeus Young,PF,34,TOR,45,9,16.1,2.2,4.0,...,0.692,1.4,1.9,3.4,1.5,1.1,0.1,0.8,1.8,5.0
551,504,Trae Young,PG,24,ATL,50,50,35.5,8.5,19.8,...,0.887,0.7,2.3,3.0,10.2,1.0,0.2,4.2,1.5,26.9


Checking if dropping Lebron James Makes a difference --> He is an outlier...

In [19]:
# dropping the king
# df = df.drop(df[df['Player'] == 'Lebron James'].index)
# it seems that it doesn't make a difference if I drop him or not


### Checking for missing values

In [20]:
#checking for missing data
missing_values = df.isnull().sum()
print(missing_values)
#data is complete --> all good!

Rk        0
Player    0
Pos       0
Age       0
Tm        0
G         0
GS        0
MP        0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
dtype: int64


##  Scaling & Normalizing the dataset

* Carsten mentioned it might be necessary to normalize my values to get more robust outcomes...
* The problem at the moment is that if I normalize the outcome is not better on first sight and the predictions are also normalized wich means that I need to renormalize in order to get realistic PTS averages - It's complex and really necessary?
* Need to ask Carsten for the moment I won't use it

In [21]:
# parametrize the target variable

target = 'PTS'

# Create a StandardScaler object
scaler = StandardScaler()

# Select the independent variables to normalize
vars_to_normalize = ['PTS', 'FGA', '2PA', 'FTA', 'MP', 'TOV', 'AST', '3PA', 'GS', 'DRB']

# Normalize the selected variables
df_normalized = df.copy()  # Create a copy of the DataFrame
df_normalized[vars_to_normalize] = scaler.fit_transform(df_normalized[vars_to_normalize])

# Print the normalized DataFrame
df_normalized

# correlation between the numerical columns and my target variable (PTS)
df_normalized.corr()[target].sort_values(ascending=False)


PTS     1.000000
FG      0.992316
FGA     0.982662
2PA     0.936654
2P      0.918001
FT      0.911350
FTA     0.901151
MP      0.880077
TOV     0.866696
GS      0.759026
AST     0.745556
3PA     0.735407
DRB     0.732581
3P      0.716671
TRB     0.654074
STL     0.640825
PF      0.613166
G       0.551349
FT%     0.419345
BLK     0.363295
ORB     0.317511
3P%     0.242823
eFG%    0.144796
FG%     0.138224
Age     0.121805
2P%     0.119473
Rk     -0.039788
Name: PTS, dtype: float64

The result is similar to the unnormalized Dataset. So we don't change anything...

In [22]:
# zero_ppg_players = df[df['PTS'] == 0]
# zero_ppg_players

## Encoding the data

* In this case using one hot encoding

**Encoding:** A players positions is a string variable but should be better displayed as a numerical value (binary 0 or 1, dummy variable) - I believe I should use one hot encoding in order to change / use this information. For example, do I suppose a Point Guard scores more than a center on average? --> Maybe!

In [23]:
# Perform one-hot encoding on the 'pos' column
encoded_df = pd.get_dummies(df['Pos'], prefix='Pos')

# Concatenate the encoded columns with the original DataFrame
df_encoded = pd.concat([df, encoded_df], axis=1)

# Drop the original 'pos' column if no longer needed
df_encoded.drop('Pos', axis=1, inplace=True)

# Print the updated DataFrame with encoded columns
print(df_encoded.head())

   Rk            Player  Age   Tm   G  GS    MP   FG   FGA    FG%  ...  TOV  \
0   1  Precious Achiuwa   23  TOR  33   9  23.0  4.0   8.2  0.489  ...  1.2   
1   2      Steven Adams   29  MEM  42  42  27.0  3.7   6.3  0.597  ...  1.9   
2   3       Bam Adebayo   25  MIA  52  52  35.3  8.6  15.7  0.546  ...  2.6   
3   4      Ochai Agbaji   22  UTA  35   1  14.0  1.5   3.2  0.486  ...  0.3   
4   5      Santi Aldama   22  MEM  52  18  22.0  3.4   7.0  0.486  ...  0.7   

    PF   PTS  Pos_C  Pos_PF  Pos_PG  Pos_SF  Pos_SF-SG  Pos_SG  Pos_SG-PG  
0  2.2  10.4      1       0       0       0          0       0          0  
1  2.3   8.6      1       0       0       0          0       0          0  
2  2.8  21.6      1       0       0       0          0       0          0  
3  1.4   4.1      0       0       0       0          0       1          0  
4  1.9   9.5      0       1       0       0          0       0          0  

[5 rows x 36 columns]


### !Players Positions do not provide robust results in the regression!

* **as a small explanation:** I tried out to encode the players position column in order to use them for the regression to see whether it plays a role for the average PTS which position the player plays. It turns out that all the positions did not have the sufficient t-value to be considered so I leave it at the original thing...


**Those were the Positions**
'Pos_C','Pos_PF', 'Pos_PG', 'Pos_SF', 'Pos_SF-SG', 'Pos_SG', 'Pos_SG-PG'

In [24]:
# review whole dataset before exporting it
# df_encoded

## Step: 5 Saving the preprocessed DF back into CSV

* this will be the one used for the model training in 03


In [25]:
# Specify the file path where you want to save the CSV file
file_path = '/Users/robertkurtz/Desktop/NBA 2023 Dataplayground/data/df_preprocessed.csv'

# Save the DataFrame to CSV format
df.to_csv(file_path, index=False)