# Chapter 1 & Chapter 2

### Data Preprocessing

- Comes after data cleaning and Exploratory Data Analysis (EDA)
- pre-requisite for modeling
- Helps to:
    - produce more reliable results
    - Improve model performance
- inspect dataset
- See summary statistics
- Deal with missing values
- Convert to specified column types
- Split into training and testing set (Take class imbalance into account)
    - Data leakage : non-training data is used to train the model
- Standardize data : Transform numeric data to make it normally distributed
    - Non-normal data introduce bias for some features due to its high variance 
    - Non-normal data introduce model underfitting due to difference in scales among different features
    - Log-normalization, standard scaling
    - Tree-based models can be trained without standardization
    - The other models like linear models or dataset with high dimensions requires standardization

```
# Inspect dataset
df.head()
df.info()
df.describe() # Summary stats

# DEAL WITH MISSING VALUES
df.drop([1, 2, 3]) # Drop specific rows
df.dropna(thresh=2) # keep at least 2 non-missing values in each row
df.dropna(subset=['C']) # Drop missing values of specified column

# Convert column types
df["C"] = df["C"].astype("float")

# Verify class imbalance
y.value_counts()

# Split into training and testing data (Consider class imbalance)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# STANDARDIZE DATASET
df.var() # Detect high variance difference in columns are candidates of log normalization


```