# General Flow for Training/Fitting Models

In [None]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split

from data_utils import PCA, RandomForestClassifier, StandardScaler
from data_utils import classification_error, display_confusion_matrix

### 3 Stages
- Data Prep: Encoding, Scaling, PCA, sometimes Splitting into train/test datasets
- Modeling: `fit()` classifier
- Evaluation: predict and measure error

#### Data Prep:
Do we need to split our data, or is it already split into train/test sets?

If it's already split we prepare the Encoding, Scaling, PCA objects using the `train` data (usually with the `fit_transform()` function), and then we use those same objects to encode, scale, PCA the `test` data (usually with the `transform()` function).

If the data is not split into two datasets, we could first split it and repeat the steps above, or, although it might add a bit of bias to the models, we could perform Encoding, Scaling, PCA with `fit_transform()` on the entire dataset and then only split the already encoded, scaled, PCA'ed data. This biases the encoder, scaler, PCA models, and in turn, the model, but is a bit easier to perform.

#### Modeling
Once we have `train` and `test` datasets that has been encoded, scaled, PCA'ed, we can use the `train` dataset to fit a supervised model (classifier, regression, etc).

Here we will usually call a `fit()` function with the training dataset's features and, separately, its labels or outcome variable values. Something like `fit(features, labels)`.

#### Evaluation
We have a model we trained/fitted with the `train` dataset. Now we can measure how well it actually performs once it's used without the correct labels.

Here we usually call `predict()` with a dataset's features to get label or regression predictions.

We want to call `predict()` for both the `train` and `test` dataset, and then measure how close those predictions are to the actual labels and values that we have in our dataset.

Eavluating with the `train` dataset will tell us if the model is capable of learning anything about the data. Evaluating with the `train` dataset will tell us if the model is capable of learning patterns and trends beyond the data that is fed to it.

It's common for the model to perform better with the `train` data since it was trained using that data and labels, but the `test` dataset error is what's more important because it will tell us what kind of error to expect from data that the model hasn't seen.

### Example

Classifying penguins based on measurements.

Let's load a dataset and look.

In [None]:
PENGUIN_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/refs/heads/main/datasets/json/penguins.json"
penguin_data = object_from_json_url(PENGUIN_URL)

display(penguin_data)

It doesn't have separate train and test data, so we can either 

#### Pre-process and then split:

<img src="./imgs/datasplit-00.jpg" width="720px"/>

OR
#### Split and then process:

<img src="./imgs/datasplit-01.jpg" width="720px"/>

In [None]:
# TODO: Put in DataFrames

# TODO: Process + Split

# TODO: Split + Process

#### Check Sizes of DataFrames / Datasets

In [None]:
# TODO: check sizes, look at distributions

### Split the Data

Using `train_test_split()`

In [None]:
# TODO: split with train_test_split()

### Check Sizes

In [None]:
# TODO: Check sizes

### Model/Fit

We can train our model now. We're going to use a `RandomForestClassifier` and `fit()` it with the training data.

In [None]:
# TODO: fit RandomForestClassifier

### Evaluate

We can now run predictions for both `train` and `test` data and measure error.

In [None]:
# TODO: predict() train and test

Before measuring the error we can check to see if these predictions have the right shapes and values.

In [None]:
# TODO: check sizes

### Measure Error

In [None]:
# TODO: measure classification error with classification_error()

### Look at Confusion (Matrix)

`display_confusion_matrix(labels, predictions, display_labels=unique_labels)`

In [None]:
# TODO: look at confusion matrices