# General Flow for Training/Fitting Models

In [None]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py

In [None]:
import pandas as pd

from sklearn import datasets
from sklearn.model_selection import train_test_split

from data_utils import PCA, RandomForestClassifier, StandardScaler
from data_utils import classification_error, display_confusion_matrix

### 3 Stages
- Data Prep: Encoding, Scaling, PCA, sometimes Splitting into train/test datasets
- Modeling: `fit()` classifier
- Evaluation: predict and measure error

#### Data Prep:
Do we need to split our data, or is it already split into train/test sets?

If it's already split we prepare the Encoding, Scaling, PCA objects using the `train` data (usually with the `fit_transform()` function), and then we use those same objects to encode, scale, PCA the `test` data (usually with the `transform()` function).

If the data is not split into two datasets, we could first split it and repeat the steps above, or, although it might add a bit of bias to the models, we could perform Encoding, Scaling, PCA with `fit_transform()` on the entire dataset and then only split the already encoded, scaled, PCA'ed data. This biases the encoder, scaler, PCA models, and in turn, the model, but is a bit easier to perform.

#### Modeling
Once we have `train` and `test` datasets that has been encoded, scaled, PCA'ed, we can use the `train` dataset to fit a supervised model (classifier, regression, etc).

Here we will usually call a `fit()` function with the training dataset's features and, separately, its labels or outcome variable values. Something like `fit(features, labels)`.

#### Evaluation
We have a model we trained/fitted with the `train` dataset. Now we can measure how well it actually performs once it's used without the correct labels.

Here we usually call `predict()` with a dataset's features to get label or regression predictions.

We want to call `predict()` for both the `train` and `test` dataset, and then measure how close those predictions are to the actual labels and values that we have in our dataset.

Eavluating with the `train` dataset will tell us if the model is capable of learning anything about the data. Evaluating with the `train` dataset will tell us if the model is capable of learning patterns and trends beyond the data that is fed to it.

It's common for the model to perform better with the `train` data since it was trained using that data and labels, but the `test` dataset error is what's more important because it will tell us what kind of error to expect from data that the model hasn't seen.

### Example

Classifying irises based on measurements.

Let's load a dataset and look.

In [None]:
iris = datasets.load_iris()
display(iris)

This dataset is an object with a bunch of keys, but if we look at the keys from the output above, we can see that the ones we want to look at are `data`, `target` and `feature_names`.

In [None]:
iris.keys()

It doesn't have separate train and test data. So, let's use this flow:

<img src="./imgs/datasplit-00.jpg" width="720px"/>

Let's create a dataframe from all the data:

In [None]:
iris_df = pd.DataFrame(iris["data"], columns=iris["feature_names"])
iris_df["label"] = iris["target"]
iris_df["flower type"] = iris_df["label"].apply(lambda x: iris["target_names"][x])

display(iris_df)

We have $150$ samples of flowers, each of which has $4$ measurements, plus a label which indicates the type of flower.

Although it might adversely affect our modeling a little bit, let's process this data before splitting it into $2$ datasets.

### Prepare Data

Let's scale the data using `StandardScaler` and run `PCA`. We need objects for both of those operations.

the dataset has very few features, but let's use `PCA` to decrease it from $4$ to $2$.

In [None]:
irisScaler = StandardScaler()
irisPCA = PCA(n_components=2)

Let's process the data using our scaler and PCA objects.

We only do this to the features and not the labels.

In [None]:
iris_scaled_df = irisScaler.fit_transform(iris_df.drop(columns=["label", "flower type"]))
iris_pca_df = irisPCA.fit_transform(iris_scaled_df)

Let's look at these results, just to make sure they make sense.

After `StandardScaler` our features should have means of $0$ and standard deviations of $1$, and the datset should have $150$ rows and $4$ columns. The range of our features should all be between $-3$ and $3$ if they are normally distributed.

In [None]:
display(iris_scaled_df)
display(iris_scaled_df.min(), iris_scaled_df.max(), iris_scaled_df.mean(), iris_scaled_df.std())

After PCA our dataset should have $150$ rows and only $2$ columns.

The means and standard deviations here won't be standardized anymore, but that reflects the advantage of the `PCA` that gives more importance to features that carry more information about our data.

In [None]:
display(iris_pca_df)

This looks ok.

Before we train our model we have to split our data into `train` and `test` datasets.

### Split the Data

The features are coming from the `iris_pca_df` `DataFrame`, while the labels are coming from the original `iris_df` `DataFrame`. Since we didn't do any kind of re-ordering of the data, the order of the rows is consistent.

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(iris_pca_df, iris_df["label"], test_size=0.2, random_state=1010)

This should've split our dataset into $4$ `DataFrames`. The number of columns in the features `DataFrames` should be $2$ and the number of columns in the labels should be $1$.

The number of rows for both of the test `DataFrame`s should be $30$ ($20\%$ of $150$), while the length of the train `DataFrames` should be $120$.

In [None]:
print(len(train_features), len(train_labels))
print(len(test_features), len(test_labels))
print(len(train_features.columns), len(test_features.columns))

### Model/Fit

We can train our model now. We're going to use a `RandomForestClassifier` and `fit()` it with the training data.

In [None]:
irisClassifier = RandomForestClassifier()

irisClassifier.fit(train_features, train_labels)

### Evaluate

We can now run predictions for both `train` and `test` data and measure error.

In [None]:
train_predictions = irisClassifier.predict(train_features)
test_predictions = irisClassifier.predict(test_features)

Before measuring the error we can check to see if these predictions have the right shapes and values.

In [None]:
print(len(train_predictions), len(test_predictions))

In [None]:
train_error = classification_error(train_labels, train_predictions)
test_error = classification_error(test_labels, test_predictions)

train_error, test_error

## With previously-separated data

Let's reload the dataset, put it into a `DataFrame` and split it before we do any of the pre-processing.

So, following a flow like this now:

<img src="./imgs/datasplit-01.jpg" width="720px"/>

First, we'll put the `Python` object into a `DataFrame` like before.


In [None]:
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris["data"], columns=iris["feature_names"])
iris_df["label"] = iris["target"]
iris_df["flower type"] = iris_df["label"].apply(lambda x: iris["target_names"][x])
display(iris_df)

### Split the Data

Now, before preprocessing the data, let's split it into $2$ datasets and check their content.

In [None]:
train_df, test_df = train_test_split(iris_df, test_size=0.2, random_state=1010)
display(train_df, test_df, len(test_df))

### Pre-Processing
Let's do our pre-processing (scaling, PCA). It's very similar, but now we have $2$ `DataFrames` to prepare.

The `train` data is used to prepare/fit the scaling and PCA objects, and then the `test` data just has to be transformed by those objects.

In [None]:
# same objects
irisScaler = StandardScaler()
irisPCA = PCA(n_components=2)

# fit and transform the training data, like before
train_scaled_df = irisScaler.fit_transform(train_df.drop(columns=["label", "flower type"]))
train_pca_df = irisPCA.fit_transform(train_scaled_df)

We can check the shape of the resulting `DataFrames`

In [None]:
display(train_scaled_df)
display(train_pca_df)

Now we process the `test` data. We already have prepared the scaler and PCA objects.

We just have to use them to transform the `test` data.

In [None]:
test_scaled_df = irisScaler.transform(test_df.drop(columns=["label", "flower type"]))
test_pca_df = irisPCA.transform(test_scaled_df)

In [None]:
display(test_scaled_df)
display(test_pca_df)

### Train/Model/Fit

We have our `train` data ready, we can now use it to `fit()` a classifier model.

We only `fit()` the `train` data, not the `test` data.

We don't have separate label `DataFrames` like before, but we can easily grab them from the original `train` `DataFrame`, `train_df`.

We can check if the length of those `DataFrames` are the same.

In [None]:
len(train_pca_df), len(train_df["label"])

And now we train the classifier.

In [None]:
irisClassifier = RandomForestClassifier()

irisClassifier.fit(train_pca_df, train_df["label"])

### Evaluate

We can now get predictions and measure error.

In [None]:
train_predictions = irisClassifier.predict(train_pca_df)
test_predictions = irisClassifier.predict(test_pca_df)

Check if results have sensible shapes.

In [None]:
print(len(train_predictions), len(test_predictions))

Measure error, like before, but where we have to get the labels from the original `DataFrames`.

In [None]:
train_error = classification_error(train_df["label"], train_predictions)
test_error = classification_error(test_df["label"], test_predictions)

train_error, test_error

In [None]:
display_confusion_matrix(train_df["label"], train_predictions, display_labels=iris["target_names"])
display_confusion_matrix(test_df["label"], test_predictions, display_labels=iris["target_names"])