# Scikit-learn

Main features of `scikit-learn`:

- **Preprocessing**: scaling, encoding, missing value imputation, ...
- **Supervised learning**: regression, classification, ...
- **Unsupervised learning**: clustering, dimension reduction, ...
- **Model evaluation**: measuring model performance based on some metrics.
- **Model selection**: choosing the best model or set of hyperparameters based on performance on validation data.
- **Pipeline and compose** (will not be covered here)

The fundamental object in `scikit-learn` is `BaseEstimator`, and the heierarchy of the objects are as follows:
```text
BaseEstimator
├── TransformerMixin      ← Data Transformation (preprocessing, dimension reduction, ...)
├── ClassifierMixin       ← Estimator (classification)
├── RegressorMixin        ← Estimator (regression)
├── ClusterMixin, etc.    ← Estimator (clustering)
└── ...
```

In what follows, I will go over various objects/features on `scikit-learn` by following the standard workflow of ML.

## 0. Data Split
- In ML, it's important to evaluate how well a model performs on unseen (or new) data. 
- To do this, we split the dataset into a training set (for fitting the model) and a test set (for final evaluation). Later, we will also split the training set (validation, cross-validation) for a model selection.
- After this step, you should **not** use any information about the test set.
- Before splitting the data, we assume that any transformations that do not depend on the training data — such as **type conversion** or **dropping rows with missing values** — have already been applied.

In `scikit-learn`, the `train_test_split` function from the `sklearn.model_selection` module is used to randomly split the data into training and test sets based on the `test_size` parameter. Setting the `random_state` ensures that the split is reproducible.


```python
# Code template for data split
X = DataFrame_of_predictors
y = DataFr
from sklearn.model_selection import train_test_split
ame_of_responses # Ignore if unsupervised

TEST_PROPORTION = 0.2 # any float between 0 and 1
SEED = 100 # any integer

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_PROPORTION, random_state=SEED)

# For unsupervised learning,
X_train, X_test = train_test_split(X, test_size=0.2, random_state=SEED)
```


## 1. Preprocessing
- After splitting the data into training and test sets, we apply preprocessing steps such as:
  -  imputing missing values
  -  scaling numerical features
  -  encoding categorical variables.
- It is important that all transformations that learn from the data — like calculating means or encoding mappings — are fit only on the training set. (WHY?)
- Once fit, these transformers should then be applied to both the training and test data.
- We use `TransformerMixin` object for preprocessing.

**Examples**

| Transformer           | Description                                                 | Typical Use Case                       |
| --------------------- | ----------------------------------------------------------- | -------------------------------------- |
| `SimpleImputer`       | Fills in missing values with mean, median, or most frequent | Handling missing data                  |
| `StandardScaler`      | Standardizes features (zero mean, unit variance)            | Scaling numeric features               |
| `MinMaxScaler`        | Scales features to a given range (default: \[0, 1])         | Normalizing data for bounded models    |
| `RobustScaler`        | Scales using median and IQR (robust to outliers)            | Scaling with outliers                  |
| `OneHotEncoder`       | Converts categorical variables into binary columns          | Encoding nominal categorical features  |
| `OrdinalEncoder`      | Encodes categories as ordered integers                      | Encoding ordinal categorical features  |


Suppose you want to transform your dataframe `df` with the specific `Transformer`.

```python
# Code template for TransformerMixin object
df = DATA
tf = Transformer(...)               # Instantiation
tf.fit(df)                          # Fitting
df_transformed = tf.transform(df)   # Transforming
```
---

Fitting and transformin can be done simultaneously by `fit_transform()`
```python
# Code template for TransformerMixin object (fit + transform)
tf = Transformer(...)                   # Instantiation
df_transformed = tf.fit_transform(df)   # Fitting + transforming
```
The object `tf` automaically saves the fit.

---

```python
## Which one is wrong?
X = pd.DataFrame(...)
X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 87)

### (1) ###
tf = Transformer(...)
tf.fit(X_train)
X_train_transformed = tf.transform(X_train)
X_test_transformed = tf.transform(X_test)

### (2) ###
tf = Transformer(...)
tf.fit(X)
X_train_transformed = tf.transform(X_train)
X_test_transformed = tf.transform(X_test)

### (3) ###
tf = Transformer(...)
tf.fit(X_train)
X_train_transformed = tf.transform(X_train)
X_test_transformed = tf.transform(X_test)
```

### Code Examples with Titanic

#### 1.1 Missing value imputation

In [1]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'mean') # strategy: mean, most_frequent, median, constant, ...

#### 1.2 Scaling (Numerical variable)

When a model depends on distance calculations or involves gradient-based optimization, feature scaling becomes essential. Examples include:
| **Model**                        | **Reason**                                              |
| -------------------------------- | ---------------------------------------------------------------------- |
| `LogisticRegression`             | Optimization assumes features are on a similar scale                   |
| `Ridge`, `Lasso`, `ElasticNet`   | Regularization penalties are sensitive to feature magnitude            |
| `SVM           `                 | Distance-based; feature scales affect margin and kernel behavior       |
| `KNeighborsClassifier/Regressor` | Distance-based; dominant features distort results                      |
| `KMeans`                         | Uses Euclidean distance; scale mismatch skews clustering               |
| `PCA`                            | Based on variance; larger-scale features dominate principal components |


In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#### 1.3 Dimensionality Reduction (Numerical variable)
- Even though dimensionality reduction techniques like PCA or t-SNE themselves are unsupervised learning models, they are often applied during preprocessing to reduce noise, remove redundancy, or simplify the feature space.
- For example, PCA can help improve model performance or training speed by projecting high-dimensional data onto a smaller set of orthogonal components.
- Each dimensionality reduction object is also a subclass of `TransformerMixin`, so they share the same code template.
- However, just like other data-dependent transformations, dimensionality reduction should be fit only on the training data and then applied to the test data.

Since we only have two numerical varialbes in the dataset, we do not pursue this direction in our preprocessing.

#### 1.4 Encoding (Categorical variable)
Encoding for categorical variables is necessary for any model that requires numerical input.

In [3]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

#### 1.5 Save

After preprocessing, save 
- `X_train_processed`, `X_test_processed`, `y_train`, `y_test` in `data/processed` by using `to_csv()`.
- all the transformers you used (`imputer`, `scaler`, `encoder`) in `model` by using `pickle`.

It's better to save the processed datasets and transformers separately for each model, especially when different models require different preprocessing steps. That is, save
- `X_train_processed`, `X_test_processed`, `y_train`, `y_test` in `data/processed/ModelName` by using `to_csv()`
- all the transformers you used (`num_imputer`, `cat_imputer`, `scaler`, `encoder`) in `model/ModelName` by using `pickle`.

#### IMPORTANT NOTE
- In addition to scikit-learn’s transformers, many preprocessing steps are often performed manually using pandas or custom logic.
- These include creating new features, grouping categories, handling outliers, or applying domain-specific transformations.
- You can perform such steps at any point in your workflow, as long as you ensure they are applied consistently to both training and test sets.
- Furthermore, as you conduct exploratory data analysis (EDA), you may identify patterns, anomalies, or relationships that reveal the need for additional preprocessing steps.

## 2. Fitting and Prediction
- Fitting a model means finding the best parameters that minimize the error between the model's predictions and the true values on the training data.
- Once the model is fit on the training data, it can be used to make predictions on test data.

**Supervised learning**

Let's say you want to fit a supervised model on (`X_train`, `y_train`), and use it to predict outcomes for `X_test`. (Here, I assume the data are processed).

```python
# Code template for supervised model
model = ModelName(...)              # Instantiation. You should import `ModelName` first.
model.fit(X_train, y_train)         # Fitting
y_pred = model.predict(X_test)      # Prediction
```


**Unsupervised learning**

Let's say you want to fit an unsupervised model on `X_train`, and then use it to transform (dimension reduction) or assign label (clustering) to `X_test`. (Here, I assume the data are processed.)

```python
# Code template for dimension reduction model
model = ModelName(...)                        # Instantiation. You should import `ModelName` first.
model.fit(X_train)                            # Fitting
X_train_reduced = model.transform(X_train)    # Dimension reduction for training set
X_test_reduced = model.transform(X_test)      # Dimension reduction for test set
```

```python
# Code template for clustering model
model = ModelName(...)                    # Instantiation. You should import `ModelName` first.
model.fit(X_train)                        # Fitting
train_labels = model.labels_              # Cluster assignment for training set
test_labels = model.predeict(X_test)      # Cluster assignment for test set
```

## 3. Model Evaluation
- Model evaluation is the process of measuring how well a trained model performs on both the training data and test data.
- Comparing training and test performance helps identify issues such as overfitting or underfitting.
- Here, we only consider the evaluation for supervised learning.

**Evaluation metrics for regression (Examples)**
| **Metric**                            | **Description**                          | **scikit-learn Function**                |
| ------------------------------------- | ---------------------------------------- | ---------------------------------------- |
| Mean Absolute Error (MAE)             | Average of absolute errors               | `mean_absolute_error`                    |
| Mean Squared Error (MSE)              | Average of squared errors                | `mean_squared_error`                     |
| Root Mean Squared Error (RMSE)        | Square root of MSE (same unit as target) | `mean_squared_error(..., squared=False)` |
| R² Score                              | Proportion of variance explained         | `r2_score`                               |

**Evaluation metrics for classification (Examples)**
| **Metric**               | **Description**                                      | **scikit-learn Function** |
| ------------------------ | ---------------------------------------------------- | ------------------------- |
| Accuracy                 | Proportion of correct predictions                    | `accuracy_score`          |
| Precision                | TP / (TP + FP) — correctness of positive predictions | `precision_score`         |
| Recall (Sensitivity)     | TP / (TP + FN) — coverage of actual positives        | `recall_score`            |
| F1 Score                 | Harmonic mean of precision and recall                | `f1_score`                |
| Confusion Matrix         | Table of TP, FP, FN, TN counts                       | `confusion_matrix`        |
| ROC AUC                  | Area under the ROC curve                             | `roc_auc_score`           |


```python
# Code template for model evaluation
train_result = EVAL_METRIC(y_train, y_train_pred)  # You should import `EVAL_METRIC` first.
test_result = EVAL_METRIC(y_test, y_test_pred)
```

## 4. Model Selection
- Most machine learning models contain hyperparameters, that are not learned from the data, but rather defined by the user before training (e.g., the number of neighbors in KNN, the regularization strength in logistic regression).
- The goal of model selection is to find the hyperparameter values that result in the best performance on unseen data.
- Since the test set must remain untouched until the final evaluation, we split the training set further into a training and validation subset, or use cross-validation to assess model performance more reliably.
- We focus on `GridsearchCV` in `scikit-learn`.

```python
# Code template for GridsearchCV
from sklearn.model_selection import GridSearchCV

model = ModelName(...)                       # Here, you should not assign hyperparameters to be chosen.
param_grid = {
    PARAMETER_1: CANDIDATE_1,                # Candidate should be a list
    PARAMETER_2: CANDIDATE_2,
    ...
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,                                    # 5-fold cross-validation
    scoring='scoring_method'                 # See examples below
)

grid_search.fit(X_train_processed, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_processed)

## Final evaluation on test set
test_score = EVAL_METRIC(y_test, y_pred)     # match this to your scoring above
```

**`scoring` exmples for regression**
| **Metric**                     | **`scoring` String**                       |
| ------------------------------ | ------------------------------------------ |
| Mean Absolute Error (MAE)      | `'neg_mean_absolute_error'`                |
| Mean Squared Error (MSE)       | `'neg_mean_squared_error'`                 |
| Root Mean Squared Error (RMSE) | `'neg_root_mean_squared_error'`            |
| R² Score                       | `'r2'`                                     |





**`scoring` exmples for classification**
| **Metric**           | **`scoring` String** |
| -------------------- | -------------------- |
| Accuracy             | `'accuracy'`         |
| Precision            | `'precision'`        |
| Recall               | `'recall'`           |
| F1 Score             | `'f1'`               |
| ROC AUC              | `'roc_auc'`          |


## 5. Save
You've finished. Now, it's time to save the model so that we can load anytime we want. We use `pickle`. If you plan to build multiple models, then save the best model under the specific model name.

```python
import pickle

with open('models/ModelName/model.pkl', 'wb') as f:
    pickle.dump(grid_search.best_estimator_, f)
```

## 6. Load
You can load the model you saved as follows:

```python
import pickle

with open('models/ModelName/model.pkl', 'rb') as f:
    model = pickle.load(f)
```

## Next?
- Interprete the results.
- Visualize the results.
- Communicate findings.
- Try other models.