### UBC Extended Learning
#### Instructor: Socorro Dominguez
#### Module 05

**Overview:**
- [] Identify when to implement feature transformations.
- [] Apply `sklearn.pipeline.Pipeline`
- [] Use `sklearn` for numerical feature transformations.
- [] Discuss the **golden rule** in feature transformations.
- [] Carry out hyperparameter optimization using `GridSearchCV` and `RandomizedSearchCV`.

## Identifying the Need for Feature Transformations

### Missing Data (Imputation)
- Detect missing values in the dataset.
- Implement imputation strategies (mean, median, etc.).

### Scaling
- Check for feature scales.
- Apply scaling for features with different scales (`StandardScaler`, `MinMaxScaler`) - more next module.

### Using `sklearn.pipeline.Pipeline`

- Sequentially apply a list of transforms and a final estimator.
- Simplifies the workflow, reduces errors, and improves reproducibility.
- Protects the **golden rule**.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'col1': [1, 2, 3, 4, None, 6, 7, 8, 9, 10],
    'col2': [0.5, 0.7, 0.2, None, 0.1, 0.9, 0.3, 0.6, 0.8, 0.4],
    'target': [0, 1, 0, 1, 1, 0, 0, 1, 1, 0]
})

df

Unnamed: 0,col1,col2,target
0,1.0,0.5,0
1,2.0,0.7,1
2,3.0,0.2,0
3,4.0,,1
4,,0.1,1
5,6.0,0.9,0
6,7.0,0.3,0
7,8.0,0.6,1
8,9.0,0.8,1
9,10.0,0.4,0


In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

In [19]:
X_train

Unnamed: 0,col1,col2
5,6.0,0.9
0,1.0,0.5
7,8.0,0.6
2,3.0,0.2
9,10.0,0.4
4,,0.1
3,4.0,
6,7.0,0.3


In [20]:
y_train

5    0
0    0
7    1
2    0
9    0
4    1
3    1
6    0
Name: target, dtype: int64

In [21]:
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('classifier', DecisionTreeClassifier())
])

# Fit and predict using the pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

In [22]:
predictions

array([1, 0])

In [23]:
y_test

8    1
1    1
Name: target, dtype: int64

#### The Golden Rule in Feature Transformations

##### Never Use Test Data in Training Transformations
- Transformations like imputation and scaling should be fitted only on training data, ie:  
   > **never** do `.fit_transform()` on **test** data  

- Apply the same transformations to the test data.
    > **do** `.transforom()` on **test** data

##### Avoid Information Leakage
- Ensure the model learns from training data only.


#### Hyperparameter Optimization in `GridSearchCV` and `RandomizedSearchCV`

**Grid Search** Strategy:

- Searches through all possible combinations of hyperparameter values within the specified search space.

- Follows a grid-like pattern as it evaluates all combinations in a systematic manner.

- Can be computationally expensive when dealing with a large hyperparameter space.
- Time complexity is generally higher compared to randomized search.

- Well-suited for a small and well-defined hyperparameter space where you have a good idea of which values to explore.

In [30]:
from sklearn.model_selection import GridSearchCV

param_grid = {'classifier__max_depth': [1, 2, 3]}
grid_search = GridSearchCV(pipeline, param_grid, cv=3)
grid_search.fit(X_train, y_train)

**Randomized Search** Strategy:

- Randomly samples a specified number of combinations from the hyperparameter space.
- Provides flexibility in defining a probability distribution for each hyperparameter.

- Typically less computationally expensive compared to grid search

- Well-suited for a large hyperparameter space where an exhaustive search would be impractical.

- Useful when you want to perform a quick search to identify promising regions of the hyperparameter space.

In [32]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {'classifier__max_depth': [1, 2, 3]}
random_search = RandomizedSearchCV(pipeline, param_dist, n_iter=3, cv=3)
random_search.fit(X_train, y_train)