You don't have to code anything here, just read and follow along.

# Tips and Tricks in Machine Learning

## Instructions
* You may not reproduce this notebook or share them to anyone.

## Import
Import **numpy**, **pandas**, **sklern.preprocessing**, and **matplotlib**.

In [None]:
import numpy as np
import pandas as pd
import sklearn.preprocessing
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (5.0, 5.0)
plt.rcParams['image.interpolation'] = 'nearest'

## Feature Scaling

Let's generate the data that we will scale. Note that in normalization and standardization, we need to perform it for each feature.

Let's generate random `x` and `y` features for our data.

In [None]:
np.random.seed(1)
x = np.arange(0, 6, 0.03) + np.random.randn(200) * 1.3 + 2
y = np.arange(0, 6, 0.03) + np.random.randn(200) * 1.3 + -4

Visualize the data in a 2D graph.

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, 'ko')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Original Data')
plt.grid()

### Normalization

Normalization is used to transform features to a similar scale. This is also called the min-max scaling. Perform normalization on our data by following the formula below.

$$x_{normalization} = \frac{x - min(x)}{max(x) - min(x)}$$


In [None]:
x_normalized_own = (x - np.min(x)) / (np.max(x) - np.min(x))
y_normalized_own = (y - np.min(y)) / (np.max(y) - np.min(y))

Visualize the result in a 2D graph.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_normalized_own, y_normalized_own, 'ko')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Normalized, Zoomed-out')
plt.grid()

To visualize the data better, let's zoom-in by changing the limits of the x and y axes.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_normalized_own, y_normalized_own, 'ko')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Normalized, Zoomed-in')
plt.grid()

### Normalization using `sklearn.preprocessing.MinMaxScaler`

Let's use `sklearn.preprocessing.MinMaxScaler` to normalize our data. The result should be similar to our own implementation.

Instantiate a `MinMaxScaler` object.

In [None]:
scaler = sklearn.preprocessing.MinMaxScaler()

Normalize `x` and `y` values by calling the `fit_transform()` function.

In [None]:
x_normalized_sklearn = scaler.fit_transform(x.reshape(-1,1))
y_normalized_sklearn = scaler.fit_transform(y.reshape(-1,1))

Visualize the result in a 2D graph.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_normalized_sklearn, y_normalized_sklearn, 'ko')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Normalized, Zoomed-out')
plt.grid()

To visualize the data better, let's zoom-in by changing the limits of the x and y axes.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_normalized_sklearn, y_normalized_sklearn, 'ko')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Normalized, Zoomed-in')
plt.grid()

Display the graph of our implementation of normalization and `sklearn.preprocessing.MinMaxScaler()`.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, nrows=1, figsize=(14, 5))

ax1.plot(x_normalized_own, y_normalized_own, 'ko')
ax1.set_title('Our Implementation')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.grid()

ax2.plot(x_normalized_sklearn, y_normalized_sklearn, 'ko')
ax2.set_title('Using sklearn')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.grid()

### Standardization

Standardization transforms features by subtracting the data from the mean and dividing it by the standard deviation. This is often called the z-score. Perform standardization on our data by following the formula below.

$$x_{standardized}=\frac{x-mean(x)}{stddev(x)}$$

In [None]:
x_standardized_own = (x - np.mean(x)) / np.std(x)
y_standardized_own = (y - np.mean(y)) / np.std(y)

Visualize the result in a 2D graph.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_standardized_own, y_standardized_own, 'ko')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Standardized, Zoomed-out')
plt.grid()

To visualize the data better, let's zoom-in by changing the limits of the x and y axes.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_standardized_own, y_standardized_own, 'ko')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Standardized, Zoomed-in')
plt.grid()

Check if the $\mu = 0$ and $\sigma = 1$.

In [None]:
print('Feature x:')
print('Mean:', '{:.2f}'.format(np.mean(x_standardized_own)))
print('Standard deviation:', '{:.2f}\n'.format(np.std(x_standardized_own)))

print('Feature y:')
print('Mean:', '{:.2f}'.format(np.mean(y_standardized_own)))
print('Standard deviation:', '{:.2f}\n'.format(np.std(y_standardized_own)))

### Standardization using `sklearn.preprocessing.StandardScaler`

Let's use `sklearn.preprocessing.StandardScaler` to standardize our data. The result should be similar to our own implementation.

Instantiate a `StandardScaler` object.

In [None]:
x_scaler = sklearn.preprocessing.StandardScaler()
y_scaler = sklearn.preprocessing.StandardScaler()

Standardize `x` and `y` values by calling the `fit_transform()` function.

In [None]:
x_standardized_sklearn = x_scaler.fit_transform(x.reshape(-1, 1))
y_standardized_sklearn = y_scaler.fit_transform(y.reshape(-1, 1))

Visualize the result in a 2D graph.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_standardized_sklearn, y_standardized_sklearn, 'ko')

plt.xlim(-10, 10)
plt.ylim(-10, 10)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Standardized, Zoomed-out')
plt.grid()

To visualize the data better, let's zoom-in by changing the limits of the x and y axes.

In [None]:
fig, ax = plt.subplots()
ax.plot(x_standardized_sklearn, y_standardized_sklearn, 'ko')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Standardized, Zoomed-in')
plt.grid()

Check if the $\mu = 0$ and $\sigma = 1$.

In [None]:
print('Feature x:')
print('Mean:', '{:.2f}'.format(np.mean(x_standardized_sklearn)))
print('Standard deviation:', '{:.2f}\n'.format(np.std(x_standardized_sklearn)))

print('Feature y:')
print('Mean:', '{:.2f}'.format(np.mean(y_standardized_sklearn)))
print('Standard deviation:', '{:.2f}\n'.format(np.std(y_standardized_sklearn)))

Display the graph of our implementation of normalization and `sklearn.preprocessing.StandardScaler()`.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, nrows=1, figsize=(14, 5))

ax1.plot(x_standardized_own, y_standardized_own, 'ko')
ax1.set_title('Our Implementation')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.grid()

ax2.plot(x_standardized_sklearn, y_standardized_sklearn, 'ko')
ax2.set_title('Using sklearn')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.grid()

## Feature Encoding

Let's create a synthetic dataset for this section. The dataset is composed of 3 features, namely `size`, `color`, and `type`.

In [None]:
clothes = pd.DataFrame(columns=['size', 'color', 'type'])

clothes['size'] = ['medium', 'large', 'small', 'medium', 'extra large', 
                   'large', 'medium', 'extra small', 'medium', 'large']

clothes['color']= ['red', 'green', 'blue', 'white', 'gray', 'black', 
                   'green', 'blue', 'grey', 'green']

clothes['type'] = ['rayon', 'polyester', 'cotton', 'cotton', 'cotton', 
                   'polyester', 'rayon', 'linen', 'cotton', 'polyester']

clothes

### Label Encoding via `sklearn.preprocessing.LabelEncoder`

Let's use `sklearn.preprocessing.LabelEncoder` to encode our labels with value between `0` to `num_classes - 1`, where `num_classes` equals the number of classes in the dataset.

Instantiate a `LabelEncoder` object.

In [None]:
label_encoder = sklearn.preprocessing.LabelEncoder()

Fit the `type` feature by calling the `fit()` function of the object.

In [None]:
label_encoder.fit(clothes['type'])

Display the classes.

In [None]:
label_encoder.classes_

Thus, labels will be transformed from string values to their corresponding integer values:
- `cotton` - `0`
- `linen` - `1`
- `polyester` - `2`
- `rayon` - `3`

Transform the `type` feature by calling the `transform()` function of the object. 

In [None]:
clothes['type'] = label_encoder.transform(clothes['type'])
clothes

To reverse the encoding, call the `inverse_transform()` function of the object.

In [None]:
clothes['type'] = label_encoder.inverse_transform(clothes['type'])
clothes

We set it back to our original categorical data because label encoding is not a suitable preprocessing technique for the `type` column

### One-Hot Encoding via `sklearn.preprocessing.OneHotEncoder`

Let's use `sklearn.preprocessing.OneHotEncoder` to encode our categorical features as a one-hot numeric array.


Instantiate a `OneHotEncoder` object.

In [None]:
one_hot_encoder = sklearn.preprocessing.OneHotEncoder(dtype='int8')

Fit the `type` feature by calling the `fit()` function of the object.

In [None]:
one_hot_encoder.fit(clothes['type'].values.reshape(-1, 1))

Display the categories.

In [None]:
one_hot_encoder.categories_

The encoding will then be an array with 4 columns, where the columns represents:
- column 1 - `cotton`
- column 2 - `linen`
- column 3 - `polyester`
- column 4 - `rayon`

If the instance has a value `linen` for the `type` feature, then the one-hot encoding for this instance is `[0, 1, 0, 0]`.

Transform the `type` feature by calling the `transform()` function of the object. 

In [None]:
encoding = one_hot_encoder.transform(clothes['type'].values.reshape(-1, 1)).toarray()
encoding

In [None]:
type_df = pd.DataFrame(encoding, columns=[x for x in one_hot_encoder.categories_] )

clothes = clothes.drop(['type'], axis=1) 
clothes = pd.concat([clothes, type_df], axis=1)
clothes

We'll rename the columns.

In [None]:
clothes.columns

In [None]:
clothes.columns=['size', 'color', 'cotton', 'linen', 'polyester', 'rayon']
clothes

### Ordinal Encoding

Ordinal encoding is a type of label encoding where there is an order associated with the data. In our example, the `size` feature is ordinal.

Let's create a dictionary that will map string values in the `size` feature to its corresponding integer value according to some order. See list below:
- `extra small` - `0`
- `small` - `1`
- `medium` - `2`
- `large` - `3`
- `extra large` - `4`

In [None]:
clothes_sizes_dict= {
    'extra small' : 0,
    'small' : 1,
    'medium' : 2,
    'large' : 3,
    'extra large' : 4
}

Use the `map()` function to transform the `size` feature to its corresponding ordinal value.

In [None]:
clothes['size'] = clothes['size'].map(clothes_sizes_dict)
clothes

## Data Augmentation

Data augmentation is helpful especially when training your machine learning model. This makes your model robust to different variations of data.

Import necessary packages.

In [None]:
import skimage.io as io
from skimage.transform import rotate, AffineTransform, warp
from skimage.util import random_noise
from skimage.filters import gaussian

Read the image and display its shape.

In [None]:
image = io.imread('https://i.imgur.com/DHI2jdW.png')
print(image.shape)

Display the image.

In [None]:
io.imshow(image)

### Rotation

Rotate the image by some degree. In the example below, we rotated the image by 45 degrees.

In [None]:
rotated_image = rotate(image, angle=45, mode='wrap')

plt.imshow(rotated_image)
plt.title('Rotated Image')

### Translation

Translate the image by some pixel. In the example below, we moved the image to 50 pixels upwards and 50 pixels to the left.

In [None]:
transform = AffineTransform(translation=(50, 50))
translated = warp(image, transform, mode='wrap')

plt.imshow(translated)
plt.title('Translated Image')

### Horizontal Flip

Flip the image with respect to the y-axis.

In [None]:
flipped_image = np.fliplr(image)

plt.imshow(flipped_image)
plt.title('Horizontally-flipped Image')

### Vertical Flip

Flip the image with respect to the x-axis.

In [None]:
flipped_image = np.flipud(image)

plt.imshow(flipped_image)
plt.title('Vertically-flipped Image')

### Random Noise

Add some random noise to the image.

In [None]:
# Standard deviation for noise to be added in the image
sigma = 0.5

# Add random noise to the image
random_noise_image = random_noise(image, var=sigma ** 2)

plt.imshow(random_noise_image)
plt.title('Image with Random Noise')

### Gaussian Blur

Perform gaussian blur on the image.

In [None]:
blurred_image = gaussian(image, sigma=5, multichannel=True)

plt.imshow(blurred_image)
plt.title('Blurred Image')

In the lecture, we explored more types of augmentations. Those augmentations came from the `imgaug` library which we will download by typing the command below in your command prompt/terminal:

`conda install imgaug`

See the documentation [here](https://github.com/aleju/imgaug).

## Training Pipeline

### Dataset

We will use the Census Income dataset as our dataset. Let's load it in a `DataFrame`.

### Pipeline

We will perform the following steps to our data.

- Pre-processing
    - Data Loading
    - Feature Encoding
    - Data Scaling
- Training a linear regression model
- Hyperparameter Tuning: Learning Rate and Regularization

### Data Loading
Let's load `census_income.csv`.

In [None]:
df = pd.read_csv('census_income.csv')
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 
              'maritalstatus', 'occupation', 'relationship', 'race', 'sex',
              'capitalgain','capitalloss', 'hoursperweek', 'nativecountry', 
              'label']
df

You will normally use EDA and feature selection to select the features for your project. But for the purpose of this notebook, let's just select few feature to apply feature encoding and scaling.

In [None]:
feature_columns = ['age', 'fnlwgt', 'race', 'sex', 'educationnum', 
                   'capitalgain', 'capitalloss', 'hoursperweek', 'label']
df = df[feature_columns]

### One-Hot Encoding
We will apply one hot encoding to `race`.

Instantiate a `OneHotEncoder` object.

In [None]:
one_hot_encoder_race = sklearn.preprocessing.OneHotEncoder(dtype='int8')

Fit the `race` feature by calling the `fit()` function of the object.

In [None]:
one_hot_encoder_race.fit(df['race'].values.reshape(-1, 1))

Display the categories.

In [None]:
one_hot_encoder_race.categories_

The encoding will then be an array with 5 columns, where the columns represents:
- column 1 - ` Amer-Indian-Eskimo`
- column 2 - ` Asian-Pac-Islander`
- column 3 - ` Black`
- column 4 - ` Other`
- column 5 - ` White`

If the instance has a value ` Asian-Pac-Islander` for the `race` feature, then the one-hot encoding for this instance is `[0, 1, 0, 0, 0]`.

Transform the `race` feature by calling the `transform()` function of the object. 

In [None]:
encoding = one_hot_encoder_race.transform(df['race'].values.reshape(-1, 1)).toarray()
encoding

In [None]:
race_df = pd.DataFrame(encoding, columns=['race_' + x for x in one_hot_encoder_race.categories_])
race_df

Concatenate the encoding to the original `DataFrame`.

In [None]:
df = pd.concat([df, race_df], axis=1)
df

### Ordinal Encoding
Let's apply ordinal encoding for the `sex` feature.

Technically, we should apply use one-hot encoding for this, but one-hot encoding is not needed for binary features (i.e., features with only two possible values).

Let's create a dictionary that will map string values in the `sex` feature to its corresponding integer value. See list below:
- ` Male` - `0`
- ` Female` - `1`

In [None]:
sex_mapping_dict = {
    ' Male' : 0,
    ' Female': 1
}

Use the `map()` function to transform the `sex` feature to its corresponding integer value.

In [None]:
df['sex'] = df['sex'].map(sex_mapping_dict)
df 

Rename columns.

In [None]:
df.columns=['age', 'fnlwgt', 'race', 'sex', 'educationnum', 
            'capitalgain', 'capitalloss', 'hoursperweek', 'label', 
            'race_american_indian_eskimo', 'race_api', 'race_black',
            'race_other', 'race_white']
df

Drop the `race` column.

In [None]:
df = df.drop(columns=['race'])

### Train/test split

Let's split the dataset into train, validation, and test sets. 

First, remove the `label` feature from `X` since this is our target feature. We will instead store it in `y`.

In [None]:
X = df.drop(columns=['label']).values
y = df['label'].values

print('X ', X.shape)
print('y ', y.shape)

Import `train_test_split()`.

In [None]:
from sklearn.model_selection import train_test_split

Divide the dataset into train and test sets, where 20% of the data will be placed in the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

We will get 10% from the train set to produce a validation set.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=1)

Let's display the shape of the train, validation, and test set.

In [None]:
print('X_train', X_train.shape)
print('y_train', y_train.shape)

print('X_val', X_val.shape)
print('y_val', y_val.shape)

print('X_test', X_test.shape)
print('y_test', y_test.shape)

### Simple pipeline

In this section, we will create a simple training pipeline using `sklearn.pipeline.Pipeline` and a classifier `sklearn.tree.DecisionTreeClassifier`.

Import `sklearn.pipeline.Pipeline` and `sklearn.tree.DecisionTreeClassifier`.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

Instantiate a `Pipeline` object. We will need to pass a list of transforms and a final estimator. Each element in the list is a tuple `(name, transform)`. In the example below, we have 2 elements in the list, where the `name`s of the elements are `scaler` and `classifier`.

In [None]:
pipe = Pipeline([
    ('scaler', sklearn.preprocessing.StandardScaler()),
    ('classifier', DecisionTreeClassifier())
])

Execute the pipeline by calling the `fit()` function of the model.

In [None]:
pipe.fit(X_train, y_train)

### Predict on train, validation, and test set

Evaluate the model on the train set by calling the `score()` function to get the train accuracy.

In [None]:
pipe.score(X_train, y_train)

Evaluate the model on the validation set by calling the `score()` function to get the validation accuracy.

In [None]:
pipe.score(X_val, y_val)

Evaluate the model on the test set by calling the `score()` function to get the test accuracy.

In [None]:
pipe.score(X_test, y_test)

To know the model hyperparameters used in the pipeline you can do this:

In [None]:
pipe.get_params().keys()

## Simple pipeline with random search of hyperparameters

In this section, we will integrate cross-validation in our pipeline to search the best set of hyperparameters.

Import `sklearn.model_selection.RandomizedSearchCV`.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

Instantiate a `Pipeline` object with an `sklearn.preprocessing.StandardScaler` and a `DecisionTreeClassifier`.

In [None]:
pipe = Pipeline([
    ('scaler', sklearn.preprocessing.StandardScaler()),
    ('classifier', DecisionTreeClassifier())
])

Create a list of dictionaries of the different hyperparameters that we want to try. In the example below, we indicated different values for the `DecisionTreeClassifier` `criterion`, `min_impurity_split`, and `max_depth`.

In [None]:
parameters = [
    {
        'classifier__criterion': ['gini', 'entropy'],
        'classifier__min_impurity_decrease': [0.001, 0.01, 0.05, 0.1, 0.3, 0.5],
        'classifier__max_depth': [5, 10, 20, 30]
    }
]

Instantiate a `RandomizedSearchCV` object. Pass the `Pipeline` object and the varible `parameters`. This performs randomized search on the different values of the hyperparameters that we listed in the variable `parameters`.

In [None]:
rscv = RandomizedSearchCV(pipe, parameters, cv=5, verbose=1)

Execute the randomized search by calling the `fit()` function of the object.

In [None]:
rscv.fit(X_train, y_train)

By default, RandomizedSearchCV will act as the model with the best found parameters.

Get the prediction of the model on the training set.

In [None]:
predictions = rscv.predict(X_train)
predictions

Compute for the accuracy of the model on the train set.

In [None]:
np.mean(predictions == y_train)

Get the prediction of the model on the training set.

In [None]:
predictions = rscv.predict(X_val)
predictions

Compute for the accuracy of the model on the train set.

In [None]:
np.mean(predictions == y_val)

Get the prediction of the model on the test set.

In [None]:
predictions = rscv.predict(X_test)
predictions

Compute for the accuracy of the model on the test set.

In [None]:
np.mean(predictions == y_test)

Here are the best parameters found:

In [None]:
rscv.best_params_

## Pipeline with different classifiers + random search

`sklearn.pipeline.Pipeline` cannot handle multiple claassifiers by default, but we can make a class that switches the classifier for us.

source: https://stackoverflow.com/questions/48507651/multiple-classification-models-in-a-scikit-pipeline-python

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator

class ClassifierSwitcher(BaseEstimator):

    def __init__(self, estimator=RandomForestClassifier()):
        '''
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        '''

        self.estimator = estimator


    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self


    def predict(self, X, y=None):
        return self.estimator.predict(X)


    def predict_proba(self, X):
        return self.estimator.predict_proba(X)


    def score(self, X, y):
        return self.estimator.score(X, y)

Import `sklearn.ensemble.AdaBoostClassifier`.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

Here, we have a pipeline that experiments between a random forest classifier and an adaboost classifier, and we also list the hyperparameters we want it to tweak in random search.

In [None]:
pipeline = Pipeline([
    ('classifier', ClassifierSwitcher()),
])

parameters = [
    {
        'classifier__estimator': [RandomForestClassifier()], # SVM if hinge loss / logreg if log loss
        'classifier__estimator__criterion': ['gini', 'entropy'],
        'classifier__estimator__min_impurity_decrease': [0.001, 0.01, 0.05, 0.1, 0.3, 0.5],
        'classifier__estimator__max_depth': [5, 10, 20, 30]
    },
    {
        'classifier__estimator': [AdaBoostClassifier()],
        'classifier__estimator__n_estimators': [100, 150, 200, 250, 300],
        'classifier__estimator__learning_rate': [0.001, 0.01, 0.1, 1]
    },
]

Instantiate a `RandomizedSearchCV` object. Pass the `Pipeline` object and the varible `parameters`.

In [None]:
rscv = RandomizedSearchCV(pipeline, parameters, cv=5, n_jobs=12, verbose=3)

Execute the randomized search by calling the `fit()` function of the object.

In [None]:
rscv.fit(X_train, y_train)

Compute for the accuracy of the model on the train set.

In [None]:
rscv.score(X_train, y_train)

Compute for the accuracy of the model on the validation set.

In [None]:
rscv.score(X_val, y_val)

Compute for the accuracy of the model on the test set.

In [None]:
rscv.score(X_test, y_test)

Here are the best parameters found:

In [None]:
rscv.best_params_

Hope this will help you with your project.

## <center>fin</center>