<a href="https://colab.research.google.com/github/yasminghd/2022_ML_Earth_Env_Sci/blob/main/Lab_Notebooks/S4_1_Dimensionality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Chapter 8 – Dimensionality Reduction**

<img src='https://unils-my.sharepoint.com/:i:/g/personal/tom_beucler_unil_ch/EX7KlNGWYypLnH_53OnJR6oBjfgb_gCZ4gmnOeR68a6zMA?download=1'>
<center> Caption: <i>Denise diagnoses an overheated CPU at our data center in The Dalles, Oregon. <br> For more than a decade, we have built some of the world's most efficient servers.</i> <br> Photo from the <a href='https://www.google.com/about/datacenters/gallery/'>Google Data Center gallery</a> </center>

*Our world is increasingly filled with data from all sorts of sources, including environmental data. Can we reduce the data to a reduced, meaningful space to save on computation time and increase explainability?*

This notebook will be used in the lab session for week 4 of the course, covers Chapters 8 of Géron, and builds on the [notebooks made available on _Github_](https://github.com/ageron/handson-ml2).

Need a reminder of last week's labs? Click [_here_](https://colab.research.google.com/github/tbeucler/2022_ML_Earth_Env_Sci/blob/main/Lab_Notebooks/Week_3_Decision_Trees_Random_Forests_SVMs.ipynb) to go to notebook for week 3 of the course.

##Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
rnd_seed = 42
rnd_gen = np.random.default_rng(rnd_seed)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "dim_reduction"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Dimensionality Reduction using PCA

This week we'll be looking at how to reduce the dimensionality of a large dataset in order to improve our classifying algorithm's performance! With that in mind, let's being the exercise by loading the MNIST dataset.

###**Q1) Load the input features and truth variable into X and y, then split the data into a training and test dataset using scikit's train_test_split method. Use *test_size=0.15*, and remember to set the random state to *rnd_seed!***

*Hint 1: The `'data'` and `'target'` keys for mnist will return X and y.*

*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for train/test split.*

In [2]:
# Load the mnist dataset
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

In [4]:
#mnist

In [6]:
# Load X and y
X = mnist.data
y = mnist.target

In [7]:
# Import the train/test split function from sklearn
from sklearn.model_selection import train_test_split

In [11]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.15, random_state=rnd_seed)

We now once again have a training and testing dataset with which to work with. Let's try training a random forest tree classifier on it. You've had experience with them before, so let's have you import the `RandomForestClassifier` from sklearn and instantiate it.

###**Q2) Import the `RandomForestClassifier` model from sklearn. Then, instantiate it with 100 estimators and set the random state to `*rnd_seed!*`**

*Hint 1: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for `RandomForestClassifier`*

*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for train/test split.*

*Hint 3: If you're still confused about **instantiation**, there's a [blurb on wikipedia](https://en.wikipedia.org/wiki/Instance_(computer_science)) describing it in the context of computer science.*

In [12]:
# Complete the code
from sklearn.ensemble import RandomForestClassifier

In [13]:
rnd_clf = RandomForestClassifier(n_estimators=100, #Number of estimators 
                 random_state=rnd_seed) #Random State

We're now going to measure how quickly the algorithm is fitted to the mnist dataset! To do this, we'll have to import the `time` library. With it, we'll be able to get a timestamp immediately before and after we fit the algorithm, and we'll get the time by calculating the difference.

###**Q3) Import the time library and calculate how long it takes to fit the `RandomForestClassifier` model.**

*Hint 1: [Here's the documentation](https://docs.python.org/3/library/time.html#time.time) to the function used for getting timestamps*

*Hint 2: [Here's the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit) for the fitting method used in `RandomForestClassifier`.*

In [14]:
import time

In [15]:
t0 = time.time() # Load the timestamp before running
rnd_clf.fit(X_train, y_train) # Fit the model with the training data
t1 = time.time()  # Load the timestamp after running

In [16]:
train_t_rf = t1-t0

print(f"Training took {train_t_rf:.2f}s")

Training took 36.53s


We care about more than just how long we took to trian the model, however! Let's get an accuracy score for our model.

###**Q4) Get an accuracy score for the predictions from the RandomForestClassifier**

*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) for the `accuracy_score` metric in sklearn.* 

*Hint 2: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) for the predict method in `RandomForestClassifier`*

In [17]:
from sklearn.metrics import accuracy_score # Import the accuracy score metric

In [19]:
# Get a set of predictions from the random forest classifier
y_pred = rnd_clf.predict(X_test)   # Get a set of predictions from the test set

In [20]:
rf_accuracy = accuracy_score(y_test, y_pred)  # Feed in the truth and predictions

In [21]:
print(f"RF Model Accuracy: {rf_accuracy:.2%}")

RF Model Accuracy: 96.71%


Let's try doing the same with with a logistic regression algorithm to see how it compares. 

###**Q5) Repeat Q2-4 with a logistic regression algorithm using sklearn's `LogisticRegression` class. Hyperparameters: `multi_class='multinomial'` and `solver='lbfgs'`**

*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for the `LogisticRegression` class.

In [22]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_clf = LogisticRegression(_____="multinomial", #Multiclass
                _____="lbfgs",  Solver
                _____=42) #Random State

In [None]:
t0 = time.time() # Timestamp before training
log_clf.fit(_____, _____) # Fit the model with the training data
t1 = time.time() # Timestamp after training

In [None]:
train_t_log = t1-t0
print(f"Training took {train_t_log:.2f}s")

In [None]:
# Get a set of predictions from the logistric regression classifier
y_pred = _____._____(_____)   # Get a set of predictions from the test set
log_accuracy = accuracy_score(_____, _____)  # Feed in the truth and predictions

In [None]:
print(f"Log Model Accuracy: {log_accuracy:.2%}")

Up to now, everything that we've done are things we've done in previous labs - but now we'll get to try out some algorithms useful for reducing dimensionality! Let's use principal component analysis. Here, we'll reduce the space using enough axes to explain over 95% of the variability in the data...

###**Q6) Import scikit's implementation of `PCA` and fit it to the training dataset so that 95% of the variability is explained.**

*Hint 1: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) for scikit's `PCA` class.*

*Hint 2: [Here is the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform) for scikit's `.fit_transform()` method.*

In [None]:
from _____._____ import _____ # Importing PCA

In [None]:
pca = PCA(_____=_____) # Set number of components to explain 95% of variability

In [None]:
X_train_reduced = pca._____(____) # Fit-transform the training data

In [None]:
X_test_reduced = pca._____(____) # Transform the test data (!!No fitting!!)

###**Q7) Repeat Q3 & Q4 using the *reduced* `X_train` dataset instead of `X_train`.**

In [None]:
# Complete the code

t0 = _____._____() # Load the timestamp before running
rnd_clf.___(_____, _____) # Fit the model with the reduced training data
t1 = _____._____()  # Load the timestamp after running

In [None]:
train_t_rf = t1-t0

print(f"Training took {train_t_rf:.2f}s")

In [None]:
# Get a set of predictions from the random forest classifier
y_pred = _____._____(_____)   # Get predictions from the reduced test set

In [None]:
red_rf_accuracy = accuracy_score(_____, _____)  # Feed in the truth and predictions

print(f"RF Model Accuracy on reduced dataset: {red_rf_accuracy:.2%}")

###**Q8) Repeat Q5 using the *reduced* X_train dataset instead of X_train.**

In [None]:
#Complete the code

t0 = time.time() # Timestamp before training
log_clf.fit(_____, _____) # Fit the model with the reduced training data
t1 = time.time() # Timestamp after training

In [None]:
train_t_log = t1-t0
print(f"Training took {train_t_log:.2f}s")

In [None]:
# Get a set of predictions from the logistric regression classifier
y_pred = _____._____(_____)   # Get a set of predictions from the test set

In [None]:
log_accuracy = accuracy_score(_____, _____)  # Feed in the truth and predictions
print(f"Log Model Accuracy on reduced training data: {log_accuracy:.2%}")

You can now compare how well the random forest classifier and logistic regression classifier performed on both the full dataset and the reduced dataset. What were you able to observe? 

Write your comments on the performance of the algorithms in this box, if you'd like 😀
(Double click to activate editing mode)