<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

#  🛰️ PCA to Detect Dunes in Satellite Images



<img src="img/Sand-Dunes.jpg" style="width:800px">

This challenge is based on a real-world data science problem: to detect dunes in satellite images. I worked on this problem to be able to detect dunes on Mars. My approach was to use computer vision (a subfield of deep-learning) with a model trained on Earth Dunes, and then deploy this algorithm to Martian satellite imagery. 

📺 If you are interested in this research project, you can <a href="https://youtu.be/3u-cOAyKocQ">watch my 15 minutes YouTube video</a> of a talk I gave at the AGU Meeting in 2021.

Here, I am giving you a much reduced version of my dataset, both in terms of number of image, and size of the images. The satellite images come from the **Sentinel 2** dataset, and were reduced to greyscale.  We will use conventional statistical machine learning to determine whether the image contains a dune (class '1') or not (class '0').

We will also use PCA to reduce the dimensionality of our dataset, and convince ourselves of the benefit of doing that.

This exercise is a bit different then others, in the sense that you are not dealing purely with tabular data. So keep this in mind:
- each image is an observation (sample)
- each pixel's luminosity level is a feature

## Load dataset

To make life simpler for you, I created a data loader class that will unzip the zipped file in the current directory, load each images within the unzipped file, and return a properly splitted `train_images`, `test_images`, `y_train`, `y_test`. Simply run the code cell below to load the data in this notebook.

The code to load the data is not overly complicated, and I would encourage you to look at it (in `dunes_dataset.py`)

In [None]:
from nbta.utils import download_data
download_data(id='11_qI2fvug7ddAjw2Ik02le6OJuvf27O9')

In [None]:
from dunes_dataset import DunesDataLoader

train_images, test_images, y_train, y_test = DunesDataLoader().get_data()

# Exploring the dataset

First, look at the shape of the `train_images`:

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

You can see that you have **2576 images**, each of **64 x 64 pixels**. Let's see some example images. Write code to plot 25 images in an image matrix of 5 x 5 images, and write the image label as the title for each tile (**tip**: `plt.subplots` is your friend here).

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Preparing our `X_train` and `X_test`

You will see in `deep-learning` that it is possible to use `convolutional neural networks` to input features from images to a neural network. But in statistical machine learning, this is not possible: each feature needs to be a number. So what are our features here? Well, it is the intensity of the pixel values! 

This means that each image has 64 x 64 = **4096 features**. There are two issues that need to be sorted before we can use the data in `train_images` and `test_images`:
1. The pixels are arrange in a matrix (dimension of 64x64). We need all the pixel lined up as individual feature. That can surted by using `np.reshape` and reshaping both the `train_images` and `test_images` to an appropriate shape.
2. We need to normalize the values: right now, the pixel intensity of the images ranges from 0 to 255. Once the arrays are reshaped, it is easy to simply devide their values by 255: this will result in a `MinMax` scaling (values between 0 and 1).

Create your `X_train` and `X_test` by following the instructions above on `train_images` and `test_images` respectively.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('reshaped', shape=X_train.shape)
result.write()
print(result.check())

## Compress images with linear PCA

This dunes dataset comprises images of 64 × 64 pixel images (4096 dimensions). Given that you have only 2576 training images, you will have very high sparsity in your data representation (few pixels of the same value in different images): the **curse of dimensionality** predicts that your machine learning algorithm will not do very well. But don't take my word for it: you will test this for yourself.

What we will do is use PCA to reduce the dimensions of the data.

**Apply PCA to the dataset (both `X_train` and `X_test`)**, to reduce dimensions to 100, by setting `n_components=100`. Put your transformation into variables named `X_train_projected` and `X_test_projected` (we will use the original `X_train` and `X_test` below so don't replace them)

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

## Thinking in terms of dimensionality reduction

It is important to think about what we have done above. We have reduced the images represented by the original 4096 pixels (or 4096 dimensions) into just 100 dimensions (the 100 first **principal components** of the images). Remember from your `Numerical Mathematics` and my own module that what we call components are directions of most variance of the dataset. 

Reducing the 4096 pixels  to describe each images into just 150 values is a gain in dimensionality by factor about 40!

**How does it work?**

- The pca has found to be the most representative directions of what distinguishes each images between each other with just 100 values for every image. 

- They are the directions of most variance. 

- You can access them in `pca.components_`

👉  Look at the first component of this array of components, and its shape

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

As you can see, it's a vector of 4096 values. We have now 100 components of 4096 values each.

One satellite image is now described as a linear combination (sum) of those components.

Let's reconstruct one satellite image from its reduced representation to see how it works.

👉 Use `inverse_transform` on your `data_projected` to reconstruct a `data_reconstructed` dataset

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

👉 Plot the 1st picture of the reconstructed dataset, and compare it with the original one. 

<details>
    <summary>💡Hint</summary>

You'll have to reshape the flattened data into an "image" with the appropriate pixel dimensions (64x64)
</details>

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('projection', shape=X_train_projected.shape)
result.write()
print(result.check())

## Investigate your Principal Components

👉 Image-plot the "mean" satellite image in the dataset

<details>
    <summary>💡Hint</summary>


You can use `pca.mean_`
</details>


In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

👉 Access your first PC. What's its shape? Print it as pd.Series or NDarray. What does each values represents?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

Each PC is a flatten "image" of 4096 pixels

- Your first PC are the most important "directions" on of your 4096-dimension dataset.

- They are the most important "linear combination of your 4096 pixels".

- The ones which preserves the most "variance" when your dataset of pictures is projected onto it.

- The first few PCs are the regions of the 2D pixel grid that bear the most differences between your 2576 images

👉 Image-Plot the **5 first** principal components, as well as the **last** one.
Do you see more intuitively what PC are? 

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

Every image can be represented by the "mean satellite image" plus a linear combination of the 100 "PC satellite image".

## How to Choose the Number of Components?

In practice, it is very important to find how many components are needed to describe the data without losing too much information. This can be determined visually by plotting the cumulative sum of `explained_variance_ratio_` as a function of the number of components.
 
👉 Plot it below for the first 100 components

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

This curve quantifies how much of the total variance is contained within the first components. For example:
- The first few components contain more than 80% of the variance,
- while we need about only a few components to describe 95% of the variance!

This means we have a great opportunity here to reduce the data further.

**❓ What is the minimal number of components you need to keep to get **at least** 95% of the variance?  Assign the value to a variable called `minimal_pc_count`**  

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('components', min_pc = minimal_pc_count)
result.write()
print(result.check())

## PCA as feature engineering

From the test you did above, you now know how many components are needed to capture 95% of the variance of your image dataset. This means we you use this value for `n_components` in a new `PCA` and then use the `principal components` as features for a classification task!

### Transform your training set to reduce the number of dimensions / features

👉 Fit a PCA __over the training data only__ and transform your `X_train` and `X_test` into the reduced dimension (the value you found above). Call your transformed components `X_train_red` and `X_test_red` (for 'reduced').

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Testing the impact of PCA on dune classification

Here we will do two things:

* 1️⃣ We will check how `PCA` impacts our training time
* 2️⃣ We will check how `PCA` impacts our predictions

We will limit ourselves to a simple linear classification algorithm (`LogisticRegression`).

First, using the `%%timeit` magic function, train a `LogisticRegression` model with the following parameters:
* `max_iter = 5000`
* Trained on `X_train` and `y_train` (the original features)

Take note of the training time!

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

Now, also using the `%%timeit` magic function, train a `LogisticRegression` model (call it `pca_lr`) but this time with the following parameters:
* `max_iter = 5000`
* Trained on `X_train_red` and `y_train` (the PCA features)

Take note of the training time again.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### Difference in training time

You should see an interesting difference in training time (`PCA` should be much faster). Below, save the **ratio** of the `full_lr` training time over the `pca_lr` training time in a variable called `time_ratio`: make sure to use the **same units** when you do this ratio. By how many order of magnitude is PCA training faster?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

## Difference in performance

When you execute code into a `%%timeit` cell, you effectivelly execute a loop (similar to a cross-validation). This means that the two models we trained before are no longer in memory (or more precisely, they are not available in the *scope* of this notebook).

Retrain a `LogisticRegression` model with the full `X_train` and call it `full_model`, and retain a model using the `X_train_red` and call it `pca_model`. This time, don't worry about the timing.

Then, produce an `accuracy_score` for your `X_test` using both the `full_model` model and the `pca_model` model. Save their respective `accuracy_score` into variables named `full_score` and `pca_score`. What can you conclude? Is the `PCA` score degraded? Why?

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

<details>
    <summary>💡 Hint</summary>
    
- Your PCA has the same number of training images, but 40 times less features
- The features selected are the ones with the greatest variance, i.e. explain most of the difference between images
- Thus, because of the curse of dimensionality, the performance of the full model should be less than the preformance of your PCA model
    
</details>

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('classification', full_score = full_score,
                         pca_score=pca_score, ratio=time_ratio)
result.write()
print(result.check())

# Further improving our score

Up to now, we have used `LogisticRegression` to predict our features, and a set value for our `n_components`. In this last part of the exercise, you will get a chance to further improve your classification score by doing the following:

* Create a pipeline containing a `PCA()` and a `RandomForestClassifier()`
* Tuning your pipeline using `GridSearchCV`

I recommend tuning the following parameters with values you deem reasonable: for PCA the `n_components`, and for the RandomForestClassifier `max_depth` and `max_features`.

Save your best `accuracy_score` for the `X_test` in a variable named `best_score`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### 🧪 Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('full_pipeline', best_accuracy=best_score)
result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.