# Problem Set 2 - Principal Component Analysis

The main objective of this problem set is for you to practice implementing principal component analysis in Scikit-Learn on a real geophysical dataset.    

### Scientific Premise:   
Ice slabs are multi-meter thick layers of refrozen ice that form in snow and firn on the Greenland Ice Sheet. They lead to more ice sheet mass loss and sea level rise by preventing snow from absorbing meltwater and enhancing meltwater runoff. Therefore, we are very interested in understanding where these ice slabs are located and how fast they are growing. Unfortunately, they are challenging to map since they are buried beneath the surface and we only have limited data about their extent from some airborne radar flight lines. Over the next two problem sets, we will work on predicting the location of ice slabs using remote sensing and climate model data. In this problem set, you will start by preparing the dataset for machine learning.   

### Data Set Variables:    
**X** - x coordinate of data point (EPSG 3413)       
**Y** - y coordinate of data point (EPSG 3413)         
**IceSlab** - binary variable indicating if an ice slab was observed at a given location, 0 = no ice slab, 1 = ice slab observed            
**HV** - horizonal-vertical polarized backscatter from the Sentinel-1 SAR satellite in digital number space        
**SAR_melt** - non-dimensional estimate of firn/snow water content, lower values indicate more stored melt           
**MAR** - decadal average melt to accumulation (snowfall) ratio from a regional climate model         
**RACMO_melt** - decadal average surface melt from the regional climate model RACMO in mm of water equivalent per year           
**RACMO_snow** - decadal average snowfall rate from the regional climate model RACMO in mm of water equivalent per year          
**MAT** - mean annual temperature in degrees Celcius from the regional climate model MAR           
**Xpol** - cross polarized backscater ratio from the Sentinel-1 SAR satellite in digital number space             
**Elevation** - surface elevation in meters           

In the code block below, import your packages.

**[1] (5 pts)** Load the from the course github at https://raw.githubusercontent.com/rtculberg/ml_in_eas/main/data/IceSlabs.csv. Display the first few rows of the dataframe.

**[2] (5 pts)** Check for and remove any rows that have NaN values. Remove the X and Y columns since we are not going to usual spatial position variables when we eventually train our machine learning model.

**[3] (5 pts)** Make a copy of the dataframe and drop the "IceSlab" column. Normalize the data using a z-score transform. Display the first few rows of your newly normalized dataframe and print out the standard deviation of each column.


**[4] (5 pts)** Compute the correlation matrix for the normalized data and make a colormap plot of these values using the pandas background_gradient function.     

Answer the following question:          
Would you expect any problems if we tried to apply our machine learning algorithm directly to this dataset before PCA? Why or why not?

**Hint**: See the styler documentation: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.background_gradient.html.

**[5] (10 pts)** Apply a PCA and determine the principal components (PCs) using the sklearn PCA fit_transform function. Leave the n_components argument blank. This will default to keeping all components.

**Hint**: See the `PCA`: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

**[6] (5 pts)** Make a scatter plot of PC1 vs. PC2. Use a different color for points where ice slabs were observed vs. points where iceslabs were not observed. Make sure to include a legend. Describe any trends.

**[7] (5 pts)** Make a scree plot showing the percent variance explained by each principle component.       

Answer the following question:                
How many principal components do we need to retain to explain at least 90% of the variance? How does that compare to the original number of raw variables?

**[8] (5 pts)** Use the function provided below to make a biplot from the principal components and loadings.      

Answer the following question:            
Which of the original variables are mostly strongly associated with PC1?

In [None]:
# this function will plot a biplot given principal components, loadings, and variable labels

def biplot(PCs,coef,labels=None):
    plt.figure(figsize=(10,10))
    xs = PCs[:,0] # PC1 (change indices for different PCs)
    ys = PCs[:,1] # PC2
    coef = np.transpose(coef)
    n = coef.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley,
                s=15,
                color='red')

    for i in range(n):
        plt.arrow(0, 0, coef[i,0],
                  coef[i,1],color = 'purple',
                  alpha = 0.5)
        plt.text(coef[i,0]* 1.15,
                 coef[i,1] * 1.15,
                 labels[i],
                 color = 'darkblue',
                 ha = 'center',
                 va = 'center')

    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.title('Biplot')

**[9] (5 pts)** Use `imshow` to display the loadings matrix. Be sure to correctly label the axes and use a diverging colorbar between -1 and 1.      

Answer the questions below:      
Describe how the loading matrix might have looked different if we did not have such strong correlations between most of the raw data variables.