# Python for STEM - Week 2 (Advanced)  

## Day 3 - Part 1: Unsupervised learning - clustering

In this notebook, we will focus on examples of unsupervised machine learning. More specifically, we will be doing clustering using Scikit-learn, one of the machine learning packages in Python. Before we start, here we first import all the packages that we need for this notebook. 

All the machine learning functions we will use in Day 3 and Day 4 all comes from [scikit-learn](https://scikit-learn.org/stable/index.html). You can find very detailed descriptions on many machine learning models included in the package user guide and various examples. This would be a good place to start when you want to adopt machine learning for your own research/work.  




In [None]:
## In this cell, we import all the packages needed for this notebook
import numpy as np                  ## packages for data handling
import pandas as pd                 
from scipy.spatial.distance import cdist 
import matplotlib.pyplot as plt     ## packages for visualization 
import seaborn as sbn 


## Data ingest  
In this notebook, we will use a dataset from geography/remote sensing. The data includes 1875 data points (locations) in the western North Carolina (Asheville region). Each row of the data contains the information of a point with its latitude/longitude, land cover type (forest, crop, urban, or water), and the surface reflectance data of six channels from the OLI sensor onboard USGS/NASA Land resource satellite Landsat-8. The reflectance data provide unique feature of the land surface as seen by the satellite sensor, which allows geographer understand how the surface is changing through time. To find more information about the data and satellite, you can visit USGS website about [Landsat-8 OLI data.](https://www.usgs.gov/land-resources/nli/landsat/landsat-8?qt-science_support_page_related_con=0#qt-science_support_page_related_con)

In [None]:
## First we use pandas to read in the comma separated values (CSV) file
datafile = 'https://raw.githubusercontent.com/geo-yrao/STEM_Python_Course/geo-yrao-patch-1/02_Week2/Data/03_land_use_land_cover_asheville.csv'
AVLData = pd.read_csv(datafile, index_col=None)

## We will check the first five rows of the data to have initial understadning of our data
print( AVLData.head() )

    Latitude  Longitude  Class   B1   B2   B3   B4    B5    B6
0  35.514769 -82.680451      0  127  150  550  226  3609  1441
1  35.753979 -82.520432      0   81  115  426  170  2913  1110
2  35.710635 -82.305661      0  156  220  538  477  2492  2077
3  35.512814 -82.413861      0  245  280  663  507  2732  1531
4  35.520636 -82.853181      0  148  181  534  265  3320  1457


In this dataset, **Class** refers to the land cover type, **B1** ~ **B6** are the surface reflectances of the sixe OLI channels. The table below explains what the class code represents.

| Class No. | Land Cover Type |
|-:|-:|
|0|Forest| 
|1|Crop|
|2|Development/Urban|
|3|Water|

Additionally, the following table gives us a quick explaination of what are the six OLI channels.  

| Channel No. | Channel Name | Wavelength |
|-:|-:|:-:|
|B1|Coastal/Areasol|0.433 – 0.453 μm| 
|B2|Blue|0.450 – 0.515 μm|
|B3|Green|0.525 – 0.600 μm|
|B4|Red|0.630 – 0.680 μm|
|B5|Near Infrared|0.845 – 0.885 μm|
|B6|Short Wavelength Infrared|1.560 – 1.660 μm|  

Typically, refletance value is between 0 and 1, describing the percentage of light reflected by the surface. The reflectance value in our data is the scaled value between 0 and 10000. You can simply convert it back to regular reflectance by multiplying 0.0001.   


In this notebook, we are doing clustering, so the **Class** information is not relevant because we are trying to guess how many clusters that we have based on this dataset. So let's assume that we do not have the land cover information.

In [None]:
## Now we have a simple matrix with six columns (attributes)
## X is the new pandas data.frame with only the six channel reflectance
X = AVLData.iloc[:,3:]
print( X.head() )
## We are now looking at the pairwise scatter plots between these six channels
## using seaborn.pairplot function to look at them in one bix plot.
## We are using "alpha" key word to change the transparency for the dots since
## there are many overlapping amongst the data.
sbn.pairplot(X, diag_kind = 'kde',
             plot_kws = {'alpha': 0.5, 's': 30, 'edgecolor': 'k'},
             height = 2)

## Feature transformation/extraction

As we can see from the pairwise scatter plots, the current six channels share some strong correlation amongst them. Can we find more useful features based on these six original channels? 

Feature transformation/extraction is the process to reduce the dimensionality of the data to explain the most of the variances in the data. Principle Component Analysis (PCA) is one of these techniques.

In [None]:
## Using PCA function from scikit-learn to perform feature transformation

## The outcome of PCA is a ndarry here

In [None]:

## we convert the ndarry to pandas data.frame with specified column names
## for PCA



In [None]:
## We now want to know how much variance of the data is explained by each
## of the principle components.


In [None]:
## We can now visualize the first three components via the pairwise scatter 
## plot that we have done earlier.


## k-Means clustering



In [None]:
## Generate cluster results using KMeans in scikit-learn
from sklearn.cluster import KMeans  ## package for k-means clustering
from sklearn import metrics 
kmeanModel = KMeans(n_clusters=8, init='k-means++', n_init=10,
                    random_state=42) # using random_state to ensure reproducible
kmeanModel.fit(PCA_df_X)
## Now we have a cluster model and the cluster label for each data points
clusterLabel = kmeanModel.labels_
## Print out the unique values of our cluster label
print ( np.unique(clusterLabel) )

How do we know what k value is the best value? Let's try different k values and use the "Elbow rule" to find a reasonable k value for our clustering task. Basically, we are repeating the clustering process for each k-value of our choice, and evaluate it based on the distance among the generated clusters.

In [None]:
## Average of the squared distances from the cluster centers of the respective clusters
distortions = [] 
## Sum of squared distances of samples to their closest cluster center
inertias = [] 
## Candidate number of clusters (k)
Kval = range(1,10) 
  
for k in Kval: 
    #Building and fitting the model 
   
    ## Calculate distortions using Euclidean distance with PCA tranformed data 
    
    ## Calculate inertias (part of the KMeans output)
    

In [None]:
## Create the Elbow plot between distortions and different K values


In [None]:
## Create the Elbow plot between inertias and different K values


In [None]:
## Fit final K-means clustering model with k = 4

## Create the final cluter labels for the data



In [None]:
# We are now visualizing our cluster result in our feature space
# Each cluster is represented using different colors
# Define color for each class lables


# Initialize a new matplotlib figure and its axes object.


# Creating the scatter plot with multiple clusters
