<a href="https://colab.research.google.com/github/rg-smith/remote_sensing_course/blob/main/lectures/lecture9/remote_sensing_ML_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### In-class assignment: Machine Learning with Python

In this assignment, we will repeat some of the exercises from Lab 10, but in python. You will see how with python, we can see what the model is doing with more detail than the simplified version in ArcMap.

The study area is the same as lab 10: the flooded region surrounding Omaha, Nebraska following the Spring 2019 floods.



###Part 1: Run through the machine learning model with an existing shapefile

First, we will install the required packages: rasterio and geopandas. This is done outside of python (the ! character runs from the command line).

In [None]:
!pip install rasterio geopandas
!git clone https://github.com/rg-smith/remote_sensing_course.git

With the required packages installed, we will now load them as well as other packages that are installed automatically with Google Colab.

In [None]:
import rasterio
import geopandas
import matplotlib.pyplot as plt
import numpy as np
from glob import glob
from rasterio.plot import show
from rasterio.mask import mask
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

def plot_raster_band(raster,band,ax=None):
  if ax==None:
    plt.imshow(r.read(band),vmin=np.percentile(r.read(band),1),vmax=np.percentile(r.read(band),99));plt.colorbar()
    plt.xlabel('Easting, m')
    plt.ylabel('Northing, m')
    show(r.read(band),transform=r.transform,vmin=np.percentile(r.read(band),1),vmax=np.percentile(r.read(band),99))
  else:
    show(r.read(band),transform=r.transform,vmin=np.percentile(r.read(band),1),vmax=np.percentile(r.read(band),99),ax=ax)

Now, we will load in the raster. We'll use the python package 'rasterio'.

In [None]:
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100

r=rasterio.open('remote_sensing_course/lectures/lecture9/20190316_compressed_100m.tif')

Next, we'll plot the raster. Plot different bands by replacing 'band' with a number, 1 through 8.

In [None]:
band = <enter band number> 
plot_raster_band(r,band)

Now we will plot a histogram, which shows the number of pixels with a specific value (DN) for a range of different values. 

In [None]:
rasterio.plot.show_hist(r,bins=20)

Now we will load some training data. This is a shapefile similar to the one you created in Lab 10.

In [None]:
shapefile_name = 'remote_sensing_course/lectures/lecture9/training_data.shp'
shp=geopandas.read_file(shapefile_name)
print(shp)

Now plot this shapefile with a few different bands from the Landsat 8 image. As previously, change the 'enter band number here' to a band number.

In [None]:
fig,ax=plt.subplots()

band = <enter band number here>

plot_raster_band(r,band,ax)
shp.plot(ax=ax,facecolor='none',edgecolor='red')

Now, we will prepare the training data by putting it in a format that is easy for python to build a model with. 

In [None]:
classes = np.unique(shp['Classname'])
classvals = np.unique(shp['Classvalue'])
print(classes)

for kk in range(len(classes)):
  class_ = classes[kk]
  shp_filt=shp['Classname']==class_
  print(shp[shp_filt])
  r_mask,gtr=mask(r,shp['geometry'][shp_filt],crop=True,nodata=0)
  if kk==0:
    dat_train=r_mask.reshape(8,-1)
    filt=dat_train[0,:]>0
    dat_train=dat_train[:,filt].transpose()
    dat_train=np.hstack((dat_train,kk*np.ones((dat_train.shape[0],1))))
  else:
    dat=r_mask.reshape(8,-1)
    filt=dat[0,:]>0
    dat=dat[:,filt].transpose()
    dat=np.hstack((dat,kk*np.ones((dat.shape[0],1))))
    dat_train=np.vstack((dat_train,dat))

np.random.seed(0)
filt = np.random.uniform(size=(dat_train.shape[0]))>0.25 # make a filter that randomly selects 75% of the data

dat_validate = dat_train[filt==0,:]
dat_train = dat_train[filt,:]

dat_full = r.read().reshape(8,-1).transpose()

Now that the data is nicely formatted, we can run our model.

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(dat_train[:,0:8],dat_train[:,8])

landcover_prediction = clf.predict(dat_full)

landcover_prediction_rast = landcover_prediction.reshape(r.read().shape[1],r.read().shape[2])

Let's plot the data and see how they look:

In [None]:
fig,ax=plt.subplots()

from matplotlib.colors import from_levels_and_colors
cmap, norm = from_levels_and_colors([-0.5,0.5,1.5,2.5,3.5],['green','white','yellow','blue'])

cax = ax.imshow(landcover_prediction_rast,cmap=cmap,norm=norm);
cbar = fig.colorbar(cax,ticks=[0,1,2,3])
cbar.ax.set_yticklabels(classes)  # vertically oriented colorbar

Now let's see how well we predict the different land use classes. We'll predict land cover on our validation dataset, which was held out of the training. Then, we'll make what is called a confusion matrix. A confusion matrix plots the actual land classification against the predicted land classification. Rows are true labels, and columns are predicted labels. If the model is perfect, then it will only have numbers in the diagonal components, meaning all cropland is predicted as cropland, etc., like this:

```
          cropland snow/ice urban water
cropland    1000     0        0     0
snow/ice      0     500       0     0
urban         0      0       300    0
water         0      0        0    200
```
Any numbers in the cells that are not on the diagonal mean there has been a mis-classification. We don't expect to have a perfect model, so some mis-classification is OK, but if most pixels are correctly classified, it means the model is doing pretty good.

In [None]:
landcover_prediction_validate = clf.predict(dat_validate[:,0:8])
import pandas as pd

C = confusion_matrix(dat_validate[:,8],landcover_prediction_validate)
C_df = pd.DataFrame(C,columns = classes, index = classes)
print(C_df)

Let's look at the top row, showing predictions where the true land type is cropland. When this is the case, cropland is predicted 731 times, snow/ice 0 times, urban 14 times and water 0 times. So our model is pretty good at predicting cropland when the true label is cropland.

Now look at the row that says urban. When the true label is urban, cropland is predicted 175 times, snow/ice 0 times, urban 71 times and water 0 times. So our model is good at classifying cropland as cropland, but often mis-classifies urban as cropland also. This is probably because many urban areas have significant vegetation, and thus looks similar to cropland. Once you identify problems like this, you can create new training data over the 'problem' areas, so the model can learn from them.

###Part 2: Repeat this exercise, but with your own shapefile

Now we'll do the same thing with the shapefile that you created in Lab 10. Find the shapefile (there will be multiple files with the same name but different extensions). Click on the 'Files' tab on the left side of this window.

Drag all of these files into the 'Files' tab in Google Colab. Then replace the shapefile_name variable below with the name of your shapefile, in quotes, with the .shp extension.

In [None]:
shapefile_name = 'replace this with your shapefile name'
shp=geopandas.read_file(shapefile_name)
print(shp)

fig,ax=plt.subplots()

band = <enter band number here>

plot_raster_band(r,band,ax)
shp.plot(ax=ax,facecolor='none',edgecolor='red')

In [None]:
classes = np.unique(shp['Classname'])
classvals = np.unique(shp['Classvalue'])
print(classes)

for kk in range(len(classes)):
  class_ = classes[kk]
  shp_filt=shp['Classname']==class_
  print(shp[shp_filt])
  r_mask,gtr=mask(r,shp['geometry'][shp_filt],crop=True,nodata=0)
  if kk==0:
    dat_train=r_mask.reshape(8,-1)
    filt=dat_train[0,:]>0
    dat_train=dat_train[:,filt].transpose()
    dat_train=np.hstack((dat_train,kk*np.ones((dat_train.shape[0],1))))
  else:
    dat=r_mask.reshape(8,-1)
    filt=dat[0,:]>0
    dat=dat[:,filt].transpose()
    dat=np.hstack((dat,kk*np.ones((dat.shape[0],1))))
    dat_train=np.vstack((dat_train,dat))

filt = np.random.uniform(size=(dat_train.shape[0]))>0.25 # make a filter that randomly selects 75% of the data

dat_validate = dat_train[filt==0,:]
dat_train = dat_train[filt,:]

dat_full = r.read().reshape(8,-1).transpose()

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(dat_train[:,0:8],dat_train[:,8])

landcover_prediction = clf.predict(dat_full)

landcover_prediction_rast = landcover_prediction.reshape(r.read().shape[1],r.read().shape[2])

In [None]:
fig,ax=plt.subplots()

from matplotlib.colors import from_levels_and_colors
vals = np.linspace(-0.5,len(classes)-0.5,len(classes)+1)
colors = [ "red", "blue", "green", "yellow", "purple", "orange", "white", "black" ]
cmap, norm = from_levels_and_colors(vals,colors[0:len(classes)])

cax = ax.imshow(landcover_prediction_rast,cmap=cmap,norm=norm);
cbar = fig.colorbar(cax,ticks=np.arange(0,len(classes)+1,1))
cbar.ax.set_yticklabels(classes)  

In [None]:
landcover_prediction_validate = clf.predict(dat_validate[:,0:8])

C = confusion_matrix(dat_validate[:,8],landcover_prediction_validate)
C_df = pd.DataFrame(C,columns = classes, index = classes)
print(C_df)