<a href="https://colab.research.google.com/github/williamlidberg/Analyses-of-Environmental-Data-2/blob/main/modules/module_7/Assignment_7_machine_learning_on_vector_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning
The core idea of machine learning is to program a system using data instead of with code. There are multiple types of machine learning which are used for different types of data such as images, text or numbers, but there are two main subdomains: traditional machine learning and deep learning. Machine learning was introduced around 1940 and deep learning 1963 but it was mainly deep learning that exploded in the 2000s. Deep learning is great but it is not always the best solution, especially when working with tablular data (tables with rows and columns). Therefore, this module will introduce you to traditional machine learning on geopandas dataframes.

## Random forest
Random forests is a very robust machine learning method which I both love and hate. I love it becourse it is so easy to use and always provide a good baseline. I hate it becouse it is a bit booring.

 Random forest can be used for both classification and regression and works by building many decicion trees on randomly selected parts of your dataframe. Multiple deicion trees makes a decision forest hence the name random forest.

In [None]:
!pip install geopandas

# Aquire some data

As always we need to get our hands on some data. We will start by reusing some data sources from previous modules. In our first example we will use Wetlands from Uppsala.


### Download wetlands
This will give us three shapefiles and a pdf describing the data. We are mainly interested in VMI_C_Objekt_KLAR_2022 and VMI_C_Vatten_TOT_2022. The first contains the wetland polygons and the second contains polygons of water bodies within each wetland. VMI_C_Objekt_KLAR_2022 also contains our target variable which is the nature value classification of each wetland. We want to build a model that can predict the nature value of wetlands based on other data.

In [None]:
from urllib.request import urlretrieve
url = ('https://geodata.naturvardsverket.se/nedladdning/VMI/C_Uppsala/5_GIS_skikt/VMI_C_GIS_2022.zip')
filename = '/content/VMI_C_GIS_2022.zip' # you need to adjust this path on your own computer if you are using anaconda.
urlretrieve(url, filename)
!unzip -o /content/VMI_C_GIS_2022.zip -d /content/wetlands

### Download ditches
These are the same ditches as before. They were detected using deep learning on airborne laser data.

In [None]:
url = ('https://geodata.naturvardsverket.se/nedladdning/Diken/Diken_Sverige/Diken_lansvis/Diken_C.zip')
filename = '/content/Diken_C.zip' # you need to adjust this path on your own computer if you are using anaconda.
urlretrieve(url, filename)
!unzip -o '/content/Diken_C.zip' -d /content/ditches

### Download valued forest areas
These are key habitats that were judged to be extra valuable during field visits.

In [None]:
url = ('https://geodata.naturvardsverket.se/nedladdning/skogliga_vardekarnor_2016.zip')
filename = '/content/skogliga_vardekarnor_2016.zip' # you need to adjust this path on your own computer if you are using anaconda.
urlretrieve(url, filename)
!unzip -o '/content/skogliga_vardekarnor_2016.zip' -d /content/valued_forest

### Download waterbodies as polygons
These are polygons of both lakes and rivers.

In [None]:
from urllib.request import urlretrieve
url = ('https://opendata-download.smhi.se/svar/Vattenytor_2016.zip')
filename = '/content/Vattenytor_2016.zip' # you need to adjust this path on your own computer if you are using anaconda.
urlretrieve(url, filename)
!unzip -o '/content/Vattenytor_2016.zip' -d /content/waterbodies



### Read all data into geodataframes
Due to the large amounts of data this can take a while (~10 min).

In [None]:
import geopandas as gpd

wetlands = gpd.read_file('/content/wetlands/VMI_C_Objekt_KLAR_2022.shp', crs='EPSG:3006') # These contain the targer variable.
waterbodies = gpd.read_file('/content/wetlands/VMI_C_Vatten_TOT_2022.shp')
waterbodies = waterbodies.to_crs('EPSG:3006')
ditches = gpd.read_file('/content/ditches/Diken_C.gpkg', crs='EPSG:3006') # Note that this is a geopackage
ditches= ditches.to_crs('EPSG:3006')
forest = gpd.read_file('/content/valued_forest/Skogliga_vardekarnor.shp', crs='EPSG:3006')

lakes = gpd.read_file('/content/waterbodies/Vattenytor_2016.shp', crs='EPSG:3006')

you can inspect the size of a geodataframe by using the info() command.

In [None]:
ditches.info()

## Combine the data
The next step is to combine the data into a single dataframe using a series of joins and spatial relates from geopandas. We want a final dataframe containing the following:


*   The lenght of ditches within each wetland
*   Wetland area
*   Area of wetland waterbodies
*   Distance to nearest lake or river
*   Distance to valuable forests



In [None]:
wetlands # Its the column Nv_klass_G that we want to predict.

Start by droping columns we dont need to make it easier to follow the process.

In [None]:
columns_to_keep = ['N1_LOID', 'Nv_klass_G', 'geometry']
wetlands = wetlands[columns_to_keep] # Drop columns not in columns_to_keep
wetlands

Intersect the ditch lines with the wetland polygons and calculate the lenght of ditch channels within each wetland.

In [None]:
intersect = gpd.overlay(ditches, wetlands, how='intersection')
intersect['ditch_length'] = intersect.geometry.length # Calculate length of intersecting ditch lines
wetland_ditch_length = intersect.groupby('N1_LOID')['ditch_length'].sum().reset_index() # Aggregate lengths by wetland polygon
wetlands_with_ditch_length = wetlands.merge(wetland_ditch_length, on='N1_LOID', how='left').fillna({'ditch_length': 0})
wetlands_with_ditch_length

Intersect the wetland waterbodies polygons with the wetland polygons and calculate the waterbody area within each wetland.

In [None]:
import geopandas as gpd

intersect = gpd.overlay(waterbodies, wetlands, how='intersection')
intersect['water_area'] = intersect.geometry.area # Calculate water area for intersecting waterbodies
area_waterbodies = intersect.groupby('N1_LOID')['water_area'].sum().reset_index()
wetlands_with_waterbodies = wetlands.merge(area_waterbodies, on='N1_LOID', how='left').fillna({'water_area': 0})
wetlands_with_waterbodies

For lakes and valuable forests we need to calculate the distance to nearest lake instead of intersecting wetlands and lakes.

Start with lakes

In [None]:
import geopandas as gpd
from scipy.spatial import cKDTree

wetlands_with_lake_distance = wetlands.copy()
lake_spatial_index = cKDTree(lakes.centroid.apply(lambda x: (x.x, x.y)).tolist()) # Create a spatial index for the lake centroids

# Function to calculate the nearest distance
def nearest_distance(point):
    distance, idx = lake_spatial_index.query((point.x, point.y))
    return distance

wetlands_with_lake_distance['nearest_lake_distance'] = wetlands_with_lake_distance.centroid.apply(nearest_distance) # Apply the function to each wetland centroid to get the nearest distance to a lake
wetlands_with_lake_distance

Then do the same with valuable forests

In [None]:
import geopandas as gpd
from scipy.spatial import cKDTree

wetlands_with_forest_distance = wetlands.copy()
forest_spatial_index = cKDTree(forest.centroid.apply(lambda x: (x.x, x.y)).tolist()) # Create a spatial index for the lake centroids

# Function to calculate the nearest distance
def nearest_forest_distance(point):
    distance, idx = forest_spatial_index.query((point.x, point.y))
    return distance

wetlands_with_forest_distance['nearest_forest_distance'] = wetlands_with_forest_distance.centroid.apply(nearest_forest_distance) # Apply the function to each wetland centroid to get the nearest distance to a lake
wetlands_with_forest_distance

Finally you need to join all geodataframes. The data can be joined based on the attribute 'N1_LOID'.

In [None]:
import pandas as pd

attribute_dataframes = [wetlands_with_ditch_length, wetlands_with_waterbodies, wetlands_with_lake_distance, wetlands_with_forest_distance]
merged_wetlands = wetlands.copy()

# Merge the dataframes one by one based on N1_LOID
for df in attribute_dataframes:
    cols_to_use = df.columns.difference(merged_wetlands.columns)
    merged_wetlands = pd.merge(merged_wetlands, df[cols_to_use], left_index=True, right_index=True, how='outer')
merged_wetlands

The final step is to drop the column for the ID and geometry so the machine learning model does not attempt to train on those attributes.

In [None]:
clean_data = merged_wetlands.drop(['geometry', 'N1_LOID'], axis=1)
clean_data = clean_data.reset_index(drop=True)
clean_data

# Machine learning
Now when we have a dataframe with wetlands and some attributes we can use them to train a model that can predict the nature value on mires.

We will use [sklearn](https://scikit-learn.org/stable/) to build and test a basic random forest model. To evaluate weather the model is good or not we will split the data into training 80% and testing 20%.

In [None]:
from sklearn.model_selection import train_test_split
y = clean_data.iloc[:,0] # This is Nv_klass_G
x = clean_data.iloc[:,1:] # These are all the other attributes variables

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0, stratify = y) # splits the data into training and testing data. test_size=0.2 means that 20% of the wetlands will be set aside for testing.


## Decision tree
It is always good to start with a simple model so we will build a decision tree using or training data and test it on our test data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3) # This number determains how many decisions the tree will contain
clf.fit(x_train, y_train)


We can also inspect the trained decision tree to see whats going on.

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
plot_tree(clf, feature_names=x.columns, class_names=clf.classes_, filled=True, fontsize=8)
plt.show()


The first decision is on the lenght of ditch channels on a wetland.

The standard way to evaluate a machine learning model is to use it to predict the test data and then compare the prediction to the test labels.

In [None]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The accuracy is between 0 and 1 where 1 is 100% and something like 0.1 would be 10% accurate. In other words, the model would predict the right nature value class 10% of the time. For more detailed information we can use a classification report to see how the model preforms on different classes. The classification report produces values for precision, recall, f1-score and support. Lets break those down a bit. using class 1 (Mkt högt natuvärde/very high nature value).


*   Precision = % of how many of the predicted wetlands in class 1 that were actually in class 1.
*   Recall = % of all wetlands in class 1 that the model predicted as class 1.
*   f1-score = A combination of precision and recall. If your model has high precision but low recall, or vice versa, the F1-score will be lower. Higher number is better.
*   support = How many samples were in that class in the test data.


This might be alot to take in so focus on the f1-score for now. we want as high f1-score as possible. You can also run the prediction on the same data that it was trained on to compare that with the prediction on the test data. If the difference is big then your model has overfitted to the training data.


In [None]:
# test the model on its own training data
from sklearn.metrics import accuracy_score
y_pred = clf.predict(x_train)
accuracy = accuracy_score(y_train, y_pred)
print("Accuracy:", accuracy)

In [None]:
from sklearn.metrics import classification_report

y_pred_test = clf.predict(x_test)
print(classification_report(y_test, y_pred_test, zero_division=0))

Another way to dive deeper into the model output is to look at a confusion matrix where each prediction is compared to the actual class of that wetland. Each square will show the number of correct predictions.


In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = clf.predict(x_test)
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
# Get the class names
class_names = ['1. Mkt högt naturvärde', '2. Högt naturvärde ', '3. Visst naturvärde', '4. Lågt naturvärde']  # Replace with your class names

# Visualize the confusion matrix with class names
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()


### Task 1
Change the tree depth to see if that improves the accuracy of the model. Inspect the f1-score and confusion matrix.

## Decision forest
Now when you have seen a decision tree it's time to build a decision forest. It is the same thing but with more trees working together. The parameter in the code below "n_estimators=2" determains the number of trees in the forest. In this case it will train an ensabmle of 2 trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=200, max_depth=3) # create a random forest model
rf_clf.fit(x_train, y_train) # train the model

In [None]:
y_pred = rf_clf.predict(x_test)
accuracy_score(y_test, y_pred)

We can also inspect the decision trees in a random forest but if the forest contains alot of trees this can be a bit tedious.

In [None]:
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image

tree_index = 0  # Change this to select a different tree. Inspect the second tree in the forest by changing 0 to 1.
tree = rf_clf.estimators_[tree_index]

dot_data = export_graphviz(tree, out_file=None,
                           feature_names=x.columns,
                           class_names=clf.classes_,
                           filled=True, rounded=True)

# Create a Graphviz object
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

With random forest we can also get an idea of what attributes are important for the model using feature importance. Keep in mind that machine learning is a bit of a black box so in order to see why a feature is important you would have to inspect all the decision trees in a random forest model.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Get feature importances from the trained model
feature_importance = rf_clf.feature_importances_
feature_names = x.columns
indices = feature_importance.argsort()[::-1]

# create a dataframe for nicer plot
importance_df = pd.DataFrame({'Feature': feature_names[indices],
                              'Importance Score': feature_importance[indices]})

# higher values are more important for the model
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance Score', y='Feature', data=importance_df, hue='Feature', palette='viridis', legend=False)
plt.title("Feature Importances")
plt.xlabel("Importance Score")
plt.ylabel("Features")
plt.show()

### Task 2
Train a random forest model with more trees and see if that improves the result. Can you think of a reason to not build as many trees as possible?


# Implement your model
You have now trained a model and evaluated the preformance of the model. The next step is to combine the prediction with the original dataframe. In this case the geodataframe with all the combined data "merged_wetlands". Note that we want to keep the geometry and ID so we can save the result as a geopackage or shapefile.

In [None]:
import pandas as pd
subset_df = merged_wetlands[x_train.columns] # Subset the DataFrame to include only the columns used for training
predictions = rf_clf.predict(subset_df) # run the prediction
predictions_df = pd.DataFrame(predictions, columns=['Prediction']) # Convert predictions to a DataFrame
final_df = pd.concat([merged_wetlands, predictions_df], axis=1) # merge predictions with the original geodataframe
final_df.to_file('/content/predicted_wetlands.gpkg', driver='GPKG')

### Visualize the result

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt

# Assuming 'final_df' is your GeoDataFrame and 'Prediction' and 'Nv_klass_G' are the columns you want to use for coloring

# Plot the first GeoDataFrame (colored by 'Prediction') on the first axis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 7))

# Plot the second GeoDataFrame (colored by 'Nv_klass_G') on the second axis
final_df.plot(column='Nv_klass_G', cmap='viridis', ax=ax1, legend=True)
ax2.set_title('Original nature value')

final_df.plot(column='Prediction', cmap='viridis', ax=ax2, legend=True)
ax1.set_title('Predicted nature value')



plt.tight_layout()
plt.show()


### Task 3
Start from the dataframe merged_wetlands and calculate wetland area and perimeter. Then devide area by perimeter to get wetland shape complexity and finally train a new random forest model with the three new variables included.

### Neural networks

Neural networks are designed to mimic the behavior of the human brain. It consists of interconnected nodes called neurons, organized into layers. Each neuron receives input, processes it, and passes the output to other neurons.

*    Input layer: This layer receives the initial data or features that are fed into the neural network. Each neuron in the input layer represents a feature of the input data.

*    Hidden layers: These layers sit between the input and output layers. They perform computations on the input data using weighted connections between neurons. Hidden layers enable the neural network to learn complex patterns and relationships within the data.

*    Output layer: This layer produces the final output of the neural network. The number of neurons in the output layer depends on the task the network is designed for. For example, in a classification task, each neuron in the output layer may represent a different class, in our case its wetland classification.

We can build simple neural networks using sklearn. Lets start by building a neural network with one hidden layer with five neurons.

In [None]:
from sklearn.neural_network import MLPClassifier

# define and train the model hidden_layer_sizes=(5) means that the hidden layer will have 5 neurons.
nnet = MLPClassifier(solver='lbfgs', max_iter=1000, hidden_layer_sizes=(5), random_state=1)
nnet.fit(x_train, y_train)

# evaluate the trained model on test data
y_pred = nnet.predict(x_test)
accuracy_score(y_test, y_pred)

To better understand what this is we can install a small package for visualization.

In [None]:
!pip install nnv

Lets plot the neural network we just trained. It has four input nodes and four output nodes, one for each wetland class. It only have one hidden layer with five nodes. The difference between machine learning and deep learning is how many hidden layers the model has.

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (5,5)
from nnv import NNV

layersList = [
    {"title":"input\n", "units": 4},
    {"title":"hidden 1\n", "units": 5}, # This is the hidden layer
    {"title":"output\n", "units": 4,},
]

NNV(layersList, max_num_nodes_visible=10,font_size=11).render()

Lets build a deep neural network using two hidden layers.

In [None]:
nnet = MLPClassifier(solver='lbfgs', max_iter=1000, hidden_layer_sizes=(5,5), random_state=1)
nnet.fit(x_train, y_train)

# evaluate the trained model on test data
y_pred = nnet.predict(x_test)
accuracy_score(y_test, y_pred)

This is how the new neural network looks

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (5,5)
from nnv import NNV

layersList = [
    {"title":"input\n", "units": 4},
    {"title":"hidden 1\n", "units": 5},
    {"title":"hidden 2\n", "units": 5},
    {"title":"output\n", "units": 4,},
]

NNV(layersList, max_num_nodes_visible=10,font_size=11).render()

### Task 4
Our dataset is not optimal for neural networks since its quite limited and we have not rescaled any of our attributes. Play around with the number of neurons and hidden layers and see if you can get better accuracy than with the basic decision tree model.

# Conclusion
Machine learning is a great tool to quickly build predictive models. With vector data we can calculate spatial relations such as distance and area using geopandas. Since machine learning readily can be applied to attribute tables we can combine geopandas with machine learning for predictions.

One strenght of decision trees and random forest is that they do not expect or require the data to be normaly distributed. You can even mix classified data such as land use with continues values such as distances.

We will work more with deep learning in module nine but generally a random forest is much easier to work with and often produce similar results on tabular data.

Look into the sklearn documentation for more inspiration on other machine learning methods: https://scikit-learn.org/stable/user_guide.html
