## 1. Machine Learning

## 1.1 Preparing the notebook

Press *play* in the following cell to install some libraries needed to view the maps. After installation, restart the runtime (from the toolbar *Runtime* -> *Restart runtime*) and continue with the next cells.

In [None]:
! apt-get install libgeos-3.5.0
! apt-get install libgeos-dev
! pip install https://github.com/matplotlib/basemap/archive/master.zip

Press *play* in the following cell to import the datasets from the GitHub repository.

In [None]:
! git clone https://github.com/vitoreno/StelleDataset.git
! unzip /content/StelleDataset/data.zip

Press *play* in the following cell to import the libraries needed to run the notebook.

In [None]:
%load_ext google.colab.data_table
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import sys
from datetime import datetime
from sklearn import svm
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

## 1.2 SVM

We intend to predict the soil moisture values on the coast starting from the sea temperatures in the nearest points, using SVM.

Select the regions to be trained on, and those to be evaluated on a given date. Possible dates range from 2016-01-01 to 2016-12-31. Finally press *play* to start the cell execution.

In [None]:
#@markdown Training
train_Adriatic = True #@param {type:"boolean"}
train_Ionian = True #@param {type:"boolean"}
train_Tyrrhenian = False #@param {type:"boolean"}
train_Labrador = False #@param {type:"boolean"}
train_Red = False #@param {type:"boolean"}
#@markdown Test
date_str = '2016-01-01' #@param {type:"date"}
test_Adriatic = False #@param {type:"boolean"}
test_Ionian = False #@param {type:"boolean"}
test_Tyrrhenian = True #@param {type:"boolean"}
test_Labrador = False #@param {type:"boolean"}
test_Red = False #@param {type:"boolean"}

train_list = []
test_list = []
if train_Adriatic:
  train_list = train_list + ["Adriatic"]
if train_Ionian:
  train_list = train_list + ["Ionian"]
if train_Tyrrhenian:
  train_list = train_list + ["Tyrrhenian"]
if train_Labrador:
  train_list = train_list + ["Labrador"]
if test_Red:
  test_list = test_list + ["Red"]
if test_Adriatic:
  test_list = test_list + ["Adriatic"]
if test_Ionian:
  test_list = test_list + ["Ionian"]
if test_Tyrrhenian:
  test_list = test_list + ["Tyrrhenian"]
if test_Labrador:
  test_list = test_list + ["Labrador"]
if test_Red:
  test_list = test_list + ["Red"]

current_date = datetime.strptime(date_str + " 12:00:00", '%Y-%m-%d %H:%M:%S')

if (current_date < datetime.strptime("2016-01-01 12:00:00", '%Y-%m-%d %H:%M:%S')) | (current_date > datetime.strptime("2016-12-31 12:00:00", '%Y-%m-%d %H:%M:%S')):
  sys.exit("Data non valida. Inserire data compresa fra 2016-01-01 e 2016-12-31")

data = pd.read_csv("/content/soil_moisture_2016.csv")
data.time = pd.to_datetime(data.time)
train_data = data.loc[data['sea'].isin(train_list)]
test_data = data.loc[((data['sea'].isin(test_list)) & (data['time'] == current_date))]

train_sst = train_data.sst.to_numpy().reshape(-1, 1)
train_sm = train_data.sm.to_numpy()
test_sst = test_data.sst.to_numpy().reshape(-1, 1)
test_sm = test_data.sm.to_numpy()

regression = svm.SVR()

regression.fit(train_sst,train_sm)

prediction = regression.predict(test_sst)

for i in range(test_sm.shape[0]):
  plt.plot([test_sst[i],test_sst[i]], [test_sm[i],prediction[i]], '--b')
plt.scatter(test_sst, test_sm, color='black', label='Observation')
plt.scatter(test_sst, prediction, color='blue', label='Prediction')
plt.xlabel('Temperature')
plt.ylabel('Albedo')
plt.legend()
plt.show()

results = pd.DataFrame({"Observed albedo": test_sm, "Predicted albedo": prediction, "Error": np.abs(test_sm - prediction)})
print("Mean Squared Error: ", mean_squared_error(test_sm, prediction))
results

## 1.3 Random Forest

We intend to predict the soil moisture values on the coast starting from the sea temperatures in the nearest points, through Random Forest.

Select the regions to be trained on, and those to be evaluated on a given date. Possible dates range from 2016-01-01 to 2016-12-31. Finally press *play* to start the cell execution.

In [None]:
#@markdown Training
train_Adriatic = True #@param {type:"boolean"}
train_Ionian = True #@param {type:"boolean"}
train_Tyrrhenian = False #@param {type:"boolean"}
train_Labrador = False #@param {type:"boolean"}
train_Red = False #@param {type:"boolean"}
#@markdown Test
date_str = '2016-01-01' #@param {type:"date"}
test_Adriatic = False #@param {type:"boolean"}
test_Ionian = False #@param {type:"boolean"}
test_Tyrrhenian = True #@param {type:"boolean"}
test_Labrador = False #@param {type:"boolean"}
test_Red = False #@param {type:"boolean"}

train_list = []
test_list = []
if train_Adriatic:
  train_list = train_list + ["Adriatic"]
if train_Ionian:
  train_list = train_list + ["Ionian"]
if train_Tyrrhenian:
  train_list = train_list + ["Tyrrhenian"]
if train_Labrador:
  train_list = train_list + ["Labrador"]
if test_Red:
  test_list = test_list + ["Red"]
if test_Adriatic:
  test_list = test_list + ["Adriatic"]
if test_Ionian:
  test_list = test_list + ["Ionian"]
if test_Tyrrhenian:
  test_list = test_list + ["Tyrrhenian"]
if test_Labrador:
  test_list = test_list + ["Labrador"]
if test_Red:
  test_list = test_list + ["Red"]

current_date = datetime.strptime(date_str + " 12:00:00", '%Y-%m-%d %H:%M:%S')

if (current_date < datetime.strptime("2016-01-01 12:00:00", '%Y-%m-%d %H:%M:%S')) | (current_date > datetime.strptime("2016-12-31 12:00:00", '%Y-%m-%d %H:%M:%S')):
  sys.exit("Data non valida. Inserire data compresa fra 2016-01-01 e 2016-12-31")

data = pd.read_csv("/content/soil_moisture_2016.csv")
data.time = pd.to_datetime(data.time)
train_data = data.loc[data['sea'].isin(train_list)]
test_data = data.loc[((data['sea'].isin(test_list)) & (data['time'] == current_date))]

train_sst = train_data.sst.to_numpy().reshape(-1, 1)
train_sm = train_data.sm.to_numpy()
test_sst = test_data.sst.to_numpy().reshape(-1, 1)
test_sm = test_data.sm.to_numpy()

regression = RandomForestRegressor()

regression.fit(train_sst,train_sm)

prediction = regression.predict(test_sst)

for i in range(test_sm.shape[0]):
  plt.plot([test_sst[i],test_sst[i]], [test_sm[i],prediction[i]], '--b')
plt.scatter(test_sst, test_sm, color='black', label='Observation')
plt.scatter(test_sst, prediction, color='blue', label='Prediction')
plt.xlabel('Temperature')
plt.ylabel('Albedo')
plt.legend()
plt.show()

results = pd.DataFrame({"Observed albedo": test_sm, "Predicted albedo": prediction, "Error": np.abs(test_sm - prediction)})
print("Mean Squared Error: ", mean_squared_error(test_sm, prediction))
results

## 1.4 Clustering

We intend to carry out an unsupervised classification of the Mediterranean Sea surface temperature, through clustering.

Select a date between 2014-01-01 and 2016-01-01, the desired number of clusters and press *play*.

In [None]:
date_str = '2014-01-01' #@param {type:"date"}
n_cluster = 2 #@param {type:"slider", min:2, max:10, step:1}

current_date = datetime.strptime(date_str + " 12:00:00", '%Y-%m-%d %H:%M:%S')

if (current_date < datetime.strptime("2014-01-01 12:00:00", '%Y-%m-%d %H:%M:%S')) | (current_date > datetime.strptime("2016-12-31 12:00:00", '%Y-%m-%d %H:%M:%S')):
  sys.exit("Data non valida. Inserire data compresa fra 2014-01-01 e 2016-12-31")

data = pd.read_csv("/content/mediterranean_surface_temperature_2014_15_16.csv")
data.time = pd.to_datetime(data.time)

current_data = data.loc[data.time == current_date]
lat = current_data.lat.to_numpy()
lon = current_data.lon.to_numpy()
sst = current_data.sst.to_numpy().reshape(-1, 1)

clustering = AgglomerativeClustering(n_clusters = n_cluster).fit(sst)

fig = plt.figure(figsize=(10, 8))
m = Basemap(projection='lcc', resolution='c',
            width=1.5E6, height=1.5E6, 
            lat_0=42, lon_0=14)
m.shadedrelief(scale=0.5)
m.scatter(lon, lat, latlon=True, c=clustering.labels_,
          cmap='Reds', marker ='+', edgecolors='none', alpha=0.7)

## 1.5 K-Means

We intend to carry out an unsupervised classification of the Mediterranean Sea surface temperature, through K-Means.

Select a date between 2014-01-01 and 2016-01-01, the desired number of clusters and press *play*.

In [None]:
date_str = '2014-01-01' #@param {type:"date"}
n_cluster = 2 #@param {type:"slider", min:2, max:10, step:1}

current_date = datetime.strptime(date_str + " 12:00:00", '%Y-%m-%d %H:%M:%S')

if (current_date < datetime.strptime("2014-01-01 12:00:00", '%Y-%m-%d %H:%M:%S')) | (current_date > datetime.strptime("2016-12-31 12:00:00", '%Y-%m-%d %H:%M:%S')):
  sys.exit("Data non valida. Inserire data compresa fra 2014-01-01 e 2016-12-31")

data = pd.read_csv("/content/mediterranean_surface_temperature_2014_15_16.csv")
data.time = pd.to_datetime(data.time)

current_data = data.loc[data.time == current_date]
lat = current_data.lat.to_numpy()
lon = current_data.lon.to_numpy()
sst = current_data.sst.to_numpy().reshape(-1, 1)

kmeans = KMeans(n_clusters = n_cluster).fit(sst)

fig = plt.figure(figsize=(10, 8))
m = Basemap(projection='lcc', resolution='c',
            width=1.5E6, height=1.5E6, 
            lat_0=42, lon_0=14)
m.shadedrelief(scale=0.5)
m.scatter(lon, lat, latlon=True, c=kmeans.labels_,
          cmap='Reds', marker ='+', edgecolors='none', alpha=0.7)