# Objectives


* To explore an existing dataset
> This week, we'll use a subset of the UK Met dataset. You can read more about the UK Met dataset here: https://rmets.onlinelibrary.wiley.com/doi/10.1002/gdj3.78. We will use the 60km-resolution data for 2010 to 2022.

* To apply support vector machine (SVM) and logistic regression algorithms from Week 3 lecture to automatic detection of the number of days of ground frost and snow based on other weather variables.

# Section 1 - Explore the UK Met (60km, 2010-2022) dataset

See the dataset on the Week 3 page for the module, on Canvas (see 'Week 3 Lab Dataset' on the page). The file is named c*urated_data_1month_2010-2022_nonans.csv*.
* What does each variable in the dataset represent?
* What is the distribution of the number of days of ground frost in the dataset? What of for the number of days of snow?
* What does this tell you about the data?
* What else can you tell about the data?


# Section 2 - Load the dataset





1. You need to first download the data before you can get started. Download from the Week 3 page for the module, on Canvas (see 'Week 3 Lab Dataset' on the page). The file you download will be named *curated_data_1month_2010-2022_nonans.csv*.

2. Then, use the file menu in Google Colab to upload the file to your Colab directory. Once upload is complete, you should be able to see the file on the listed contents of your Colab directory.

3. You can now run the code in the cell below to load the data.

In [None]:
import csv
import numpy


!ls  /content

data_file_full_path = "/content/curated_data_1month_2010-2022_nonans.csv"

data_as_list = []

# load the dataset
with open(data_file_full_path) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')

    row_count = 0
    for row in csv_reader:

      if row_count > 0:
        data_as_list.append([float(val) for val in row])
      row_count += 1
data = numpy.array(data_as_list)

# check its shape
print("\n The dataset has shape: "+str(data.shape))


# get features and labels from the data
# based on the objectives (see the Objectives section)
feat_col = [5, 6, 7, 8, 9, 10, 11]
ground_frost_col = 4
snow_col = 12

feats = data[:, feat_col]
ground_frost_label = data[:, ground_frost_col]
snow_label = data[:, snow_col]


# take a peek
print("\n A peek at the dataset features: \n"+str(feats))
print("\n A peek at the ground frost labels: \n"+str(ground_frost_label))
print("\n A peek at the snow labels: \n"+str(snow_label))


# Section 3 - Split into training, validation, and test sets

In [None]:
from sklearn.model_selection import train_test_split

all_ids = numpy.arange(0, feats.shape[0])

random_seed = 1

# First randomly split the data into 70:30 to get the training set
train_set_ids, rem_set_ids = train_test_split(all_ids, test_size=0.3, train_size=0.7,
                                 random_state=random_seed, shuffle=True)


# Then further split the remaining data 50:50 into validation and test sets
val_set_ids, test_set_ids = train_test_split(rem_set_ids, test_size=0.5, train_size=0.5,
                                 random_state=random_seed, shuffle=True)


train_data = feats[train_set_ids, :]
train_ground_frost_labels = ground_frost_label[train_set_ids]
train_snow_labels = snow_label[train_set_ids]

val_data = feats[val_set_ids, :]
val_ground_frost_labels = ground_frost_label[val_set_ids]
val_snow_labels = snow_label[val_set_ids]

test_data = feats[test_set_ids, :]
test_ground_frost_labels = ground_frost_label[test_set_ids]
test_snow_labels = snow_label[test_set_ids]

# Section 4 - Train and evaluate a SVM regression model (with hyperparameter optimization)

In [None]:

from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error
import sys

#--- Use the validation set to optimize the box constraint hyperparameter ---
#--- based on grid search method ---

# set the range of hyperparameters to search from
c_options = [0.1, 1.0, 10.0]
# initialize the optimal box constraint value
best_c = 0.1
# initialize the performance of the optimal box constraint value
best_c_perf = sys.float_info.max

# for each box constraint in the set of values to search
# training a SVM model and evaluate it
# if the performance obtained is better than the currrent 'best_c_perf'
# set the box constraint as the current optimal
for c in c_options:
  #print("\n for c="+str(c)+"...")
  model_SVM = LinearSVR(C=c, random_state=random_seed, loss='squared_epsilon_insensitive')
  model_SVM.fit(train_data, train_ground_frost_labels)
  val_pred_SVM = model_SVM.predict(val_data)
  val_mse_SVM = mean_squared_error(val_ground_frost_labels, val_pred_SVM)

  if val_mse_SVM < best_c_perf:
    best_c = c
    best_c_perf = val_mse_SVM

print('\n The optimal c for this data is: '+str(best_c))


# use the optimized box constraint to train the final model
model_SVM = LinearSVR(C=best_c, random_state=random_seed, loss='squared_epsilon_insensitive')
model_SVM.fit(train_data, train_ground_frost_labels)

# evaluate the trained model using the test set
test_pred_SVM = model_SVM.predict(test_data)
mse_SVM = mean_squared_error(test_ground_frost_labels, test_pred_SVM)
print('\n The test mean squared error (MSE) is: '+str(mse_SVM))



# Section 5 - Train and evaluate with scaled features

* Read the StandardScaler documentation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Using the documentation above, compute scaled features from *feats* in Section 2 based on standard scaling.

* Train and evaluate the SVM model with the scaled features.

* What differences do you notice in the feature distribution and the results?


# Section 6 - Train and evaluate a LR classification model

* Use the information from Section 1 to split the ground frost label values into 4 classes.
* Apply this to create classification labels for the labels in Section 2.
* Use the classification labels to train and evaluate a logistic regression model using the Scikit Learn library (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).


# Section 7 - Evaluate using other classification metrics

In [None]:
from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# F1 score similar to accuracy in that it ranges between 0 and 1
# We will look at this metric in Weeks 5-6
avg_f1_score_LR = f1_score(test_ground_frost_labels_class, test_pred_LR, average='macro')
f1_scores_LR = f1_score(test_ground_frost_labels_class, test_pred_LR, average=None)
print('\n The F1 scores for each of the classes are: '+str(f1_scores_LR))
print('\n The average F1 score is: '+str(avg_f1_score_LR))
print()

# Confusion shows the misclassification
# We will look at this metric in Weeks 5-6
confusion_matrix_SVM = confusion_matrix(test_ground_frost_labels_class, test_pred_LR)
disp = ConfusionMatrixDisplay(confusion_matrix_SVM)
disp.plot()
plt.show()
