# Phytosensing – Using Plants as Sensors
## Introduction
In this tutorial we will develop a toolchain to analyse and classify plant electrical signals to infer the external stimulus of the given plant.
Because collecting the data would take too much time we will use a dataset that we collected earlier (in Lübeck). It is a real world dataset.
I planned the practical in the way that you will be guided step by step. I encourage the people who are familiar with programming in python (and the used libraries) to explore other possible approaches. In the end we will do a little challenge were you can implement a classifier (and possible pipeline) and the best performing classifier will win.

## 1. Load the dataset (data acquisition)
Because we do not have to collect a dataset this step only consists of loading data, visualization, and cutting the data into the appropriate experiment intervals. Hence, we will do the following steps:

1. Load the dataset;
2. Visualize the dataset;
3. Cut the dataset into the appropriate length; and
4. Scale the data to real units.

<em>Hint: for the final challenge you are allowed to use all available data (not only the cut data)</em>

### Import necessary libraries
These libraries are necessary for the tutorial. You can add more libraries if you want to use them later.

In [None]:
# import the required libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import timedelta, datetime
from scipy.signal import welch
from scipy.signal import resample
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense, InputLayer
# from keras.optimizers import Adam
from keras.optimizers.legacy import Adam
from sklearn import preprocessing
from numpy import genfromtxt

### Loading the dataset
The data were collected using the Cybres MU measurement system. Hence, we have a lot of information that are not interesting for us. We will only use the following columns: `timestamp`, `differential_potential_CH1`, and `differential_potential_CH2`. In a first step load the data into a pandas dataframe and print the first 5 rows.

In [None]:
# load the data and print the first 5 rows


### reduce the dataset to the necessary information
As you see, the raw data contain much information we do not need (in this tutorial). Hence, the next step is to drop all but the relevant columns.

In [None]:
# drop all but the relevant columns

# reset the index


### Visualize the dataset
After you successfully removed unnecessary information, you can have a first look at the data. Please plot the two electrical signals over time.

In [None]:
# plot differnetial_potential_CH1 and differential_potential_CH2 over time


### Cut the dataset into the appropriate length
After plotting the data, we now want to cut the data according to the experiments. The required timestamps are given in `experiment_start.csv` and `experiment_end.csv`. Load these files and cut the data into the appropriate intervals. For future processing you may want to save the intermediate results that you do not have to compute them again.

In [None]:
stimulus = "wind"
# load the experiment start timestamps

running_exp_number = 0
for i in range(1, 5):
    # transform the timestamps to datetime

    # load data

    # iterate through data and cut
    for j in range(...):
        begin_experiment = # TODO
        # because series have different number of experiments
        if begin_experiment is not pd.NaT:
            end_experiment = begin_experiment + timedelta(minutes=...)
            print(f"begin: {begin_experiment} -- end: {end_experiment}")
            # cut data

            # (optional) plot the cut data

            # save back to csv

            # increase experiment number
            running_exp_number += 1


### Cut data continued
Well done. By now you should have a folder of files, were every file contains one experiment. Now you have to repeat it for all stimuli and series. This can be done by using a `for` loop. If you are short on time you can skip this step and use the provided cut data (`data/cut_data/*`).

In [None]:
# cut all the data
# ATTENTION if you want to cut the light data you have to use a different approach (because the experimental protocol is slightly different
# # blue light: starting at 1 am; every 6 hours
# 1 am, 7 am 1 pm, 7 pm
# red light: starting at 4 am; every 6 hours
# 4 am, 10 am, 4 pm, 10 pm

### Scaling
While cutting the data you probably recognised that the values are quite large. This is because they have not yet been scaled. Currently, the data are raw values from the measurement unit. You can convert them using the following formula:
$$ x \text{mV} = \frac{x_{\text{raw}} - 512000}{1000} $$

In [None]:
# Scale the electric potential channels to mV
def scale_data(data):
    # TODO apply the above formula
    return data

# iterate through all experiments
for i in range(...):
    # load experiment

    # scaling
    df = scale_data(df)

    # (optional) plot

    # save back to csv

## 2. Feature extraction
Now that we have the data in the correct format, we can start with the feature extraction. We begin by extracting features from the stimuli part and the initial 10 minutes for background subtraction. We will calculate the mean, std, and average power spectral density. (Attention some samples may contain NaN values/are incomplete!)

In [None]:
def power_spectral_density(data):
    # Calculate PSD using welch estimate. PSD gives the spectra
    # depending on the frequencies. Sum over all spectra to
    # receive the total (average) spectral power
    _, psd = welch(data)
    return sum(psd)


stimulus = "wind"

nb_experiments = ... # number of experiments

for i in range(nb_experiments):
    # load experiment
    df = ...
    # deselect empty experiments
    if df.empty:
        print(f"experiment {i} is empty")
        continue

    # convert timestamp to datetime

    # select stimulus phase and background subtraction phase

    # calculate features per channels
    # make sure the values can be calculated (e.g. no NaN values)

    # calculate mean

    # calculate std

    # calculate asp

# load the features into a dataframe and save it to a csv file
df = pd.DataFrame({"mean_ch1": ..., "mean_ch2": ..., "std_ch1": ..., "std_ch2": ..., "asp_ch1": ..., "asp_ch2": ...})
df.to_csv(f"data/features/{stimulus}/mean_std_asp_features.csv", index=False)

### Feature extraction continued
I encourage you to find and implement other features.

In [None]:
# TODO implement other interesting features

## 3. Data augmentation (Slicing)
This is voluntary. If you do not plan on using deep learning you can skip this part. If you just want to do deep learning, you can use just the raw time series. However, as you may notice, the dataset is rather small and probably not sufficient for deep learning. Hence, we can use data augmentation strategies to enlarge the dataset. A starting point can be the [tsaug library](https://tsaug.readthedocs.io/en/stable/). It provides a lot of different data augmentation strategies for time series data. Some simple example is to add white noise.

In [None]:
stimulus = "wind"
n_experiments = len(os.listdir(f"data/scaled_data/{stimulus}"))

for i in range(n_experiments):
    # load experiment data
    df = pd.read_csv(f"data/scaled_data/{stimulus}/experiment_{i}.csv")
    df["timestamp"] = pd.to_datetime(df["timestamp"], format="%Y-%m-%d %H:%M:%S")

    # make sure that the dataframe is not empty
    if df.empty:
        print(f"skipping experiment {i} because of empty data")
        continue

    # get timings
    stimulus_application_begin = df["timestamp"].iloc[0] + timedelta(minutes=59)
    stimulus_application_end = df["timestamp"].iloc[0] + timedelta(minutes=60+11)  # 10 min stimulus

    # cut data
    df = df[df["timestamp"] >= stimulus_application_begin]
    df = df[df["timestamp"] <= stimulus_application_end]

    # make sure all experiments have the same length -> resample
    new_num_points = 418
    tmp_df = df.set_index('timestamp')
    if tmp_df.shape[0] < new_num_points-10:
        print(f"skipping experiment {i} because of too few data points")
        continue
    interpolated_df = resample(tmp_df, num=new_num_points)

    # reshape to fit the desired format
    ch_1 = np.reshape(interpolated_df[:, 0], (1, -1))
    ch_2 = np.reshape(interpolated_df[:, 1], (1, -1))

    # TODO here you can augment the data! (e.g. add white noise)

    # stack the data together
    if i == 0:
        ch_1_all = ch_1
        ch_2_all = ch_2
    else:
        ch_1_all = np.vstack((ch_1_all, ch_1))
        ch_2_all = np.vstack((ch_2_all, ch_2))

    # (optional) plot data
    # fig, ax = plt.subplots(1, 1, figsize=(20, 10))
    # ax.plot(df["timestamp"], df["differential_potential_CH1"], label="differential_potential_CH1")
    # ax.plot(df["timestamp"], df["differential_potential_CH2"], label="differential_potential_CH2")
    # fig, ax = plt.subplots(1, 1, figsize=(20, 10))
    # ax.plot(interpolated_df[:, 0], label="differential_potential_CH1")
    # ax.plot(interpolated_df[:, 1], label="differential_potential_CH2")
    # ax.set_xlabel("Time")
    # ax.set_ylabel("Voltage [mV]")
    # ax.legend()
    # plt.show()

# save the data
df = pd.DataFrame(ch_1_all)
df.to_csv(f"data/dataset/{stimulus}/ch_1.csv", index=False, header=False)
df = pd.DataFrame(ch_2_all)
df.to_csv(f"data/dataset/{stimulus}/ch_2.csv", index=False, header=False)



### Preparing the classification dataset
Now, that we have features extracted, we need to prepare the data for classification by adding class information and splitting the data in training and test set.

#### Add class information
Here we structure the data into feature vectors and the corresponding class labels.

In [None]:
# load features

# select which features to use


# shuffle features
np.random.seed(10)  # TODO set to reproduce results


# split train and test data
x_train = ...
y_train = ...

x_test = ...
y_test = ...

# stack the data together


print(f"x_train: {x_train.shape}, y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}, y_test: {y_test.shape}")

## 4. Classification - classical machine learning
In this section we will use the previous generated dataset for classification.

In [None]:
# scaling the features based on the training set
min_max = MinMaxScaler()
for i in range(...):  # min max scale all the features
    pass
# create classifier
classifier = ...

# fit classifier
trained_classifier = ...

# predict the test set
y_pred = ...

## (optinoal) Have a read into SHAP analysis for explainable

print(f"accuracy: {accuracy_score(y_test, y_pred)}")
print(f"F1-score: {f1_score(y_test, y_pred, average='weighted')}")
print(f"confusion matrix: \n{confusion_matrix(y_test, y_pred)}")

## 5. Classification - deep learning
In this section we will use the previous generated dataset for classification using deep learning. In this section I present a sample implementation of a dense neural network and LSTM network. I provide the general structure of building a model. Your task is to implement your own model using the given structure and play with the parameters.

In [None]:
# select parameters
n_classes = 2  # number of classes
np.random.seed(10)  # TODO set to reproduce results
split_ratio = 0.8  # 80% train, 20% test
batch_size = 32  # batch size: how many samples are used to calculate the gradient
model_type = "LSTM"  # Dense or LSTM

# load time series data

# shuffle the data
x_train = ...
y_train = ...
x_test = ...
y_test = ...

# make dim right for LSTM input data should have shape (samples, time steps, features)
#  thereby feature corresponds to the number of channels (in our case 1 or 2
if model_type == "LSTM":
    x_train = x_train.reshape((x_train.shape[0], x_train.shape[1], 1))
    x_test = x_test.reshape((x_test.shape[0], x_test.shape[1], 1))

# do one hot encoding to get the probability distribution
enc = preprocessing.OneHotEncoder(categories='auto')
enc.fit(np.concatenate((y_train, y_test), axis=0).reshape(-1, 1))
y_train = enc.transform(y_train.reshape(-1, 1)).toarray()
y_test = enc.transform(y_test.reshape(-1, 1)).toarray()
# You can blur the labels for better numerical stability
# y_train[y_train == 1] -= (0.01 * (n_classes-1))
# y_train[y_train == 0] += 0.01
#
# y_test[y_test == 1] -= (0.01 * (n_classes-1))
# y_test[y_test == 0] += 0.01

# Let's make our dataset performant using tf.data API
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

train_dataset = train_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

# build model (LSTM or Dense) or your own!
model = Sequential()
if model_type == "LSTM":
    model.add(keras.layers.LSTM(100, input_shape=(x_train.shape[1], x_train.shape[2]), return_sequences=True))
    model.add(keras.layers.LSTM(100))
    model.add(keras.layers.Flatten())
elif model_type == "Dense":
    model.add(InputLayer(input_shape=(x_train.shape[1],), batch_size=batch_size))
    model.add(Dense(500, activation='relu'))
    model.add(Dense(100, activation='relu'))
model.add(Dense(50, activation='sigmoid'))  # see: https://stackoverflow.com/questions/49016723/softmax-cross-entropy-loss-explodes
model.add(Dense(n_classes, activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

# check if gpu is available -> depends on your machine and is probably not available
print(f"is gpu available: {tf.config.list_physical_devices('GPU')}")
# print model summary (structure, shape, and number of parameters)
print(model.summary())

# fit model
# save best model for later use
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=f"neural_network/best_{model_type}_model",
    save_weights_only=False,
    monitor='val_accuracy',
    mode='auto',
    save_best_only=True)

# fit model
history = model.fit(train_dataset, validation_data=test_dataset, epochs=100, batch_size=32,
                    callbacks=[model_checkpoint_callback], verbose=True)

# validate model -> use the best model
model = tf.keras.models.load_model(f"neural_network/best_{model_type}_model")
y_pred = model.predict(x_test, verbose=True)
keras.backend.clear_session()

# calculate metrics
y_pred = np.argmax(y_pred, axis=1)
y_test = np.argmax(y_test, axis=1)

# calculate: accuracy, f1-score, and confusion matrix

# plotting the loss and accuracy
train_loss = history.history['loss']
val_loss = history.history['val_loss']
train_acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
xc = range(len(history.history['loss']))

plt.figure()
plt.plot(xc, train_loss, label="train loss")
plt.plot(xc, val_loss, label="val loss")
plt.plot(xc, train_acc, label="train accuracy")
plt.plot(xc, val_acc, label="val accuracy")
plt.title(f"{model_type} loss and accuracy")
plt.legend()
plt.show()
plt.savefig(f"plots/{model_type}_loss_acc_.pdf")


## 6. Challenge
Now that you have seen the basic approaches, you can try to implement your own classifier. You are allowed to use all resources. Best of luck!

In [None]:
# TODO make the best classifier and win