This notebook summarizes the main steps I followed for this dataset. I mainly compared tree ensembling methods and neural networks.

Machine learning algorithms cannot be straightforwardly applied on the given data. We first have to perform some feature extraction. Three functions are introduced below :
- ListFiles: this function takes as argument a directory and lists all the csv files present in this directory. It then returns a list containing all csv corresponding to volcanoes, a list containing the id's of the volcanoes and a DataFrame corresponding to the train.csv file.
- FeatureExtraction: this function takes as argument a DataFrame containing sensor data, and extracts features from it. We aim at summarizing efficiently the information contained in the sensor data. The extracted features are namely the mean, standard deviation, max, quantiles, skewness and kurtosis of each sensor. The positions of the maximum and minimum values are also extracted.
- ImportData: this function takes as arguments a directory containing the csv files corresponding to each volcano, and the file containing the response variable. It then performs the extraction of features on each csv files corresponding to a volcano. It finally returns a DataFrame containing the extracted features for each volcano, and another DataFrame containing the id's and the remaining time to eruption (the response variable).

In [None]:
import pandas as pd
import numpy as np
import os

def ListFiles(path):
    listfiles = []
    id = []
    for root, dir, files in os.walk(path):
        for x in files:
            listfiles.append(os.path.join(root, x))
            id.append(x)
    id = [x.split(".csv")[0] for x in id]
    ord = np.argsort(np.array(id))
    files = np.array(listfiles)[ord]
    id = np.array(id)[ord]
    return files, id

from scipy.stats import skew, kurtosis
def FeatureExtraction(X):
    sumNA = np.apply_along_axis(np.sum,0,(X.isnull()).values)
    X = (X.fillna(0)).values
    xmax = np.apply_along_axis(np.max,0,X)
    xmean = np.apply_along_axis(np.mean,0,X)
    xq = np.apply_along_axis(lambda x: np.quantile(x,np.array([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99])),0,X).T.reshape(90)
    xstd = np.apply_along_axis(np.std,0,X)
    xskew = np.apply_along_axis(skew,0,X)
    xkurt = np.apply_along_axis(kurtosis, 0, X)
    xargmax = np.apply_along_axis(np.argmax,0,X)
    xmin = np.apply_along_axis(np.min, 0, X)
    xargmin = np.apply_along_axis(np.argmin, 0, X)
    res = np.concatenate([xmax,xmean,xstd,xskew,xkurt,xmin,xargmax,xargmin,sumNA],axis=0)
    res = np.concatenate([res,xq],axis=0)
    return res

def ImportData(path,Yfile):
    Y = pd.read_csv(Yfile,sep=",")
    Y["segment_id"] = [str(x) for x in Y["segment_id"].values]
    Y = Y.iloc[np.argsort(Y["segment_id"]),:]
    files, id = ListFiles(path)
    X = np.zeros((len(files), 9*10+9*10))
    for i in range(len(files)):
        if i % 500 == 0:
            print(str(round(i * 100 / len(files))) + "% ")
        XX = pd.read_csv(files[i], sep=",")
        X[i, :] = FeatureExtraction(XX)
    cols1 = ["max", "mean", "std", "skew", "kurt", "min", "argmax", "argmin", "sumNA"]
    cols2 = ["Q" + x for x in ["1", "5", "10", "25", "50", "75", "90", "95", "99"]]
    cols = cols1 + cols2
    cols = [[x]*10 for x in cols]
    cols = [x for l in cols for x in l]
    S = ["S" + str(i) + "_" for i in range(1,11)]*int(len(cols)/10)
    cols = [S[i] + cols[i] for i in range(len(cols))]
    X = pd.DataFrame(X, columns=cols, index=id)
    ind_const = np.where(X.std() == 0)[0]
    X = X.drop(columns=X.columns[ind_const])
    cols = [i for i in range(X.shape[1]) if "max" in X.columns[i] and not "argmax" in X.columns[i]]
    X["Maxmax"] = X[X.columns[cols]].apply("max",axis=1)
    cols = [i for i in range(X.shape[1]) if "std" in X.columns[i]]
    X["Maxstd"] = X[X.columns[cols]].apply("max", axis=1)
    cols = [i for i in range(X.shape[1]) if "skew" in X.columns[i]]
    X["Maxskew"] = X[X.columns[cols]].apply("max", axis=1)
    cols = [i for i in range(X.shape[1]) if "kurt" in X.columns[i]]
    X["Maxkurt"] = X[X.columns[cols]].apply("max", axis=1)
    cols = [i for i in range(X.shape[1]) if "mean" in X.columns[i]]
    X["Maxmean"] = X[X.columns[cols]].apply("max", axis=1)
    cols = [i for i in range(X.shape[1]) if "min" in X.columns[i] and not "argmin" in X.columns[i]]
    X["Maxmin"] = X[X.columns[cols]].apply("max", axis=1)
    X["Minmin"] = X[X.columns[cols]].apply("min", axis=1)
    return X, Y

[](http://)These functions can be used to build the dataset using the code below. 

In [None]:
#path = "../input/predict-volcanic-eruptions-ingv-oe/train"
#file = "../input/predict-volcanic-eruptions-ingv-oe/train.csv"
#X, Y = ImportData(path,file)
#X.to_csv("X.csv",sep=";",columns=X.columns)
#Y.to_csv("Y.csv",sep=";",columns=Y.columns)
X = pd.read_csv("../input/preprocessed-volcano-data/X.csv",sep=";",index_col=0)
Y = pd.read_csv("../input/preprocessed-volcano-data/Y.csv",sep=";",index_col=0)

We can now start evaluating algorithms. We first split the dataset in a train and a test sets,  the test set will contain 500 instances of the whole dataset. We then perform linear regression only using the max features, in order to obtain a baseline performance.

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y["time_to_eruption"],test_size=500,random_state=1234)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

reg = LinearRegression()
cols = [col for col in X.columns if "max" in col and not "argmax" in col]
reg.fit(Xtrain[cols].values, Ytrain)
pred = reg.predict(Xtest[cols].values)
mae = np.mean(np.abs(Ytest-pred))
print(mae)
R2 = np.corrcoef(Ytest,pred)[0,1]**2
print(R2)

The mean absolute error approximately equals 10^7. According to the R-squared coefficient, the correlation between the predictors and the response is weak. However, there might be a strong non-linear link. It might also be useful to include the remaining features. Let us try using all these features using a RandomForest algorithm.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
gridsearch = GridSearchCV(estimator=RandomForestRegressor(),
                          param_grid={"max_features" : [50, 100, 150],
                                      "min_samples_leaf" : [1, 5],
                                      "n_estimators" : [1000]},
                          cv=5,n_jobs=5,scoring="neg_mean_absolute_error",verbose=10)
gridsearch.fit(Xtrain, Ytrain)
res_RF = pd.DataFrame({"max_features" : [list(x.values())[0] for x in gridsearch.cv_results_["params"]],
                       "min_samples_leaf" : [list(x.values())[1] for x in gridsearch.cv_results_["params"]],
                       "mean_mae" : np.round(gridsearch.cv_results_["mean_test_score"],4),
                       "sd_mae" : np.round(gridsearch.cv_results_["std_test_score"],4)})
print(res_RF)

The cross-validation scores obtained with the RandomForest algorithm are clearly lower than the score obtained on the test set with linear regression. This demonstrates that the chosen features are useful for predicting the remaining time to eruption. I then tried the ExtraTrees (extremely randomized trees) algorithm.

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
gridsearch = GridSearchCV(estimator=ExtraTreesRegressor(),
                          param_grid={"max_features" : [50, 100, 150],
                                      "min_samples_leaf" : [1, 5],
                                      "n_estimators" : [1000]},
                          cv=5,n_jobs=5,scoring="neg_mean_absolute_error",verbose=10)
gridsearch.fit(Xtrain, Ytrain)
res_ET = pd.DataFrame({"max_features" : [list(x.values())[0] for x in gridsearch.cv_results_["params"]],
                       "min_samples_leaf" : [list(x.values())[1] for x in gridsearch.cv_results_["params"]],
                       "mean_mae" : np.round(gridsearch.cv_results_["mean_test_score"],4),
                       "sd_mae" : np.round(gridsearch.cv_results_["std_test_score"],4)})
print(res_ET)

The results are better than those obtained with the RandomForest. After fine tuning both algorithms, I obtained the following optimal values for the hyperparameters, ExtraTrees still gave better results than RandomForest. The obtained values of the hyperparameters were max_features = 164 and min_node_size = 1. Let's evaluate it on the test set.

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
ET = ExtraTreesRegressor(max_features=164,min_samples_leaf=1)
ET.fit(Xtrain, Ytrain)
pred = ET.predict(Xtest)
mae = np.mean(np.abs(Ytest-pred))
print(mae)
R2 = np.corrcoef(Ytest,pred)[0,1]**2
print(R2)

These results start looking interesting. At this point, several boosting algorithms (CatBoost, LightGBM, XGBoost) were also compared. All of them had a lower performance than ExtraTrees. 

To further improve the performance, I tried to perform some feature selection. It is very likely that some features are noisy or redundant, which might deter the performance of the regression algorithms. The feature selection procedure is the following :
- Entries : number of iterations n_iter (30 is enough here), matrix of predictors X and response vector Y
- For i = 1 to 30 :
    - Create an ExtraTrees algorithm with max_features = number of features of X and min_node_size = 1
    - Evaluate the algorithm using 5-fold cross-validation, get the mean, min and max CV scores
    - Fit the algorithm on the full set to get the feature importances
    - Remove from X the 5 least important features
- The result is a dictionary whose elements are :
    - remaining_cols : a list giving the names of the features remaining at each step
    - cvmeanMAE, cvminMAE, cvmaxMAE : the mean, min and max CV MAE obtained at each step.
This procedure is implemented in the function below. After running it, one can plot the CV scores against the number of feature selection steps. We will then select the remaining features at the step giving the best CV scores.

The value of max_features at each step was chosen to be the total number of available features since in the previous results, the optimal performance was obtained with max_features very close to the total number of features. At each step, a batch of 5 features is removed because the number of features is quite important, so removing them one by one would take too much time. Moreover, if there truly are many useless features, removing batches of 5 features will not do any harm at the beginning.

In [None]:
from sklearn.model_selection import KFold, cross_validate
from sklearn.ensemble import ExtraTreesRegressor
def FeatureSelection(X,Y,nfolds=5,random_state=123):
    XX = X.values
    n_iter = 30
    cvmeanMAE = np.zeros(n_iter)
    cvminMAE = np.zeros(n_iter)
    cvmaxMAE = np.zeros(n_iter)
    cols = X.columns
    remaining_cols = []
    cvsplit = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    for i in range(n_iter):
        print("Variable selection step " + str(i + 1) + " - " + str(XX.shape[1]) + " features remaining")
        rf = ExtraTreesRegressor(max_features=XX.shape[1], min_samples_leaf=1, n_estimators=1000)
        cv = cross_validate(estimator=rf, X=XX, y=Y, cv=cvsplit, n_jobs=5,
                            scoring="neg_mean_absolute_error")
        cvmeanMAE[i] = np.mean(-cv["test_score"])
        cvminMAE[i] = np.min(-cv["test_score"])
        cvmaxMAE[i] = np.max(-cv["test_score"])
        rf.fit(XX, Y)
        varimp = rf.feature_importances_
        ind = np.argsort(varimp)[5:]
        XX = XX[:, ind]
        cols = cols[ind]
        remaining_cols.append(cols)
    res = {"remaining_cols" : remaining_cols,
           "cvmeanMAE" : cvmeanMAE,
           "cvminMAE" : cvminMAE,
           "cvmaxMAE" : cvmaxMAE}
    return res

tmp = FeatureSelection(Xtrain, Ytrain)
from matplotlib import pyplot
pyplot.bar(np.arange(len(tmp["cvmeanMAE"])),height=tmp["cvmeanMAE"],
           yerr=np.concatenate([(tmp["cvmeanMAE"]-tmp["cvminMAE"]).reshape(-1,1),
                                (tmp["cvmaxMAE"]-tmp["cvmeanMAE"]).reshape(-1,1)],axis=1).T)

It can be seen that the MAE slowly decreases as the number of feature selection steps increases, until a number of steps around 25, after which it starts to increase again. Below are given the remaining columns when I ran it for the first time.

In [None]:
sel_cols = ['S1_max', 'S10_Q75', 'S3_Q50', 'S1_sumNA', 'S3_Q1', 'S3_max', 'S9_std',
       'S6_Q5', 'S7_Q1', 'S10_Q95', 'S5_Q90', 'S2_Q5', 'S5_Q5', 'S6_Q10',
       'S7_Q50', 'S1_Q95', 'S3_Q75', 'S5_Q10', 'S8_Q99', 'S4_Q95', 'S3_sumNA',
       'S7_Q90', 'S8_Q50', 'S6_Q90', 'S10_Q90', 'S2_sumNA', 'S4_Q99', 'S2_Q75',
       'S9_Q90', 'S9_Q25', 'S1_Q99', 'S6_Q1', 'S1_Q5', 'S7_Q99', 'S3_Q5',
       'S3_Q95', 'S5_Q99', 'S4_Q1', 'S9_Q50', 'S2_std', 'S1_Q75', 'S2_Q50',
       'S7_Q5', 'S10_Q25', 'S5_sumNA', 'S8_Q5', 'S10_Q1']

We can now select these features and evaluate the former algorithms again. As previously, the best results were obtained using ExtraTrees (here with max_features = 47 and min_node_size = 1).

In [None]:
Xtrain = Xtrain[sel_cols]
Xtest = Xtest[sel_cols]
ET = ExtraTreesRegressor(max_features=47,min_samples_leaf=1,n_estimators=1000)
ET.fit(Xtrain, Ytrain)
pred = ET.predict(Xtest)
mae = np.mean(np.abs(Ytest-pred))
print(mae)
R2 = np.corrcoef(pred,Ytest)[0,1]**2
print(R2)

The feature selection procedure led to a futher improvement of the results. Using only the selected features, the previous algorithms (RandomForest and boosting algorithms) were compared to ExtraTrees again, but ExtraTrees still gave the best results (this might be biased at this point since ExtraTrees was used to select the features).

I then switched to neural networks. I first tried out on the full set of variables, but could not get a satisfying performance. Trying again on the selected features gave much better results. A class implementing all the required steps for the training of the network and for predicting the response is given below. Those steps namely include proper scaling of the predictors and response and two callbacks:
- ModelCheckpoint, which enables us to save a fitted model at a given epoch when its score on the validation step beats the previously obtained optimal validation score
- ReduceLROnPlateau, which automatically multiplies by 0.5 the current value of the learning rate if no improvement on the validation score is observed during 20 successive epochs.

The arguments of the Network class are the following :
- file : name of the file to which will be saved the network by the ModelCheckpoint callback, each time an improvement is observed on the validation set
- nval : size of the validation set
- epochs : number of epochs
- batch_size : batch size used for training the network.

In [None]:
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
class Network:
    def __init__(self,file,nval=500,nepochs=300,batch_size=64):
        self.nval = nval
        self.nepochs = nepochs
        self.batch_size = batch_size
        self.file = file
    def __build_model(self,input_shape):
        net = keras.models.Sequential([
            keras.layers.Input(shape=input_shape),
            keras.layers.Dense(256, activation="relu"),
            keras.layers.Dense(256, activation="relu"),
            keras.layers.Dense(256, activation="relu"),
            keras.layers.Dense(256, activation="relu"),
            keras.layers.Dense(512, activation="relu"),
            keras.layers.Dropout(0.5),
            keras.layers.Dense(1, activation="linear")
        ])
        return net
    def fit(self,X,Y):
        Xtrain, Xval, Ytrain, Yval = train_test_split(X, Y, test_size=self.nval)
        scaler = StandardScaler()
        scaler.fit(Xtrain)
        Xtrain_sc = scaler.transform(Xtrain)
        Xval_sc = scaler.transform(Xval)
        my = np.mean(Ytrain)
        sy = np.std(Ytrain)
        Ytrain_sc = (Ytrain - my) / sy
        Yval_sc = (Yval - my) / sy
        net = self.__build_model(X.shape[1])
        checkpoint = keras.callbacks.ModelCheckpoint(self.file,monitor="val_loss",
                                                     save_best_only=True,mode="min",
                                                     verbose=1)
        lr_callback = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=20)
        net.compile(loss="mean_absolute_error", optimizer=keras.optimizers.Nadam(learning_rate=0.001))
        fitted = net.fit(x=Xtrain_sc, y=Ytrain_sc, batch_size=self.batch_size, 
                         epochs=self.nepochs, verbose=1,
                         callbacks=[lr_callback, checkpoint], 
                         validation_data=(Xval_sc, Yval_sc))
        net = keras.models.load_model(self.file)
        fitted = {"net" : net, "scaler" : scaler, "my" : my, "sy" : sy}
        self.fitted = fitted
    def predict(self,Xtest):
        fitted = self.fitted
        net = fitted["net"]
        scaler = fitted["scaler"]
        my = fitted["my"]
        sy = fitted["sy"]
        pred = net.predict(scaler.transform(Xtest))[:,0]*sy+my
        return pred

The network can now be trained on the training set and evaluated on the test set.

In [None]:
net = Network(file="fitted_network.h5",nval=500,nepochs=300,batch_size=64)
net.fit(Xtrain,Ytrain)
pred = net.predict(Xtest)
mae = np.mean(np.abs(Ytest-pred))
print(mae)
R2 = np.corrcoef(Ytest,pred)[0,1]**2
print(R2)

So far, the best performance was obtained with this neural network. We can retrain it on the full dataset and predict the values for the test dataset.

In [None]:
X = pd.read_csv("../input/preprocessed-volcano-data/X.csv",sep=";",index_col=0)
Y = pd.read_csv("../input/preprocessed-volcano-data/Y.csv",sep=";",index_col=0)
Xtest = pd.read_csv("../input/preprocessed-volcano-data/Xtest.csv",sep=";",index_col=0)
segment_id = Xtest.index
X = X[sel_cols]
Xtest = Xtest[sel_cols]
net = Network(file="fitted_network.h5",nval=500,nepochs=300,batch_size=64)
net.fit(X,Y["time_to_eruption"])
pred = net.predict(Xtest)
res = pd.DataFrame({"segment_id" : segment_id, "time_to_eruption" : pred})
res.to_csv("prediction_test.csv",sep=",",columns=res.columns,index=False)