# What is Volcano Forecasting?
This is an introduction to understanding the data and the problem.

**Volcanoes are awesome!** both in the California way and the "instilling awe" way.

![Mount Etna CC-BY Kuhnmi Flickr](https://i.imgur.com/TZUE3ht.jpg)
Mount Etna on a calm day. (CC-BY Kuhnmi [Flickr](https://www.flickr.com/photos/31176607@N05/25720452925))

Volcano monitoring is important for both the inhabitants on and next to volcanoes, but also globally, as seen with the [EyjafjallajÃ¶kull eruption](https://en.wikipedia.org/wiki/2010_eruptions_of_Eyjafjallaj%C3%B6kull) disrupting air travel ten years ago. Geophysics is the field that largely works with active volcanoes and their activity, measuring earthquakes, tiltmeters, changes in gravimetry, etc. Specifically, seismologists record the rumble of the Earth when the magma forces its way upward. Geochemists collect data on degassing on Volcanos. And geologists are looking how the rocks form from lava. 

In the spirit of the competition, we should not find additional metadata, however, it's important to understand the kind of data we are dealing with. A Nature article ([Hall 2018](https://www.nature.com/articles/d41586-018-07420-y#ref-CR1)) describes the world's first automatic volcano forecast system on Mount Etna. The data for this is mostly of acoustic nature, specifically infrasound ([Ripepe et al. 2018](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018JB015561)). Seeing that we have 10 sensors with timeseries, it's pretty safe to make the assumption that we are dealing with a seismological problem. These tend to be the most reliable, as compared to e.g. gas analyzers. Gas analyzers have to be right on top of a vent (which tend to rebuild in the lifetime of volcanoes), whereas the seismological stations just have to listen and not run out of battery.

So let's have a look at the actual data!

# Data
Files
train.csv Metadata for the train files.

`segment_id`: ID code for the data segment. Matches the name of the associated data file.

`time_to_eruption`: The target value, the time until the next eruption.

`[train|test]/*.csv`: the data files. Each file contains ten minutes of logs from ten different sensors arrayed around a volcano. The readings have been normalized within each segment, in part to ensure that the readings fall within the range of int16 values. If you are using the Pandas library you may find that you still need to load the data as float32 due to the presence of some nulls.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plots
import seaborn as sns

from pathlib import Path
from tqdm import tqdm

random_state = 42

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train.csv")
train.describe()

In [None]:
sequence = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train/1000015382.csv", dtype="Int16")
sequence.describe()

In [None]:
sequence.tail()

Each sequence is 10 minutes long with 600001 samples. The data is `int16` but contains nan's, luckily with Pandas 1.0 they introduced the [nullable integer datatype](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html), just make sure to actually call it capitalized `"Int16"` so Pandas knows.

In [None]:
sequence.fillna(0).plot(subplots=True, figsize=(25, 10))
plt.tight_layout()
plt.show()

These stations clearly have similar data, but are shifted in time. Why?

Look at this digital elevation model from Etna from ([Bonaccorso 2011](https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2010GC003480))

![Bonaccorso 2011 DEM of Etna](https://i.imgur.com/2b99LHc.jpg)

The different stations are located all around the crater. That means the further away a station is from the magma or hypocentre of an Earthquake or other accoustic signals like streets, or [humans](https://www.weforum.org/agenda/2020/07/seismic-anthropogenic-noise-lockdown-covid19/), the longer the acoustic wave has to travel, arriving later at the station. So we can assume the with regards to that one signel event above, station 9 and 10 are almost identical in how far they are from it. Sensor 8 is a bit further.

Additionally, you can see the different noise levels of the data. Sensor 5 is fantastically quiet, only responding to big events, while Sensor 6 has some extremely periodic noise on it. It'd be worth to investigate if this is always the case.

Overall, the sensors will have different lags, depending where the volcanic activity is happening. So when we build a model that uses the time series, this is something to keep in mind.

# Play With the Data
Time series are hard. Can we get away with initially playing with some aggregate statistics?

In [None]:
def agg_stats(df, idx):
    df = df.agg(['sum', 'min', "mean", "std", "median", "skew", "kurtosis"])
    df_flat = df.stack()
    df_flat.index = df_flat.index.map('{0[1]}_{0[0]}'.format)
    df_out = df_flat.to_frame().T
    df_out["segment_id"] = int(idx)
    return df_out

In [None]:
summary_stats = pd.DataFrame()
for csv in tqdm(Path("../input/predict-volcanic-eruptions-ingv-oe/train/").glob("**/*.csv"), total=4501):
    df = pd.read_csv(csv)
    summary_stats = summary_stats.append(agg_stats(df, csv.stem))

In [None]:
test_data = pd.DataFrame()
for csv in tqdm(Path("../input/predict-volcanic-eruptions-ingv-oe/test/").glob("**/*.csv"), total=4501):
    df = pd.read_csv(csv)
    test_data = test_data.append(agg_stats(df, csv.stem))

In [None]:
features = list(summary_stats.drop(["segment_id"], axis=1).columns)
target_name = ["time_to_eruption"]
summary_stats = summary_stats.merge(train, on="segment_id")
summary_stats.head()

In [None]:
summary_stats.describe()

# Train a LightGBM Regressor

Use [Cross Validation](https://scikit-learn.org/stable/modules/cross_validation.html), because if you don't [shakeup](https://www.kaggle.com/jtrotman/meta-kaggle-competition-shake-up) will not be your friend.

In [None]:
import lightgbm as lgbm
from sklearn.model_selection import KFold
import gc


n_fold = 7
folds = KFold(n_splits=n_fold, shuffle=True, random_state=random_state)

data = summary_stats

params = {
    "n_estimators": 2000,
    "boosting_type": "gbdt",
    "metric": "mae",
    "num_leaves": 66,
    "learning_rate": 0.005,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "agging_freq": 3,
    "max_bins": 2048,
    "verbose": 0,
    "random_state": random_state,
    "nthread": -1,
    "device": "gpu",
}

sub_preds = np.zeros(test_data.shape[0])
feature_importance = pd.DataFrame(index=list(range(n_fold)), columns=features)

for n_fold, (trn_idx, val_idx) in enumerate(folds.split(data)):
    trn_x, trn_y = data[features].iloc[trn_idx], data[target_name].iloc[trn_idx]
    val_x, val_y = data[features].iloc[val_idx], data[target_name].iloc[val_idx]
    
    model = lgbm.LGBMRegressor(**params)
    
    model.fit(trn_x, trn_y, 
            eval_set= [(trn_x, trn_y), (val_x, val_y)], 
            eval_metric="mae", verbose=0, early_stopping_rounds=150
           )

    feature_importance.iloc[n_fold, :] = model.feature_importances_
    
    sub_preds += model.predict(test_data[features], num_iteration=model.best_iteration_) / folds.n_splits


In [None]:
best = feature_importance.mean().sort_values(ascending=False)
best_idx = best[best > 5].index

plt.figure(figsize=(14,26))
sns.boxplot(data=feature_importance[best_idx], orient="h")
plt.title("Features Importance per Fold")
plt.tight_layout()

So that gives us a nice idea which data actually matters for the next iteration.

# Submit Prediction
Let's build a csv for submission.

In [None]:
submission = pd.DataFrame()
submission['segment_id'] = test_data["segment_id"]
submission['time_to_eruption'] = sub_preds
submission.to_csv('submission.csv', header=True, index=False)

# How Far Can One Go?
Need Inspiration?

Of course someone tried Transformers on Earthquake time series, to detect the event and different phases. I love multi tasking like that!
![Nature Paper on Earthquake Transformers](https://i.imgur.com/KvRcjLh.png)
Seismogram tagging on full sequences with an Earthquake Transformer in Nature ([Mousavi 2020](https://www.nature.com/articles/s41467-020-17591-w)).