<h1>INGV - Volcanic Eruption Prediction

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import clear_output
import gc
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from scipy.stats import moment

In [None]:
train_files = os.listdir("../input/predict-volcanic-eruptions-ingv-oe/train")
test_files = os.listdir("../input/predict-volcanic-eruptions-ingv-oe/test")

In [None]:
len(train_files)

In [None]:
len(test_files)

<h3>Checking if the shape of data is uniform across all the train files

In [None]:
cols = []
rows = []
for i,fname in enumerate(train_files):
    train = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/train",fname))
    cols.append(train.shape[0])
    rows.append(train.shape[1])
    print(f'{i+1} / {len(train_files)}')
    clear_output(wait=True)

In [None]:
print(f"Rows of all train files: {pd.Series(rows).unique()}\nColumns of all train files: {pd.Series(cols).unique()}")

<h3>Checking if the shape of data is uniform across all the test files

In [None]:
cols = []
rows = []
for i,fname in enumerate(test_files):
    test = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/test",fname))
    cols.append(test.shape[0])
    rows.append(test.shape[1])
    print(f'{i+1} / {len(test_files)}')
    clear_output(wait=True)

In [None]:
print(f"Rows of all test files: {pd.Series(rows).unique()}\nColumns of all test files: {pd.Series(cols).unique()}")

<h3>All the train and test files have same shape. Hence, we'll inspect the first few rows of first 10 train files to get an idea of data

In [None]:
train = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/train",train_files[2]))
train.head()

<h3>Whoa!! NaNs

In [None]:
train.count()

<h3>Not only NaNs there are also few 0s

In [None]:
(train==0).sum()

<h3>There are few sensors with no data at all in some files. Let us also examine if there are sensors with intermittent missing values (NaN) across files

In [None]:
missing_tracker = pd.DataFrame()
for i,fname in enumerate(train_files):
    train = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/train",fname))
    missing_tracker = missing_tracker.append(pd.DataFrame(train.count()).T)
    print(f'{i+1} / {len(train_files)}')
    clear_output(wait=True)

In [None]:
missing_tracker

<h3>There are sensors with intermittent missing values

In [None]:
missing_tracker.nunique()

In [None]:
for col in missing_tracker.columns:
    print(f"{col}\n{sorted(missing_tracker[col].unique())}\n\n")

<h3><ol>
<li>sensor_1: Has moderate no. of intermittent missing readings (few NaNs) and cases with no readings (all NaNs)</li>
<li>sensor_2: Has lots of intermittent missing readings (few NaNs) and cases with no readings (all NaNs)</li>
<li>sensor_3: Has moderate no. of intermittent missing readings (few NaNs) and cases with no readings (all NaNs)</li>
<li>Sensor_4: Looks quite stable. It's having very less number of intermittent NaNs and there are no cases where it hasn't recorded any readings.</li>
<li>sensor_5: Has lots of intermittent missing readings (few NaNs) and cases with no readings (all NaNs)</li>
<li>sensor_6: Has moderate no. of intermittent missing readings (few NaNs) and no cases with no readings (all NaNs)</li>
<li>sensor_7: Has less no. of intermittent missing readings (few NaNs) and cases with no readings (all NaNs)</li>
<li>sensor_8: Has very less no. of intermittent missing readings (few NaNs) and cases with no readings (all NaNs)</li>
<li>Sensor_9:  Seems to be highly unstable with varying intermittent missing readings / NaNs</li>

In [None]:
plt.figure(figsize=(15,20))
sns.heatmap(missing_tracker);

<h3>The heatmap shows sensors 2, 3, 5, 8, and 9 have no readings in many files. Sensors 4 and 6 have atleast 1 reading in every file.

<h3>Let's see the distribution of sensors with complete readings in the files

In [None]:
mi = ((missing_tracker==60001).sum()/4431)*100
print(f'% of files with complete data per sensor\n\n{mi}')
mi.plot(kind="bar")
plt.axhline(y=100,color="red")
plt.title("Sensors with complete readings")
plt.ylabel("% of files with complete data");

<h3>The above bar chart shows sensors 1,4,6, and 7 have almost all the readings in all the files</h3>

<h3>Let's see the distribution of sensors with no readings at all across the files

In [None]:
mi = ((missing_tracker==0).sum()/4431)*100
print(f'% of files with no data at all per sensor\n\n{mi}')
mi.plot(kind="bar")
plt.title("Sensors with no readings at all")
plt.ylabel("% of files with no data at all");

<h3>As seen earlier in the heatmap, sensors 2, 3, 5, 8, and 9 have no readings in many files.

<h3>Let's look at the data in a single training file

In [None]:
train_1 = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/train",train_files[0]))
train_1.head()

In [None]:
train_1.shape

<h3>Let's plot the readings of each sensor

In [None]:
fig,ax=plt.subplots(5,2,figsize=(12,17))
r=0
c=0
for i,col in enumerate(train_1.columns):
    ax[r,c].plot(train_1[col])
    ax[r,c].set_title(col)
    c+=1
    if (i+1)%2==0:
        r+=1
        c=0
plt.show();

<h3>All the sensors don't have the readings in the same range

In [None]:
train_1.plot(figsize=(15,8))
plt.legend(loc="upper left", ncol=5);

<h3>In the above sample, sensors 1 and 2 have many peaks and troughs in the readings, especially sensor 2

<h3>Are the sensor readings correlated. How's the distribution of sensor readings?</h3>

In [None]:
fig = plt.figure(figsize=(10,7))
sns.heatmap(train_1.corr(),annot=True,fmt=".2f",cbar=False);

<h3>There is no correlation between the sensor readings

In [None]:
sns.pairplot(train_1,diag_kind="kde");

<h1>Reducing the dimensionality of train and test sets by computing important stats (using describe) on monthly data</h1>

<h3>There are 4,000+ train files/locations with 60,000 readings of 10 sensors. Each loacation/file has a single target variable i.e. 60,000 rows of data has only 1 target/label. Hence, we need to reshape the data into single row per location. Doing this for huge data can be tedious. The readings are recorded for every 10 minutes. Hence, I've applied the describe function on monthly data and reshaped it into a single row.

In [None]:
def find_stats(x):
    '''
    Function to apply describe on monthly data
    '''
    return x.describe()

<h3>Selecting 70% of train files due to memory constraints

In [None]:
import random
random.seed(11)
rand_sample = random.sample(range(len(train_files)),int(np.floor(len(train_files)*0.7)))
print(len(rand_sample))
train_files = list(pd.Series(train_files)[rand_sample])

In [None]:
train = pd.DataFrame()
for n,i in enumerate(train_files):
    df = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/train",i))
    df.index=pd.date_range("2000-01-01",periods=len(df),freq="10T") #adding date range with 10 minute interval
    df = df.resample(rule="M").apply(find_stats).values.reshape(1,-1) #resampling the time series data and calculating stats on monthly data
    df = pd.DataFrame(df)
    df["segment_id"] = int(i.replace(".csv",""))
    train = train.append(df)
    print(f'{n}')
    gc.collect()
    clear_output(wait=True)

In [None]:
train_labels = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train.csv")

In [None]:
train = train.merge(train_labels,on="segment_id",how="left")

In [None]:
train=train.drop(columns="segment_id")

In [None]:
train.fillna(0,inplace=True)

In [None]:
X = train.iloc[:,:-1].copy()

In [None]:
y = train.iloc[:,-1].copy()

In [None]:
X.head()

In [None]:
y.isnull().sum()

In [None]:
#from sklearn.model_selection import GridSearchCV
#from sklearn.decomposition import PCA
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold,cross_val_score

<h3>Though PCA helped me reduce the dimensionality with 99.99 explained variance ratio. But the performance of the model was better without PCA being applied since it ignores the target variable

In [None]:
#pca = PCA(n_components=10)
#X_pca = pca.fit_transform(X)
#np.cumsum(pca.explained_variance_ratio_)

<h3>Since the dataset is huge, RAM utilization reached to its maximum, for better memory management I've manually removed unwanted variables and performed garbage collection frequently

In [None]:
del df
del train
gc.collect()

In [None]:
X_pca = X.copy()

<h3>Using Random Forest for feature importance

In [None]:
rf = RandomForestRegressor(random_state=11)
rf.fit(X_pca,y)
print(f"No of Features with Importance > 0: {sum(rf.feature_importances_>0)}")

<h3>Selecting threshold for feature importance. Importance above 75 percentile

In [None]:
feat_imp = pd.Series(rf.feature_importances_).reset_index(drop=True)
threshold = feat_imp.quantile(0.75)

In [None]:
final_feat = list(feat_imp[feat_imp>threshold].index)
len(final_feat)

In [None]:
X_pca = X_pca.iloc[:,final_feat].copy()
X_pca.shape

In [None]:
del rf
gc.collect()

<h3>Manually tuning the hyperparameters, since GridSearchCV is taking too long

In [None]:
MAX_DEPTH = 10
N_ESTIMATORS = 1000
MIN_SAMPLES_LEAF = 300
L1 = 50000


gscv = LGBMRegressor(n_estimators=N_ESTIMATORS,
                     max_depth=MAX_DEPTH,
                     num_leaves=2**MAX_DEPTH,
                     min_data_in_leaf=MIN_SAMPLES_LEAF,
                     lambda_l1 = L1,
                     random_state=11,
                     n_jobs=-1)


In [None]:
cv = KFold(n_splits=3,random_state=11,shuffle=True)
cv_score = cross_val_score(gscv,X_pca,y,cv=cv,scoring="neg_mean_absolute_error",n_jobs=-1)

<h3>CV scores (MAE)

In [None]:
-1 * (cv_score.astype("int"))

In [None]:
np.mean(-1*(cv_score.astype("int")))

In [None]:
gscv.fit(X_pca,y)

In [None]:
del X
gc.collect()

<h3>MAE on train set

In [None]:
mean_absolute_error(gscv.predict(X_pca),y)

<h3>R<sup>2</sup> on Training set

In [None]:
r2_score(gscv.predict(X_pca),y)

<h2>Model is overfit to the training set. Need further hyperparameter tuning.

In [None]:
del cv, cv_score
gc.collect()

In [None]:
del X_pca
gc.collect()

In [None]:
test = pd.DataFrame()
for n,i in enumerate(test_files):
    df = pd.read_csv(os.path.join("../input/predict-volcanic-eruptions-ingv-oe/test",i))
    df.index=pd.date_range("2000-01-01",periods=len(df),freq="10T")
    df = df.resample(rule="M").apply(find_stats).values.reshape(1,-1)
    df = pd.DataFrame(df)
    df = df.iloc[:,final_feat].copy()
    df["segment_id"] = int(i.replace(".csv",""))
    test = test.append(df)
    print(f'{n}')
    gc.collect()
    clear_output(wait=True)

In [None]:
test_segment_ids = test["segment_id"]
test = test.drop(columns="segment_id")

In [None]:
test.fillna(0,inplace=True)

In [None]:
#X_test_pca = pca.transform(test)

In [None]:
X_test_pca = test.copy()

In [None]:
del test
gc.collect()

In [None]:
#X_test_pca = X_test_pca.iloc[:,final_feat].copy()

In [None]:
pred = gscv.predict(X_test_pca)

In [None]:
pred

In [None]:
submission = pd.DataFrame({"segment_id":test_segment_ids,"time_to_eruption":pred})

In [None]:
submission.to_csv("ingv_submission.csv",index=False)