## Import and Explore Sample Training File

We'll take a look at a single `segment_id` to review a file example. At the same time importing the `train.csv` file.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import signal
import seaborn as sns
import glob

In [None]:
train_labels = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train.csv")
train_labels.head()

In [None]:
print("Total segment files: {}".format(len(train_labels['segment_id'])))
train_labels.dtypes

In [None]:
df_example = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/train/1003154738.csv")
df_example.head()

data_columns = list(df_example.columns)

print('Index Dataframe Shape: {}'.format(df_example.shape))
print('Column Headers:\n')
print(data_columns)
df_example.describe()

10 sensors with 60001 readings. We also see a significant amount of NaN in there as well. Let's plot the time series data.

In [None]:
fig, axs = plt.subplots(nrows=5, ncols=2)
fig.set_size_inches(20,10)
fig.subplots_adjust(hspace=0.5)

for col,ax in zip(data_columns, axs.flatten()):
    ax.plot(range(len(df_example[col])),df_example[col])
    ax.set_title(col)

We are going to build a feature set for each `segment_id` using the spectral density. This approach evaluates the magnitude of a signal over the range of frequencies. With 100 Hz as the sampling rate, the `signal.welch` function returns 129 features for each sample totaling 1290 features. This is far more manageable than the 60001 features per sensor. This characterizes the event into the plot shown below for each `segment_id`.

In [None]:
plt.figure(figsize=(10,6))
plt.title('PSD')
plt.xlabel('Frequency')
plt.ylabel('Power')
plt.tight_layout()

example_PSD = []

for col in data_columns:
    col_mean = df_example[col].mean()
    df_example[col].fillna(col_mean, inplace=True)
    freq, psd = signal.welch(df_example[col],500)
    plt.loglog(freq,psd)
    example_PSD.append(psd)
    
plt.legend(data_columns)

print('Length of PSD: {}'.format(len(psd)))

In [None]:
example_PSD = np.transpose(example_PSD)
print("Transposed PSD array shape: {}".format(example_PSD.shape))
df_PSD = pd.DataFrame(data=example_PSD, columns=data_columns)
df_PSD.head()

## Generate Training DataFrame

We'll conduct the power spectral density function on every segment id and flatten it out. We should end up with an array of 4431 samples of 1290 features.

In [None]:
train_input = []
i=0

for segment in train_labels['segment_id']:
    output_psd = []
    dataframe = pd.read_csv(f'../input/predict-volcanic-eruptions-ingv-oe/train/{segment}.csv')
    for col in data_columns:
        freq, psd = signal.welch(dataframe[col],100)
        output_psd = np.append(output_psd,psd)
    
    train_input = np.append(train_input,output_psd)
    
    i=i+1
    #print('Manipulating segment {}, {} out of {}'.format(segment,i,len(train_labels['segment_id'])))

num_features = len(output_psd)
train_input = np.reshape(train_input,(i,num_features))
print('Finalized input shape: {}'.format(train_input.shape))

In [None]:
print("Shape of training input: {}".format(train_input.shape))

## Generate Testing DataFrame

Now we'll do the exact same for all the test data. The files will be read in sequence of the `sample_submission.csv`.

In [None]:
test_sample = pd.read_csv("../input/predict-volcanic-eruptions-ingv-oe/sample_submission.csv")
test_sample.head()

In [None]:
test_input = []
k=0

for segment in test_sample['segment_id']:
    output_psd = []
    dataframe = pd.read_csv(f'../input/predict-volcanic-eruptions-ingv-oe/test/{segment}.csv')
    for col in data_columns:
        freq, psd = signal.welch(dataframe[col],100)
        output_psd = np.append(output_psd,psd)
    
    test_input = np.append(test_input,output_psd)
    
    k=1+k
    #print('Manipulating segment {}, {} out of {}'.format(segment,k,len(test_sample['segment_id'])))


num_features = len(output_psd)
test_input = np.reshape(test_input,(-1,num_features))
print('Finalized input shape: {}'.format(test_input.shape))

In [None]:
print("Shape of test input: {}".format(test_input.shape))

## Training and Test Uniformity and Distribution

There seems to be a discrepency between the distribution of data between the two sets. We'll do a ks_2samp test to illustrate that. We'll take a look at this for each of the 129 features for the 10 sensors and plot it against the 10 sensors.

In [None]:
from scipy import stats

ks_value = []
p_value = []
index = range(0,1290)

for i in index:
    train = train_input[:,i]
    test = test_input[:,i]
    statistic,pvalue = stats.ks_2samp(train,test)
    ks_value = np.append(ks_value,statistic)
    p_value = np.append(p_value,pvalue)
    
sensor_ks = np.reshape(ks_value,(-1,129))
sensor_p = np.reshape(p_value,(-1,129))

sensor_array = []
for i in range(len(data_columns)):
    sensor_array = np.append(sensor_array,np.full((129),i+1))
    
sensor_df = pd.DataFrame({'sensor_id':sensor_array,'ks_value':ks_value,'p_value':p_value})

fig, axes = plt.subplots(1, 2,figsize=(15, 5))
fig.suptitle("K Statistic and P-values for Train and Test")
sns.stripplot(ax=axes[0],x='sensor_id',y='ks_value',data=sensor_df)
sns.stripplot(ax=axes[1],x='sensor_id',y='p_value',data=sensor_df)

There are a few sensors that have decent variation and have non-uniform distribution mong the train and test data. We can also plot the features and see we have not addressed the `NaN` within the tables.

In [None]:
fig, axs = plt.subplots(nrows=5, ncols=2)
fig.set_size_inches(20,10)
fig.subplots_adjust(hspace=0.5)

index_3 = test_input[3,:]
index_7 = test_input[7,:]

index_3 = np.reshape(index_3,(10,129))
index_3 = np.transpose(index_3)
index_3 = pd.DataFrame(data=index_3,columns=data_columns)

index_7 = np.reshape(index_7,(10,129))
index_7 = np.transpose(index_7)
index_7 = pd.DataFrame(data=index_7,columns=data_columns)

for col,ax in zip(data_columns, axs.flatten()):
    ax.loglog(freq,index_3[col])
    ax.loglog(freq,index_7[col])
    ax.set_title(col)

## Standard Scaler

The best performance I found was simply replacing the `NaN` with zeroes. Below I conduced a Nonlinear Kernal PCA to reduce the features and better compair test/train segments. I had tried a KNN Imputer but it seems there is too much variation between the test/train segments to help in the modell itself. 

In [None]:
train_input = np.nan_to_num(train_input)
test_input = np.nan_to_num(test_input)
y_train = train_labels['time_to_eruption']

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_input)

X_train = scaler.transform(train_input)
X_test = scaler.transform(test_input)

## Nonlinear Kernal PCA

With the high number of features, I used a Nonlinear PCA to reduce the number of featurs in order to better visualize the relationship between train and test segments.

In [None]:
from sklearn.decomposition import KernelPCA
transformer = KernelPCA(n_components=100,kernel="linear")
train_transformed = transformer.fit_transform(X_train)
test_transformed = transformer.transform(X_test)

combo_input = np.vstack((train_transformed,test_transformed))

train_seg_list = train_labels['segment_id'].to_list()
test_seg_list = test_sample['segment_id'].to_list()
combo_seg = np.array(train_seg_list)
combo_seg = np.append(combo_seg,test_seg_list)
combo_seg

In [None]:
from sklearn.covariance import EmpiricalCovariance, MinCovDet

robust_cov = MinCovDet().fit(combo_input[:,:])

m = robust_cov.mahalanobis(combo_input[:,:])

In [None]:
plt.figure(figsize=(20,12))
plt.title('Train versus Test Scatter of Principal Components')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')

cm = plt.cm.get_cmap('viridis')
plt.scatter(train_transformed[:,0],train_transformed[:,1], c=m[:train_input.shape[0]], cmap=cm,s=100)
plt.scatter(test_transformed[:,0],test_transformed[:,1], c=m[train_input.shape[0]:], cmap=cm,marker="P",s=100)
plt.colorbar()

test_x = test_transformed[:,0]
test_y = test_transformed[:,1]

for i,x,y in zip(range(0,len(test_sample['segment_id'])),test_x,test_y):
    if ((x > 50) | (y > 50)):
        label = test_sample['segment_id'][i]
        #plt.annotate(label,(x,y),ha="left")
        
train_x = train_transformed[:,0]
train_y = train_transformed[:,1]
        
for i,x,y in zip(range(0,len(train_labels['segment_id'])),train_x,train_y):
    if ((x > 50) | (y > 50)):
        label = train_labels['segment_id'][i]
        #plt.annotate(label,(x,y),ha="left")
        #outlier_list.append(label)
plt.legend(['train','test'])
plt.show()

In [None]:
outlier_list = []
plt.figure(figsize=(20,12))
plt.title('Train versus Test Scatter of Principal Components')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')

cm = plt.cm.get_cmap('viridis')
plt.scatter(train_transformed[:,0],train_transformed[:,1],alpha=.6)
plt.scatter(test_transformed[:,0],test_transformed[:,1], alpha=.4)


test_x = test_transformed[:,0]
test_y = test_transformed[:,1]

for i,x,y in zip(range(0,len(test_sample['segment_id'])),test_x,test_y):
    if ((x > 100) | (y>100)):
        label = test_sample['segment_id'][i]
        plt.annotate(label,(x,y),ha="left")
        
train_x = train_transformed[:,0]
train_y = train_transformed[:,1]
        
for i,x,y in zip(range(0,len(train_labels['segment_id'])),train_x,train_y):
    if ((x > 100) | (y>100)):
        label = train_labels['segment_id'][i]
        plt.annotate(label,(x,y),ha="left")
        outlier_list.append(label)
plt.legend(['train','test'])
plt.show()

In [None]:
# List of unclustered segments.

outlier_list

## Random Forest Regression

We see there is a wide scattering between number of test and training segments. Most pronounced is there is a high concentration of training semgents along the y axis which have no similar training sets. This may be due to the high number of 'NaN' in the test set.

In [None]:
#from sklearn.model_selection import cross_val_score
#from sklearn.model_selection import RepeatedKFold
#from sklearn.ensemble import RandomForestRegressor

#model = RandomForestRegressor(max_features=700,criterion='mae',random_state=42,
#                              max_samples=0.8,n_jobs=-1,min_samples_leaf=3)

#cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

#n_scores = cross_val_score(model,X_train,y_train,scoring='neg_mean_absolute_error',
#                          cv=cv, n_jobs=-1, error_score='raise',
#                          verbose=10)

#print('MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

In [None]:
import joblib
from joblib import dump,load
model = joblib.load('../input/reg-model/original_reg_model.joblib')

In [None]:
model.fit(X_train,y_train)

In [None]:
predictions = model.predict(X_test).astype('int64')
df_submit = test_sample.copy()
df_submit['time_to_eruption'] = abs(predictions)
df_submit.head(10)

In [None]:
print("Minimum event time is: {}".format(df_submit['time_to_eruption'].min()))
print("Maximum event time is: {}".format(df_submit['time_to_eruption'].max()))
df_submit.to_csv('submission.csv',index=False)

The random forest appears to handle the missing sensors better. I've had scores with CV and LB differencing by 5M but with this model I've at least been able to reduce the gap between the CV & LB by less than 3 million. One option may be to explore other feature characteristics that align the train and test set better.