# 2 Data wrangling

## 2.2 Introduction

This step focuses on collecting your data, organizing it, and making sure it's well defined.

### 2.2.1 Recap Of Data Science Problem

The purpose of this data science project is to predict time to failure given each small segment of acoustic signal using a data-driven model

## 2.3 Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# from pandas_profiling import ProfileReport
import numpy as np
import glob, os
pd.set_option("display.precision", 8)
pd.options.display.max_rows = 99
from scipy import signal
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import acf
%matplotlib inline  

## 2.4 Objectives

There are some fundamental questions to resolve in this notebook before you move on.

* Do you think you may have the data you need to tackle the desired question?
    * Have you identified the required target value?
    * Do you have potentially useful features?
* Do you have any fundamental issues with the data?
* Do your column names correspond to what those columns store?
    * Check the data types of your columns. Are they sensible?
    * Calculate summary statistics for each of your columns, such as mean, median, mode, standard deviation, range, and number of unique values. What does this tell you about your data? What do you now need to investigate?

# 2.5 Load the training data

## 2.5.0 Check complete data every 100 records

In [None]:
%%time
train_df = pd.read_csv(os.path.join(PATH,'train.csv'), dtype={'acoustic_data': np.int16, 'time_to_failure': np.float32})

In [None]:
train_ad_sample_df = train_df['acoustic_data'].values[::100]
train_ttf_sample_df = train_df['time_to_failure'].values[::100]

def plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% sampled data"):
    fig, ax1 = plt.subplots(figsize=(16, 12))
    plt.title(title)
    plt.plot(train_ad_sample_df, color='r')
    ax1.set_ylabel('acoustic data', color='r')
    plt.legend(['acoustic data'], loc=(0.01, 0.95))
    ax2 = ax1.twinx()
    plt.plot(train_ttf_sample_df, color='b')
    ax2.set_ylabel('time to failure', color='b')
    plt.legend(['time to failure'], loc=(0.01, 0.9))
    plt.grid(True)

plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df)
del train_ad_sample_df
del train_ttf_sample_df

## 2.5.1 Check a small section of data

In [2]:
filename = '../input/LANL-Earthquake-Prediction/train.csv'


In [3]:
# df_tr_sec =  pd.read_csv(filename,nrows = 6e6)
records = 10_000_000
df_tr_sec =  pd.read_csv(filename,nrows = records)


In [4]:
df_tr_sec.info()

In [6]:
df_tr_sec.describe()

In [7]:
df_tr_sec.head()

In [None]:
fig, ax1 = plt.subplots(1,1,figsize=(16, 12), dpi=100 )
ax2 = ax1.twinx()
ax1.plot(df_tr_sec['acoustic_data'], 'g.')
ax2.plot(df_tr_sec['time_to_failure'], 'b.')


ax1.set_ylabel('Signal', color='g')
ax2.set_ylabel('Time to failure', color='b')
ax2.grid()
plt.title(f'The first {records /1000}k data points');

In [None]:
df_tr_sec.plot('time_to_failure','acoustic_data')
plt.gca().invert_xaxis()

In [None]:
# time2fail_np = df_tr_sec['time_to_failure'].values
# # dt_np = time2fail_np[0:-1] - time2fail_np[1:] 
dt = pd.Series(-df_tr_sec['time_to_failure'].diff(), name='dt')
print(dt.value_counts())
plt.hist(dt.sort_values()[1:])# remove the time difference between two segement, which is about -11
plt.title('Histogram of time sampling')
plt.ylabel('Count')
plt.yscale('log')
plt.xlabel('time sampling');

`acoustic_data` distribute unevenly.

From `time_to_failure`, the equipment may work a while to acquire signal at high frequency (1.1e-9 s), and then take a rest (1e-3 s)before next aquisition.

The sampling rate is not a constant, and the distribution of time step is bimodal with one large timestep and a small timestep.

Try resample with the large timestep later.



## 2.5.2 Truncate training data for each earthquake event

In [None]:
# using chunk to manipulate the raw train data
filenum = 0
j = 0
previous_df = pd.DataFrame()
total_missing = np.array([0,0])
total_records = 0
min_time = 100000

for chunk in pd.read_csv(filename, chunksize = 1e7):
    # count missing value in each column

    total_missing += chunk.isna().sum().to_numpy()
    total_records += len(chunk)
    min_time = np.array([chunk.iloc[-1,1], min_time]).min()
    
    j += 1
    dt_df = chunk[['time_to_failure']].shift(fill_value = chunk.iloc[0]['time_to_failure']) - chunk[['time_to_failure']]
    # find the last index of an earthquake event
    idx = dt_df[dt_df['time_to_failure'] < 0].index
    
    if len(idx) == 0:
        previous_df = pd.concat([previous_df,chunk])
    else:
        prev_i = chunk.index[0]
        for i in idx:
            # save each earthquake event in a pickle file
            pd.concat([previous_df, chunk.loc[prev_i:i, :]]).to_pickle('D:/DataScience/traindata/train_sec_' + str(filenum) + '.pkl')
            filenum += 1
            prev_i = i
        previous_df = chunk.loc[i:, :]

# The last earthquake file    
previous_df.to_pickle('D:/DataScience/traindata/train_sec_' + str(filenum) + '.pkl')

In [None]:
print("nan for two variables: ", total_missing)
print("Total records: ", total_records)
print("Last time of recording: ", min_time)

## 2.5.3 Check tuncated data

In [None]:
def plot_train_segment(fileobject):
    df = pd.read_pickle(fileobject)
    print(df.columns)
    print('Records number:',len(df))
    df.plot(x = 'time_to_failure', y = 'acoustic_data')
    
    plt.gca().invert_xaxis()
    plt.title(fileobject[31:-4])
    plt.savefig("../images/" + fileobject[25:-4] + ".jpg")
    #     plt.close()

In [None]:
pathlist = ['D:/DataScience/traindata/train_sec_' + str(num) + '.pkl' for num in range (0,17)]


In [None]:
# plot the training segment and save figure
for path in pathlist[4:6]:
    plot_train_segment(path)

In [None]:
df_tr_end = pd.read_pickle(pathlist[-1])
df_tr_end.iloc[-5:]

`train_sec_16` is not complete, i.e., it does not records the failure event

## 2.5.4 Pandas profiling for a training segment

In [None]:
df_tr_sec = pd.read_pickle(pathlist[7])

In [None]:
profile = ProfileReport(df_tr_sec, title="Pandas Profiling Report of the Training Data 6")

In [None]:
# profile.to_widgets()

In [None]:
profile.to_file("../report/training_data_report_6.html")

## 2.5.5 Resampling the training data with 1 ms sampling rate

In [None]:
# power spectrum density (square of amplitude spectrum)
def psd(input_signal):
    f, Pxx_den = signal.periodogram(input_signal, fs = 1/1.1e-9)
    f_dom = f[np.argmax(Pxx_den)]
    return f, Pxx_den, f_dom

In [None]:
def iqr(X):
    Q1 = X.quantile(0.25)
    Q3 = X.quantile(0.75)
    iqr_X = Q3 - Q1
    return iqr_X

In [None]:
def create_feature(X):
    mean_X = X.mean()
    median_X = X.median()
    std_X = X.std()
    iqr_X = iqr(X)
    range_X = np.max(X) - np.min(X)
#     min_X = np.min(X)
#     max_X = np.max(X)
#     f, Pxx_den, f_dom_X = psd(X)
    return [mean_X, median_X, std_X, iqr_X, range_X]

In [None]:
def create_lag_feature(X, periods= 0):
    feature_list = []
    for period in range(1,periods):
        feature_list.extend(create_feature(X.diff(periods))) 
    return feature_list

In [None]:
def create_feature_by_smplrt(df, precision):
    if precision == 4:
        col_name = 'ttf_ms'
    elif precision == 6:
        col_name = 'ttf_us'
        
    # Create feature using statistics
    df[col_name] = df['time_to_failure'].round(precision).astype('str')
    df_agg = df.groupby(col_name)['acoustic_data'].agg(create_feature)
    df_agg = pd.DataFrame(df_agg.tolist(), index= df_agg.index, columns = [ 'mean','median','std','iqr','range'])
    
    # prepare feature names of lagging data in time domain
    col_name_by_lag = [i + '_lag' for i in ['mean','median','std','iqr','range']]
    col_names = []
    for j in range(1,11):
        col_names.extend([name + '_'+ str(j) for name in col_name_by_lag])
        
    # Create feature using time lagging on acoustic data
    df_lag_agg = df.groupby(col_name)['acoustic_data'].agg(create_lag_feature, periods = 11)
    df_lag_agg = pd.DataFrame(df_lag_agg.tolist(), index= df_lag_agg.index, columns =col_names)
    
    # Merge features in time domains
    df_time = df_agg.merge(df_lag_agg, how = 'left',left_index = True, right_index = True)

    # Frequency domain statistics in sampling signal
    df_psd = df.groupby(col_name)['acoustic_data'].agg(psd)
    df_psd = pd.DataFrame(df_psd.tolist(), index= df_psd.index, columns = ['freq', 'psd','f_dom'])
    
    # prepare feature names in frequency domain
    fsecs = ['f_sec' + str(num) for num in range(1,11)] # variable names
    df_f = pd.DataFrame(columns = fsecs)
    for index, row in df_psd.iterrows():
        f_dict = dict()
        freq_sec = np.linspace(0,1e8,11)
        PSD = row['psd']
        freq = row['freq']
        area = PSD[freq < 1e8].sum()

        for idx, fsec in enumerate(fsecs):
            row_filter = (freq >= freq_sec[idx]) & (freq < freq_sec[idx+1])
            pzone = PSD[row_filter].sum()    
            f_dict[fsec] = pzone / area
        temp = pd.DataFrame(f_dict,index =[index] )
        df_f = df_f.append(temp) 
    df_psd = df_psd.merge(df_f, how = 'left', left_index = True, right_index = True)
    
    df_new_smplrt = df_time.merge(df_psd, how = 'left', left_index = True, right_index = True).reset_index()
    df_new_smplrt[col_name] = df_new_smplrt[col_name].astype('float')
    return df_new_smplrt

### 2.5.5.1 Create feature using statistics on 1 ms groups

In [None]:
df_tr_sec['ttf_round'] = df_tr_sec['time_to_failure'].round(4).astype('str')
df_agg = df_tr_sec.groupby('ttf_round')['acoustic_data'].agg(create_feature)
df_agg = pd.DataFrame(df_agg.tolist(), index= df_agg.index, columns = [ 'mean','median','std','iqr','diff'])
# df_agg.reset_index(inplace = True)

df_agg.head()

### 2.5.5.2 Create feature using time lagging on acoustic data

In [None]:
# prepare feature names of lagging data in time domain
col_name_by_lag = [i + '_lag' for i in ['mean','median','std','iqr', 'diff']]

col_names = []
for j in range(1,11):
    col_names.extend([name + '_'+ str(j) for name in col_name_by_lag])
col_names[:10]

In [None]:
df_tr_sec['ttf_round'] = df_tr_sec['time_to_failure'].round(4).astype('str')
df_lag_agg = df_tr_sec.groupby('ttf_round')['acoustic_data'].agg(create_lag_feature, periods = 11)
df_lag_agg = pd.DataFrame(df_lag_agg.tolist(), index= df_lag_agg.index, columns =col_names)
df_lag_agg.head(5).T

### 2.5.5.3 Time domain features compilation

In [None]:
df_tr_agg = df_agg.merge(df_lag_agg, how = 'left',left_index = True, right_index = True)
df_tr_agg.head(5).T

### 2.5.5.4 Frequency domain in 1 ms sampling signal

Plot periodogram (square of amplitude spectrum)

In [None]:
row1 = df_tr_sec['ttf_round'] == df_tr_sec['ttf_round'].unique()[1]
f1, Pxx_den1, f_dom1 = psd(df_tr_sec.loc[row1,'acoustic_data'])

row2 = df_tr_sec['ttf_round'] == df_tr_sec['ttf_round'].unique()[2]
f2, Pxx_den2, f_dom2 = psd(df_tr_sec.loc[row2,'acoustic_data'])

fig, ax_list = plt.subplots(2,1,figsize=(16, 12), dpi=100 )
ax = ax_list[0]
ax.plot(f1,Pxx_den1)
ax.set_ylabel('Power spectral density')
ax.set_xlabel('Frequency (Hz)')
ax.set_title('time to failure: ' + df_tr_sec['ttf_round'].unique()[1] + ' sec' )

ax = ax_list[1]
ax.plot(f2,Pxx_den2)
ax.set_ylabel('Power spectral density')
ax.set_xlabel('Frequency (Hz)')
ax.set_title('time to failure: ' + df_tr_sec['ttf_round'].unique()[2] + ' sec' );

The frequency component mainly distribute in 1e8 Hz. Divide this frequency interval to 10 buckets, calculate the probability to create some features

In [None]:
df_psd = df_tr_sec.groupby('ttf_round')['acoustic_data'].agg(psd)
df_psd = pd.DataFrame(df_psd.tolist(), index= df_psd.index, columns = ['freq', 'psd','f_dom'])
df_psd.head()

In [None]:
fsecs = ['f_sec' + str(num) for num in range(1,11)] # variable names
df_f = pd.DataFrame(columns = fsecs)
for index, row in df_psd.iterrows():
#     print(type(row))
    f_dict = dict()
    freq_sec = np.linspace(0,1e8,11)
    PSD = row['psd']
    freq = row['freq']
    area = PSD[freq < 1e8].sum()

    for idx, fsec in enumerate(fsecs):
        row_filter = (freq >= freq_sec[idx]) & (freq < freq_sec[idx+1])
        pzone = PSD[row_filter].sum()    
        f_dict[fsec] = pzone / area
    temp = pd.DataFrame(f_dict,index =[index] )
#     print(temp)
    df_f = df_f.append(temp) 
df_psd = df_psd.merge(df_f, how = 'left', left_index = True, right_index = True)
df_psd.head().T

### 2.5.5.5 Merge time- and frequency-domain features
merge time-lag features `df_lag_all` with frequency-domain features `df_psd`

In [None]:
df_agg_all = df_tr_agg.merge(df_psd, how = 'left', left_index = True, right_index = True).reset_index()
df_agg_all['ttf_round'] = df_agg_all['ttf_round'].astype('float')
df_agg_all.head(5).T

In [None]:
corr_lag = df_agg_all.corr()
mask = np.triu(np.ones_like(corr_lag, dtype=bool))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
f, ax = plt.subplots(figsize=(16, 16))
ax = sns.heatmap(corr_lag, mask=mask, cmap=cmap, vmax=.3, center=0,
                 square=True, linewidths=.5, cbar_kws={"shrink": .5})
ax.set_title('1 ms sampling rate');

In [None]:
df_agg_all.hist(figsize=(16,16))
plt.subplots_adjust(hspace=0.5);

Q: why median and iqr only have a precision of 1

Because `acoustic_data` is int type.

In [None]:
df_ms = create_feature_by_smplrt(df_tr_sec, 4)
df_ms.head(5).T

### 2.5.5.6 Compare resampling data with raw data in periodogram


In [None]:
df_ms.ttf_ms.diff().mean()

In [None]:
# psd()
# def psd(input_signal):
#     f, Pxx_den = signal.periodogram(input_signal, fs = 1/1.1e-9)
#     f_dom = f[np.argmax(Pxx_den)]
#     return f, Pxx_den, f_dom
f_ms, Pxx_den_ms_mean = signal.periodogram(df_ms['mean'], fs = 1/0.00106)
_, Pxx_den_ms_median = signal.periodogram(df_ms['median'], fs = 1/0.00106)


Plot periodogram of 1 ms sampling data

In [None]:
# plot
fig, ax_list = plt.subplots(2,1,figsize=(16, 12), dpi=100 )
ax = ax_list[0]
ax.plot(f_ms, Pxx_den_ms_mean)
ax.set_ylabel('Power spectral density')
ax.set_xlabel('Frequency (Hz)')
ax.set_title('1 ms sampling data by mean')

ax = ax_list[1]
ax.plot(f_ms, Pxx_den_ms_median)
ax.set_ylabel('Power spectral density')
ax.set_xlabel('Frequency (Hz)')
ax.set_title('1 ms sampling data by median' );

Plot periodogram for raw data

In [None]:
w = np.linspace(1, 500, 100_000)
pgram = signal.lombscargle(df_tr_sec['time_to_failure'], df_tr_sec['acoustic_data'], w)


In [None]:
fig, ax_w = plt.subplots()
ax_w.plot(w/2/np.pi, pgram)
ax_w.set_xlabel('Frequency (Hz)')
ax_w.set_ylabel('Amplitude')
ax_w.set_title('Raw data periodogram' );
# plt.show()

In [None]:
range_cols = [col for col in df_ms.columns if 'range' in col]
df_ms[range_cols].head()

In [None]:
corr_lag = df_ms.corr()
mask = np.triu(np.ones_like(corr_lag, dtype=bool))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
f, ax = plt.subplots(figsize=(16, 16))
ax = sns.heatmap(corr_lag, mask=mask, cmap=cmap, vmax=.3, center=0,
                 square=True, linewidths=.5, cbar_kws={"shrink": .5})
ax.set_title('1 ms sampling rate');

## 2.5.6 Resampling the training data with 1 us sampling rate

In [None]:
df_us = create_feature_by_smplrt(df_tr_sec, 6)
df_us.head(5).T

In [None]:
corr_lag = df_us.corr()
mask = np.triu(np.ones_like(corr_lag, dtype=bool))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
f, ax = plt.subplots(figsize=(16, 16))
ax = sns.heatmap(corr_lag, mask=mask, cmap=cmap, vmax=.3, center=0,
                 square=True, linewidths=.5, cbar_kws={"shrink": .5})
ax.set_title('1 us sampling rate');

Compare periodogram of 1 us sampling data and original data

## 2.5.7 Rolling window to get statistic features

In [5]:
windows = [10, 50, 100, 500]
for i in range(len(windows)):
    df_tr_sec["rolmean_" + str(windows[i])] = df_tr_sec['acoustic_data'].rolling(windows[i], center = True).mean()
    df_tr_sec["rolstd_" + str(windows[i])] = df_tr_sec['acoustic_data'].rolling(windows[i], center = True).std()
df_tr_sec.info()


In [9]:
# Choose a window size
fig, axs = plt.subplots(len(windows),1,figsize=(16, len(windows) * 5))
fig.subplots_adjust(hspace = .5)
axs = axs.ravel()
for i in range(len(windows)):
    roll_smpl = df_tr_sec.loc[1_100_000:1_110_000]
    axs[i].plot(roll_smpl.index.values, roll_smpl['rolmean_' + str(windows[i])].values, color = 'b',label = 'rolmean_' + str(windows[i]))
#         axs[i].plot(df_tr_sec['rolmean'] + 2 * df_tr_sec['rolstd'], 'r:',df_tr_sec['rolmean'] - 2 * df_tr_sec['rolstd'], 'r:',label = 'rolling_mean + std')
    axs[i].fill_between(roll_smpl.index.values,
                        roll_smpl['rolmean_' + str(windows[i])] - roll_smpl['rolstd_' + str(windows[i])], 
                        roll_smpl['rolmean_' + str(windows[i])] + roll_smpl['rolstd_' + str(windows[i])],
                        facecolor='lightgreen', alpha = 0.5, label='rolstd_' + str(windows[i]))
    axs[i].legend()
    axs[i].set_xlabel('index')
    axs[i].set_ylabel('Acoustic signal')
    axs[i].set_title("Rolling window = %.i" % windows[i] )
#     axs[i].set_ylim((-500, 500))

In [6]:
w =50
arfun = lambda x: x.autocorr()
df_tr_sec["rolmean_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).mean()
df_tr_sec["rolstd_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).std()
df_tr_sec["rolkurt_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).kurt()
df_tr_sec["rolskew_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).skew()
df_tr_sec["rolquantile_25_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).quantile(0.25)
df_tr_sec["rolquantile_50_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).median()
df_tr_sec["rolquantile_75_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).quantile(0.75)
df_tr_sec["rolIQR_" + str(w)] = df_tr_sec["rolquantile_75_" + str(w)] - df_tr_sec["rolquantile_25_" + str(w)]
df_tr_sec["rolmin_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).min()
df_tr_sec["rolmax_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).max()
df_tr_sec["rolsum_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).sum()
df_tr_sec["rolautocorr_" + str(w)] = df_tr_sec['acoustic_data'].rolling(w, center = True).apply(arfun)

# create auto correlation features with lag
# for col in df_tr_sec.columns[2:]:
#     for n in range(11):
#         df_tr_sec[col + "autocorr_lag_" + str(n)] = df_tr_sec[col].autocorr(lag = n)  
df_tr_sec.info()
# save the data
df_tr_sec.to_pickle('./training_section.pickle')


In [8]:
# save the data
df_tr_sec.to_pickle('./training_section.pickle')

In [None]:
df_tr_sec.columns

In [None]:
# df_tr_sec.iloc[25:-24]['acoustic_data']
plot_acf(df_tr_sec.iloc[25:-24]['acoustic_data'], lags = 10)
# plot_acf（df_tr_sec.iloc[25:-24]['acoustic_data'],lags=10） #rolmean_50

In [None]:
plot_acf(df_tr_sec.iloc[25:-24]['rolmean_50'], lags = 10)


In [None]:
corr = df_tr_sec.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
f, ax = plt.subplots(figsize=(16, 16))
ax = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                 square=True, linewidths=.5, cbar_kws={"shrink": .5})
ax.set_title('rolling features');

# 2.6 Load the testing data

In [None]:
testfilepath = "../../../rawdata/2ndCapstone/test/"
os.chdir(testfilepath)
testfiles = glob.glob("*.csv")
print(testfiles[0:5])

In [None]:
def plot_test_file(fileobject):
    df = pd.read_csv(fileobject)
    print(df.describe())
    df.plot(y = 'acoustic_data')
    plt.title(fileobject[0:-4])

In [None]:
df_te = pd.read_csv(testfiles[0])
df_te.head()

In [None]:
for testfile in testfiles[0:3]:
    print(len(testfile))
    plot_test_file(testfile)
    

# 2.7 Save data

In [None]:
# save the data to a new pickle file
df_ms.to_pickle('../data/train_ms_section.pkl')

In [None]:
df_us.to_pickle('../data/train_us_section.pkl')

In [None]:
.to_pickle('D:/DataScience/traindata/train_sec_' + str(filenum) + '.pkl')

# 2.8 Summary

For the training data:
* The data has zero missing values for both columns
* The data has only two column ,`acoustic_data` in int type and `time_to_failure` in float type. `time_to_failure` is our target.
* The data has 629,145,480 records, too large to operate.
* `acoustic_data` is the acoustic emission signal (amplitude), consisting of many peaks and troughts. 
* `time_to_failure` decreases from a value to zero periodically. The data should be truncated based on it.
* 17 segments were obtained after analyzing chunk data in the rawdata. 
* Each training segment has different time length
* The first one only has about two seconds record to earthquake. The last one is not complete because the `time_to_failure` is 9.75 s, far from zero.
* After `acoustic_data` reaches a extremly large event, the eqrthquake occurs soon in each training segment, and `time_to_failure` is less than 0.5s.

For the testing data:
* Only one row in each file, which is the `acoustic_data`
* Each segment contains 150k records, a small segment from a complete earthquake event. By comparision, records in training data is about 200 times of the test data. This may inspire us the feature extraction from the training data
* Similar to the training data, the average `acoustic_data` of a test data are small, while some abnormal `acoustic_data` exists. These determines what a feature is.
* Most test data may not contain the extremly large event with an absolute of amplitude over 1000. They contains peaks at a scale of several hundred amplitude

7/11/2022
For the training data:

* Resampling using groupby on round precision = 4 for `time_to_failure`
* Time lagging is tried on `acoustic_data` to create features. However, what time step is an optimum value?