# LANL: Review
***
In May/June 2019, I completed my first Kaggle competition - the Earthquake Prediction challenge hosted by Los Alamos National Laboratory (LANL). It was a great chance to take some rather freshly-acquired Python and ML knowledge out for a spin. Admitedly, I did it in a bit of a rush as I joined the competition late. Also, I didn't really take the time to look back and document the lessons learnt (and make sense of all the mess I created as a newbie). 

So 8 months later, here I am... (I am also yet to complete another competition and thought this would be a good way to get back into it!)

## What I did ... in a nutshell

For me, this competition was mostly about feature engineering - I knew almost nothing about seismology and acoustic data processing, but at least I had a kitchen sink full of statistical functions to throw at the data. I submitted a very simple ML model - an Elastic Net with minimal hyperparameter tuning. I did play around with a neural network, but couldn't really improve performance all that much and the deadline was looming...

I did get in the top 10% and nabbed a Bronze medal, jumping up 2875 spots all the way up to 443 when the leaderboard was finalised with all the test data. Maybe the fact I joined late worked in my favour as I would have probably ended up overfitting my model if I had more time.

Anyway, here's a review of what I did (quite a bit more readable than my actual submision here: https://www.kaggle.com/slashie/lanl-final), plus a little bonus XGBoost exploration at the end, because I've always been itching to know if XGBoost could've really *boosted* my score :)

(I also placed some of the more detailed code in this utility script here https://www.kaggle.com/slashie/lanl-udf to keep this notebook relatively tidy)

In [None]:
# Bread and butter
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-darkgrid')
%matplotlib inline

# ML modelling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

# Utility module: my user-defined functions | see: https://www.kaggle.com/slashie/lanl-udf
#import lanl_udf
from lanl_udf import *
#help(lanl_udf)

## Exploring the data
***
First, let's see what's living in the data directory (I've noticed the Kaggle directory setup has changed slightly since last time...):

In [None]:
data_dir = '/kaggle/input/LANL-Earthquake-Prediction'
print(os.listdir(data_dir)) # let's see what's in the data directory!

### Inspecting training data

The training dataset in `'train.csv'` is very large and should be extracted in batches/chunks.

*I still remember trying to import the whole training dataset in one go and running out of memory! Luckily `pandas.read_csv()` has pretty good documentation and so I eventually figured out to limit the number of rows extracted! #noobmoment*

In [None]:
n_obs = 150000 # number of obs to extract in one go (corresponds with how test data is set up - see test data section below)
n_skip = 4 * (10 ** 5) # number of rows to skip (i.e. to be able to look at different data sections, not just one!)
train_path = os.path.join(data_dir,'train.csv')
sample = pd.read_csv(train_path, 
                     nrows=n_obs, header=None, skiprows=n_skip) # header set to None, else values will be set as column names when skipping
sample.columns = ['acoustic_data','time_to_failure']
sample.head()

At first glance of the dataframe head, we see some variation in `acoustic_data`, but none in `time_to_failure`. 

When we plot it out below, we can see that `acoustic_data` is quite volatile, whereas the `time_to_failure` decreases in a consistent step-wise linear manner. 

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, facecolor='white', figsize=(14,7))
sample.acoustic_data.plot(linewidth=0.5, ax=ax[0])
ax[0].set_ylabel('Acoustic Data',fontsize=12)
sample.time_to_failure.plot(linewidth=1.5, ax=ax[1])
ax[1].set_ylabel('Time to Failure (seconds)',fontsize=12)
plt.show()

### Inspecting test files
By looking at files in the `test` folder, it becomes clear that training data should be split into chunks of 150000. This way the training data is chunked into files that correspond in size to the test data files. For each of these test files, we eventually have to generate a *single* prediction of `time_to_failure` using the patterns in the `acoustic_data`.

Given we have to generate only one prediction for each segment in the `test` folder, that means we don't need to retain all the `time_to_failure` values from each training data segment as a training y-variable. We just need the last `time_to_failure` value.

*When I was doing this the first time, it took me a while to realise this, but I eventually understood after looking through some of the public notebooks for tips (they are a great learning resource for beginners)... #noobmoment*

In [None]:
test_path = os.path.join(data_dir,'test')
test_files = os.listdir(test_path)
test_file_num = 28
test_sample = pd.read_csv(os.path.join(data_dir,'test',test_files[test_file_num]))
print("There are %d rows in each test file" %(test_sample.shape[0]))
display(test_sample.head())

## Training Data Setup
***
### Feature Generating Functions

My general approach to feature engineering was to summarise the `acoustic_data` values in a given 150000 segment using statistical features such as *autocorrelation*, *volatility*, *concentration* etc. I also attempted to read up on frequency measures from the field signal processing and found an approach to compute the sine-based frequency of a series using `numpy.fft()`.

My user-defined functions are in the `lanl_udf.py` module, which is imported at the beggining. Using my functions, as well as some `scipy` summary statistics functions, I generated the following 14 features in each 150000-long segment of `acoustic_data`:
1. *Mean*
2. *Standard deviation*
3. *Autocorrelation (first order)*
4. *Log of skewness squared*
5. *Log of kurtosis*
6. *Autocorrelation (first order) of first differences*
7. *Mean of absolute deviations*
8. *Geometric mean of absolute deviations*
9. *Harmonic mean of absolute deviations*
10. *Fraction of sum of absolute deviations in top 500 observations*
11. *Fraction of sum of absolute deviations in top 25000 observations*
12. *Fraction of absolute deviations above 750*
13. *Fraction of observations equal to the mode*
14. *Wave frequency measure*

With most of these features I tried to capitalise on the fact that the autocorrelation of `acoustic_data` tended to fall and volatility (as well as extreme values) would spike as `time_to_failure` approached 0 (as will be shown shortly). I experimented with a lot more than this, but many other features were often almost perfectly correlated with one of the 14 features above and thus redundant. 

### Training Data Extraction
To increase the number of observations in my training data, I extracted overlapping (rather than mutually exclusive) training segments. To implement this, I started iterating through the training dataset at different points, using the `skiprows` argument of `pandas.read_csv()`. Therefore, I iterated through the training data multiple times (5 to be exact), each time starting at a different point, so that the segments are different. This is implemented below by the `gen_training_data()` function from the `lanl_udf.py` module. 

In [None]:
X_names = ['mean','stdev','AC(1)','log(skew^2)','log(kurt)','AC(1)_diff',
           'mean(abs_dev)','gmean(abs_dev)','hmean(abs_dev)', 'frac_top500', 
           'frac_top25000', 'frac_dev>750', 'frac_eq_mode', 'wave_freq']
y_name = 'time_to_failure'
try:
    try:
        df_train = pd.read_csv('df_train.csv')
    except:
        df_train = pd.read_csv('/kaggle/input/lanl-review/df_train.csv')
    X_train = df_train[X_names].values
    y_train = df_train[y_name].values
except:
    X_train, y_train = gen_training_data(train_path)
    df_train = pd.DataFrame(X_train)
    df_train.columns = X_names
    df_train.loc[:,y_name] = y_train
df_train.to_csv('df_train.csv', index=False)
df_train.info()

## Visual Analysis
***
### Correlation of each feature with the target

As I trialled various features, I looked at each of their individual correlations with the target in the search of something that would provide some predictive power. 

In [None]:
corr_Xy = [np.corrcoef(y_train,X_train[:,i])[0,1] for i in range(X_train.shape[1])]
df = pd.DataFrame({'feature':X_names, 'corr_w_target':corr_Xy}).set_index('feature')
fig = plt.figure(facecolor='white', figsize=(14,7))
df['corr_w_target'].plot(kind='bar', fontsize=12)
plt.xlabel(None)
plt.xticks(rotation=45)
plt.show()

### Feature Correologram

I was also worried about multi-collinearity across my features, so I regularly checked the feature correologram as I added new features. In other words, I wanted each new feature I experiemented with to bring *new information* to the table i.e. it would not be too strongly correlated with other features (but still correlated somewhat with the y-variable). 

To be honest, looking at some of the correlations in the figure below, I'm not sure if I succeeded. What I should have done is to extract as many features as possible and apply Principal Component Analysis to extract the key components of variation. Instead, I just went ahead and hoped some form of regularization would take care of redundant features later on...

In [None]:
X_df = df_train[X_names]
fig = plt.figure(facecolor='white', figsize=(12,10))
hm = sns.heatmap(X_df.corr(), cmap='viridis')
hm.tick_params(labelsize=12)
plt.xticks(rotation=45)
plt.show()

## The Model
***
### Evaluating model with train-test splits

Instead of setting up a cross-validation K-fold type environment, I went for a less elegant approach to test the model's predictive stability. I basically ran it multiple times with different `random_state` arguments in `train_test_split()`. I also did it manually back then (not even using the for-loop approach below), picking whichever number came to my head (not the best, I know). The things I was looking out for:
* The R squared
* The Mean Absolute Error - which was the competition's evaluation metric
* Parameter Estimates - must be my econometrics background, but I really cared about those! 

Anyway, I manually tried different linear models with different hyperparameters, dabbled unsuccessfully with a neural network, but in the end, my *super-scientfic* methods led me to the model below!

The `ElasticNet` was quite good because it muted the large coefficient estimates on some of the parameters that I got when I ran a simple `LinearRegression`. This likely happened because of multi-collinearity i.e. very high correlation among some of my features, as can be seen in the correlogram above. Principal Components Analysis (PCA) could have helped, but I went ahead with the `ElasticNet` as a quicker solution given the time pressure. You can see the `0` coeffcient values where the `ElasticNet` did its job and thereby improved predictive power by making the model more generalizable. 

In [None]:
for state in [7, 42, 88, 101]:
    print("Random state is %d"%state,"\n","-"*30)
    X_fit, X_eval, y_fit, y_eval = train_test_split(X_train, y_train, test_size = 0.3, random_state=state)
    steps = [('scaler', StandardScaler()),
            ('reg', ElasticNet(alpha=0.01))]
    pipeline = Pipeline(steps)
    pipeline.fit(X_fit, y_fit)
    y_pred = pipeline.predict(X_eval)
    print("R^2: {}".format(pipeline.score(X_eval, y_eval)))
    MAE = mean_absolute_error(y_eval,y_pred)
    print("Mean Absolute Error: {}".format(MAE))
    print(pipeline.steps[1][1].coef_,'\n')

### Training the full model

This was the best I could do, so I proceeded to fit the model on all the training data, taking a peek at the fitted parameters to make sure they didn't change in an unexpectedly crazy way. They did not.

In [None]:
pipeline.fit(X_train, y_train)
print(pipeline.steps[1][1].coef_)

## Making and Submitting Predictions
***
### Generating test data

Here, the `gen_test_data()` function from my utility module `lanl_udf` is applied to extract the same set of features as above from the `acoustic_data` in each of the test segments. This works very similarly to the `gen_train_data()` function, except that, being *unseen* test data, we  don't have any `time_to_failure` y-data to extract. Instead, we need to keep track of the name of each test segment file using `seg_id`, so that we have an identifier for the Kaggle submission file.

In [None]:
seg_id, X_test = gen_test_data(test_path)
print("Generated features for %d test segments"%len(seg_id))

### Predictions

Now we can use the model to generate predictions of `time_to_failure` for each test segment and construct the Kaggle `submission.csv` file. I also adjusted any negative predictions to equal 0, as negative `time_to_failure` does not make sense.

In [None]:
elnet_pred = pipeline.predict(X_test)
elnet_pred[elnet_pred<0] = 0
elnet_submit_df = pd.DataFrame({'seg_id': seg_id, 'time_to_failure': elnet_pred})
#elnet_submit_df.to_csv('submission.csv', index=False)

## Trying out XGBoost

With this competition, I felt like more than 90% of my time went into the feature engineering part and I didn't get to play around with other ML models - particularly `XGBoost`, which is what all the cool kids use. I always wondered whether using `XGBoost` could have gotten me a much better score, holding my training data and feature set fixed. Well... let's find out!

***
### Quick parameter tune-up

I'm not going to spend a lot of time fine-tuning here since the comeptition is long done, but I still want to use something that's reasonable. The parameters I will *roughly* fine-tune are: `max_depth`, `eta` (learning rate) and `num_boost_round`, using the MAE score from the test sample across 4 cross-validation folds. 

In [None]:
xgb_train = xgb.DMatrix(data=X_train, label=y_train)
all_results = {'MAE-test':[], 'MAE-train':[], 'max_depth':[], 'eta':[], 'num_boost_round':[]}
for max_depth in [10, 12, 14]:
    for eta in [0.2, 0.25, 0.3]:
        for num_boost in [9, 11, 13]:
            params = {"objective":"reg:squarederror", "max_depth":max_depth, "eta":eta}
            cv_results = xgb.cv(dtrain=xgb_train, params=params, nfold=4, num_boost_round=num_boost, metrics="mae", seed=42)
            all_results['max_depth'].append(max_depth)
            all_results['eta'].append(eta)
            all_results['num_boost_round'].append(num_boost)
            all_results['MAE-test'].append(cv_results['test-mae-mean'].values[-1])
            all_results['MAE-train'].append(cv_results['train-mae-mean'].values[-1])
all_results_df = pd.DataFrame(all_results).sort_values(by='MAE-test')
all_results_df.head()

### Hyperparameter choice and prediction

I am not going to use the top 2 models from above, because I am worried they are over-fitted given the significantly lower training MAE. To me, the third and fourth models seem most resonable and they are quite similar. I am actually going to go with the fourth model, because it uses fewer boosting rounds and thus potentially less prone to overfitting.  

In [None]:
params = {"objective":"reg:squarederror", "max_depth":10, "eta":0.25}
xg_reg = xgb.train(params=params, dtrain=xgb_train, num_boost_round=11)
xgb_test = xgb.DMatrix(data=X_test, label=np.zeros([X_test.shape[0],]))
xgb_pred = xg_reg.predict(xgb_test)
xgb_submit_df = pd.DataFrame({'seg_id': seg_id, 'time_to_failure': xgb_pred})
xgb_submit_df.to_csv('submission.csv', index=False)

### ElasticNet vs XGBoost

Out of curiosity, I wanted to see whether the predictions of the `XGBoost` model are that different from the `ElasticNet` model. At this point, I have not submitted the new XGBoost predictions yet, but, given the similarity of predictions, my initial instinct is that `XGBoost` would not have gotten me any prize money :) A lot more work would've had to be done! 

In [None]:
print("Correlation between XGBoost and ElasticNet predictions is %.3f"%np.corrcoef(xgb_pred, elnet_pred)[0,1])
fig = plt.figure(facecolor='white', figsize=(12,9))
plt.scatter(xgb_pred, elnet_pred)
plt.xlabel('XGBoost', fontsize=12)
plt.ylabel('ElasticNet', fontsize=12)
plt.show()