---

<h1><span style="color:red;"><strong>LANL Earthquake Prediction</strong></span></h1>

---

<h3><span style="color:Blue;"><strong><i>Can you predict upcoming laboratory earthquakes?</i></strong></span></h3>

---

> ## **_Objective_**
> In this competition, you will address when the earthquake will take place. Specifically, you’ll predict the time remaining before laboratory earthquakes occur from real-time seismic data.
> ## **_Solution thought by me_**
> _In this kernel, I tried to apply lightgbm with initial parameter and kfold validation and also try to apply xgboost and other ensemble model and stacking is also applied._

---
> ## **_Outline_**
* [**1.Load library**](#1.Load-library)
* [**2.Read Data**](#2.Read-Data)
* [**3.Feature Engineering**](#3.Feature-Engineering)
* [**4.Data transformation**](#4.Data-transformation)
* [**5.Test Data**](#5.Test-Data)
* [**6.Model Training**](#6.Model-Training)
    * [**1. Lightgbm**](#1.-Lightgbm)
    * [**2. XGboost**](#2.-XGboost)
* [**7.Stacking**](#7.Stacking)
> * [**8.Final Prediction**](#8.Final-Prediction)
---

## **1.Load library**

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook as tqdm
from sklearn.preprocessing import StandardScaler
from sklearn.svm import NuSVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import BayesianRidge
import multiprocessing as mp
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

## **2.Read Data**

In [None]:
%%time
train = pd.read_csv('../input/train.csv', dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
train.head()

In [None]:
def add_trend_feature(arr, abs_values=False):
    idx = np.array(range(len(arr)))
    if abs_values:
        arr = np.abs(arr)
    lr = LinearRegression()
    lr.fit(idx.reshape(-1, 1), arr)
    return lr.coef_[0]

### **GeneticAlgorithm**

In [None]:
class GeneticAlgorithm:

    def __init__(self,X,Y,Algorithm,Niter=100,keep_fraction=0.5,mutation_rate="auto",nfeatures="auto",test_size=0.3,njobs=1):

        '''
        A simple genetic algorithm designed to work with sklearn objects.
        The object of the algorithm is to select the optimal combination of columns of input datafame X that produce the
        best prediction of Y given the algorithm object provided.
        To do this, the algorithm starts by randomly selecting different 
        combinations of columns. These become the first 'generation' of individuals
        For each combination, the sklearn algorithm is trained in the associated columns and tested. The test score 
        becomes the 'fitness' of that combination/individual
        Individuals are then selected for 'breeding'. Those with the highest fitness scores have the highest chances
        of doing so. When two individuals 'breed'. Two children are produced by a simple crossover of their column IDs
        The child generation is then combined with the 'best' of the parent generation via the 'keep_fraction' argument
        The algorithm proceeds by testing the fitness of each generation. The result is an 'optimal' combination of 
        features that correspond to the best fitness score. 
        This should work with any supervised sklearn object
        Inputs:
        X - dataframe containing the predictors only
        Y - datafame or series containing the target only
        Algorithm - sklearn classifier or regression object, such as an RandomForestClassifier
        Niter - number of iterations of the genetic algorithm
        keep_fraction - the proportion of fittest parents to keep in each new generation
        mutation_rate - the probability of mutation in each child
        nfeatures - the maximum number of features that an output model can have
        test_size - test_size in the train_test_split that occurs in model fitness evaluation 
        Main outputs:
        self.fitness_evolution - list of the best fitness value from each generation
        self.best_individual_evolution - list of arrays of the best individuals in each generation
        self.feature_selection - dataframe corresponding to the selected features 
        self.best_fitness - fitness score correspondng to self.feature_selection
        self.best_individual - individual corresponding to self.feature_selection
        '''


        self.dataset = X
        self.response = Y
        self.algorithm = Algorithm #needs to be a sklearn object
        self.Niter = Niter #number of iterations 
        self.parent_keep = keep_fraction
        self.test_size = test_size
        self.nprocs = int(njobs)

        if self.nprocs > mp.cpu_count():

            raise ValueError("Entered number of processes > CPU count!")


        self.feature_columns = self.dataset.columns

        if nfeatures == 'auto':
            self.nfeatures = len(self.feature_columns)
        else:
            self.nfeatures = nfeatures

        self.P = 2*int(np.ceil(self.nfeatures*1.5/2)) #number of individuals in a given generation

        if mutation_rate == 'auto':

            self.mutation_rate = 1.0/(self.P*np.sqrt(self.nfeatures))
        else:
            self.mutation_rate = mutation_rate

        self.fitness_evolution = []
        self.best_individual_evolution = []

        #These three things are typically the most desired output
        self.feature_selection = None
        self.best_fitness = None
        self.best_individual = None


    def fitness(self,generation):

        '''
        Assess the fitness of a generation of individuals
        This is the part that takes a long time because it must train a supervised ML algorithm on all individuals in a generation
        '''

        def determine_fitness(subgeneration,output,pos):

            fitness_array = np.zeros(np.shape(subgeneration)[0])

            for i in range(np.shape(subgeneration)[0]):
            
                individual = subgeneration[i,:]
                
                #Subset the columns based on this individual
                X_individual = self.dataset[[self.dataset.columns[j] for j in range(len(individual)) if individual[j] == 1]]
                
                #Split into train-test datasets
                X_train, X_test, y_train, y_test = train_test_split(X_individual,self.response,test_size=self.test_size)
                
                #Fit the classifier
                self.algorithm.fit(X_train,y_train)
                
                #Report fitness score (score in the testing dataset)
                fitness = self.algorithm.score(X_test,y_test)
                
                #append to fitness array
                fitness_array[i] = fitness

            output.put((pos,fitness_array))


        process_output = mp.Queue()
        subarrays = np.array_split(generation,self.nprocs)
        processes = [mp.Process(target=determine_fitness,args=(subarrays[i],process_output,i)) for i in range(self.nprocs)]

        for p in processes:
            p.start()

        for p in processes:
            p.join()

        results = [process_output.get() for p in processes]
        results.sort()
        rlist = []
        for element in results:
            r = element[1]
            for j in range(len(r)):
                rlist.append(r[j])
        rlist = np.array(rlist)
   
        return rlist

    def make_new_generation(self,old_generation,old_fitness_array):
        
        '''
        Make a new generation of individuals
        '''
        
        generation_size = len(old_fitness_array)
            
        #Vector describing the probability of reporduction of each individual in a generation
        prob_weights = 2*np.argsort(old_fitness_array/(generation_size*(generation_size+1)))[::-1]
        
        prob_reproduction = prob_weights/np.sum(prob_weights)
        
        #Make vector of indices to choose
        a = np.arange(generation_size)
        
        children = np.zeros([2*generation_size,np.shape(old_generation)[1]])
        
        for i in range(generation_size):
            parent_index_pair = np.random.choice(a,size=2,replace=False,p=prob_reproduction)
            
            parent1 = old_generation[parent_index_pair[0]]
            parent2 = old_generation[parent_index_pair[1]]
            
            #Do cross over and apply mutation to generate two children for each parent pair
            child1 = parent1.copy()
            child2 = parent2.copy()
            
            #Generate locations of genetic information to swap
            pos = np.random.choice(len(parent1),size=int(len(parent1)/2),replace=False)
            child1[pos] = parent2[pos]
            child2[pos] = parent1[pos]
            
            #Generate mutation vector
            mutate1 = np.random.binomial(1,self.mutation_rate,len(parent1))
            mutate2 = np.random.binomial(1,self.mutation_rate,len(parent1))
            
            #Generate children and fill child array
            child1 = (child1+mutate1 >= 1).astype(int)
            child2 = (child2+mutate2 >= 1).astype(int)
            
            children[i,:] = child1
            children[-(i+1),:] = child2
            
        #shuffle and return only the same number of children as there were parents 
        np.random.shuffle(children)
        
        new_generation = children[0:generation_size,:]
        
        #replace some fraction of the children with the fittest parents, if desired
        
        nparents_to_keep = int(self.parent_keep*generation_size)
        
        if nparents_to_keep > 0:
            parents_keep = np.argsort(old_fitness_array)[::-1][:nparents_to_keep]

            for i in range(len(parents_keep)):
                new_generation[i,:] = old_generation[parents_keep[i],:]

        np.random.shuffle(new_generation)
        
        
        return new_generation 

    def fit(self):

        '''
        Run the genetic algorithm to obtain the optimal features for this problems
        This part takes a long time and could be parallelized
        '''

        #Make the first generation 
        old_generation = np.zeros([self.P,self.nfeatures])
        for i in range(self.P):
            old_generation[i,:] = np.random.binomial(1,0.5,self.nfeatures)

        old_fitness_array = self.fitness(old_generation)

        self.best_fitness = np.max(old_fitness_array)
        self.best_individual = old_generation[np.argmax(old_fitness_array),:]

        self.fitness_evolution.append(self.best_fitness)
        self.best_individual_evolution.append(self.best_individual)

        for n in range(1,self.Niter):

            print("GeneticAlgorithm: Testing generation %i" %n)

            #Make new generation
            new_generation = self.make_new_generation(old_generation,old_fitness_array)
            #Get fitness of new generation
            new_fitness_array = self.fitness(new_generation)

            #Locate and extract the best individual and its score
            self.best_fitness = np.max(new_fitness_array)
            self.best_individual = new_generation[np.argmax(new_fitness_array),:]
            self.fitness_evolution.append(self.best_fitness)
            self.best_individual_evolution.append(self.best_individual)

            old_fitness_array = new_fitness_array
            old_generation = new_generation

        #Get the features associated with the 'winning' individual

        self.feature_selection = self.dataset[[self.dataset.columns[j] for j in range(len(self.best_individual)) if self.best_individual[j] == 1]]

## **3.Feature Engineering**

In [None]:
rows = 150_000
segments = int(np.floor(train.shape[0] / rows))

X_train = pd.DataFrame(index=range(segments), dtype=np.float64,
                       columns=['ave', 'std', 'max', 'min','q95','q99', 'q05','q01',
                                'abs_max', 'abs_mean', 'abs_std', 'trend', 'abs_trend'])
y_train = pd.DataFrame(index=range(segments), dtype=np.float64,
                       columns=['time_to_failure'])

for segment in tqdm(range(segments)):
    seg = train.iloc[segment*rows:segment*rows+rows]
    x = seg['acoustic_data'].values
    y = seg['time_to_failure'].values[-1]
    
    y_train.loc[segment, 'time_to_failure'] = y
    
    X_train.loc[segment, 'ave'] = x.mean()
    X_train.loc[segment, 'std'] = x.std()
    X_train.loc[segment, 'max'] = x.max()
    X_train.loc[segment, 'min'] = x.min()
    X_train.loc[segment, 'q95'] = np.quantile(x,0.95)
    X_train.loc[segment, 'q99'] = np.quantile(x,0.99)
    X_train.loc[segment, 'q05'] = np.quantile(x,0.05)
    X_train.loc[segment, 'q01'] = np.quantile(x,0.01)
    
    X_train.loc[segment, 'abs_max'] = np.abs(x).max()
    X_train.loc[segment, 'abs_mean'] = np.abs(x).mean()
    X_train.loc[segment, 'abs_std'] = np.abs(x).std()
    X_train.loc[segment, 'trend'] = add_trend_feature(x)
    X_train.loc[segment, 'abs_trend'] = add_trend_feature(x, abs_values=True)
    
X_train.head()

## **4.Data transformation**

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

## **5.Test Data**

In [None]:
submission = pd.read_csv('../input/sample_submission.csv', index_col='seg_id')
X_test = pd.DataFrame(columns=X_train.columns, dtype=np.float64, index=submission.index)
for seg_id in tqdm(X_test.index):
    seg = pd.read_csv('../input/test/' + seg_id + '.csv')
    
    x = seg['acoustic_data'].values
    
    X_test.loc[seg_id, 'ave'] = x.mean()
    X_test.loc[seg_id, 'std'] = x.std()
    X_test.loc[seg_id, 'max'] = x.max()
    X_test.loc[seg_id, 'min'] = x.min()
    X_test.loc[seg_id, 'q95'] = np.quantile(x,0.95)
    X_test.loc[seg_id, 'q99'] = np.quantile(x,0.99)
    X_test.loc[seg_id, 'q05'] = np.quantile(x,0.05)
    X_test.loc[seg_id, 'q01'] = np.quantile(x,0.01)
    
    X_test.loc[seg_id, 'abs_max'] = np.abs(x).max()
    X_test.loc[seg_id, 'abs_mean'] = np.abs(x).mean()
    X_test.loc[seg_id, 'abs_std'] = np.abs(x).std()
    X_test.loc[seg_id, 'trend'] = add_trend_feature(x)
    X_test.loc[seg_id, 'abs_trend'] = add_trend_feature(x, abs_values=True)

X_test_scaled = scaler.transform(X_test)

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled,columns=X_train.columns)
X_train_scaled.head()

In [None]:
X_test_scaled = pd.DataFrame(X_test_scaled,columns=X_test.columns)
X_test_scaled.head()

## **6.Model Training**
---
## **1. Lightgbm**

In [None]:
import time
import lightgbm as lgb
param = {'num_leaves': 31,
         'min_data_in_leaf': 32, 
#          'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.001,
         "min_child_samples": 20,
#          "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.1,
         "nthread": 4,
         "verbosity": -1}

In [None]:
features = X_train_scaled.columns

In [None]:
# dataset = lgb.Dataset(X_train_scaled.values, y_train.values)

In [None]:
lgbm = lgb.LGBMRegressor(param)
GA = GeneticAlgorithm(X_train_scaled,y_train,lgbm,njobs=4)

In [None]:
GA.fit()
print(GA.best_fitness)
print(GA.best_individual)
X_subset = GA.feature_selection
print(X_subset.head())

In [None]:
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(X_train_scaled))
predictions = np.zeros(len(X_test_scaled))
start = time.time()
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_scaled.values, y_train.values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(X_train_scaled.iloc[trn_idx][features], label=y_train.iloc[trn_idx])
    val_data = lgb.Dataset(X_train_scaled.iloc[val_idx][features], label=y_train.iloc[val_idx])

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(X_train_scaled.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(X_test_scaled[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(mean_squared_error(oof, y_train)**0.5))

In [None]:
cols = (feature_importance_df[["feature", "importance"]]
        .groupby("feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:1000].index)

best_features = feature_importance_df.loc[feature_importance_df.feature.isin(cols)]

plt.figure(figsize=(14,16))
sns.barplot(x="importance",
            y="feature",
            data=best_features.sort_values(by="importance",
                                           ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances.png')

## **2. XGboost**

In [None]:
%%time
import xgboost as xgb

xgb_params = {'eta': 0.001, 'max_depth': 5, 'subsample': 0.8, 'colsample_bytree': 0.8, 'alpha':0.1,
          'objective': 'reg:linear', 'eval_metric': 'mae', 'silent': True, 'random_state':folds}


folds = KFold(n_splits=5, random_state=4520)
oof_xgb = np.zeros(len(X_train_scaled))
predictions_xgb = np.zeros(len(X_test_scaled))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_scaled.values, y_train.values)):
    print("fold n°{}".format(fold_ + 1))
    trn_data = xgb.DMatrix(data=X_train_scaled.iloc[trn_idx][features], label=y_train.iloc[trn_idx])
    val_data = xgb.DMatrix(data=X_train_scaled.iloc[val_idx][features], label=y_train.iloc[val_idx])
    watchlist = [(trn_data, 'train'), (val_data, 'valid')]
    print("-" * 10 + "Xgboost " + str(fold_) + "-" * 10)
    num_round = 11000
    xgb_model = xgb.train(xgb_params, trn_data, num_round, watchlist, early_stopping_rounds=50, verbose_eval=1000)
    oof_xgb[val_idx] = xgb_model.predict(xgb.DMatrix(X_train_scaled.iloc[val_idx][features]), ntree_limit=xgb_model.best_ntree_limit+50)

    predictions_xgb += xgb_model.predict(xgb.DMatrix(X_test_scaled[features]), ntree_limit=xgb_model.best_ntree_limit+50) / folds.n_splits
    
np.save('oof_xgb', oof_xgb)
np.save('predictions_xgb', predictions_xgb)
print("CV score: {:<8.5f}".format(mean_squared_error(oof_xgb, y_train)**0.5))

In [None]:
# %%time
# from catboost import CatBoostRegressor
# folds = KFold(n_splits=5, random_state=4520)
# oof_cat = np.zeros(len(X_train_scaled))
# predictions_cat = np.zeros(len(X_test_scaled))

# for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_scaled.values, y_train.values)):
#     print("fold n°{}".format(fold_ + 1))
#     trn_data, trn_y = X_train_scaled.iloc[trn_idx][features], y_train.iloc[trn_idx]
#     val_data, val_y = X_train_scaled.iloc[val_idx][features], y_train.iloc[val_idx]
#     print("-" * 10 + "Catboost " + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=8000, learning_rate=0.01, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), use_best_model=True, verbose=True,)
    
#     oof_cat[val_idx] = cb_model.predict(val_data)
#     predictions_cat += cb_model.predict(X_test_scaled[features]) / folds.n_splits
    
# np.save('oof_cat', oof_cat)
# np.save('predictions_cat', predictions_cat)
# np.sqrt(mean_squared_error(y_train.values, oof_cat))

## **7.Stacking**

In [None]:
train_stack = np.vstack([oof, oof_xgb]).transpose()
test_stack = np.vstack([predictions,predictions_xgb]).transpose()

folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof_stack = np.zeros(train_stack.shape[0])
predictions_stack = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, y_train)):
    print("fold n°{}".format(fold_))
    trn_data, trn_y = train_stack[trn_idx], y_train.iloc[trn_idx].values
    val_data, val_y = train_stack[val_idx], y_train.iloc[val_idx].values

    print("-" * 10 + "Ridge Regression" + str(fold_) + "-" * 10)
#     cb_model = CatBoostRegressor(iterations=3000, learning_rate=0.1, depth=8, l2_leaf_reg=20, bootstrap_type='Bernoulli',  eval_metric='RMSE', metric_period=50, od_type='Iter', od_wait=45, random_seed=17, allow_writing_files=False)
#     cb_model.fit(trn_data, trn_y, eval_set=(val_data, val_y), cat_features=[], use_best_model=True, verbose=True)
    clf = BayesianRidge()
    clf.fit(trn_data, trn_y)
    
    oof_stack[val_idx] = clf.predict(val_data)
    predictions_stack += clf.predict(test_stack) / 5


print("CV score: {:<8.5f}".format(mean_squared_error(oof, y_train)**0.5))

## **8.Final Prediction**

In [None]:
sample_submission = pd.read_csv('../input/sample_submission.csv')
sample_submission['time_to_failure'] = predictions_stack
sample_submission.to_csv('Bayesian_Ridge_Stacking.csv', index=False)

In [None]:
sample_submission.shape