# Introduction
The challenge 'Collision detection AI competition using vibration data' is hosted by the Korea Atomic Energy Research Institute.  The objective is to predict collider parameters using time and acceleration data. 

A collider is a type of particle accelerator that brings two opposing particle beams together such that the particles collide [Wikipedia]. The colliders inside the coolant systems inside a nuclear power plant. Detecting any abnormal activity in the collider helps technicians to prevent accidents.  Details of the competitions are available at https://dacon.io/competitions/official/235614/overview/. 

In [None]:
%matplotlib inline
import os
import warnings
warnings.simplefilter(action='ignore')

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import sklearn as sl
import scipy as sp

from tqdm import tqdm

In [None]:
train_data = pd.read_csv("/kaggle/input/collision-detection-ai-using-vibration-data/train_features.csv")
train_target = pd.read_csv("/kaggle/input/collision-detection-ai-using-vibration-data/train_target.csv")
test_data = pd.read_csv("/kaggle/input/collision-detection-ai-using-vibration-data/test_features.csv")

In [None]:
submission_file = pd.read_csv("/kaggle/input/collision-detection-ai-using-vibration-data/sample_submission.csv")

In [None]:
train_data.shape,train_target.shape

In [None]:
train_data.head()

In [None]:
train_target.head()

In [None]:
train_data.info()

In [None]:
train_data.id.nunique()

In [None]:
train_data[train_data.id == 0]

In [None]:
train_data[train_data.id == 1]

# Data

The train data contains five attributes. The attribute id and time are self-explanatory. The acceleration parameters in the collider are labeled as S1, S2, S3, and S4. In this data, each id is corresponding to one training instance. The timestamp difference between each observation in id is four seconds, and it can be considered an equispaced time series data-set. For each id, there is a corresponding entry in the training targets data. There are 1050000 in the training data and 2800 entries. The training target contains 2800 entries for X, Y, M, and V. These are the prediction target, the collider parameters. 


Unlike the traditional data-sets in Machine learning exercises, we can't jump into modeling immediately. The data should be further converted to an appropriate scientific format before approaching the problem. One of the widely adopted methods is to apply Fourier Transform before using any modeling techniques. Let's explore the data further to understand the same. 

## EDA


In [None]:
def plot_data(accelaration_df : pd.DataFrame,features : list, title : str) -> None:
    """ Plot the accelaration data
        :params accelaration_df: accelaration data for one id
        :params title: string
    """
    
    fig = plt.figure(figsize=(10,6))
    fig.tight_layout(pad=10.0)
    fig.suptitle(title)
    
    for idx,feature in enumerate(features):
        ax = fig.add_subplot(2,2,idx+1)
        accelaration_df[feature].plot(kind='line',
                                     title = title + " " + feature,
                                     ax=ax)

In [None]:
feats_to_plot = ["S1","S2","S3", "S4"]
plot_data(train_data[train_data.id == 0],feats_to_plot,"Accelaration Params")

In [None]:
train_target[train_target.id == 0]

In [None]:
feats_to_plot = ["S1","S2","S3", "S4"]
plot_data(train_data[train_data.id == 100],feats_to_plot,"Accelaration Params")

In [None]:
feats_to_plot = ["S1","S2","S3", "S4"]
plot_data(train_data[train_data.id == 250],feats_to_plot,"Accelaration Params")

In [None]:
feats_to_plot = ["S1","S2","S3", "S4"]
plot_data(train_data[train_data.id == 300],feats_to_plot,"Accelaration Params")

In [None]:
feats_to_plot = ["S1","S2","S3", "S4"]
plot_data(train_data[train_data.id == 400],feats_to_plot,"Accelaration Params")

## Fourier Transform 

One of the prominent methods to approach signal data is to apply forurier transformation in the data. The Fourier transformed data can be used for training a model. 

In [None]:
fs = 5 #sampling frequency
fmax = 25 #sampling period
dt = 1/fs #length of signal
n = 75

def fft_features(data_set : pd.DataFrame) -> np.ndarray:
    """ Convert the dataset to fourier transfomed
        :params data_set: original collider params data
        :returns ft_data: Fourier transformed data
        #Reference - https://dacon.io/competitions/official/235614/codeshare/1174
    """
    ft_data = list()
    
    features = ["S1","S2","S3", "S4"]
    
    id_set = list(data_set.id.unique())
    
    for ids in tqdm(id_set):
        s1_fft = np.fft.fft(data_set[data_set.id==ids]['S1'].values)*dt
        s2_fft = np.fft.fft(data_set[data_set.id==ids]['S2'].values)*dt
        s3_fft = np.fft.fft(data_set[data_set.id==ids]['S3'].values)*dt
        s4_fft = np.fft.fft(data_set[data_set.id==ids]['S4'].values)*dt
        
        ft_data.append(np.concatenate([np.abs(s1_fft[0:int(n/2+1)]),
                                       np.abs(s2_fft[0:int(n/2+1)]),
                                       np.abs(s3_fft[0:int(n/2+1)]),
                                       np.abs(s4_fft[0:int(n/2+1)])]))
    
    return np.array(ft_data)

In [None]:
train_fft = fft_features(train_data)

In [None]:
train_fft.shape[0] == len(train_data.id.unique())

In [None]:
test_fft = fft_features(test_data)

In [None]:
test_fft.shape[0] == len(test_data.id.unique())

# Model
Let's create a multi-output Regressor Model.

In [None]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
base_model = GradientBoostingRegressor(loss='quantile',
                                      n_estimators=100,
                                      criterion='mae',
                                      random_state=2021,
                                      max_features='sqrt',
                                      n_iter_no_change=2)

mult_regressor = MultiOutputRegressor(base_model,
                                      n_jobs=-1)

In [None]:
mult_regressor.fit(train_fft,
                  train_target.drop(['id'],axis=1))

In [None]:
predictions = mult_regressor.predict(test_fft)

In [None]:
predictions[0]

In [None]:
submission_file[['X','Y','M','V']] = predictions
submission_file.head()

In [None]:
submission_file.to_csv("submission_1_1.csv",
                  index=False)

# Result
The submission ranked - 13.48313 (score) 226 in Public leaderbord.

# Alternative Fature Engineering
An alternative approach in feature engineering is to aggregate the features and compute key statistics such as mean, median, standard deviation, minimum value, and skew. 

In [None]:
def generate_agg_feats(data_set : pd.DataFrame) -> pd.DataFrame:
    """ Create aggrage features from the data
        :param data_set: Base data as DataFrame
        :returns agg_data: Aggragated DataFrame
    """
    
    max_feats = data_set.groupby(['id']).max().add_suffix('_max').iloc[:,1:]
    min_feats = data_set.groupby(['id']).min().add_suffix('_min').iloc[:,1:]
    mean_feats = data_set.groupby(['id']).mean().add_suffix('_mean').iloc[:,1:]
    std_feats = data_set.groupby(['id']).std().add_suffix('_std').iloc[:,1:]
    median_feats = data_set.groupby(['id']).median().add_suffix('_median').iloc[:,1:]
    skew_feats = data_set.groupby(['id']).skew().add_suffix('_skew').iloc[:,1:]
    
    agg_data = pd.concat([max_feats,min_feats,
                          mean_feats,std_feats,median_feats,skew_feats],
                        axis=1)
    
    return agg_data

In [None]:
agg_train = generate_agg_feats(train_data)
agg_train.shape

In [None]:
agg_train.head()

In [None]:
agg_test = generate_agg_feats(test_data)
agg_test.shape

# Model Two with Aggragted Data

In [None]:
mult_regressor.fit(agg_train,
                  train_target.drop(['id'],axis=1))

In [None]:
agg_pred = mult_regressor.predict(agg_test)

In [None]:
agg_pred[0]

In [None]:
submission_file[['X','Y','M','V']] = agg_pred
submission_file.head()

In [None]:
submission_file.to_csv("submission_2.csv",
                  index=False)

#### Score
In this case also the score is same (13.48313). We need to try alternative modelling approach to make it better.

# Support Vector Regressor

In [None]:
from sklearn.svm import SVR
from sklearn.multioutput import RegressorChain

In [None]:
svr = SVR(kernel='rbf',
         gamma='auto',
         shrinking=True)
regressor_chain = RegressorChain(svr,
                                order='random',
                                random_state=1999)

In [None]:
regressor_chain.fit(agg_train,
                  train_target.drop(['id'],axis=1))

In [None]:
svr_p1 = regressor_chain.predict(agg_test)

In [None]:
submission_file[['X','Y','M','V']] = svr_p1
submission_file.head()

In [None]:
submission_file.to_csv("submission_3.csv",
                  index=False)

## Improved
Our score went to 3.25082 in LB!!

In [None]:
regressor_chain.fit(train_fft,
                  train_target.drop(['id'],axis=1))

In [None]:
fft_pred = regressor_chain.predict(test_fft)

In [None]:
submission_file[['X','Y','M','V']] = fft_pred
submission_file.head()

In [None]:
submission_file.to_csv("submission_4.csv",
                  index=False)

### Submission
Submission with FFT and aggragated features resulted in the same LB score.
