# Rocket

**What is Rocket?**

Rocket makes use of Covulutional kernals. CNNs are often used for image classification, but they can also be applied to time series where CNNs use convulutional kernals to detect patterns in the input.

Rocket transforms time series using random convolutional kernels (random length, weights, bias, dilation, and padding) to detect patterns in the time series. ROCKET computes two features from the resulting feature maps: the max, and the proportion of positive values (or ppv). These features are then used to train a classifier.

[arXiv:1910.13051](https://arxiv.org/abs/1910.13051)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import GroupKFold
from lightgbm import LGBMClassifier
from sklearn.pipeline import make_pipeline

In [None]:
! pip install sktime==0.9.0 #Compatibility issues with some versions

In [None]:
from sktime.transformations.panel.rocket import MiniRocketMultivariate, Rocket
from sktime.datatypes._panel._convert import from_multi_index_to_nested

In [None]:
train_df = pd.read_csv("../input/tabular-playground-series-apr-2022/train.csv")
train_labels = pd.read_csv("../input/tabular-playground-series-apr-2022/train_labels.csv")

We transform the data into the format required to use Rocket. This takes 15 mins to run so we may wish to save it for future use.

In [None]:
%%time
#X = pd.read_pickle("../input/tpsapr22nested-data/tpsapr22_nested_train.pkl")
#X.head()

train_df = train_df.set_index(["sequence"])
X = train_df.drop(columns=["subject"])
X = X.set_index("step",append=True)
X = from_multi_index_to_nested(multi_ind_dataframe = X,  instance_index="sequence")
#X.to_pickle(path="./tpsapr22_nested_train.pkl")

display(X.head(2))

In [None]:
y = train_labels["state"]
groups = train_df.loc[train_df["step"] == 0,"subject"]
display(y.head(2))
display(groups.head(2))

We can use any classifier with rocket, but linear models are probably better because they can make use of a small amount of information from a large number of features. Regularisation is very important given the large number of features.

In [None]:
rocket_pipeline = make_pipeline(Rocket(num_kernels=5500, random_state=0, n_jobs=-1), LGBMClassifier(random_state=1, learning_rate=0.05, n_estimators=800, n_jobs=-1))
#Alpha value decided based on previous runs using RidgeClassifierCV but may not be optimal

In [None]:
%%time
group_kfold = GroupKFold(n_splits = 5)
for fold, (train_index, val_index) in enumerate(group_kfold.split(X, y, groups=groups)):
    print("==fold==", fold)
    X_train = X.loc[train_index]
    X_val = X.loc[val_index]
    
    y_train = y.loc[train_index]
    y_val = y.loc[val_index]
    
    rocket_pipeline.fit(X_train,y_train)
    
    #print("Best alpha", rocket_pipeline[2].alpha_)
    
    y_pred = rocket_pipeline.predict_proba(X_val)
    y_pred = y_pred[:,1]
    print('Acc', accuracy_score(y_pred.round(), y_val))
    print("ROC AUC", roc_auc_score(y_val,y_pred))
    
    
    #y_pred = rocket_pipeline.decision_function(X_val)
    #print('Acc', accuracy_score(y_pred > 0, y_val))
    #print("ROC AUC", roc_auc_score(y_val,y_pred))

# Test

In [None]:
rocket_pipeline.fit(X,y)

In [None]:
%%time
test_df = pd.read_csv("../input/tabular-playground-series-apr-2022/test.csv")

test_df = test_df.set_index(["sequence"])
X_test = test_df.drop(columns=["subject"])
X_test = X_test.set_index("step",append=True)
X_test = from_multi_index_to_nested(multi_ind_dataframe = X_test,  instance_index="sequence")
#X.to_pickle(path="./tpsapr22_nested_test.pkl")

In [None]:
preds = rocket_pipeline.predict_proba(X_test)
preds = preds[:,1]

#preds = rocket_pipeline.decision_function(X_test)

In [None]:
sample_sub = pd.read_csv("../input/tabular-playground-series-apr-2022/sample_submission.csv")
sample_sub["state"] = preds
sample_sub

In [None]:
sample_sub.to_csv('submission.csv', index = False)