# Non-NN models

We should study these notebooks:

https://www.kaggle.com/code/jeroenvdd/tpsapr22-best-non-dl-model-tsflex-powershap?scriptVersionId=94240450

https://www.kaggle.com/code/ambrosm/tpsapr22-best-model-without-nn

In [1]:
import sys
import os
sys.path.append(os.path.abspath('../'))

input_path = '../../input/tabular-playground-series-apr-2022'
output_path = '../../output'

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score

def load_raw_data(train_or_test='train'):
    file_name = f'{input_path}/{train_or_test}.csv'
    df = pd.read_csv(file_name)
    return df

def load_label(train_or_test='train'):
    file_name = input_path + ('train_labels.csv' if train_or_test=='train' else 'sample_submission.csv')
    df = pd.read_csv(file_name)
    return df['state'].values

def competition_metric(y_true, y_score):
    return roc_auc_score(y_true, y_score)

def evaluate(model, X, y):
    return competition_metric(y, model.predict_proba(X)[:, 1])

def submit(arr):
    df = pd.read_csv(f'{input_path}/sample_submission.csv')
    df['state'] = arr
    df.to_csv(f'{output_path}/submission.csv', index=False)

def group_splitter(df, nfold=5, random_state=None):
    subject_nums = df['subject'].unique()
    rng = np.random.default_rng(random_state)
    subject_to_setnum = rng.integers(0, nfold, subject_nums.shape[0])
    for i in range(nfold):
        val_subjects = subject_nums[subject_to_setnum == i]
        mask_df_val = df['subject'].isin(val_subjects)
        mask_y_val = mask_df_val.iloc[::60]
        yield mask_df_val, mask_y_val

In [5]:
df = load_raw_data('train')
y = load_label('train')

In [None]:
from ElementaryExtractor import ElementaryExtractor, TsfreshExtractor
from SWK.MBOP import MBOP
# from JHLee.CorrExtractor (2) import CorrExtractor

from lightgbm import LGBMClassifier
from sklearn.pipeline import make_union
from sklearn.metrics import classification_report
cv_scores = []

extractors = [CorrExtractor(), ElementaryExtractor(), TsfreshExtractor(), MBOP()]
extractor = make_union(*extractors)

for mask_df_val, mask_y_val in group_splitter(df, nfold=5, random_state=42):
    df_train, y_train = df[~mask_df_val], y[~mask_y_val]
    df_val, y_val = df[mask_df_val], y[mask_y_val]
    
    X_train = extractor.fit_transform(df_train)
    X_val = extractor.transform(df_val)
    print(X_train.shape, X_val.shape)
    
    clf = LGBMClassifier(num_leaves=31, max_depth=-1, n_estimators=100, random_state=42)
    clf.fit(X_train, y_train)
    print(evaluate(clf, X_train, y_train))
    print(evaluate(clf, X_val, y_val))
    print(classification_report(y_val, (clf.predict(X_val) >= 0.5).astype(int), digits=4 ))
    
    cv_scores.append(evaluate(clf, X_val, y_val))
print(f'5-fold CV score: {np.mean(cv_scores):.4f}')

Feature Extraction: 100%|██████████| 10/10 [02:09<00:00, 12.91s/it]
Feature Extraction: 100%|██████████| 10/10 [00:59<00:00,  5.98s/it]
Feature Extraction: 100%|██████████| 10/10 [01:35<00:00,  9.52s/it]
Feature Extraction: 100%|██████████| 10/10 [01:54<00:00, 11.43s/it]
Feature Extraction: 100%|██████████| 10/10 [04:34<00:00, 27.41s/it]
Feature Extraction: 100%|██████████| 10/10 [01:08<00:00,  6.87s/it]
Feature Extraction: 100%|██████████| 10/10 [00:58<00:00,  5.85s/it]
Feature Extraction: 100%|██████████| 10/10 [02:11<00:00, 13.17s/it]
Feature Extraction: 100%|██████████| 10/10 [00:11<00:00,  1.14s/it]
Feature Extraction: 100%|██████████| 10/10 [02:08<00:00, 12.87s/it]
Feature Extraction: 100%|██████████| 10/10 [04:28<00:00, 26.86s/it]
Feature Extraction: 100%|██████████| 10/10 [02:01<00:00, 12.16s/it]
Feature Extraction: 100%|██████████| 10/10 [02:39<00:00, 15.93s/it]


0-th machine fitted
1-th machine fitted
2-th machine fitted
3-th machine fitted
4-th machine fitted
5-th machine fitted
6-th machine fitted
7-th machine fitted
8-th machine fitted
9-th machine fitted
10-th machine fitted
11-th machine fitted
12-th machine fitted
reducing
fit_transform result has been saved as instance variable ft_X
all fitted


Feature Extraction: 100%|██████████| 10/10 [00:33<00:00,  3.31s/it]
Feature Extraction: 100%|██████████| 10/10 [00:14<00:00,  1.47s/it]
Feature Extraction: 100%|██████████| 10/10 [00:24<00:00,  2.40s/it]
Feature Extraction: 100%|██████████| 10/10 [00:30<00:00,  3.02s/it]
Feature Extraction: 100%|██████████| 10/10 [01:09<00:00,  6.92s/it]
Feature Extraction: 100%|██████████| 10/10 [00:18<00:00,  1.84s/it]
Feature Extraction: 100%|██████████| 10/10 [00:15<00:00,  1.57s/it]
Feature Extraction: 100%|██████████| 10/10 [00:34<00:00,  3.49s/it]
Feature Extraction: 100%|██████████| 10/10 [00:03<00:00,  3.12it/s]
Feature Extraction: 100%|██████████| 10/10 [00:34<00:00,  3.41s/it]
Feature Extraction: 100%|██████████| 10/10 [01:06<00:00,  6.66s/it]
Feature Extraction: 100%|██████████| 10/10 [00:31<00:00,  3.11s/it]
Feature Extraction: 100%|██████████| 10/10 [00:40<00:00,  4.08s/it]


0-th channel finished
number of pure features of 0 BOP=(74,)
1-th channel finished
number of pure features of 1 BOP=(76,)
2-th channel finished
number of pure features of 2 BOP=(130,)
3-th channel finished
number of pure features of 3 BOP=(75,)
4-th channel finished
number of pure features of 4 BOP=(972,)
5-th channel finished
number of pure features of 5 BOP=(310,)
6-th channel finished
number of pure features of 6 BOP=(74,)
7-th channel finished
number of pure features of 7 BOP=(74,)
8-th channel finished
number of pure features of 8 BOP=(79,)
9-th channel finished
number of pure features of 9 BOP=(77,)
10-th channel finished
number of pure features of 10 BOP=(1102,)
11-th channel finished
number of pure features of 11 BOP=(72,)
12-th channel finished
number of pure features of 12 BOP=(1287,)
shape=(5151, 406)
(20817, 909) (5151, 909)
0.997048055973935
0.9679046719638748
              precision    recall  f1-score   support

           0       0.93      0.88      0.91      2592
     

Feature Extraction: 100%|██████████| 10/10 [02:18<00:00, 13.87s/it]
Feature Extraction: 100%|██████████| 10/10 [01:02<00:00,  6.25s/it]
Feature Extraction: 100%|██████████| 10/10 [01:39<00:00,  9.98s/it]
Feature Extraction: 100%|██████████| 10/10 [02:04<00:00, 12.47s/it]
Feature Extraction: 100%|██████████| 10/10 [04:52<00:00, 29.23s/it]
Feature Extraction: 100%|██████████| 10/10 [01:15<00:00,  7.52s/it]
Feature Extraction: 100%|██████████| 10/10 [01:02<00:00,  6.25s/it]
Feature Extraction: 100%|██████████| 10/10 [02:20<00:00, 14.07s/it]
Feature Extraction: 100%|██████████| 10/10 [00:12<00:00,  1.20s/it]
Feature Extraction: 100%|██████████| 10/10 [02:18<00:00, 13.82s/it]
Feature Extraction: 100%|██████████| 10/10 [04:39<00:00, 27.95s/it]
Feature Extraction: 100%|██████████| 10/10 [02:10<00:00, 13.07s/it]
Feature Extraction: 100%|██████████| 10/10 [02:46<00:00, 16.69s/it]


0-th machine fitted
1-th machine fitted
2-th machine fitted
3-th machine fitted
4-th machine fitted
5-th machine fitted
6-th machine fitted
7-th machine fitted
8-th machine fitted
9-th machine fitted
10-th machine fitted
11-th machine fitted
12-th machine fitted
reducing
fit_transform result has been saved as instance variable ft_X
all fitted


Feature Extraction: 100%|██████████| 10/10 [00:28<00:00,  2.87s/it]
Feature Extraction: 100%|██████████| 10/10 [00:12<00:00,  1.28s/it]
Feature Extraction: 100%|██████████| 10/10 [00:21<00:00,  2.12s/it]
Feature Extraction: 100%|██████████| 10/10 [00:26<00:00,  2.65s/it]
Feature Extraction: 100%|██████████| 10/10 [01:00<00:00,  6.09s/it]
Feature Extraction: 100%|██████████| 10/10 [00:15<00:00,  1.52s/it]
Feature Extraction: 100%|██████████| 10/10 [00:12<00:00,  1.27s/it]
Feature Extraction: 100%|██████████| 10/10 [00:29<00:00,  2.90s/it]
Feature Extraction: 100%|██████████| 10/10 [00:02<00:00,  3.50it/s]
Feature Extraction: 100%|██████████| 10/10 [00:28<00:00,  2.84s/it]
Feature Extraction: 100%|██████████| 10/10 [00:57<00:00,  5.76s/it]
Feature Extraction: 100%|██████████| 10/10 [00:27<00:00,  2.73s/it]
Feature Extraction: 100%|██████████| 10/10 [00:35<00:00,  3.51s/it]


0-th channel finished
number of pure features of 0 BOP=(76,)
1-th channel finished
number of pure features of 1 BOP=(81,)
2-th channel finished
number of pure features of 2 BOP=(132,)
3-th channel finished
number of pure features of 3 BOP=(85,)
4-th channel finished
number of pure features of 4 BOP=(775,)
5-th channel finished
number of pure features of 5 BOP=(164,)
6-th channel finished
number of pure features of 6 BOP=(75,)
7-th channel finished
number of pure features of 7 BOP=(81,)
8-th channel finished
number of pure features of 8 BOP=(79,)
9-th channel finished
number of pure features of 9 BOP=(81,)
10-th channel finished
number of pure features of 10 BOP=(1065,)
11-th channel finished
number of pure features of 11 BOP=(73,)
12-th channel finished
number of pure features of 12 BOP=(1290,)
shape=(4599, 406)
(21369, 909) (4599, 909)
0.9967701620475397
0.9665739281037276
              precision    recall  f1-score   support

           0       0.94      0.88      0.91      2412
    

Feature Extraction: 100%|██████████| 10/10 [02:05<00:00, 12.53s/it]
Feature Extraction: 100%|██████████| 10/10 [00:54<00:00,  5.48s/it]
Feature Extraction: 100%|██████████| 10/10 [01:31<00:00,  9.11s/it]
Feature Extraction: 100%|██████████| 10/10 [01:51<00:00, 11.15s/it]
Feature Extraction: 100%|██████████| 10/10 [04:19<00:00, 25.96s/it]
Feature Extraction: 100%|██████████| 10/10 [01:07<00:00,  6.71s/it]
Feature Extraction: 100%|██████████| 10/10 [00:56<00:00,  5.70s/it]
Feature Extraction: 100%|██████████| 10/10 [02:06<00:00, 12.69s/it]
Feature Extraction: 100%|██████████| 10/10 [00:10<00:00,  1.09s/it]
Feature Extraction: 100%|██████████| 10/10 [02:02<00:00, 12.21s/it]
Feature Extraction: 100%|██████████| 10/10 [04:05<00:00, 24.56s/it]
Feature Extraction: 100%|██████████| 10/10 [02:02<00:00, 12.21s/it]
Feature Extraction: 100%|██████████| 10/10 [02:34<00:00, 15.46s/it]


0-th machine fitted
1-th machine fitted
2-th machine fitted
3-th machine fitted
4-th machine fitted
5-th machine fitted
6-th machine fitted
7-th machine fitted
8-th machine fitted
9-th machine fitted
10-th machine fitted
11-th machine fitted
12-th machine fitted
reducing
fit_transform result has been saved as instance variable ft_X
all fitted


Feature Extraction: 100%|██████████| 10/10 [00:37<00:00,  3.76s/it]
Feature Extraction: 100%|██████████| 10/10 [00:16<00:00,  1.64s/it]
Feature Extraction: 100%|██████████| 10/10 [00:27<00:00,  2.79s/it]
Feature Extraction: 100%|██████████| 10/10 [00:32<00:00,  3.30s/it]
Feature Extraction: 100%|██████████| 10/10 [01:17<00:00,  7.79s/it]
Feature Extraction: 100%|██████████| 10/10 [00:20<00:00,  2.03s/it]
Feature Extraction: 100%|██████████| 10/10 [00:17<00:00,  1.75s/it]
Feature Extraction: 100%|██████████| 10/10 [00:37<00:00,  3.77s/it]
Feature Extraction: 100%|██████████| 10/10 [00:03<00:00,  2.87it/s]
Feature Extraction: 100%|██████████| 10/10 [00:36<00:00,  3.65s/it]
Feature Extraction: 100%|██████████| 10/10 [01:14<00:00,  7.50s/it]
Feature Extraction: 100%|██████████| 10/10 [00:36<00:00,  3.66s/it]
Feature Extraction: 100%|██████████| 10/10 [00:46<00:00,  4.64s/it]


0-th channel finished
number of pure features of 0 BOP=(75,)
1-th channel finished
number of pure features of 1 BOP=(81,)
2-th channel finished
number of pure features of 2 BOP=(127,)
3-th channel finished
number of pure features of 3 BOP=(84,)
4-th channel finished
number of pure features of 4 BOP=(994,)
5-th channel finished
number of pure features of 5 BOP=(325,)
6-th channel finished
number of pure features of 6 BOP=(74,)
7-th channel finished
number of pure features of 7 BOP=(81,)
8-th channel finished
number of pure features of 8 BOP=(79,)
9-th channel finished
number of pure features of 9 BOP=(81,)
10-th channel finished
number of pure features of 10 BOP=(1115,)
11-th channel finished
number of pure features of 11 BOP=(71,)
12-th channel finished
number of pure features of 12 BOP=(1289,)
shape=(6004, 406)
(19964, 909) (6004, 909)
0.9977424039223814
0.9510285631120259
              precision    recall  f1-score   support

           0       0.89      0.87      0.88      2789
    

In [None]:
clf = LGBMClassifier(num_leaves=31, max_depth=4, n_estimators=100)

df_train_final = df
y_train_final = y
X_train_final = extractor.fit_transform(df_train_final)
clf.fit(X_train_final, y_train_final)

df_test_final = load_raw_data('test')
X_test_final = extractor.transform(df_test_final)
y_pred = clf.predict_proba(X_test_final)[:, 1]
submit(y_pred)