# Overview

In this competition, you'll classify 60-second sequences of sensor data, indicating whether a subject was in either of two activity states for the duration of the sequence.

**Files and Field Descriptions**

- **train.csv**: the training set, comprising ~26,000 60-second recordings of thirteen biological sensors for almost one thousand experimental participants
    - *sequence* - a unique id for each sequence
    - *subject* - a unique id for the subject in the experiment
    - *step* - time step of the recording, in one second intervals
    - *sensor_00* - sensor_12 - the value for each of the thirteen sensors at that time step
- **train_labels.csv**: the class label for each sequence.
    - *sequence* - the unique id for each sequence.
    - *state* - the state associated to each sequence. This is the target which you are trying to predict.
- **test.csv**: the test set. For each of the ~12,000 sequences, you should predict a value for that sequence's state.
- **sample_submission.csv**: a sample submission file in the correct format.

# Importing packages and loading dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
from itertools import chain
from sklearn import metrics
import scipy.stats
%matplotlib inline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score



In [None]:
# Load the data

train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')

test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')

# Exploratory Data Analysis (EDA) with Pandas and NumPy

Thank you [AMBROSM ](https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense) for his always useful and inspiring EDA notebbok.

See my  [EDA + Viz notebook](https://www.kaggle.com/code/girolamanotarangelo/classification-states-eda-viz-tps-april-2022)

In [None]:
train

In [None]:
train_labels

In [None]:
train.info()

In [None]:
train_labels.info()

In [None]:
test.info()

In [None]:
print(f'Subject numbering in train: from {train.subject.min()} to {train.subject.max()}')
print(f'Subject numbering in test: from {test.subject.min()} to {test.subject.max()}')
print()

Comments:
- There are **25968 sequences** (labeled **from 0 to 25967**) in the **train** with 672 subjects.
- The train data has ***1558080*** rows, which makes sense since we have that each sequence has **60 steps, one step per second ** (25968*60=1558080). 
- No missing value.
- Every sequence has **60 * 13 = 780 features.**
- The **test** data has **12218 sequences (labeled from 25968 to 38185**)
- The **train and test subjects are different**, we cannot use the subject as a feature.
- We need to predict what state are the sequence in the test data, labeled from 25968 to 38185.

## Creating 'state' column

In [None]:
train = train.merge(train_labels, how='left')
train.head(123)
#train.loc[train['sequence'] == 21401]

## Removing sequences with stuck values ##

Thank you [WALDEMAR](https://www.kaggle.com/code/waldemar/63-or-44-outliers-to-remove/) for  the idea of sensors stuck on values.

In [None]:
train_unique_1 = train.drop(['subject', 'step', 'state', 'sensor_02'], axis=1).groupby(['sequence']).agg(lambda x: x.nunique() == 1).sum(axis=1).sort_values(ascending=False)

In [None]:
train_unique_1

In [None]:
at_least_8 = train_unique_1[train_unique_1>1]
'Sequences with at least 8 sensor stuck: ', len(at_least_8)
stuck = list(at_least_8.index)
stuck

In [None]:
len(stuck)

In [None]:
train.loc[(train['sequence'].isin(stuck)) & (train['state'] == 1)]

In [None]:
train = train.drop(train.loc[train['sequence'].isin(stuck)].index, axis = 0)
train_labels = train_labels.drop(train_labels.loc[train_labels['sequence'].isin(stuck)].index, axis = 0)

In [None]:
train.loc[train['sequence'].isin(stuck)]

All the sequences with at least 8 sensors stuck have state 0.

In [None]:
#train.describe()

## Looking to averages #

Thanks to [JIRI PRUDKY](https://www.kaggle.com/code/jiprud/tps-apr22-rookie-eda-submission/notebook) for his inspiring simple notebook.

Looking for differences between two states in terms of averages.

In [None]:
means = train.groupby('state').mean()
display(means)
display(means.diff()) # difference between state 0 and 1

In [None]:
medians = train.groupby('state').median()
display(medians)
display(medians.diff()) # difference between state 0 and 1

It looks we got differences between states in terms of averages.

## Counting sequences per subject ##

Now let's see how many sequences there are per subject.

In [None]:
# counting how many sequences per subject
count_sub = pd.DataFrame(train.subject.value_counts().sort_values().reset_index() )
count_sub

In [None]:
count_sub['number of sequences'] = (count_sub['subject']/60).astype(int) #dividing by 60 seconds to obtain the right count
count_sub.drop(['subject'], axis = 1, inplace = True)

In [None]:
count_sub['subject'] = count_sub['index']
count_sub.drop(['index'], axis = 1, inplace = True)
count_sub

In this way, by using the train-labels, we know which state was the sequence. 
It looks that in order to gather information for classificaton it is useful to group by sequence.
See my  [EDA + Viz notebook](https://www.kaggle.com/code/girolamanotarangelo/classification-states-eda-viz-tps-april-2022) for a graph showing that subjects with more sequences tend to be on state 1.

## Features correlation ##

See my  [EDA + Viz notebook](https://www.kaggle.com/code/girolamanotarangelo/classification-states-eda-viz-tps-april-2022)


## Visualisations ##

See my  [EDA + Viz notebook](https://www.kaggle.com/code/girolamanotarangelo/classification-states-eda-viz-tps-april-2022)


# Features engineering

Thank you **AMBROSM** for his useful [advices](https://www.kaggle.com/competitions/tabular-playground-series-apr-2022/discussion/318527) and model [notebook](https://www.kaggle.com/code/ambrosm/tpsapr22-best-model-without-nn).

Thank you **JIRI PRUDKY** for his simple but effective [model](https://www.kaggle.com/code/jiprud/tps-apr22-rookie-eda-submission/notebook).

In [None]:
def features(df):
    out_df = df.groupby('sequence').agg(['mean','max', 'min', 'std', scipy.stats.variation, scipy.stats.iqr,'median', 'skew', pd.DataFrame.kurt])
    #out_df2 = df.groupby('sequence').apply(pd.DataFrame.kurt)
    out_df.columns = ['_'.join(col).strip() for col in out_df.columns]

    return out_df

In [None]:
sensors = [col for col in train.columns if 'sensor_' in col]
sensors
def engineer(df):
    new_df = pd.DataFrame([], index=df.index)
    for sensor in sensors:
        new_df[sensor + '_mean'] = df[sensor].mean(axis=1)
        #new_df[sensor + '_max'] = df[sensor].max(axis=1)
        #new_df[sensor + '_min'] = df[sensor].min(axis=1)
        #new_df[sensor + '_var'] = df[sensor].var(axis=1)
        #new_df[sensor + '_mad'] = df[sensor].mad(axis=1)
        #new_df[sensor + '_sum'] = df[sensor].sum(axis=1)
        new_df[sensor + '_std'] = df[sensor].std(axis=1)
        new_df[sensor + '_sm'] = np.nan_to_num(new_df[sensor + '_std'] / 
                                               new_df[sensor + '_mean'].abs()).clip(-1e30, 1e30) # Compute the coefficient of variation, which is the standard deviation divided by the mean.
        new_df[sensor + '_iqr'] = scipy.stats.iqr(df[sensor], axis=1)
        new_df[sensor + '_median'] = df[sensor].median(axis=1)
        #new_df[sensor + '_skew'] = df[sensor].skew(axis=1)
        new_df[sensor + '_kurtosis'] = scipy.stats.kurtosis(df[sensor], axis=1)
        new_df['sensor_02_up'] = (df.sensor_02.diff(axis=1) > 0).sum(axis=1)
        new_df['sensor_02_down'] = (df.sensor_02.diff(axis=1) < 0).sum(axis=1)
        new_df['sensor_02_upsum'] = df.sensor_02.diff(axis=1).clip(0, None).sum(axis=1)
        new_df['sensor_02_downsum'] = df.sensor_02.diff(axis=1) .clip(None, 0).sum(axis=1)
        new_df['sensor_02_upmax'] = df.sensor_02.diff(axis=1).max(axis=1)
        new_df['sensor_02_downmax'] = df.sensor_02.diff(axis=1).min(axis=1)
        new_df['sensor_02_upmean'] = np.nan_to_num(new_df['sensor_02_upsum'] / new_df['sensor_02_up'], posinf=40)
        new_df['sensor_02_downmean'] = np.nan_to_num(new_df['sensor_02_downsum'] / new_df['sensor_02_down'], neginf=-40)
    return new_df

In [None]:
train_pivoted = train.pivot(index=['sequence','subject','state'], columns='step', values=[col for col in train.columns if 'sensor_' in col])

train_pivoted

In [None]:
train_pivoted_feat = engineer(train_pivoted)
train_pivoted_feat

In [None]:
count_sub

In [None]:
count_sub.set_index('subject', inplace=True)
count_sub

In [None]:
#Adding count column

train_pivoted_feat = train_pivoted_feat.join(count_sub, how = 'inner') # create a column count by joining the 2 dataframe

In [None]:
train_pivoted_feat

In [None]:
train_pivoted_feat1 = train_pivoted_feat.droplevel(1)
train_pivoted_feat1
train_pivoted_feat2 = train_pivoted_feat1.droplevel(1)
train_pivoted_feat2

In [None]:
X = train_pivoted_feat2

y =train_labels['state']

X

In [None]:
y

In [None]:
# Features to drop from AMBROSM EDA

dropped_features = ['sensor_05_kurt', 'sensor_08_mean',
                    'sensor_05_std', 'sensor_06_kurt',
                    'sensor_06_std', 'sensor_03_std',
                    'sensor_02_kurt', 'sensor_03_kurt',
                    'sensor_09_kurt', 'sensor_03_mean',
                    'sensor_00_mean', 'sensor_02_iqr',
                    'sensor_05_mean', 'sensor_06_mean',
                    'sensor_07_std', 'sensor_10_iqr',
                    'sensor_11_iqr', 'sensor_12_iqr',
                    'sensor_09_mean',
                     'sensor_05_iqr', 
                     'sensor_09_iqr', 
                    'sensor_07_iqr', 'sensor_10_mean']

In [None]:
selected_columns = X.columns
selected_columns = [f for f in selected_columns if f not in dropped_features]
len(selected_columns)

In [None]:
X = X[selected_columns]
X

In [None]:
index = [i for i in X.index if i not in stuck]

In [None]:
len(index)

In [None]:
X = X.loc[X.index.isin(index)]
X

# Sequential Feature selection ([SFS](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)) ##

We start by selection the "best" 50 features from the Iris dataset via Sequential Forward Selection (SFS). Here, we set forward=True and floating=False. By choosing cv=0, we don't perform any cross-validation, therefore, the performance (here: 'accuracy') is computed entirely on the training set.

In [None]:
from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier

In [None]:
estimator = HistGradientBoostingClassifier()

In [None]:
# from xgboost  import XGBClassifier
# xgb = XGBClassifier(use_label_encoder=False, random_state = 2, eval_metric = 'logloss')

In [None]:
# from mlxtend.feature_selection import SequentialFeatureSelector as SFS

# sfs1 = SFS(estimator, 
#            k_features=45, 
#            forward=True, 
#            floating=False, 
#            verbose=2,
#            scoring='accuracy',
#            cv=0)

# sfs1 = sfs1.fit(X, y)

In [None]:
#sfs1.subsets_

In [None]:
X = X[['sensor_00_std',
   'sensor_00_sm',
   'sensor_00_median',
   'sensor_00_kurtosis',
   'sensor_02_upsum',
   'sensor_02_downsum',
   'sensor_02_upmax',
   'sensor_02_downmax',
   'sensor_02_upmean',
   'sensor_01_std',
   'sensor_01_iqr',
   'sensor_02_mean',
   'sensor_02_std',
   'sensor_02_sm',
   'sensor_02_kurtosis',
   'sensor_03_sm',
   'sensor_03_iqr',
   'sensor_03_median',
   'sensor_03_kurtosis',
   'sensor_04_mean',
   'sensor_04_std',
   'sensor_04_sm',
   'sensor_04_iqr',
   'sensor_04_median',
   'sensor_04_kurtosis',
   'sensor_05_sm',
   'sensor_05_median',
   'sensor_06_sm',
   'sensor_06_iqr',
   'sensor_07_mean',
   'sensor_07_median',
   'sensor_08_iqr',
   'sensor_08_kurtosis',
   'sensor_09_std',
   'sensor_09_median',
   'sensor_09_kurtosis',
   'sensor_10_std',
   'sensor_10_sm',
   'sensor_10_kurtosis',
   'sensor_11_sm',
   'sensor_11_kurtosis',
   'sensor_12_std',
   'sensor_12_sm',
   'sensor_12_kurtosis',
   'number of sequences']]

In [None]:
X

In [None]:
# X = X[['sensor_00_std',
#    'sensor_00_iqr',
#    'sensor_01_mean',
#    'sensor_01_max',
#    'sensor_01_skew',
#    'sensor_02_mean',
#    'sensor_02_std',
#    'sensor_04_mean',
#    'sensor_04_std',
#    'sensor_04_skew',
#    'sensor_04_kurt',
#    'sensor_05_max',
#    'sensor_05_min',
#    'sensor_06_min',
#    'sensor_07_mean',
#    'sensor_07_skew',
#    'sensor_08_std',
#    'sensor_08_skew',
#    'sensor_08_kurt',
#    'sensor_09_min',
#    'sensor_09_std',
#    'sensor_10_min',
#    'sensor_10_kurt',
#    'sensor_11_mean',
#    'sensor_11_max',
#    'sensor_11_median',
#    'sensor_12_max',
#    'sensor_12_std',
#    'sensor_12_skew',
#    'sensor_12_kurt']]
# X

# Train and split ##

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)#, random_state=42, stratify = y) 
#stratify parameter will preserve the proportion of target as in original dataset, in the train and test datasets as well.

# Gradient boosting ##

In [None]:
#from sklearn.ensemble import GradientBoostingClassifier

In [None]:
#from sklearn.model_selection import KFold

In [None]:
# gradient_booster = GradientBoostingClassifier(learning_rate=0.1, n_estimators = 100)
# gradient_booster.get_params()

In [None]:
#gradient_booster.fit(X_train,y_train)

In [None]:
# y_pred = gradient_booster.predict(X_test)
# y_pred

In [None]:
# y_pred_proba = gradient_booster.predict_proba(X_test)
# y_pred_proba

In [None]:
# print(classification_report(y_test,y_pred))
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
# score = GradientBoostingClassifier.score(gradient_booster, X_test,y_test)
# print('Test Accuracy Score',score)

In [None]:
# from sklearn.model_selection import cross_val_score
# accuracy = cross_val_score(GradientBoostingClassifier(learning_rate=0.1), X_train, y_train,cv=3)
# accuracy

In [None]:
# #get the mean of each fold 
# print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

In [None]:
#fpr, tpr, _ = metrics.roc_curve(y_test, y_pred)

In [None]:
#fpr

In [None]:
# def plot_roc_curve(y_va, y_va_pred):
#     plt.figure(figsize=(8, 8))
#     fpr, tpr, _ = metrics.roc_curve(y_va, y_va_pred)
#     plt.plot(fpr, tpr, color='r', lw=2)
#     plt.plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
#     plt.gca().set_aspect('equal')
#     plt.xlim([0.0, 1.0])
#     plt.ylim([0.0, 1.0])
#     plt.xlabel("False Positive Rate")
#     plt.ylabel("True Positive Rate")
#     plt.title("Receiver operating characteristic")
#     plt.show()



In [None]:
# plot_roc_curve(y_test, y_pred_proba[:,1])
# print(metrics.auc(fpr, tpr))

# XGBoost ( eXtreme Gradient Boosting ) ## 

In [None]:
params = {'n_estimators': 1200,
          'max_depth': 7,
          'learning_rate': 0.15,
          'subsample': 0.95,
          'colsample_bytree': 0.60,
          'reg_lambda': 1.50,
          'reg_alpha': 6.10,
          'gamma': 1.40,
          'random_state': 42,
          'eval_metric' : 'logloss',
          #'tree_method': 'gpu_hist',
         }

In [None]:
# params = {'n_estimators': 8192,
#           'max_depth': 7,
#           'learning_rate': 0.1,
#           'subsample': 0.96,
#           'colsample_bytree': 0.80,
#           'reg_lambda': 1.50,
#           'reg_alpha': 6.10,
#           'gamma': 1.40,
#           'random_state': 16,
#           'eval_metric' : 'logloss',
#           #'tree_method': 'gpu_hist',
#          }


In [None]:
from xgboost  import XGBClassifier
#xgb = XGBClassifier(random_state = 2)
xgb = XGBClassifier(n_estimators=500, n_jobs=-1,
                          eval_metric=['logloss'],
                          #max_depth=10,
                          colsample_bytree=0.8,
                          #gamma=1.4,
                          reg_alpha=6, reg_lambda=1.5,
                          tree_method='hist',
                          learning_rate=0.03,
                          verbosity=1,
                          use_label_encoder=False, random_state=3)
#xgb = XGBClassifier(**params, use_label_encoder=False)
# xgb = XGBClassifier(n_estimators=800, n_jobs=-1,
#                           eval_metric=['logloss'],
#                           max_depth=10,
#                           colsample_bytree=0.8,
#                           gamma=1.4,
#                           reg_alpha=6, reg_lambda=1.5,
#                           tree_method='hist',
#                           learning_rate=0.03,
#                           verbosity=1,
#                           use_label_encoder=False, random_state=3)

In [None]:
xgb.fit(X_train, y_train)

In [None]:
# make predictions for test data
y_pred_XGB = xgb.predict(X_test)
y_pred_XGB

In [None]:
y_pred_XGB_proba = xgb.predict_proba(X_test)
y_pred_XGB_proba

In [None]:
print(classification_report(y_test,y_pred_XGB))
accuracy = accuracy_score(y_test, y_pred_XGB)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(xgb, X_train, y_train,cv=3)
accuracy

In [None]:
#get the mean of each fold 
print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

In [None]:
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_XGB)

In [None]:
fpr

In [None]:
def plot_roc_curve(y_va, y_va_pred):
    plt.figure(figsize=(8, 8))
    fpr, tpr, _ = metrics.roc_curve(y_va, y_va_pred)
    plt.plot(fpr, tpr, color='r', lw=2)
    plt.plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
    plt.gca().set_aspect('equal')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver operating characteristic")
    plt.show()

In [None]:
plot_roc_curve(y_test, y_pred_XGB_proba[:,1])
print(metrics.auc(fpr, tpr))

# Hist Gradient Boosting Classfier ##

In [None]:
#from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier

In [None]:
# HGBC = HistGradientBoostingClassifier(learning_rate=0.05, max_leaf_nodes=25,
#                                            max_iter=1000, min_samples_leaf=500,
#                                            validation_fraction=0.05,
#                                            l2_regularization=1,
#                                            max_bins=63,
#                                            random_state=100, verbose=0)

In [None]:
#HGBC.fit(X_train,y_train)

In [None]:
# # make predictions for test data
# y_pred_HGBC = HGBC.predict(X_test)
# y_pred_HGBC

In [None]:
# y_pred_HGBC_proba = HGBC.predict_proba(X_test)
# y_pred_HGBC_proba

In [None]:
# print(classification_report(y_test,y_pred_HGBC))
# accuracy = accuracy_score(y_test, y_pred_HGBC)
# print("Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
# from sklearn.model_selection import cross_val_score
# accuracy = cross_val_score(HGBC, X_train, y_train,cv=3)
# accuracy

In [None]:
# #get the mean of each fold 
# print("Accuracy of Model with Cross Validation is:",accuracy.mean() * 100)

In [None]:
#fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_HGBC)

In [None]:
#fpr

In [None]:
# def plot_roc_curve(y_va, y_va_pred):
#     plt.figure(figsize=(8, 8))
#     fpr, tpr, _ = metrics.roc_curve(y_va, y_va_pred)
#     plt.plot(fpr, tpr, color='r', lw=2)
#     plt.plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
#     plt.gca().set_aspect('equal')
#     plt.xlim([0.0, 1.0])
#     plt.ylim([0.0, 1.0])
#     plt.xlabel("False Positive Rate")
#     plt.ylabel("True Positive Rate")
#     plt.title("Receiver operating characteristic")
#     plt.show()

In [None]:
# plot_roc_curve(y_test, y_pred_HGBC_proba[:,1])
# print(metrics.auc(fpr, tpr))

# Predictions and submission #

## Feature engineering on test data ##

In [None]:
test


In [None]:
# counting how many sequences per subject
count_sub = pd.DataFrame(test.subject.value_counts().sort_values().reset_index() )
count_sub

In [None]:
count_sub['number of sequences'] = (count_sub['subject']/60).astype(int) #dividing by 60 seconds to obtain the right count
count_sub.drop(['subject'], axis = 1, inplace = True)

In [None]:
count_sub['subject'] = count_sub['index']
count_sub.drop(['index'], axis = 1, inplace = True)
count_sub

In [None]:
count_sub.set_index('subject', inplace=True)
count_sub

In [None]:
test_pivoted = test.pivot(index=[ 'sequence','subject'], columns='step', values=[col for col in test.columns if 'sensor_' in col])

test_pivoted

In [None]:
test_pivoted_feat = engineer(test_pivoted)
test_pivoted_feat

In [None]:
test_pivoted_feat = test_pivoted_feat.join(count_sub, how = 'inner') # create a column count by joining the 2 dataframe

In [None]:
test_pivoted_feat

In [None]:
test_pivoted_feat1 = test_pivoted_feat.droplevel(1)
test_pivoted_feat1


In [None]:
selected_columns = test_pivoted_feat1.columns
selected_columns = [f for f in selected_columns if f not in dropped_features]
len(selected_columns)

## Predictions ##

In [None]:
# test = test[selected_columns]
# test = test[['sensor_00_std',
#    'sensor_00_iqr',
#    'sensor_01_mean',
#    'sensor_01_max',
#    'sensor_01_skew',
#    'sensor_02_mean',
#    'sensor_02_std',
#    'sensor_04_mean',
#    'sensor_04_std',
#    'sensor_04_skew',
#    'sensor_04_kurt',
#    'sensor_05_max',
#    'sensor_05_min',
#    'sensor_06_min',
#    'sensor_07_mean',
#    'sensor_07_skew',
#    'sensor_08_std',
#    'sensor_08_skew',
#    'sensor_08_kurt',
#    'sensor_09_min',
#    'sensor_09_std',
#    'sensor_10_min',
#    'sensor_10_kurt',
#    'sensor_11_mean',
#    'sensor_11_max',
#    'sensor_11_median',
#    'sensor_12_max',
#    'sensor_12_std',
#    'sensor_12_skew',
#    'sensor_12_kurt']]
# test

In [None]:
test_pivoted_feat1 = test_pivoted_feat1[['sensor_00_std',
   'sensor_00_sm',
   'sensor_00_median',
   'sensor_00_kurtosis',
   'sensor_02_upsum',
   'sensor_02_downsum',
   'sensor_02_upmax',
   'sensor_02_downmax',
   'sensor_02_upmean',
   'sensor_01_std',
   'sensor_01_iqr',
   'sensor_02_mean',
   'sensor_02_std',
   'sensor_02_sm',
   'sensor_02_kurtosis',
   'sensor_03_sm',
   'sensor_03_iqr',
   'sensor_03_median',
   'sensor_03_kurtosis',
   'sensor_04_mean',
   'sensor_04_std',
   'sensor_04_sm',
   'sensor_04_iqr',
   'sensor_04_median',
   'sensor_04_kurtosis',
   'sensor_05_sm',
   'sensor_05_median',
   'sensor_06_sm',
   'sensor_06_iqr',
   'sensor_07_mean',
   'sensor_07_median',
   'sensor_08_iqr',
   'sensor_08_kurtosis',
   'sensor_09_std',
   'sensor_09_median',
   'sensor_09_kurtosis',
   'sensor_10_std',
   'sensor_10_sm',
   'sensor_10_kurtosis',
   'sensor_11_sm',
   'sensor_11_kurtosis',
   'sensor_12_std',
   'sensor_12_sm',
   'sensor_12_kurtosis',
   'number of sequences']]

In [None]:
test_pivoted_feat1

In [None]:
#featuring the test file

# test = test.drop(['subject', 'step'], axis=1)
# test = features(test)

# display(X,test,y)

In [None]:
#retraining
xgb.fit(X, y)

In [None]:
# make predictions for test data
#sub_pred = xgb.predict(test)
sub_pred = xgb.predict(test_pivoted_feat1)
#sub_pred = xgb.predict(test_pivoted_feat1[selected_columns])
len(sub_pred)

In [None]:
#sub_pred_proba = xgb.predict_proba(test)
#sub_pred_proba = xgb.predict_proba(test_pivoted_feat1[selected_columns])
sub_pred_proba = xgb.predict_proba(test_pivoted_feat1)
len(sub_pred_proba)

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-apr-2022/sample_submission.csv')
submission

For each sequence in the test set, you must predict a **probability for the state** variable. 

In [None]:
#submission['state'] = sub_pred
submission['state'] = sub_pred_proba[:,1]
#submission['state'] = y_pred_XGB_proba[:,1]
submission.to_csv('submission.csv', index = False)
submission