This is a simple ensemble method for submission.

Here we use the 3 best solutions to get the best accuracy of the model, by selecting the best weights:

https://www.kaggle.com/code/dmitryuarov/sensors-deep-analysis-0-98
https://www.kaggle.com/code/tyrionlannisterlzy/xgboost-dnn-ensemble-lb-0-980
https://www.kaggle.com/code/hasanbasriakcay/tpsapr22-fe-pseudo-labels-bi-lstm

<h1>Importing Libraries</h1>

In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
import warnings
warnings.filterwarnings('ignore')

<h2>Loading Data</h2>

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
submission = pd.read_csv("../input/tabular-playground-series-apr-2022/sample_submission.csv")
labels = pd.read_csv("../input/tabular-playground-series-apr-2022/train_labels.csv")

train

In [None]:
# give you an quike insght into the train data 
# including count,mean,std,min,25%,50%,75% and max value
train.describe()

In [None]:
# run the code below to check if missing data exits
train.isnull().sum(axis=0)

<h3>adding labels to train data</h3>

In [None]:
labels.head()

In [None]:
train =train.merge(labels,how='left', on=["sequence"])
train.head()

In [None]:
# set the size of the map
features  = [col for col in test.columns if col not in ("sequence","step","subject")]
plt.figure(figsize = (15,7))

hm = sns.heatmap(train[features].corr(),    # data
                cmap = 'coolwarm',# style
                annot = True,     # True to show the specific values
                fmt = '.2f',      # set the precision
                linewidths = 0.05)
plt.title('Correlation Heatmap for Train dataset', 
              fontsize=14, 
              fontweight='bold')

through the heatmap we may have find that
sensor **00, 01, 03, 06, 07, 09, 10, 11** have something to dig

also the sensor 04 but we just ignore it temporarily

so we will focus on them.

In [None]:
col_t=["sensor_00","sensor_01","sensor_03","sensor_04","sensor_06","sensor_07","sensor_09","sensor_10","sensor_11"]

# set the size of the map
plt.figure(figsize = (9,5))

hm = sns.heatmap(train[col_t].corr(),    # data
                cmap = 'coolwarm',      
                annot = True,     
                fmt = '.2f', 
                linewidths = 0.05)
plt.title('Correlation Heatmap for Selected columns from Train dataset', 
              fontsize=14, 
              fontweight='bold')

Let's have a quike glimpse into the data of the sensors.

In [None]:
sequences = [0, 1, 2, 3, 4, 5]
figure, axes = plt.subplots(13, len(sequences), sharex=True, figsize=(16, 16))
for i, sequence in enumerate(sequences):
    for sensor in range(13):
        sensor_name = f"sensor_{sensor:02d}"
        plt.subplot(13, len(sequences), sensor * len(sequences) + i + 1)
        plt.plot(range(60), train[train.sequence == sequence][sensor_name],
                color=plt.rcParams['axes.prop_cycle'].by_key()['color'][i % 10])
        if sensor == 0: plt.title(f"Sequence {sequence}")
        if sequence == sequences[0]: plt.ylabel(sensor_name)
figure.tight_layout(w_pad=0.1)
plt.suptitle('Selected Time Series', y=1.02)
plt.show()

<h1>Feature Seeking bewteen Target and Sensors Data</h1>

here we first introduce the concept of Mutual Information(MI).

Mutual information describes relationships in terms of uncertainty. The MI between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. 

In [None]:
# from sklearn.feature_selection import mutual_info_regression

# def make_mi_scores(X, y, discrete_features):
#     mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
#     mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
#     mi_scores = mi_scores.sort_values(ascending=False)
#     return mi_scores

In [None]:
# X_mi = train.copy()
# y_mi = X_mi.pop("state")

# # Label encoding for categoricals
# for colname in X_mi.select_dtypes("object"):
#     X_mi[colname], _ = X_mi[colname].factorize()

# # All discrete features should now have integer dtypes (double-check this before using MI!)
# discrete_features = X_mi.dtypes == int

The least possible mutual information between quantities is 0.0. When MI is zero, the quantities are independent: neither can tell you anything about the other. Conversely, in theory there's no upper bound to what MI can be. In practice though values above 2.0 or so are uncommon. (Mutual information is a **logarithmic quantity**, so it increases very slowly.)


In [None]:
# %%time
# mi_scores = make_mi_scores(X_mi, y_mi, discrete_features)
# mi_scores[::3]  # show a few features with their MI scores

In [None]:
# mi_scores


below I will test the 'mean', 'max', 'min', 'var', 'mad', 'sum', 'median' value of the data hoping to dig anything valuable

> the codes below are inspired by C4rl05/V with her work [https://www.kaggle.com/code/cv13j0/tps-apr-2022-xgboost-model](http://)

In [None]:
def aggregated_features(df, aggregation_cols = ['sequence'], prefix = ''):
    agg_strategy = {'sensor_00': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_01': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_02': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_03': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_04': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_05': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_06': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_07': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_08': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_09': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_10': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_11': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                    'sensor_12': ['mean', 'max', 'min', 'var', 'mad', 'sum', 'median'],
                   }
    group = df.groupby(aggregation_cols).aggregate(agg_strategy)
    group.columns = ['_'.join(col).strip() for col in group.columns]
    group.columns = [str(prefix) + str(col) for col in group.columns]
    group.reset_index(inplace = True)
    
    temp = (df.groupby(aggregation_cols).size().reset_index(name = str(prefix) + 'size'))
    group = pd.merge(temp, group, how = 'left', on = aggregation_cols,)
    return group

In [None]:
train_merge_data = aggregated_features(train, aggregation_cols = ['sequence', 'subject'])
test_merge_data = aggregated_features(test, aggregation_cols = ['sequence', 'subject'])

In [None]:
train_subjects_merge_data = aggregated_features(train, aggregation_cols = ['subject'], prefix = 'subject_')
test_subjects_merge_data = aggregated_features(test, aggregation_cols = ['subject'], prefix = 'subject_')

up to now we have a clear view of the values of sensors 

In [None]:
train_subjects_merge_data.head()

<h3>Experimenting with Lags</h3>

>lagging is a commom techinic used in time series datasets

Lagging a time series means to shift its values forward one or more time steps, or equivalently, to shift the times in its index backward one or more steps. In either case, the effect is that the observations in the lagged series will appear to have happened later in time.

In [None]:
train['sensor_00_lag_01'] = train['sensor_00'].shift(1)
train['sensor_00_lag_10'] = train['sensor_00'].shift(10)
train.head(15)

<h3>Merging the Datasets before Training</h3>

In [None]:
train_merge_data = train_merge_data.merge(labels, how = 'left', on = 'sequence')

In [None]:
train_merge_data = train_merge_data.merge(train_subjects_merge_data, how = 'left', on = 'subject')
test_merge_data = test_merge_data.merge(test_subjects_merge_data, how = 'left', on = 'subject')
train_merge_data.head()

In [None]:
test_merge_data.head()

<h3>Post Processing the Information for the Model</h3>

In [None]:
ignore = ['sequence', 'state', 'subject']
features = [feat for feat in train_merge_data.columns if feat not in ignore]
target_feature = 'state'

<h3>Train - Test Split </h3>

you may do cross-validation too.

In [None]:
%%time
from sklearn.model_selection import train_test_split
test_size_pct = 0.20
X_train, X_valid, y_train, y_valid = train_test_split(
                                train_merge_data[features], 
                                train_merge_data[target_feature], 
                                test_size = test_size_pct, 
                                random_state = 16)

<h3>Building a XGBoost Model</h3>

In [None]:
from xgboost  import XGBClassifier

params = {'n_estimators': 8192,
          'max_depth': 7,
          'learning_rate': 0.1,
          'subsample': 0.96,
          'colsample_bytree': 0.80,
          'reg_lambda': 1.50,
          'reg_alpha': 6.10,
          'gamma': 1.40,
          'random_state': 16,
          'objective': 'binary:logistic',
          #'tree_method': 'gpu_hist',
         }

xgb = XGBClassifier(**params)
xgb.fit(X_train, y_train, 
        eval_set = [(X_valid, y_valid)], 
        eval_metric = ['auc','logloss'], 
        early_stopping_rounds = 64, 
        verbose = 32)

In [None]:
from sklearn.metrics import roc_auc_score

preds = xgb.predict_proba(X_valid)[:, 1]
score = roc_auc_score(y_valid, preds)
print(score)

<h3>Check the Feature Importance through plots</h3>

In [None]:
def plot_feature_importance(importance, names, model_type, max_features = 10):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_df = fi_df.head(max_features)

    #Define size of bar plot
    plt.figure(figsize=(8,6))
    
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plot_feature_importance(xgb.feature_importances_,X_train.columns,'XGBOOST ', max_features = 15)

<h3>Make Submission File</h3>

In [None]:
from sklearn.metrics import roc_auc_score
xgb_preds = xgb.predict_proba(test_merge_data[features])[:, 1]
xgb_preds

ENsemble from DNN model
later will be updated..

rough version.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import KFold, GroupKFold

import tensorflow as tf
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.layers import Concatenate, LSTM, GRU
from tensorflow.keras.layers import Bidirectional, Multiply
np.random.seed(2022)
tf.random.set_seed(2022)
train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
t_lbls = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
s6 = pd.read_csv('../input/tabular-playground-series-apr-2022/sample_submission.csv')
s7=pd.read_csv('../input/blend-sub/blend_sub12.csv')

In [None]:
features = train.columns.tolist()[3:]
def prep(df):
    for feature in features:
        df[feature + '_lag1'] = df.groupby('sequence')[feature].shift(1)
        df.fillna(0, inplace=True)
        df[feature + '_diff1'] = df[feature] - df[feature + '_lag1']    

prep(train)
prep(test)

features = train.columns.tolist()[3:]
sc = StandardScaler()
train[features] = sc.fit_transform(train[features])
test[features] = sc.transform(test[features])

groups = train["sequence"]
labels = t_lbls["state"]

train = train.drop(["sequence", "subject", "step"], axis=1).values
train = train.reshape(-1, 60, train.shape[-1])

test = test.drop(["sequence", "subject", "step"], axis=1).values
test = test.reshape(-1, 60, test.shape[-1])

In [None]:
# try:
#     tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
#     tf.config.experimental_connect_to_cluster(tpu)
#     tf.tpu.experimental.initialize_tpu_system(tpu)
#     tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
#     BATCH_SIZE = tpu_strategy.num_replicas_in_sync * 64
#     print("Running on TPU:", tpu.master())
#     print(f"Batch Size: {BATCH_SIZE}")
    
# except ValueError:
#     strategy = tf.distribute.get_strategy()
#     BATCH_SIZE = 256
#     print(f"Running on {strategy.num_replicas_in_sync} replicas")
#     print(f"Batch Size: {BATCH_SIZE}")

In [None]:
def dnn_model():
    
    x_input = Input(shape=(train.shape[-2:]))
    
    x1 = Bidirectional(LSTM(units=512, return_sequences=True))(x_input)
    x2 = Bidirectional(LSTM(units=256, return_sequences=True))(x1)
    z1 = Bidirectional(GRU(units=256, return_sequences=True))(x1)
    
    c = Concatenate(axis=2)([x2, z1])
    
    x3 = Bidirectional(LSTM(units=128, return_sequences=True))(c)
    
    x4 = GlobalMaxPooling1D()(x3)
    x5 = Dense(units=128, activation='selu')(x4)
    x_output = Dense(1, activation='sigmoid')(x5)

    model = Model(inputs=x_input, outputs=x_output, name='lstm_model')
    
    return model

model = dnn_model()

In [None]:
# with tpu_strategy.scope():
VERBOSE = True
BATCH_SIZE = 256
predictions, scores = [], []
k = GroupKFold(n_splits = 10)

for fold, (train_idx, val_idx) in enumerate(k.split(train, labels, groups.unique())):
    print('-'*15, '>', f'Fold {fold+1}', '<', '-'*15)

    X_train, X_val = train[train_idx], train[val_idx]
    y_train, y_val = labels.iloc[train_idx].values, labels.iloc[val_idx].values

    model = dnn_model()
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics='AUC')

    lr = ReduceLROnPlateau(monitor="val_auc", factor=0.6, 
                           patience=4, verbose=VERBOSE)

    es = EarlyStopping(monitor="val_auc", patience=7, 
                       verbose=VERBOSE, mode="max", 
                       restore_best_weights=True)

    save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
    chk_point = ModelCheckpoint(f'./TPS_model_2022_{fold+1}C.h5', options=save_locally, 
                                monitor='val_auc', verbose=VERBOSE, 
                                save_best_only=True, mode='max')

    model.fit(X_train, y_train, 
              validation_data=(X_val, y_val), 
              epochs=16,
              verbose=VERBOSE,
              batch_size=BATCH_SIZE, 
              callbacks=[lr, chk_point, es])

    load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
    model = load_model(f'./TPS_model_2022_{fold+1}C.h5', options=load_locally)

    y_pred = model.predict(X_val, batch_size=BATCH_SIZE).squeeze()
    score = roc_auc_score(y_val, y_pred)
    scores.append(score)
    predictions.append(model.predict(test, batch_size=BATCH_SIZE).squeeze())
    print(f"Fold-{fold+1} | OOF Score: {score}")

print(f'Mean accuracy on {k.n_splits} folds - {np.mean(scores)}')

In [None]:
s6["state"] = sum(predictions)/k.n_splits
s6["state"]

In [None]:
blendsub=pd.read_csv("../input/blens-sub31/blend_sub31_exp.csv")
preds=(s7.state+s6.state)*0.25+xgb_preds*0.24+blendsub.state*0.5
preds

just replace the state columns with your predicts

In [None]:

submission['state'] = preds
submission.to_csv('my_submission_ty.csv', index = False)




**still working on find more features ....
will be updated soon!**