<a href="https://www.kaggle.com/code/jiprud/tps-apr22-rookie-eda?scriptVersionId=91988380" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10) # make plots a bit bigger

# Load Data

In [None]:
train_df = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
train_labels_df = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')

In [None]:
# add target information (from train_labels) into training dataframe
train_df = train_df.merge(train_labels_df,how='outer')

# Explore

In [None]:
display(train_df.head())
display(test_df.head())

In [None]:
train_df.describe()

* sequence - a unique id for each sequence
* subject - a unique id for the subject in the experiment
* step - time step of the recording, in one second intervals

Lets see how many unique values we have for each category

In [None]:
train_df[['sequence', 'subject','step']].nunique()

In [None]:
test_df[['sequence', 'subject','step']].nunique()

Do we have the same subjects in train and test?

In [None]:
len(np.intersect1d(train_df['subject'],test_df['subject']))

No, the test dataframe contains only new subjects.

Let's draw sensor values for one (randomly selected) sequence.

In [None]:
sequence = 42
seq = train_df.query('sequence == @sequence').copy()

seq.drop(['sequence','subject','step','state'], axis=1).plot();

## Mean Values

Let's look closer to mean values of sensors across the whole sequence. Are the means different for the two states?

In [None]:
means = train_df.groupby('state').mean()
display(means)
display(means.diff()) # difference between state 0 and 1

Yes, there is some difference in mean values for the two states.
Are the differences significant? Can we use this information for a model? Let's try...

# Engineer Features

In [None]:
def create_features(df):
    df_copy = df.copy()
    # inspired by: https://www.kaggle.com/code/hasanbasriakcay/tpsapr22-fe-pseudo-labels-baseline
    df_copy['sensor_02_num'] = df_copy['sensor_02'] > -15
    df_copy['sensor_02_num'] = df_copy['sensor_02'].astype(int)
    df_copy['sensor_sum1'] = (df_copy['sensor_00'] + df_copy['sensor_09'] + df_copy['sensor_06'] + df_copy['sensor_01'])
    df_copy['sensor_sum2'] = (df_copy['sensor_01'] + df_copy['sensor_11'] + df_copy['sensor_09'] + df_copy['sensor_06'] + df_copy['sensor_00'])
    df_copy['sensor_sum3'] = (df_copy['sensor_03'] + df_copy['sensor_11'] + df_copy['sensor_07'])
    df_copy['sensor_sum4'] = (df_copy['sensor_04'] + df_copy['sensor_10'])
    
    out_df = df_copy.groupby('sequence').agg(['mean', 'max', 'min', 'var', 'mad', 'sum', 'median','skew'])
#     out_df = df_copy.groupby('sequence').agg(['mean','max'])
    out_df.columns = ['_'.join(col).strip() for col in out_df.columns]

    return out_df

# Model



## Prepare training and testing dataframes

In [None]:
%%time
train = train_df.drop(['subject', 'step', 'state'], axis=1)
X_train = create_features(train)
test = test_df.drop(['subject', 'step'], axis=1)
X_test = create_features(test)
y_train = train_labels_df['state']

submission = pd.DataFrame(index = X_test.index)

display(X_train,X_test,y_train)

## Gradient Boosting

In [None]:
%%time
from xgboost import XGBClassifier

model_xgb = XGBClassifier(random_state = 2)

model_xgb.fit(X_train,y_train);

In [None]:
y_xgb = model_xgb.predict(X_test)

#display(y_xgb)
#submission['xgb'] = y_xgb

y_xgb_proba = model_xgb.predict_proba(X_test)
display(y_xgb_proba)
#submission['xgb_proba_0'] = y_xgb_proba[:,0]
submission['xgb_proba_1'] = y_xgb_proba[:,1]

In [None]:
# show feature importance
from xgboost import plot_importance
plot_importance(model_xgb, max_num_features = 50)
plt.show()


In [None]:
feature_importance = model_xgb.get_booster().get_score()

feature_importance

## Create Submission file

In [None]:
submission.to_csv('submission.csv',columns=['xgb_proba_1'], header=['state'],index=True)