# Summary
To create the Baseline, we share the results of the predictions of the machine learning model by Pycaret.Features are created in the following two steps, and the values obtained in Step 2,3 are used as features.<br>

**Step1**. Acquire the changes from the previous step of 13 sensors.<br>
**Step2**. Acquire the maximum value, minimum value, mean value, and standard deviation for each experiment.<br>
**Step3**. Acquire the maximum value, minimum value, mean value, and standard deviation for each experiment sensor time difference.



# Table of Contents
1. [Extract](#Extract)<br>
2. [Assess](#Assess)<br>
3. [EDA](#EDA)<br>
4. [Featute_Engineering](#Featute_Engineering)<br>
5. [Preprocess](#Preprocess)<br>
6. [Modeling](#Modeling)<br>
7. [Submission](#Submission)<br>

In [None]:
%%capture
!pip install pycaret[full]

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 

from pycaret.classification import *
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from xgboost  import XGBClassifier
from sklearn.metrics import roc_auc_score
from xgboost import cv

# Extract
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

In [None]:
sample_sub = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2022/sample_submission.csv')
train_label_df = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2022/train_labels.csv')
train_df = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2022/train.csv')
test_df = pd.read_csv('/kaggle/input/tabular-playground-series-apr-2022/test.csv')

# Assess 
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

## Train 

In [None]:
train_df.info()

* Null values do not appear to exist.

In [None]:
train_df.head()

In [None]:
train_df.duplicated().sum()

In [None]:
train_df.describe()

## Cardinality

In [None]:
for col in train_df.loc[:,['sequence','subject','step']].columns.values:
    print(f'{col}:{len(train_df[col].unique())}')

**Note**<br>
672subjects of 25968sequence types and 60steps of sensor logs are recorded.

## Test

In [None]:
test_df.info()

In [None]:
test_df.head()

In [None]:
test_df.describe()

**Note**
* From Median, we assumed that sensor_02 might be moving differently from other sensors.

# EDA
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

## Sensor Data Structure

Initially, an overview of the data structure of each sensor is analyzed.

In [None]:
summary_df = pd.DataFrame(train_df[['sequence','subject','step']].groupby(['subject']).count()/60)
summary_df = summary_df.reset_index()
summary_df = summary_df.sort_values(by='sequence')
summary_df

In [None]:
summary_df.describe()

In [None]:
fig=plt.figure(figsize=[32,4])
sns.barplot(data =summary_df ,x='subject',y='sequence',color='gray',order=summary_df['subject']);
plt.xticks(rotation=90);
plt.axhline(y=10,xmin=0,xmax=672,color='red');

* The median of the sequence in each subject is 33, 1st quantile is 25, and 4th quantile is 47, but some subjects include experiments with less than 10 sequences. The following table shows the results of the survey. The red line indicates that 10 sequences are included in the subject.

In [None]:
# Small 15 subject
summary_df.head(15)

In [None]:
# Large 15 subject
summary_df.tail(15)

## Plot for Each Sensor

### Plot for Subpect Statics

In [None]:
# Todo

### Plot for Specific Subject

In [None]:
# Todo

### Correlation - Original

In [None]:
feat_old_list=['sequence', 'subject', 'step',
'sensor_00', 'sensor_01', 'sensor_02', 'sensor_03', 
'sensor_04', 'sensor_05', 'sensor_06', 'sensor_07',
'sensor_08', 'sensor_09', 'sensor_10', 'sensor_11', 'sensor_12']

In [None]:
plt.figure(figsize=[16,6])
sns.heatmap(train_df.loc[:,feat_old_list[3:]].corr(),annot=True);

* Observations
    * Sensors for which correlation is observed
        * sensor_00_diff1: sensor_01_diff1, sensor_06_diff1
        * sensor_03_diff1: sensor_11_diff1, sensor_09_diff1
        * sensor_04_diff1: sensor_10_diff1, sensor_12_diff1
    * Sensors with no correlation observed
        * sensor_02_diff1,sensor_05_diff1, sensor_07_diff1, sensor_08_diff1

### Difference

In [None]:
def prep(df,features):
    """
    """
    for feature in features:
        df[feature + '_lag1'] = df.groupby('sequence')[feature].shift(1)
        df.fillna(0, inplace=True)
        df[feature + '_diff1'] = df[feature] - df[feature + '_lag1']  
    return df

In [None]:
feature_col = train_df.columns.values[3:]
train_df = prep(train_df,feature_col)
test_df = prep(test_df,feature_col)

### Correlation between sensor difference

In [None]:
feat_lst=['sequence', 'subject', 'step',
'sensor_00_diff1', 'sensor_01_diff1', 'sensor_02_diff1', 'sensor_03_diff1', 
'sensor_04_diff1', 'sensor_05_diff1', 'sensor_06_diff1', 'sensor_07_diff1',
'sensor_08_diff1', 'sensor_09_diff1', 'sensor_10_diff1', 'sensor_11_diff1', 'sensor_12_diff1']

In [None]:
fig, ax =plt.subplots(1,4,sharey=True,figsize=[16,4]);
sns.heatmap(train_df.loc[:,feat_lst[3:]].corr(),ax=ax[0], cbar=False);
sns.heatmap(train_df[train_df['subject']==333].loc[:,feat_lst[3:]].corr(),ax=ax[1], cbar=False);
sns.heatmap(train_df[train_df['subject']==1].loc[:,feat_lst[3:]].corr(),ax=ax[2], cbar=False);
sns.heatmap(train_df[train_df['subject']==472].loc[:,feat_lst[3:]].corr(),ax=ax[3]);

**Note**
* Observations
    * Sensors for which correlation is observed
        * sensor_00_diff1: sensor_01_diff1, sensor_06_diff1
        * sensor_03_diff1: sensor_11_diff1, sensor_09_diff1
        * sensor_04_diff1: sensor_10_diff1, sensor_12_diff1
    * Sensors with no correlation observed
        * sensor_05_diff1, sensor_07_diff1, sensor_08_diff1,sensor_02_diff1
* Consideration
    * Summarize sensors that are too strongly correlated.
        * sensor_00_diff1,sensor_03_diff1,sensor_04_diff1,sensor_05_diff1, sensor_07_diff1, sensor_08_diff1,sensor_02_diff1
        * Note that different sensors have stronger correlations depending on the number of data.
* Question
    * Are there any cases where sensor correlation is a feature?

# Featute_Engineering
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

In [None]:
sensor = ['00','01','02','03','04','05','06','07','08','09','10','11','12']

drop_columes = []
for i in sensor:
    drop_columes.append(f"sensor_{i}")

drop_columes.append("step")

[Reference:SHOMA TATENO](https://www.kaggle.com/code/shoooono/lightgbm-baseline-with-simple-feature)

In [None]:
def feature_engineer(df,i,str=''):
    df_copy = df.copy()
    mean_value = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].mean()
    mean_value = mean_value.rename(f"sensor_{i}_mean{str}")

    sum_value  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].sum()
    sum_value  = sum_value.rename(f"sensor_{i}_sum{str}")

    std_value  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].std()
    std_value  = std_value.rename(f"sensor_{i}_std{str}")

    var_value  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].var()
    var_value  = var_value.rename(f"sensor_{i}_var{str}")

    skew_value  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].skew()
    skew_value  = skew_value.rename(f"sensor_{i}_skew{str}")    
    
    q001  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].quantile(0.01)
    q001  = q001.rename(f"sensor_{i}_q001{str}")

    q005  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].quantile(0.05)
    q005  = q005.rename(f"sensor_{i}_q005{str}")

    q095  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].quantile(0.95)
    q095  = q095.rename(f"sensor_{i}_q095{str}")

    q099  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].quantile(0.99)
    q099  = q099.rename(f"sensor_{i}_q099{str}")

    max_value  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].max()
    max_value  = max_value.rename(f"sensor_{i}_max{str}")

    min_value  = df.groupby(['sequence','subject'])[f"sensor_{i}{str}"].min()
    min_value  = min_value.rename(f"sensor_{i}_min{str}")

    df_copy = df_copy.merge(mean_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(sum_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(skew_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(std_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(var_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(max_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(min_value, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(q001, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(q005, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(q095, left_on=['sequence', 'subject'], right_index=True)
    df_copy = df_copy.merge(q099, left_on=['sequence', 'subject'], right_index=True)
    
    return df_copy

In [None]:
for val in sensor:
    train_df = feature_engineer(train_df,val)

In [None]:
for val in sensor:
    train_df = feature_engineer(train_df,val,'_diff1')

In [None]:
for val in sensor:
    test_df =  feature_engineer(test_df,val)

In [None]:
for val in sensor:
    test_df =  feature_engineer(test_df,val,'_diff1')

In [None]:
sensor = ['00','01','02','03','04','05','06','07','08','09','10','11','12']

drop_columns = []
for i in sensor:
    drop_columns.append(f"sensor_{i}")
    drop_columns.append(f"sensor_{i}_lag1")
    drop_columns.append(f"sensor_{i}_diff1")

In [None]:
train_df = train_df.drop(drop_columns, axis=1)
train_df = train_df.drop_duplicates()

In [None]:
test_df = test_df.drop(drop_columns, axis=1)
test_df = test_df.drop_duplicates()

# Preprocess
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

In [None]:
train_df = train_df.merge(train_label_df, on=['sequence'])

In [None]:
# reduce memory
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (
                start_mem - end_mem) / start_mem))
    return df

In [None]:
train_df = reduce_mem_usage(train_df)
test_df = reduce_mem_usage(test_df)

**Note**<br>
Since the LB at the time of submission is considerably worse than the AUC at the time of study, I suspect overfitting.This time, Groupkfold was performed using "**step**".

In [None]:
feature_list=train_df.columns.values[2:-1]

X_train = train_df[feature_list]
y_train = train_df['state']

X_test= test_df[feature_list]

# Modeling
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

#### Useful Notebook
[Reference:TYRIONLANNISTERLZY](https://www.kaggle.com/code/tyrionlannisterlzy/eda-feature-seeking-xgboost-beginners-friendly)<br>
[XGB-kfold:PRASHANT BANERJEE](https://www.kaggle.com/code/prashant111/xgboost-k-fold-cv-feature-importance/notebook)<br>
[XGB-kfold:JIE DENG](https://www.kaggle.com/code/xinyangkabuda/baseline-lgboost)

In [None]:
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold,KFold,train_test_split,GroupKFold

def kfold_lgb(train, x_test, target, seed=2021):
    kf = StratifiedKFold(n_splits=5)
         #if you use GPU, add 'device':'gpu',   
    paras = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'max_depth': 5,
        'random_state': seed,
        "learning_rate":0.05, 
        "n_estimators":1000,
        'num_leaves': 52,
        'bagging_fraction': 0.8,
        'bagging_freq': 4,
        'feature_fraction': 0.6,
        'min_data_in_leaf': 7,
        'min_sum_hessian_in_leaf': 1,
        'max_bin': 100,
        'metric': 'auc'
    }
    
    group = train['step']
    y_val_hat = 0  
    y_train_hat = target - target  
    y_sub = 0  
    feature_importance_df = pd.DataFrame()  

    for fold, (train_index, val_index) in enumerate(kf.split(train, target)):
        X_train, X_val = train.iloc[train_index], train.iloc[val_index]
        y_train, y_val = target.iloc[train_index], target.iloc[val_index]
        data_train = lgb.Dataset(X_train, y_train)
        data_val = lgb.Dataset(X_val, y_val)
        watchlist = [data_train,data_val]
        n_round = 2500
        model = lgb.train(dict(paras),
                          data_train,
                          num_boost_round = n_round,
                          valid_sets = watchlist,
                          early_stopping_rounds=128)
        
        # predict and valid
        data_test = x_test
        y_sub = y_sub + model.predict(data_test) / kf.n_splits 
        
        data_val_test = X_val
        y_val_hat = model.predict(data_val_test) 
        fold_roc_auc_score = roc_auc_score(y_val, y_val_hat)  
        print(f'{fold + 1}fold roc_auc_score: {fold_roc_auc_score}')
        y_train_hat.loc[y_val.index] = y_val_hat
        

        # feature importance 
        fold_importance_df = pd.DataFrame()
        fold_importance_df["feature"] = model.feature_name()
        fold_importance_df["importance"] = model.feature_importance()  # split,
#         fold_importance_df["importance"] = model.feature_importance(importance_type = 'gain')  # gain

        fold_importance_df["fold"] = fold + 1
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
                            
    return y_sub, y_train_hat,feature_importance_df

In [None]:
seed = 2022
y_sub_lgb, y_train_hat_lgb,feature_importance_lgb = kfold_lgb(X_train, X_test, y_train, seed)

#f1 = f1_score(y, np.where(y_train_hat_lgb >= 0.5, 1, 0))
#auc = roc_auc_score(y, y_train_hat_lgb)
#print(f'all seed f1: {f1}, all seed auc: {auc}')

In [None]:
# LGB features importance

N = 10 # TOP N
cols = (feature_importance_lgb[["feature", "importance"]] .groupby("feature").mean().sort_values(by="importance", ascending=False)[:N].index)
best_features_lgb = feature_importance_lgb.loc[feature_importance_lgb.feature.isin(cols)].sort_values(by='importance',ascending=False)
best_features_lgb

# Submission
[Extract](#Extract)<br>
[Assess](#Assess)<br>
[EDA](#EDA)<br>
[Featute_Engineering](#Featute_Engineering)<br>
[Preprocess](#Preprocess)<br>
[Modeling](#Modeling)<br>
[Submission](#Submission)<br>

In [None]:
test_df['state'] = y_sub_lgb
test_df['pred_LGB_mean'] = test_df.groupby(['sequence'])['state'].transform('mean')
test_df['state'] = test_df.pred_LGB_mean
test_df = test_df.drop_duplicates()

In [None]:
sub2 = sample_sub
sub2['state'] = test_df['state'].values 
sub2.to_csv('submission.csv', index=False)

In [None]:
sub2.head()