# Tabular Playground Series - Apr 2022

This is a notebook for Kaggle competition of tabular playground in April 2022. We are given 12 sensor data and based on those values, we need to perform a binary classification. Let's start!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import libraries: 

In [None]:
import scipy as sp
import pandas as pd
import numpy as np

# EDA (Exploratory Data Analysis)

### Check variables and observations in test and train data

#### Train data

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')

In [None]:
df_train.head()

In [None]:
df_train.shape

#### Test data

In [None]:
df_test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')

In [None]:
df_test.head()

In [None]:
df_test.shape

#### Train_labels

In [None]:
df_train_labels = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
df_train_labels.head()

In *train_labels* the variable *state* is associated with each sequence in train data

### Target distribution

It's important to have target normally distributed in regression model or to have all targets equally represented in classification problem. The reason for that is that the model will be more precise and accurate.

So when we have high skewness in data, we need to use log normal transformation to have skewness placed approximately to zero.

In [None]:
positive_state, negative_state = df_train_labels.state.value_counts()
print('There are {} positive and {} negative states'.format(positive_state, negative_state))

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.pie([positive_state, negative_state], labels=['True', 'False'], autopct='%1.1f%%', shadow=True, startangle=45, textprops=dict(color="white", fontsize=15, weight="bold"), colors=['orangered', 'steelblue'])
ax.set_title('Target distribution')
ax.legend(title='State')

As we see, the values of targets are equally distributed

### Input data

We need to check the following:
* The type of variables
* Are the values within the range (not applicable here)
* Explore missing values and use best technique for handling missing values

In [None]:
# Data types
df_train.dtypes

#### Categorical variables

Although we see that there are no categorical variables in this dataset, formally, by checking the `object` dtype, we can conclude whether a column has a text.

In [None]:
# return boolean array
s = (df_train.dtypes == 'obect')

# Get only indices as a list
object_cols = list(s[s].index)

print("Numer of categorical variables: {}".format(len(object_cols)))

#### Check for NULL values

In [None]:
# Check for NULL values
df_train.isnull().sum().sum()

As we can see, there are no NULL values in our DataFrame

### Descriptive statistics 

In [None]:
# Describe values from that data frame
df_train.describe().T

I want to know if every measurement has the same length (0 - 59 seconds)

I realize that I would actually need to group by sequence and subject to be able to measure steps

In [None]:
for observation in df_train.groupby(['sequence', 'subject']).size():
    if observation != 60:
        print('There is a sequence which is not 60 seconds long')

We can see that each sequence is 60 seconds long (which is great)

In [None]:
# I want to know how many participants are there?
df_train.reset_index().subject.nunique()

There are **672** participants in this data frame

I want to know how many times each subject had measurement

In [None]:
duration_of_measurement = 60 # In seconds
df_train['subject'].value_counts().sort_index() / duration_of_measurement

## The sensors

Firstly, we need to plot 13 boxplots with outliers to see the situation.

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(20,10))
for index, sensor in enumerate(df_train.columns):
    if sensor.startswith('sensor'):
        plt.subplot(4,4, index - 2)
        plt.boxplot(df_train[sensor])

As we can see, the most sensors are defined by outliers. We can see that using boxplot might be kind of unpractical, because we have 1.5 million observations. That's why, we are using histograms

In [None]:
fig = plt.figure(figsize=(20,10))
for index, sensor in enumerate(df_train.columns):
    if sensor.startswith('sensor'):
        plt.subplot(4,4, index - 2)
        plt.hist(df_train[sensor], bins = 30)

Let's see the value of upper and lower bound for every sensor and it's outliers using the IQR method

In [None]:
# Create an empty data frame
index_labels = [col for col in df_train.columns if col.startswith('sensor')]
df_sensor_iqr = pd.DataFrame(columns=['upper_bound', 'lower_bound', 'mean', 'NumberOfOutliers'], index = index_labels)

for index, sensor in enumerate(df_train.columns):
    if sensor.startswith('sensor'):
        q1 = df_train[sensor].quantile(q = 0.25)
        q3 = df_train[sensor].quantile(q = 0.75)
        mean = df_train[sensor].mean()
        
        # IQR region
        IQR = q3 - q1
        
        # finding upper and lower whiskers
        upper_bound = q3 + (1.5 * IQR)
        lower_bound = q1 - (1.5 * IQR)
        
        # Number of outliers
        count_outliers = df_train[(df_train[sensor] <= lower_bound) | (df_train[sensor] >= upper_bound)]
        
        df_sensor_iqr.loc[index_labels[index - 3]] = [upper_bound, lower_bound, mean, count_outliers.shape[0]]
        
df_sensor_iqr

As we can see, we cannot just 'delete' outliers. Some sensors (like sensor_12) have almost 30% of the values out of interquartile range.
IQR (Interquartile range) is a measure of statistical dispersion. There are 50% of all values within IQR, while also 99.3% od data within upper and lower bound.

Let's plot the number of outliers (use barplot):

In [None]:
import seaborn as sns

fig, ax = plt.subplots(figsize=(14,5))

sns.barplot(x = df_sensor_iqr.index, y=df_sensor_iqr.NumberOfOutliers.sort_values(ascending=False))

In [None]:
from IPython.display import Image

Image(url='https://upload.wikimedia.org/wikipedia/commons/1/1a/Boxplot_vs_PDF.svg')

This notebook: https://www.kaggle.com/code/ambrosm/tpsapr22-eda-which-makes-sense suggest to use some kind of non-linear transformation to get it to the normal distribution

#### *NOTE* : Mahalanobis distance

The best practice for dealing with outliers in multivariante statistics is to use Mahalanobis distance. I had two approaches with dealing with the outliers.

The first one is to implement Mahalanobis distance by myself and to add a new variable for each observation and detect outliers. This approach didn't work because there weren't enough memory space as the implementation is quite memory consuming.

The second approach is to use already implemented `mahalanobis` function in **R** language. However, the version of the package *rpy2* which serves as an interface to **R** is deprecated in Anaconda and cannot be used.  

### Correlation between sensors

In [None]:
fig, ax = plt.subplots(figsize=(14, 8))

sensors = [column for column in df_train.columns if column.startswith('sensor')]

# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(df_train[sensors].corr(), dtype=np.bool_))

sns.heatmap(df_train[sensors].corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='seismic')

As we can see, the strongest positive correlation is between (*sensor_00*, *sensor_06*), (*sensor_00, sensor_09*), (*sensor_03*, *sensor_07*), (*sensor_03, sensor_11*)

### Feature engineering

As we have more than 1.5 million observations, it would be memory efficicient to remove highly correlated variables which would not have such an impact on our training model. 

In [None]:
# Drop values of specific columns
df_train = df_train.drop(['sensor_06', 'sensor_07', 'sensor_09', 'sensor_11'], axis = 1)
df_test = df_test.drop(['sensor_06', 'sensor_07', 'sensor_09', 'sensor_11'], axis = 1)

In [None]:
fig, ax = plt.subplots(figsize=(14, 8))

sensors = [column for column in df_train.columns if column.startswith('sensor')]

# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(df_train[sensors].corr(), dtype=np.bool_))

sns.heatmap(df_train[sensors].corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='seismic')

# Feature extraction

*NOTE*: This part is based on https://www.kaggle.com/code/reymaster/apr-2022-tps-simple-time-series-analysis-xgboost notebook

We are using *tsfresh* package for feature extraction. *tsfresh* is very convenient in this situation because it enables us to summarize variable's values based on a sequence number (a.k.a. it enables us a summary statistics for each observation which we got by grouping it by some other variable). Also, when we will train the model, the shape of training set and training labels will be the same which is suitable for our case.

In [None]:
from tsfresh.feature_extraction import extract_features, MinimalFCParameters

In [None]:
# Return the train_X which contains summary statistics for each variable group by sequence variable
train_X = extract_features(df_train, default_fc_parameters=MinimalFCParameters(), column_id="sequence", column_sort="step")

# Same here
test_X = extract_features(df_test, default_fc_parameters=MinimalFCParameters(), column_id="sequence", column_sort="step")

In [None]:
train_X;

In [None]:
test_X;

### Impuation methods

`extract_feature` also produces NaN values which were created by feature calculators that can not be used on the given data, e.g., because the statistics are too low.

Although `.info()` method stated that there are no NaN values in our data, we will perform imputation for educational purpose.

In [None]:
#train_X.info();

The best way to perform impuation in *R* is to use MICE package. In python, there are no such complicated impuation methods (or there are still in experimental status). That's why we are using `SimpleImputer` to to replace missing values with the median value along each column. 

In [None]:
from sklearn.impute import SimpleImputer

# Use median for imputation
simple_imputer = SimpleImputer(strategy = 'median')

# Save column names as imputation is removing them
column_names = train_X.columns

train_X = pd.DataFrame(simple_imputer.fit_transform(train_X))
test_X = pd.DataFrame(simple_imputer.transform(test_X))

# Imputation removed column names; put them back
train_X.columns = column_names
test_X.columns = column_names

### Select relevant features

In [None]:
from tsfresh import select_features

# Get the state values of train_labels (using for selecting the features)
y = pd.Series(df_train_labels.state)

In [None]:
# Select features which might be interesting (select appropriate columns)
X_train_selected = select_features(train_X, y)
X_train_selected

In [None]:
# Select the same columns for test data (on which we will make predictions)
X_test_selected = test_X[X_train_selected.columns]

We have two DataFrames - *X_train_selected* and *X_test_selected*. Both frames have same variables which are relevant features extracted from basic statistics provided by *tsfresh* package. 

# XGBoost

Gradient boosting is a ensemble method that combine predictions of several models. XGBoost is an *extreme* GB which provides additional features focused on speed and performance. 

In [None]:
from xgboost import XGBClassifier

### Model creation

***NOTE***: in this scenario, we don't use cross-validation as our dataset is classified as *large* (1.5 million observations). We have sufficient data to fit the model on training set and evaluate it on test set. Here, we can use cross-validation only for educational purposes.

In [None]:
from sklearn.model_selection import train_test_split

# Split matrices into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X_train_selected, y)

#### Parameter tuning

In `XGBClassifier` class, we used few parameters:
- `n_estimators` - the number of models we include in the ensemble
- `early_stopping_rounds` - stop iteration after validation score stops inproving
- `learning_rate` - size of the step in every iteration
- `n_jobs` - equal to number of cores on the machine

In [None]:
from xgboost import XGBClassifier

# Create the model
model = XGBClassifier(n_estimators = 1000, 
                      early_stopping_rounds = 5, 
                      learning_rate = 0.05, 
                      n_jobs = 2)

In [None]:
%%time

# Fit the model with X_train and y_train data 
model.fit(X_train, 
          y_train,
          eval_set=[(X_test, y_test)], 
          verbose=False)

### Model evaluation

We use confusion matrix and F1 score to evaluate the model

In [None]:
from sklearn.metrics import f1_score, plot_confusion_matrix

# Create confusion matrix
plot_confusion_matrix(model, X_test, y_test)

# Make prediction based on test_X
xgb_prediction = model.predict(X_test)

In [None]:
# Calculating the F1 score of classifier
print(f"F1 Score of the classifier is: {f1_score(y_test, xgb_prediction)}")

### Prediction

In [None]:
prediction_values = model.predict(X_test_selected)

# Submission

In [None]:
sample_sub = pd.read_csv("../input/tabular-playground-series-apr-2022/sample_submission.csv")

submission = pd.DataFrame({
    "sequence" : sample_sub.sequence,
    "state" : prediction_values
})

submission.to_csv("submission.csv", index=False)