# LGBM with Fourier transform

In this notebook I will show you how to easily generate features from time series using Fourier transformation

Fourier transform is widely used in real world task such as sound noise reduction, image compression. And it is also used in medical applications so it\`s perfect to apply it for this task!

You can read more about Fourier transform:
- [here](https://realpython.com/python-scipy-fft/) or
- [here](https://thefouriertransform.com/)  
And watch this video to get visual intuition about it:
- [video](https://www.youtube.com/watch?v=spUNpyF58BY)

## Load data

Our train and test data consist of 13 sensor indications at every step. The index is id of subject and sequence

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

X_train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
X_train.head()

In [None]:
y_train = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
y_train.head()

In [None]:
X_test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
X_test.head()

## EDA

Now let\`s get a closer look at data. It always a good improvment in your final score if you find some data leak which other competitors or organizers don\`t suppose to exist. My first idea was to check whether there are same subjects in train and test

But we see no intersection here

In [None]:
set(X_train['subject']).intersection(set(X_test['subject']))

The second idea was to analyze how many times subject appears in train and test

In [None]:
%matplotlib inline

In [None]:
from matplotlib import pyplot as plt

_, axs = plt.subplots(1, 2, figsize=(15, 5))
X_train.groupby('subject')['sequence'].nunique().value_counts().plot(kind='hist', ax=axs[0], title='Train')
X_test.groupby('subject')['sequence'].nunique().value_counts().plot(kind='hist', ax=axs[1], title='Test')

As we see the most of subjects appears up to 5 times in train and test. But there are subjects with many appearence. And if a subject appears to much in data, it has also a bigger target rate. It defenetly could be used as a feature

In [None]:
from IPython.display import display

# add title
target_rate = X_train[['subject', 'sequence']].drop_duplicates()\
    .merge(y_train, on='sequence')
target_rate.groupby('subject').agg({'state': 'mean', 'sequence': 'count'})\
    .sort_values('sequence').plot(x='sequence', y='state', ylabel='target rate', xlabel='appearence')

Every sequence has the same length of 60 steps

In [None]:
X_train.groupby(['subject', 'sequence'])['step'].nunique().value_counts()

## Make frequency features with Fourier transform

Now let\`s get to Fourier transform with it\`s scypi realization. It has a bunch of functions:
- fft
- ifft
- rfft
- irfft
- ...  
where ft stands for Fourier transform, first f stands for "Fast" - the name of Fourier transform realization, i - stands for inverse transform (ifft(fft(x))==x) and r stands for real meaning our input data doesn\`t contain complex numbers

So we first group data by sequence, and for each sensor data column in seuqence with real fast Fourier transform get it representation in "frequncy space". We are able to get len(sequnce) / 2 + 1 = 31 frequnces by sequence. And as rfft returns it\`s values in complex space we use absolute value of frequnce "power"

It\`s also a good idea to select only "low" frequences from rfft as "high" frequences often represent noise, but for now we keep all of them

In [None]:
from scipy.fft import rfft
import numpy as np

def make_fft_features(group):
    return pd.concat(
        [pd.Series(np.abs(rfft(group[col].values)), 
                   index=[f'{col}_freq_{i}' for i in range(31)]) 
         for col in group.columns if col not in ['sequence', 'subject', 'step']
        ])

train_df = X_train.sort_values(['subject', 'sequence', 'step'])\
    .groupby(['sequence', 'subject']).apply(make_fft_features)
train_df.head()

Also don\`t forget to use number of subject appearences as a feature. To correctly use it on train and test normalize it by maximum value respectively

In [None]:
n_sequence = X_train.groupby('subject')['sequence'].nunique()
perc_sequence = n_sequence.rank(method='max').apply(lambda x: 100.0*(x-1)/len(n_sequence))
perc_sequence.name = 'n_sequence_percentile'
perc_sequence.head()

Build the final dataset

In [None]:
train_df = train_df.reset_index()\
    .merge(perc_sequence, on='subject')\
    .merge(y_train, on='sequence')
train_df.head()

## Modeling

At current version simple Light GBM model is used with no parameter tuning. But I decided to show how to use group KFold validation. It is very useful when model prediction is sensible to some hidden variable information like subject id in our case. Also it allows to fit several models on different data, to use them for blending later

In [None]:
from sklearn.model_selection import GroupKFold
import lightgbm
from sklearn.metrics import roc_auc_score

model_list = []
group_kfold = GroupKFold(n_splits=5)
for train_index, test_index in group_kfold.split(train_df.drop(['sequence', 'subject', 'state'], axis=1), 
                                                 train_df['state'], 
                                                 train_df['subject']):
    X_train_group = train_df.drop(['sequence', 'subject', 'state'], axis=1).iloc[train_index] 
    X_test_group = train_df.drop(['sequence', 'subject', 'state'], axis=1).iloc[test_index]
    y_train_group = train_df['state'].iloc[train_index] 
    y_test_group = train_df['state'].iloc[test_index]
    
    lgbm = lightgbm.LGBMClassifier()
    model_list.append(lgbm.fit(X_train_group, y_train_group))
    fold_score = roc_auc_score(y_test_group, lgbm.predict_proba(X_test_group)[:, 1])
    print(f'Fold score: {fold_score}')

## Predict and final submission

To get final predictions of our model, perform the same transformations with test data as we did with train

In [None]:
test_df = X_test.sort_values(['subject', 'sequence', 'step'])\
    .groupby(['sequence', 'subject']).apply(make_fft_features)
n_sequence_test = X_test.groupby('subject')['sequence'].nunique()
perc_sequence_test = n_sequence_test.rank(method='max').apply(lambda x: 100.0*(x-1)/len(n_sequence_test))
perc_sequence_test.name = 'n_sequence_percentile'
test_df = test_df.reset_index()\
    .merge(perc_sequence_test, on='subject')
test_df.head()

Get predictions of all models and average them to get final model predictions

In [None]:
submission = test_df[['sequence']]
predictions = pd.DataFrame(
    [m.predict_proba(test_df.drop(['sequence', 'subject'], axis=1))[:, 1] for m in model_list]).T
submission['state'] = predictions.mean(axis=1)
submission.to_csv('lgbm_fourier.csv', index=False)
submission.head()