# You should know about "Data Shape"

It is very important to read the shape of the data. It is also very important to know the domain knowledge of the data. Domain knowledge refers to valid knowledge used in human activities, autonomous computer activities, or other specialized fields.

To hypothesize and prove when looking at data. That seems to be what data analysts should do.

In [None]:
import pandas as pd 

X_train = pd.read_csv('../input/tabular-playground-series-apr-2022/train.csv')
X_train.head()

In [None]:
y_train = pd.read_csv('../input/tabular-playground-series-apr-2022/train_labels.csv')
y_train.head()

In [None]:
X_test = pd.read_csv('../input/tabular-playground-series-apr-2022/test.csv')
X_test.head()

Note that the following insights can be obtained from [reference](https://www.kaggle.com/code/matanivanov/lgbm-with-fourier-transform)

* There is no overlap between train data and test data
* There is a subject that comes out often.
* The number of sequences per subject is the same.

When dealing with these data, the data features are as follows.

There is a subject for one sequence. First of all, the subject is sequence. Sequence is meaningless. It is simply the use of index. And a meaningful subject is sensor data consisting of 60 frequencies.

In this case, either 60 frequencies are reduced or 60 frequencies are used as new axes.

And one of the ways to reduce the 60 dimensions is to use representative values (average, median, etc.) and to use Fourier transform because there is a cycle.

To use 60 dimensions, you can use them as 3D data. In that case, deep learning methods are used rather than machine learning. Refer to [this](https://www.kaggle.com/code/azzamradman/tps04-best-single-model-0-989).

**Before that**

Fourier transform is a very important concept not only in signal processing, voice, and communication, but also in image processing, and has various applications. Various analysis and processing can be performed by converting images into frequency components, and arbitrary filtering operations can be implemented at high speed using fast Fourier transform (fft). And fundamental theories like Fourier transform are not limited to specific applications, so once you know it, it's helpful to stay away from the field.

**Fourier transform is expressed by decomposing an arbitrary input signal into a sum of periodic functions with various frequencies.**

In [None]:
from IPython.display import display

# 1. Use representative values

# Purify through reset_index and drop columns before using.
display(X_train.groupby(['sequence','subject']).mean())

# 2. fourier transform

from scipy.fft import rfft
import numpy as np

def make_fft_features(group):
    return pd.concat(
        [pd.Series(np.abs(rfft(group[col].values)), 
                   index=[f'{col}_freq_{i}' for i in range(31)]) 
         for col in group.columns if col not in ['sequence', 'subject', 'step']
        ])

train_df = X_train.sort_values(['subject', 'sequence', 'step'])\
    .groupby(['sequence', 'subject']).apply(make_fft_features)

display(train_df)

# 3. deep learning

# Later, techniques such as Conv1d and LSTM can be used.
X_train_array=[]
for i in range(0,X_train.drop('step',axis=1).shape[0],60):
    X_train_array.append(np.array(X_train.iloc[i:i+60].values))
X_train_array=np.array(X_train_array)
print(X_train_array.shape)
display(X_train_array)


Machine learning is not inferior to deep learning, and that does not mean that deep learning is inferior to machine learning. If there is a best model after trying this and that, it is the best model.
