In [a previously published code](https://www.kaggle.com/code/tmrtj9999/number-of-investment-id-as-a-feature) I pointed out that the number of investment_ids given for each time_id works as a feature.

To briefly summarize, the number of investment_ids per given time_id is considered to reflect the market conditions at that time, and adding this number as a feature value enables the model to learn the overall market conditions, which is expected to improve the accuracy of the model.


When I posted the code, someone commented that they had noticed the same thing and were creating features using this investment_id.

So, following they lead, I would like to do some feature engineering using this number of investment_ids.

First, the data is read into a DataFrame.

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgbm
from lightgbm import *

In [None]:
df = pd.read_parquet('../input/num-investment2/num_investment2.parquet')

In [None]:
df.head(3)

In this case, I am reading a dataset that has been feature-engineered in advance.
'num_investment' is the number of investment_ids given for each time_id.

The following eight features were created for this code.

'shift1'~'shift5' : num_investment of from 1 to 5 previous time_id

'MA3'~'MA5' : Moving average of from 3 to 5 time_id num_investment

Next, the DataFrame is split into train and val data.
Also, for memory reasons, the data volume is reduced to about half.

In [None]:
from sklearn.model_selection import KFold, train_test_split

df = df.tail(1500000)


features = [f'f_{i}' for i in range(300)] + ['num_investment'] + ['shift1'] + ['shift2'] + ['shift3'] + ['shift4'] + ['shift5'] + ['MA3'] + ['MA4'] + ['MA5']
target = 'target'
 

df_features = df[features]


X_train, X_val, Y_train, Y_val = train_test_split(df_features, df[target], train_size=0.95, shuffle=False)

df = [[]]
df_features = [[]]

Train LightGBM.

In [None]:
import warnings
import numpy as np
import lightgbm as lgb
from scipy.stats import pearsonr

warnings.simplefilter('ignore')

lgb_train = lgb.Dataset(X_train, Y_train)
lgb_eval = lgb.Dataset(X_val, Y_val, reference=lgb_train)

params = {'seed': 1,
          'verbose' : -1,
           'objective': "regression",
           'learning_rate': 0.05,
           'bagging_fraction': 0.1,
           'bagging_freq': 1,
           'feature_fraction': 0.1,
           'max_depth': 6,
           'min_child_samples': 50,
           'num_leaves': 64}
        
        
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=lgb_eval,
                verbose_eval=False,
                early_stopping_rounds=3,
                )

Displays the feature importance of LightGBM.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


feature = gbm.feature_importance(importance_type='gain')


f = pd.DataFrame({'number': range(0, len(feature)),
             'feature': feature[:]})
f2 = f.sort_values('feature',ascending=False)

#features' name
label = X_train.columns[0:]

#feature rank
indices = np.argsort(feature)[::-1]

for i in range(len(feature)):
    print(str(i + 1) + "   " + str(label[indices[i]]) + "   " + str(feature[indices[i]]))

As you can see from the feature importance, you can see that the features created based on the number of investment_ids are indeed effective.