When I am doing EDA, I see that the number of investment_ids for each time_id is different.

How is the data given determined?

I can only speculate, but I think that the investment_id is determined based on certain rules assuming actual operations. (i.e., the host does not intentionally narrow or increase the number of stocks given).

There are many possible reasons why the number of issues varies with time_id. Because trading is suspended or resumed, because there is an IPO, because liquidity has increased and added to the investment universe, etc.

In this article, I will focus on these differences in the number of investment_ids per time_id.

Specifically, I will analyze the newly added investment_ids to see if they have any characteristics.


In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import lightgbm as lgbm
from lightgbm import *

In [None]:
df = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')

In [None]:
def reduce_mem_usage(df):
  
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
     
    return df

df = reduce_mem_usage(df)

Flag the newly appearing investment_id. Specifically, df.at[i, 'new'] = 1 and 0 otherwise.

In [None]:
df['new'] = 0

inv_list = []

for i in range(0, len(df)):
    inv_id = df.at[i, 'investment_id']

    if inv_id in inv_list:
        pass

    else:
        inv_list.append(inv_id)

        if i > 2272:
            df.at[i, 'new'] = 1

Create a DataFrame of only newly appearing investment_id and calculate the average of target.

In [None]:
df_new = df[df['new'] == 1]
df_new['target'].mean()

This is the average of newly appearing investment_id targets. Given that the average of all investment_id targets is -0.021, we can see that it is quite different.

Next, the histograms of newly appearing investment_id and all investment_id targets are illustrated.

In [None]:
df_new['target'].hist(bins = 100, figsize = (20,10))

In [None]:
df['target'].hist(bins = 100, figsize = (20,10))

Comparing the two histograms, we can see that they are very different.

Next, I count and visualize in which time_id the new investment_id appears.

In [None]:
#the number of newly apperaing investment_ids
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt



num_investment = df_new.groupby('time_id').count()



time_id = num_investment.index
num = num_investment['investment_id']
plt.plot(time_id, num)
plt.show()

For comparison, a graph of the counted number of investment_ids for each time_id is shown.

In [None]:
#the number of all apperaing investment_ids
from matplotlib import pyplot as plt
import pandas as pd
import matplotlib.pyplot as plt



num_investment = df.groupby('time_id').count()



time_id = num_investment.index
num = num_investment['investment_id']
plt.plot(time_id, num)
plt.show()

The figure does not reveal any clear characteristics, although there are areas where new appearances are concentrated.

That is all.
I hope this helps in some way in improving your score.