This is a quick stab at understanding the dataset and might be useful for folks who are starting out with this competition, are new to time-series (like me) or want a quick look at the fundamentals of the data.

## Imports

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load Dataset

Using parquet format of the dataset allows for fast loading with lower memory footprint. Thanks to Rob and check out his kernel here: https://www.kaggle.com/robikscube/fast-data-loading-and-low-mem-with-parquet-files

In [None]:
df = pd.read_parquet('../input/ubiquant-parquet/train.parquet')
df

> It takes time to load the `train.csv` file and usually the kernel crashes in the process of doing so.

# EDA

In [None]:
print('Number of rows in the train.csv file: ', len(df))

### `time_id`

In [None]:
df.time_id.unique()

In [None]:
len(df.time_id.unique())

> `time_id`: The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set. 

> Yes the IDs are in order from 0-1219 with 8 missing (?) time_ids.

> One time id may belong to 1st Jan 2:00 IST, the next one can be 4th Jan 12:00 IST, the other one 5th Jan 16:00 IST and so on.

In [None]:
df.time_id.value_counts().sort_index()

> Clearly the number of data points (rows) in each `time_id` is not constant. 

In [None]:
missing_time_ids = []
for t in range(1220):
    if t not in df.time_id.unique():
        missing_time_ids.append(t)
        
print('Missing time_ids: ', missing_time_ids)

> The following `time_id`s are not present. I don't think it should be an issue since we anyway don't have a constant gap between consecutive `time_id`s. 

### `row_id`

In [None]:
len(df.row_id.unique()) == len(df)

> `row_id`: It's a unique identifier for each row. The id is in the format of `x_y` where `x` is the unique `time_id` and `y` is the unique `investment_id`.

### `investment_id`

In [None]:
unique_investments = sorted(df.investment_id.unique())
print('Number of investment ids: ', len(unique_investments))

In [None]:
df.investment_id.value_counts().sort_index()

> Total number of unique investments are 3579 while the last `investment_id` is 3773. There must be missing `investment_id`s. 

> I don't think this to be an issue as well. 

In [None]:
missing_investment_ids = []
for iid in range(3774):
    if iid not in df.investment_id.unique():
        missing_investment_ids.append(iid)
        
print('Missing investment_ids: ', missing_investment_ids)

> The following `investment_id`s are not present.

In [None]:
df.groupby('time_id')['investment_id'].unique()

> We can see that not all investment have data in all time IDs.

## Let's look at all the `investment_id`s in a single `time_id`. 

Note that few `investment_id`s may be missing in a given `time_id`.

In [None]:
sample_time_id = 0
assert sample_time_id not in missing_time_ids

sample_df = df[df.time_id == sample_time_id]
sample_df

In [None]:
sample_df.investment_id.value_counts()

> There's one `investment_id` per `time_id`.

## Let's look at a single `investment_id` across `time_id`s. 

In [None]:
sample_investment_id = 30
assert sample_investment_id not in missing_investment_ids

sample_df = df[df.investment_id == sample_investment_id]
sample_df

In [None]:
plt.figure(figsize=(12,6));
sample_df.set_index('time_id').target.plot();

> Clearly there is a time series trend when we look at an `investment_id` across time. 

> There are missing `time_id`s which is needed to be handled. 

> Clearly the `target` values are not scaled but we will be using LightGBM so scaling the data is not crucial. 

# Using Time Series API

In [None]:
import ubiquant
env = ubiquant.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission
for (test_df, sample_prediction_df) in iter_test:
    print(test_df)
    sample_prediction_df['target'] = 0  # make your predictions here
    env.predict(sample_prediction_df)   # register your predictions

> We get dataframes with shape `(n row x 302 columns)` where `n rows` are`row_id`s. Each `row_id` belong to the same `time_id` so at each iteration we get data for different `investment_id`s. So we need to predict the targets for each `investment_id`s for the given `time_id`. 

# Conclusion

> Features `f_0` to `f_299` are features for the model per `time_id`. 

> `investment_id` can be a feature, feature with extra weightage, handled seperately by individual models (but then there will be a lot of models) or part of the feature vector for the same model. 