- This notebook contains an initial EDA of train dataset for the [Ubiquant Market Prediction competition](https://www.kaggle.com/c/ubiquant-market-prediction) built on https://www.kaggle.com/ilialar/ubiquant-eda-and-baseline and https://www.kaggle.com/gpreda/santander-eda-and-prediction
- Main objective is to find some Features of interest and group features based on similarity

#### Load packages 

In [None]:
import numpy as np
import pandas as pd        
import seaborn as sns
import matplotlib.pyplot as plt
import gc
import os
import logging
import datetime
import warnings
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
pd.set_option('max_columns', 310)
pd.set_option('max_rows', 200)

#### Load data
- Let's see what all files are provided by Ubiquant

In [None]:
PATH="../input/ubiquant-market-prediction/"
os.listdir(PATH)

- `example_sample_submission.csv` sample submission file (Not for EDA)
- `ubiquant` API wheel for loading test data (Not for EDA)
- `example_test.csv` sample test file (Not for EDA)
- `train.csv` This is our little treasure - TRAINING DATA

Let's load the train file.

In [None]:
data_types_dict = {
    'time_id': 'int32',
    'investment_id': 'int16',
    "target": 'float16'}

features = [f'f_{i}' for i in range(300)]

for f in features:
    data_types_dict[f] = 'float16'
    
target = 'target'

In [None]:
%%time
train_df = pd.read_csv('/kaggle/input/ubiquant-market-prediction/train.csv', 
                       usecols = data_types_dict.keys(),
                       dtype=data_types_dict, 
                       index_col = 0)

### Data exploration
Let's check the train set.

In [None]:
train_df.shape

Let's have a glimpse of train dataset.

In [None]:
train_df.head()

Our dataset contains 300 anonymous features that don't have any description, `investment_id,` and target that is also some anonymous float value.

**Train contains:**
- time_id (int)
- investment_id (int)
- target (float)
- 300 numerical features (float), from f_0 to f_299

Let's check if we have any missing data.

In [None]:
def missing_data(data):
    
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    
    return(np.transpose(tt))

missing_data(train_df)

There is no missing data present in the train data. Let's check the numerical value distribution.

In [None]:
train_df.describe()

In [None]:
feats = []
for i in range(0,300):
    if train_df['f_'+str(i)].std()!=0:
        feats.append('f_'+str(i))

print('Features with Non Zero Standard Deviation: {}'.format(feats))

It's interesting to see here that:
- Majority of features have 0 standard deviations, seems like they are deidentified by some transformation.
- Only `f_124` have a non-zero standard deviations (std = 0.04837)

Let's check the **target** distribution in train dataset.

In [None]:
train_df['target'].hist(bins = 100, figsize = (10,6))

Target values look quite normal without any outliers or long tails. Let's also plot distributions of targets with few random features.

In [None]:
for f in np.random.choice(train_df['investment_id'].unique(), 10):
    train_df[train_df['investment_id'] == f]['target'].hist(bins=100, alpha=0.2,figsize=(10,6))

On a high-level target for each investment_id also looks ok.

### Feature plots with Target
Let's see the density plot of variables in train dataset.

In [None]:
def plot_feature_distribution(df1, df2, label1, label2, features):
    i = 0
    sns.set_style('whitegrid')
    plt.figure()
    fig, ax = plt.subplots(10,10,figsize=(20,24))

    for feature in features:
        i += 1
        plt.subplot(10,10,i)
        sns.distplot(df1[feature], hist=False,label=label1)
        sns.distplot(df2[feature], hist=False,label=label2)
        plt.xlabel(feature, fontsize=9)
        locs, labels = plt.xticks()
        plt.tick_params(axis='x', which='major', labelsize=6, pad=-6)
        plt.tick_params(axis='y', which='major', labelsize=6)
    plt.show();

In [None]:
t0 = train_df.loc[train_df['target'] > 0]
t1 = train_df.loc[train_df['target'] < 0]
features = train_df.columns.values[2:102]
plot_feature_distribution(t0, t1, '>0', '<0', features)

In [None]:
features = train_df.columns.values[102:202]
plot_feature_distribution(t0, t1, '>0', '<0', features)

In [None]:
features = train_df.columns.values[202:]
plot_feature_distribution(t0, t1, '>0', '<0', features)

#### Some Observations based on the First 100 features density plot:
- Features with significant spikes: `['f_23','f_39','f_42','f_45','f_55','f_64','f_68','f_71','f_77','f_79','f_80','f_92']`
- Features with High Skewness: `['f_63','f_87','f_59','f_3','f_67','f_34','f_7','f_57','f_98','f_74','f_56','f_26','f_33','f_30','f_8','f_58']`

We can use this information to group features and do feature engineering using them for our prediction model.

### More to be Added later...