# Ubiquant Market Prediction EDA  
First of all, please note that English is not an my official language, so it will be a poor English.  
I read 500,000 data in this notebook, but I read all data in local environment so my comments are based on local results.  
- Competition URL: https://www.kaggle.com/c/ubiquant-market-prediction/  

## Competition Overview  
- input data: sequential data  
- output data: scaler(investment's return rate)  

### Evaluation function  
**Pearson corelation coefficient**  
The mean of the Pearson correlation coefficient for each time ID.

$$ \rho = \frac{\sum^{n}_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum^{n}_{i=1}(x_i - \bar{x})^2}\sqrt{\sum^{n}_{i=1}(y_i - \bar{y})^2}} $$  

Maybe, X and Y are target data and prediction data for each time ID data. 'i' is investment id(I can also think investment ID and time ID are the opposite).  

For example;
- $X_1$: Target data with time id 1.  
- $Y_1$: Prediction data with time id 1.  
- $x_1$: A value with investment id 1 in target data with time id 1.  
- $y_1$: A value with investment id 1 in prediction data with time id 1.  

## Import packages  

In [None]:
import os
import gc
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import ubiquant  # Unique library of this competition
sns.set()
warnings.filterwarnings("ignore")
%matplotlib inline

## Load Dataset  

In [None]:
DIR = "../input/ubiquant-market-prediction"
train_df = pd.read_csv(os.path.join(DIR, "train.csv"), nrows=500000)  # read 500,000 rows(all data size is about 3,000,000)

print(f"train data shape: {train_df.shape}")
display(train_df.head())

In [None]:
train_df['time_id'].describe()

In [None]:
train_df['investment_id'].describe()

In [None]:
# Memory lack
total_null = 0
for column in train_df.columns:
    total_null += train_df[column].isnull().sum()
    gc.collect()

print(f"Number of null: {total_null}")

### Infomation of train data set  
- **non null**  
- Number of train samples: 3,141,410  
- Number of original features: 300(No names)  
- Number of target: 1  
- time_id: 0 to 1219  
- investment_id: 0 to 3773  
- row_id: {time_id}\_{investment_id}  

## Data types  
- target: float  
- features: float(all features)  

In short, this data don't include a categorical feature.  

In [None]:
# Get dtypes target and features
set(train_df.iloc[:, 3:].dtypes.values.tolist())

## Target data distribution  
- Target distributin is close gaussian? -> easy handle target without transform.  
- But, there is a concentrated part in distribution.  

### Plot with seaborn  

In [None]:
f = plt.figure(figsize=(16, 10))
sns.histplot(train_df['target'].values, kde=True)

plt.show()

### Target summary statistic  

In [None]:
train_df['target'].describe()

## Features distribution  
- Plot some feature distributions because number of feature samples about 3,000,000  
- Feature distributions are different each other. I should transform features when training model.  

In [None]:
f = plt.figure(figsize=(16, 10))
sns.histplot(train_df['f_0'].values, kde=True)

plt.show()

In [None]:
f = plt.figure(figsize=(16, 10))
sns.histplot(train_df['f_58'].values, kde=True)

plt.show()

In [None]:
f = plt.figure(figsize=(16, 10))
sns.histplot(train_df['f_249'].values, kde=True)

plt.show()

### Plot sequential data  
- Sampling some investment id and plot target in chronological order.  
- Don't know how long the time interval is, but each graph seems to be periodicity.  

In [None]:
ids = [3, 28, 169]
for investment_id in ids:
    tmp_df = train_df.query("investment_id == @investment_id").sort_values("time_id")

    plt.figure(figsize=(16, 8))
    plt.plot(tmp_df['time_id'].values, tmp_df['target'].values)
    plt.title(f"target data(investment_id={investment_id})")
    plt.show()

## Analysis example test data  

In [None]:
test_df = pd.read_csv(os.path.join(DIR, "example_test.csv"))

print(f"example test data shape: {test_df.shape}")
display(test_df.head())

In [None]:
test_df['time_id'].describe()

In [None]:
test_df['investment_id'].describe()

### Information of example test data set  
- **non null**(maybe in all test data)  
- time_id: 1220-  
- investment_id: Maybe, equal train data  

I could't observe all test data.  

In [None]:
total_null = 0
for column in test_df.columns:
    total_null += test_df[column].isnull().sum()
    gc.collect()

print(f"Number of null: {total_null}")

## Caluculate correlation coefficient  
- Train data is very large so narrow down the train data to caluculate correlation coefficient.  
- Nothing has a large correlation between target and features.  
- There seems to be large correlations between part of a features(For example, a correlation between f_4 and f228 is 9.29).  

In [None]:
# It takes so many time for executing
train_df = train_df.iloc[:, 3:]  # except id
corr = train_df.corr()

f = plt.figure(figsize=(50, 50))
sns.heatmap(corr, square=True, cmap=sns.color_palette("coolwarm", 200))

# plt.savefig("correlation_coefficient.png")
plt.show()

In [None]:
for column in corr.columns:
    corr.loc[column, column] = 0  # set the diagonal values to 0
corr.max().describe()

In [None]:
corr.max()[corr.max() > 0.9]

## I thought about this evaluation function  
I think the evaluation function in competitions to solve regression task is almost MAE or MSE etc. What can be understand by using the Pearson correlation coefficient for evaluation function.  
At first, the Pearson correlation coefficient is shown this.  

$$ \rho = \frac{\sum^{n}_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum^{n}_{i=1}(x_i - \bar{x})^2}\sqrt{\sum^{n}_{i=1}(y_i - \bar{y})^2}} $$  

As I mentioned at the beginning, I think X and Y are target data and prediction data for each time ID data. 'i' is investment id.  

In [None]:
# Implement honestly the Pearson correlation coefficient
def eval_func(x: np.ndarray, y: np.ndarray):
    x_mean = x.mean()
    y_mean = y.mean()

    cov = ((x - x_mean) * (y - y_mean)).sum()
    x_std = np.sqrt(((x - x_mean) ** 2).sum())
    y_std = np.sqrt(((y - y_mean) ** 2).sum())

    return cov / (x_std * y_std)

In [None]:
X = np.arange(10)  # Temporary target variable
Y = X.copy()  # Suppose prediction and target is equal

print(f"Pearson correlation coefficient(X = Y): {eval_func(X, Y)}")

In this way, the Pearson correlation coefficient is 1.0 if can predict perfectly.  
Next, change values of prediction data(Y) by 1.  

In [None]:
for i in range(Y.shape[0]):
    Y[i] += 1
    print(f"X variance: {X.std()}, Y variance: {Y.std()}")
    print(f"Pearson correlation coefficient(change value {i+1}): {eval_func(X, Y)}\n")

Next, change values of prediction data(Y) by 2.  

In [None]:
for i in range(Y.shape[0]):
    Y[i] += 2
    print(f"X variance: {X.std()}, Y variance: {Y.std()}")
    print(f"Pearson correlation coefficient(change value {i+1}): {eval_func(X, Y)}\n")

The Pearson correlation coefficient is 1 if prediction and target isn't equal but increase amount is equal at each point in time. In the competition description it is written that we will build a model that in this competition forecasts an investment's return rate, but I think we should focus increase amount in this competition since the pearson correlation coefficient is 1 even if each mean isn't equal, if each increase amount is equal at each point in time.  

Please see for reference only and give me various opinion and your knowledge because I'm just a student, neither a statistician nor a machine learning expert.  
In addition, I would be grateful if you could tell me other things that should be added to EDA. Thank you for watching!  