# Hi, Welcome.

**Today, we will examine the Ubiquant Market Prediction dataset.**

**I will explain feature values and make visualizations. In this notebook, I will use the highly efficient parquet data format.**

# We Import relevant libraries here

Which are useful.

In [None]:
import numpy as np
import pandas as pd
import gc
import plotly.express as px
import random
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split

 **I had imported the dataset in parquet format, you can check this [notebook](https://www.kaggle.com/code/robikscube/fast-data-loading-and-low-mem-with-parquet-files) for details. Its main purpose is to reduce the file size, memory usage and prevent crashes due to massive file size.**

In [None]:
%%time
train_df = pd.read_parquet('../input/ubiquant-parquet/train_low_mem.parquet')
train_df.memory_usage().sum()

In [None]:
train_df.head()

**Let's check our features**

In [None]:
for idx, i in enumerate(train_df.columns):
    idx += 1
    seq = "th"
    if idx <= 5: # for first 5 rows
        if idx == 1: 
            seq = "st"
        elif idx == 2:
            seq = "nd"
        elif idx == 3:
            seq = "rd"
        else:
            seq="th"
        print(f'{idx}{seq} column is {i}')
    elif idx >= (train_df.columns.shape[0] - 4): # for last 5 rows
        print(f'{idx}{seq} column is {i}')
    elif idx % 20 == 0: # print every 20th column for less confusion.
        print(f'{idx}{seq} column is {i}')

**Apparently, we have 304 different columns.**

**First four columns are with respect to:**

**row_id** = A unique identifier for every single row.

**time_id** = The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.

**investment_id** = The ID code for an investment. Not all investment have data in all time IDs.

**target** = The target value.

**f_0 - f_299** = Anonymized features generated from market data. (my guess is they are particular financial calculations, to find the correlations.)

# Check how many observations we have, to understand the size of dataset intuitively.

In [None]:
total_obs = train_df.shape[0]
print(f"WOW! There are {total_obs} observations exist.")

In [None]:
fig = px.histogram(train_df, x="target", nbins=50)
fig.show()

**Our target values look normal, It is roughly between -2.5 and 3 values. Highest population is in -0.5 and 0 interval.**

**Now, let's inspect time_id and investment id**

In [None]:
tmp = train_df.groupby("time_id").investment_id.nunique()
fig = px.scatter(x=tmp.index, y=tmp.values, labels={'x':'time_id', 'y':'investment_id'})#, title="time_id and investment_id")
fig.show()

**There was a break in the area roughly corresponding to time_id 400, except that one we don't have remarkable outliers.**


# SUBMISSION API?

****
**Now, I will create a very simple model to show how to submit.**

**Submitting via API is not a common concept, it took me a while to understand.**

That's what host says:

> Submissions are evaluated on the mean of the Pearson correlation coefficient for each time ID.

**What is Pearson Correlation coefficient?**

Pearson Correlation coefficient is simply a measure of linear correlation between two sets of data.

**Formula:**

![](https://www.gstatic.com/education/formulas2/397133473/en/correlation_coefficient_formula.svg)

* r = correlation coefficient

* xᵢ = values of the x-variable in a sample

* x̄ = mean of the values of the x variable

* yᵢ = values of the y-variable in a sample

* ȳ = mean of the values of the y variable

> definition from [statisticshowto.com](https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/)

For detailed [definition](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

**In order to submit our predicts, we must do following steps:**

![](https://cdn.discordapp.com/attachments/928151364524183565/957917677836460032/unknown.png)

# Let's create our model now!

**I will be using Linear Regression for sake of simplicity.** For information you can check [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

**First of all, let's reduce the row size and columns of our example.**

In [None]:
# Train
features = [f'f_{i}' for i in range(300)]

X = train_df[features]
y = train_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_test

In [None]:
num = 100
con = 500000
features = [f'f_{i}' for i in range(num)]

reduced_X_train = train_df[features][:con]
reduced_y_train = train_df['target'][:con]

In [None]:
# linear regression model
reg = LinearRegression().fit(reduced_X_train, reduced_y_train)
# our predict function
def predict(df):
    predict = df[reduced_X_train.columns]
    prediction = reg.predict(predict)
    return prediction

In [None]:
res = predict(X_test)
print("Result:", pearsonr(y_test, res)[0]) # pearson correlation result

In [None]:
import ubiquant # host's lib
env = ubiquant.make_env()   # initialize the environment
iter_test = env.iter_test()    # an iterator which loops over the test set and sample submission
for (test_df, sample_prediction_df) in iter_test:
    sample_prediction_df['target'] = predict(test_df)  # make your predictions here
    env.predict(sample_prediction_df)

# Conclusion
**Thanks for reading my notebook.**

**Feel free to ask in comments section.**

**Upvotes are all appreciated.**