Hi, and thank you for stopping by and looking at my work! Please feel free to upvote or comment if you find this notebook helpful, or if you have suggestions for improvement. 

Please note that this notebook is work in progress and does not reflect a complete solution/approach yet.

# Loading Dataset
Prepare reading the CSV file as int16 and float16 format to save memory.

In [None]:
import time
import numpy as np 
import pandas as pd 

data_types_dict = {
    'time_id': 'int16',
    'investment_id': 'int16',
    "target": 'float16',
}

features = [f'f_{i}' for i in range(300)]

for f in features:
    data_types_dict[f] = 'float16'

In [None]:
start_time = time.time()

train_data = pd.read_csv('../input/ubiquant-market-prediction/train.csv',
                        usecols = data_types_dict.keys(),
                        dtype = data_types_dict,
                        skiprows=range(1,100000), #Keeps the header this way
                        nrows=10000)

end_time = time.time()

read_time = end_time - start_time 

print("Reading dataset took: ", read_time, "sec")

In [None]:
#Checking for missing values
missing_values_count = train_data.isnull().sum()
print(missing_values_count)

# Columns' Interpretation

* row_id - A unique identifier for the row.
* time_id - The ID code for the time the data was gathered. The time IDs are in order, but the real time between the time IDs is not constant and will likely be shorter for the final private test set than in the training set.
* investment_id - The ID code for an investment. Not all investment have data in all time IDs.
* target - The target.
* [f_0:f_299] - Anonymized features generated from market data.

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
#Sorting by investment_id and time_id to use for constructing lag features later
test_frame = train_data.sort_values(by=['investment_id', 'time_id'])
test_frame.head()

# What's next?
* Data cleaning (data appears to have no missing values)
* Baseline Model
* EDA 
* Feature Engineering (try constructing lag features)

[this notebook might be useful for reference](https://www.kaggle.com/ilialar/ubiquant-eda-and-baseline/notebook)

Trying out PCA

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from sklearn.feature_selection import mutual_info_regression


plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

def plot_variance(pca, width=8, dpi=100):
    # Create figure
    fig, axs = plt.subplots(1, 2)
    n = pca.n_components_
    grid = np.arange(1, n + 1)
    # Explained variance
    evr = pca.explained_variance_ratio_
    axs[0].bar(grid, evr)
    axs[0].set(
        xlabel="Component", title="% Explained Variance", ylim=(0.0, 1.0)
    )
    # Cumulative Variance
    cv = np.cumsum(evr)
    axs[1].plot(np.r_[0, grid], np.r_[0, cv], "o-")
    axs[1].set(
        xlabel="Component", title="% Cumulative Variance", ylim=(0.0, 1.0)
    )
    # Set up figure
    fig.set(figwidth=8, dpi=100)
    return axs

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [None]:
X = test_frame.copy()
y = X.pop('target')
X = X.loc[:, features]

#Standardize - NOTE TO SELF: check if normalization is needed for this dataset
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

In [None]:
#np.any(np.isnan(X_scaled))
#X_scaled[features].isnull().sum()
for i in range(0,299):
    if X_scaled["f_%s" % i].isnull().sum() != 0:
        print("f_%s" % i," ", X_scaled["f_%s" % i].isnull().sum())

In [None]:
X.f_170

Turns out, f_124, f_170, f_182 have std = 0. we'll need to address those manualy to make PCA work.
However, a better solution would be to use float64 data type, so those inf and NaN will not appear in the first place.

In [None]:
X_scaled.f_124 = X_scaled.f_124.fillna(0.0)
X_scaled.f_170 = X_scaled.f_170.fillna(0.0)
X_scaled.f_182 = X_scaled.f_182.fillna(1.0)

In [None]:
from sklearn.decomposition import PCA

# Create principal components
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Convert to dataframe
component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=component_names)

X_pca.head()

After fitting, the PCA instance contains the loadings in its components_ attribute. (Terminology for PCA is inconsistent, unfortunately. We're following the convention that calls the transformed columns in X_pca the components, which otherwise don't have a name.) We'll wrap the loadings up in a dataframe.

In [None]:
loadings = pd.DataFrame(
    pca.components_.T,  # transpose the matrix of loadings
    columns=component_names,  # so the columns are the principal components
    index=X.columns,  # and the rows are the original features
)
loadings.head()


In [None]:
# Look at explained variance
plot_variance(pca)

In [None]:
mi_scores = make_mi_scores(X_pca, y, discrete_features=False)

In [None]:
for i in range(1,300):
    print(mi_scores["PC%s" % i])