# Data preprocessing

## Baseline model

For the baseline I have chosen `Logistic Regression`, `Random Forest` with `100 estimators` and simple `Neural Network` with two hidden layers and early stopping from [scikit-learn](https://scikit-learn.org/stable/).

Each uses 5-fold cross-validation, this function will act as a rule of thumb for feature engineering coming ahead.

In [1]:
import pathlib

# Where data will be stored if not present
DATA_PATH = "../input"
! ./utilities/download_data.sh "$DATA_PATH"

# Results from analysis should already be obtained
ANALYSIS_PATH = pathlib.Path("../analysis")
# Where results of analysis are/will be stored
PREPROCESSING_PATH = pathlib.Path("../preprocessing")

PREPROCESSING_PATH.mkdir(parents=True, exist_ok=True)

Downloading data to: ../input
test_data.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
train_data.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
train_labels.csv: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
/home/vyz/projects/Kaggle1NN2019/src
Script ran successfully

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

from utilities.general import train_data, test_data, random_seed

# Seed used by all transformations
SEED = random_seed()

X, y = train_data(pathlib.Path(DATA_PATH))

## Data scaling

Below some scaling methods are presented. Each is tested using three algorithms with fixed seed so the experiments at least resemble reproducibility.

No substanial improvement has been reached with any widely applicable schema. Usually accuracy across three different classifiers dropped (especially in the case of neural networks).

Only sample-wise normalization yields approximately the same results as the original data, which means this approach could be very well a dead-end.

In [3]:
from utilities.baseline import baselines

baselines(
    PREPROCESSING_PATH / pathlib.Path("initial_baseline.csv"),
    X,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.769889,0.802702,0.842481
1,0.765525,0.79624,0.835347
2,0.776015,0.80673,0.838704
3,0.772659,0.802199,0.842313
4,0.770309,0.803122,0.840718


In [4]:
from sklearn.preprocessing import scale

X_standardized = scale(X)

baselines(
    PREPROCESSING_PATH / pathlib.Path("standardized_baseline.csv"),
    X_standardized,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.769889,0.802702,0.811934
1,0.764938,0.79624,0.80757
2,0.774924,0.806898,0.803541
3,0.771819,0.802283,0.814535
4,0.76905,0.803206,0.809584


In [5]:
from sklearn.preprocessing import MinMaxScaler

X_scaled = MinMaxScaler(feature_range=(-1, 1)).fit_transform(X)

baselines(
    PREPROCESSING_PATH / pathlib.Path("scaled_baseline.csv"),
    X_scaled,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.767791,0.802702,0.824773
1,0.763511,0.796324,0.822088
2,0.772659,0.80673,0.831067
3,0.770057,0.802199,0.827878
4,0.767036,0.803122,0.825277


In [6]:
from sklearn.preprocessing import normalize

X_normalized = normalize(X)

baselines(
    PREPROCESSING_PATH / pathlib.Path("normalized_baseline.csv"),
    X_normalized,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.759147,0.800688,0.841474
1,0.751091,0.793303,0.831403
2,0.759231,0.805891,0.844579
3,0.762504,0.800352,0.843404
4,0.759147,0.799094,0.836774


Interestingly, in `2/5` folds predictions are a little better using this normalization scheme, while the others are not far apart. 

It might be beneficial to combine those two data representations when ensembling multiple methods in order to increase variance between different predictions, though it might be computationally wasteful to check it.

Other schemes (and even power transform) and their results are presented below.
It seems no normalization really helps in this case (though I'm unable to say anything definitely as I'm not checking it on the model-to-model basis).

In [7]:
from sklearn.preprocessing import normalize

X_normalized = normalize(X, axis=0)

baselines(
    PREPROCESSING_PATH / pathlib.Path("normalized_features_baseline.csv"),
    X_normalized,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.711732,0.802534,0.811682
1,0.708291,0.79624,0.809332
2,0.719285,0.808409,0.812437
3,0.722558,0.799261,0.814283
4,0.720879,0.801863,0.813948


In [8]:
from sklearn.preprocessing import normalize

X_normalized = normalize(X, norm="l1")

baselines(
    PREPROCESSING_PATH / pathlib.Path("l1_normalized_baseline.csv"),
    X_normalized,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.704179,0.798926,0.82234
1,0.69864,0.793387,0.82662
2,0.706361,0.804884,0.831319
3,0.705774,0.801779,0.829893
4,0.711564,0.798758,0.827627


In [9]:
from sklearn.preprocessing import normalize

X_normalized = normalize(X, norm="l1", axis=0)

baselines(
    PREPROCESSING_PATH / pathlib.Path("l1_normalized_features_baseline.csv"),
    X_normalized,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.09374,0.796912,0.09374
1,0.095838,0.790702,0.099698
2,0.097348,0.798003,0.101292
3,0.096845,0.793387,0.096845
4,0.096005,0.793807,0.197214


In [10]:
from sklearn.preprocessing import PowerTransformer

X_unskewed = PowerTransformer().fit_transform(X)

In [11]:
baselines(
    PREPROCESSING_PATH / pathlib.Path("unskewed_baseline.csv"),
    X_unskewed,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.769134,0.802786,0.808241
1,0.763595,0.796324,0.808577
2,0.774253,0.806898,0.802618
3,0.7714,0.802031,0.813612
4,0.770309,0.803206,0.813108


## Remove low variance

Removing features with low variance might remove noise. It might be especially helpful in the case of artificially generated data (were some columns harm model's performance and clutter feature space). Below, features with variance lower than `1%` are removed:

In [12]:
from sklearn.feature_selection import VarianceThreshold

threshold = VarianceThreshold(0.01)
X_thresholded = threshold.fit_transform(X)

print(f"Feature count after removing low variance {X_thresholded.shape[1]}")

baselines(
    PREPROCESSING_PATH / pathlib.Path("no_low_variance_baseline.csv"),
    X_thresholded,
    y,
    cv=5,
    random_state=SEED,
)

Feature count after removing low variance 288


Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.770141,0.804632,0.843991
1,0.764434,0.797163,0.838285
2,0.777274,0.808577,0.837613
3,0.774253,0.805052,0.84097
4,0.772323,0.80673,0.840886


Apparently this scheme reduces workload and makes predictions less varying across folds, it's probably worth to keep this preprocessing scheme.

You can find different `thresholds` below and their effects:

In [13]:
threshold = VarianceThreshold(0.1)
X_thresholded = threshold.fit_transform(X)

print(f"Feature count after removing low variance {X_thresholded.shape[1]}")

baselines(
    PREPROCESSING_PATH / pathlib.Path("no_medium_variance_baseline.csv"),
    X_thresholded,
    y,
    cv=5,
    random_state=SEED,
)

Feature count after removing low variance 56


Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.756378,0.817976,0.847684
1,0.753776,0.809164,0.841306
2,0.758644,0.819319,0.844327
3,0.762504,0.816046,0.848523
4,0.756294,0.81378,0.841641


In [14]:
threshold = VarianceThreshold(0.2)
X_thresholded = threshold.fit_transform(X)

print(f"Feature count after removing low variance {X_thresholded.shape[1]}")

baselines(
    PREPROCESSING_PATH / pathlib.Path("no_high_variance_baseline.csv"),
    X_thresholded,
    y,
    cv=5,
    random_state=SEED,
)

Feature count after removing low variance 30


Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.739846,0.809248,0.841222
1,0.734642,0.80673,0.832494
2,0.74144,0.813025,0.838453
3,0.746224,0.809248,0.835935
4,0.743286,0.809248,0.836606


In [15]:
threshold = VarianceThreshold(0.15)
X_thresholded = threshold.fit_transform(X)

print(f"Feature count after removing low variance {X_thresholded.shape[1]}")

baselines(
    PREPROCESSING_PATH / pathlib.Path("no_next_to_high_variance_baseline.csv"),
    X_thresholded,
    y,
    cv=5,
    random_state=SEED,
)

Feature count after removing low variance 39


Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.746056,0.814955,0.840467
1,0.746643,0.805975,0.837362
2,0.750839,0.820326,0.841054
3,0.752937,0.815123,0.84567
4,0.750587,0.813864,0.840886


In [16]:
threshold = VarianceThreshold(0.05)
X_thresholded = threshold.fit_transform(X)

print(f"Feature count after removing low variance {X_thresholded.shape[1]}")

baselines(
    PREPROCESSING_PATH / pathlib.Path("no_next_to_medium_variance_baseline.csv"),
    X_thresholded,
    y,
    cv=5,
    random_state=SEED,
)

Feature count after removing low variance 95


Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.765945,0.814451,0.849614
1,0.761078,0.810926,0.842649
2,0.769386,0.820913,0.84332
3,0.771484,0.815206,0.847768
4,0.767372,0.814955,0.848019


Threshold around `0.05` seems to bring the best results.

# Leave only most important traits based on analysis

First, let's load data from ANOVA statistical test, mutual information and importance of features according to random forest classifier.

This data should be available after previous `analysis` notebook.

In [17]:
from utilities.analysis import rf_feature_importance
from utilities.analysis import feature_importance

from utilities.analysis import feature_importance

feature_importance = feature_importance(
    ANALYSIS_PATH / pathlib.Path("feature_importance.csv"), X, y
)

rf_feature_importance = rf_feature_importance(
    ANALYSIS_PATH / pathlib.Path("rf_feature_importance.csv"), X, y, random_state=SEED
)

Results stored, retrieving...
Results stored, retrieving...


Unnamed: 0,importance
1,0.067963
0,0.047087
2,0.035552
7,0.033523
4,0.028876


Positions from each algorithm will be summed and everything is sorted in the internal function (e.g. best feature has value of `0`, second to best of `1` and so on).

Based on those values we will see whether removing some "unimportant" features might help:

In [18]:
from utilities.preprocessing import get_best_features

best_features = get_best_features(feature_importance, rf_feature_importance)

Below are some thresholds how many features one should keep and what results it yields:

In [19]:
from utilities.preprocessing import slice_best_features

to_keep = 60

baselines(
    PREPROCESSING_PATH / pathlib.Path(f"important_{to_keep}.csv"),
    slice_best_features(X, best_features, to_keep),
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.755035,0.812101,0.8476
1,0.75193,0.809919,0.838285
2,0.759147,0.815542,0.843991
3,0.761413,0.818228,0.845586
4,0.757553,0.813192,0.842649


In [20]:
from utilities.preprocessing import slice_best_features

to_keep = 96

baselines(
    PREPROCESSING_PATH / pathlib.Path(f"important_{to_keep}.csv"),
    slice_best_features(X, best_features, to_keep),
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.763595,0.811094,0.845166
1,0.755791,0.810003,0.842145
2,0.764267,0.818312,0.848523
3,0.768295,0.816885,0.849866
4,0.763427,0.812353,0.847264


As it's hard to decide on either I will check all sensible possibilities using function `find_threshold`:

In [21]:
from utilities.preprocessing import slice_best_features

to_keep = 78

baselines(
    PREPROCESSING_PATH / pathlib.Path(f"important_{to_keep}.csv"),
    slice_best_features(X, best_features, to_keep),
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.760322,0.814451,0.847432
1,0.756294,0.809836,0.839711
2,0.761245,0.818479,0.841641
3,0.765777,0.817724,0.847516
4,0.759651,0.816717,0.845502


In [22]:
from utilities.preprocessing import slice_best_features

to_keep = 87

baselines(
    PREPROCESSING_PATH / pathlib.Path(f"important_{to_keep}.csv"),
    slice_best_features(X, best_features, to_keep),
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.761413,0.814871,0.84995
1,0.755791,0.807066,0.844243
2,0.763008,0.81999,0.849782
3,0.767372,0.819151,0.846845
4,0.760154,0.813025,0.845082


In [23]:
from utilities.preprocessing import slice_best_features

to_keep = 50

baselines(
    PREPROCESSING_PATH / pathlib.Path(f"important_{to_keep}.csv"),
    slice_best_features(X, best_features, to_keep),
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.752685,0.814619,0.847852
1,0.750671,0.808996,0.842145
2,0.753357,0.81571,0.845418
3,0.75898,0.815542,0.844075
4,0.753441,0.81101,0.840131


Features from (approximately) `50` to `96` seems to give varying, yet high quality results. 

I will ensemble datasets from up to `96` features in the `predict` notebook.

## Higher order features

One can improve classification power using polynomial features as well (though such naive scheme is unlikely to be helpful in case of ANNs).

To get a sense how well our classification performs, I will use second order features using `30` best features found:

In [24]:
from utilities.preprocessing import slice_best_features

to_keep = 30
sliced_X = slice_best_features(X, best_features, to_keep)

baselines(
    PREPROCESSING_PATH / pathlib.Path(f"important_{to_keep}.csv"),
    sliced_X,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.730195,0.809584,0.835347
1,0.725327,0.802534,0.829221
2,0.735146,0.813025,0.831235
3,0.739426,0.807654,0.833921
4,0.73523,0.804716,0.827375


In [25]:
from sklearn.preprocessing import PolynomialFeatures

feature_creator = PolynomialFeatures(degree=2, include_bias=False)
polynomial_data = feature_creator.fit_transform(sliced_X)

polynomial_data.shape

(59580, 495)

In [26]:
baselines(
    PREPROCESSING_PATH / pathlib.Path("polynomial.csv"),
    polynomial_data,
    y,
    cv=5,
    random_state=SEED,
)

Results stored, retrieving...


Unnamed: 0,LR,RF,NN
0,0.822172,0.810339,0.830648
1,0.822256,0.794982,0.821249
2,0.81999,0.811094,0.833333
3,0.820577,0.808493,0.826452
4,0.823095,0.804045,0.825109


Second order features harmed neural network's performance (probably due to noise added).

# Save data after preprocessing

Two approaches yielded improvement, namely:
- Best feature selection
- Removing low variance features

Those two will be saved, both keeping `96` most important features.

Due to empirical experiments, feature count `50` up to `96` will be used for training (similar method to `RandomForest` technique), while validation dataset of size `20%` will be used as a bagging part.

Both dataset are saved below:

In [33]:
best_features_algorithmic = slice_best_features(X, best_features, 96)

threshold = VarianceThreshold(0.05)
best_features_variance = pd.DataFrame(threshold.fit_transform(X))

best_features_algorithmic.to_csv(PREPROCESSING_PATH / pathlib.Path("algorithmic_data.csv"))
best_features_variance.to_csv(PREPROCESSING_PATH / pathlib.Path("variance_data.csv"))

### Removing outliers

Outlier detection might help model's generalization. There are no outliers in this data as it's artificially generated from Gaussian Distributions, I didn't check it all in all but function is ready and should be working fine.


In [28]:
# It might be used I guess
from utilities.preprocessing import find_outliers