In [None]:
import numpy as np
import pandas as pd
from maxvolpy.maxvol import rect_maxvol, maxvol
import random

## Does it drop redundant features?

$ 2 $ classes. We are generating a sample. Randomly choose one, then generate features for it. Different classes have different distributions of features. We generate $ n $ samples this way, suppose they all have $ k $ features. Replicate each column $ r $ times, so that we have $ n $ samples each with $ rk $ columns now. Add noise to each column.

Invoke maxvol to select $ k $ columns. See how many duplicates it gets. The hypothesis we're checking is it selects very few duplicates.

In [None]:
np.random.seed(82138123)
random.seed(82138123)

k = 4

# let all features of class 1 have mean -1
# and let all features of class 2 have mean 1

class_1_features_means = -1 * np.ones(k)
class_2_features_means = np.ones(k)

# now let's choose standard deviations for features
# for each feature its standard deviation will be the same no matter class 1 or class 2
min_feature_std = 0.5
max_feature_std = 1.5
features_stds = np.linspace(min_feature_variance, max_feature_variance, num=k)

n = 1000
class_1_prob = 0.75
class_1_num_samples = np.random.binomial(n, class_1_prob)
class_2_num_samples = n - class_1_num_samples

In [None]:
# now let's actually generate the data
class_1_dataset = features_stds * np.random.randn(class_1_num_samples, k) + class_1_features_means
class_2_dataset = features_stds * np.random.randn(class_2_num_samples, k) + class_2_features_means

# num duplicates is r
r = 5
# build the whole dataset
dataset = np.tile(
    np.concatenate((class_1_dataset, class_2_dataset), axis=0),
    r
)
# add noise
noise_level = 0.1
noise_std = noise_level * (min_feature_std + max_feature_std) / 2
dataset += noise_std * np.random.randn(*dataset.shape)

We see that for each $ i \in \{ 0, \dots, \text{ num_true_features} \} $ correlation of each column with number $ i + j \text{ num_true_features} $ is almost one.

In [None]:
pd.DataFrame(dataset).corr()

Now we want to choose the best 4-column submatrix using `rect_maxvol`.

First let's try simply choosing 4 random rows.

Then we'll first choosing good rows by `rect_maxvol`, then good columns by `rect_maxvol` on that.

Then maybe try SVD truncation.

In [None]:
def choose_features(dataset, n, k, samples_choice):
    if samples_choice == "random":
        samples_subset_indices = np.random.choice(n, size=k, replace=False)
    elif samples_choice == "rect_maxvol":
        raise ValueError("Not implemented yet")
    else:
        raise ValueError("Incorrect samples_choice parameter")
    features_subset_indices = rect_maxvol(dataset[samples_subset_indices, :].T, minK=k, maxK=k, tol=0.05)[0]
    return features_subset_indices

In [None]:
def calculate_percentage_uniq_features(features_subset_indices):
    return len(np.unique(features_subset_indices % k)) / len(features_subset_indices)

In [None]:
chosen_features_indices = choose_features(dataset, n=n, k=k, samples_choice="random")
print(chosen_features_indices)
print(chosen_features_indices % k)
print(calculate_percentage_uniq_features(chosen_features_indices))