## Creating toy data with related attributes

The goal of this notebook is to generate some data with many columns which are related in *some* way to a target value, so that the data can be decomposed

In [51]:
import numpy as np
from sklearn.datasets import make_classification
from numpy import savetxt

I could generate some nice data with numpy with specified covariances and very specific distributions... or just use `sklearn` to generate some toy data!

Using `sklearn` allows me to specify useful attributes of the data set for when we do reduce the data later on, such as how many features are actually important to the data.

In [86]:
# Parameter set for generation
n = 1000
n_feat = 100
n_informative = 10 
n_redundant = 0  
n_classes = 2 # binary problem
flip_percent = 0.0 # change labels to reduce prediction ability
class_balance = np.array([[0.4], [0.6]])

In [87]:
X, y = make_classification(n_samples=n,
                            n_features=n_feat,
                            n_informative=n_informative,
                            n_redundant=n_redundant,
                            n_classes=n_classes,
                            flip_y=flip_percent,
                            weights=class_balance,
                            shuffle=True,
                            random_state=123
                        )

# Is there the same number of rows?
X.shape[0] == y.shape[0]

True

In [88]:
# Combine to one data structure
all_data = np.c_[X, y]

In [89]:
# Speak peak of target class
all_data[:,-1][0:10]

array([1., 1., 1., 1., 1., 1., 0., 1., 0., 0.])

In [90]:
#savetxt("../data/high_dimensions.csv", all_data, delimiter=",")