# Custom Dataset

This notebook demonstrates how to construct a custom dataset from previously loaded data.

---

Include the repository root in your path and import all required packages and components from the Model Agnostic Toolkit.

In [12]:
import sys
sys.path.insert(0, '..')

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from model_agnostic_toolkit import Model, ImportanceAnalyzer
from model_agnostic_toolkit.datasets import PreloadedDataset
from model_agnostic_toolkit.tools import VIPImportance
from model_agnostic_toolkit.types import DataType

Load your custom data the way you normally would. Make sure you have access to all feature variable data (X) and target data (y).

> **Note**: Here we just generate some random data (5 features and 100 samples) using NumPy.

In [13]:
X = np.random.random_sample(size=(5, 100))
y = X[0] + 2 * X[1] + X[2] * X[3]

print('X:', X.shape)
print('y:', y.shape)

X: (5, 100)
y: (100,)


Create a pandas data frame from your raw data matrix if it isn't already loaded as a data frame. You may need to transpose your data matrix. Make sure to add sensible labels for your variables.

Similarly, create a pandas series from your raw target vector if it isn't already loaded as a series.

> **Note**: Not all of the following steps may be necessary in your case. This depends on how your loaded data is formatted.

In [14]:
X = pd.DataFrame(X.T, columns=[f'base_{i}' for i in range(5)])

X

Unnamed: 0,base_0,base_1,base_2,base_3,base_4
0,0.694067,0.070527,0.793122,0.069495,0.132887
1,0.923867,0.420598,0.969987,0.449187,0.183908
2,0.215082,0.540979,0.270988,0.075068,0.978858
3,0.908609,0.237584,0.809504,0.169581,0.010859
4,0.909004,0.213110,0.666893,0.394594,0.163360
...,...,...,...,...,...
95,0.587455,0.966329,0.836645,0.184127,0.655776
96,0.446756,0.625371,0.203727,0.685625,0.283182
97,0.015987,0.449222,0.858094,0.854437,0.153337
98,0.846246,0.688231,0.895720,0.997211,0.231015


In [15]:
y = pd.Series(y, name='target')

y

0     0.890239
1     2.200769
2     1.317383
3     1.521053
4     1.598376
        ...   
95    2.674161
96    1.837178
97    1.647619
98    3.115931
99    2.752832
Name: target, Length: 100, dtype: float64

Split both your data frame (X) and your target series (y) randomly into training and testing sets. You can choose the relative size suitable for your use case.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (80, 5)
y_train: (80,)
X_test: (20, 5)
y_test: (20,)


Create your final dataset by creating a `PreloadedDataset` instance and passing training and testing parts of X and y. Additionally specify the correct data type, depending on whether your dataset is to be used for a classification or a regression task.

In [17]:
data = PreloadedDataset(X_train, y_train, X_test, y_test, data_type=DataType.REGRESSION)

Creating dataset from preloaded data ... done.


The dataset can now be used for feature importance analysis in the usual way.

In [18]:
model = Model(data_type=data.data_type)
tools = [VIPImportance()]
ana = ImportanceAnalyzer(model=model, dataset=data, tools=tools)

Created XGBRegressor.


In [19]:
ana.run_analysis(features=data.features)
ana.plot_results()

You can still access your training, testing or the complete data via the dataset's `get_train()`, `get_test()` and `get_all()` methods.

In [20]:
X_train, y_train = data.get_train()

X_train

Unnamed: 0,base_0,base_1,base_2,base_3,base_4
0,0.694067,0.070527,0.793122,0.069495,0.132887
16,0.108013,0.486905,0.063222,0.453542,0.200168
64,0.066589,0.968346,0.302472,0.130364,0.059088
10,0.440706,0.449126,0.661917,0.237818,0.401107
91,0.699248,0.617938,0.708576,0.359316,0.588720
...,...,...,...,...,...
63,0.275062,0.713175,0.613112,0.673775,0.045426
50,0.416080,0.167800,0.453828,0.045818,0.907052
41,0.504906,0.496255,0.533552,0.082222,0.551781
7,0.218742,0.268188,0.960620,0.778194,0.627592
