# Datasets

Within CK, data for machine learning and other analyses are represented using a `Dataset`.

Logically a dataset is a table of data where columns represent random variables and each row represents an _instance_ (also known as a _sample_, _data point_, or _joint states_).

The _length_ of a dataset is the number of instances in the dataset. In CK, the length of a dataset is determined at construction time, and cannot be modified. Random variables can be added to and removed from a dataset.

Each instance in a dataset has an instance weight. An instance weight represents a _weight of evidence_. The value of an instance weight is notionally 1, but can se set to other values to represent multiple or fractional evidence. This is explained in detail below.

A CK dataset is either a `HardDataset` or `SoftDataset`. These are explained in the next section, with reference to the following simple PGM.

In [1]:
from ck.pgm import PGM, RandomVariable

pgm: PGM = PGM()
difficult: RandomVariable = pgm.new_rv('difficult', ('Yes', 'No'))
intelligent: RandomVariable = pgm.new_rv('intelligent', ('Yes', 'No'))
grade: RandomVariable = pgm.new_rv('grade', ('1', '2', '3'))
sat: RandomVariable = pgm.new_rv('sat', ('High', 'Low'))
letter: RandomVariable = pgm.new_rv('letter', ('Yes', 'No'))


## Hard Datasets

A hard dataset represents instances of hard evidence for a collection of random variables. That is, for each instance, each random variable is in a specific state.

There are multiple ways to initialise a dataset. Here we create a dataset of length 5, with no random variables.

In [2]:
from ck.dataset import HardDataset

# Create and empty dataset - no instances and no random variables.
dataset = HardDataset(length=5)

len(dataset)


5

This dataset has no random variables.

In [3]:
dataset.rvs

()

## Updating dataset random variables
Each random variable of a hard dataset is associated with a series, which is a 1D numpy array.

The length of the series is the same as the length of the dataset, and the *i*th element that holds the state index of the random variable for the *i*th instance.

One way to add a random variable to a dataset is to have the dataset allocate memory for the series, and initialise it to zero.

In [4]:
letter_series = dataset.add_rv(letter)

print(*dataset.rvs)

letter


In [5]:
letter_series

array([0, 0, 0, 0, 0], dtype=uint8)

A dataset series is mutable.

In [6]:
letter_series[2] = 1
letter_series[3] = 1

The series for a random variable can be accessed using the `states` method.

In [7]:
dataset.states(letter)

array([0, 0, 1, 1, 0], dtype=uint8)

A random variable can be added with state values, by state indices. If a numpy array is given, then the dataset will keep a reference to the data, otherwise it will be copied.

In [8]:
import numpy as np

difficult_data = np.array([1, 1, 0, 1, 0])
dataset.add_rv_from_state_idxs(difficult, difficult_data)

print(*dataset.rvs)

letter difficult


In [9]:
dataset.states(difficult)

array([1, 1, 0, 1, 0])

In [10]:
difficult_data[0] = 0

dataset.states(difficult)

array([0, 1, 0, 1, 0])

A random variable can also be added using the states of a random variable, rather than state indicies.

In [11]:
sat_series = dataset.add_rv_from_states(sat, ['High', 'Low', 'High', 'Low', 'High'])

sat_series

array([0, 1, 0, 1, 0], dtype=uint8)

In [12]:
print(*dataset.rvs)

letter difficult sat


A random variable can be removed from a dataset.

In [13]:
dataset.remove_rv(difficult)

print(*dataset.rvs)

letter sat


## Instance Weights

If two or more instances in a dataset are identical, then mathematically this may be represented in a dataset equally well by combining identical instances and noting how many of that instance was present.

For example, this dataset

| difficult | intelligent | grade | high_sat | letter |
|-----------|-------------|-------|----------|--------|
| 0         | 1           | 2     | 0        | 1      |
| 1         | 1           | 2     | 0        | 1      |
| 1         | 1           | 2     | 0        | 1      |
| 0         | 1           | 1     | 1        | 0      |
| 1         | 1           | 0     | 0        | 0      |
| 1         | 1           | 0     | 0        | 0      |

may be represented using weighted instances as

| difficult | intelligent | grade | high_sat | letter | _weight_ |
|-----------|-------------|-------|----------|--------|----------|
| 0         | 1           | 2     | 0        | 1      | 1        |
| 1         | 1           | 2     | 0        | 1      | 2        |
| 0         | 1           | 1     | 1        | 0      | 1        |
| 1         | 1           | 0     | 0        | 0      | 2        |

Instances of weight zero have no effect on the meaning of a dataset and can be removed. So this example dataset is also equivalent to the above.

| difficult | intelligent | grade | high_sat | letter | _weight_ |
|-----------|-------------|-------|----------|--------|----------|
| 0         | 1           | 2     | 0        | 1      | 1        |
| 1         | 1           | 2     | 0        | 1      | 2        |
| 0         | 1           | 1     | 1        | 0      | 1        |
| 1         | 0           | 0     | 0        | 1      | 0        |
| 1         | 1           | 0     | 0        | 0      | 2        |


The instance weights of a dataset are access using the `weights` method, with returns a 1D numpy array with the same length as the dataset.

In [14]:
dataset.weights

array([1., 1., 1., 1., 1.])

The instance weights of a dataset are directly modifiable

In [15]:
dataset.weights[3] = 2

dataset.weights

array([1., 1., 1., 2., 1.])

The notion of an instance weight may be generalised further by introducing the concept of _weight of evidence_. A dataset represents a _weight of evidence_ that is the count of observations which lead to the data.

Multiplying the instance weights of a dataset by a constant value does not change the empirical joint probability distribution, it only changes the weight of evidence. We use this principle to define fractional weights.

For a dataset with fractional weighted instances, there is some constant such that when the instance weights of a dataset are multiplied by the constant, then all instance weights are integer. The dataset with fractional weighted instances is equivalent to the dataset with integer weighted instances, but with a weight of evidence proportionally reduced by the constant.

There is a Bayesian interpretation for re-weighting evidence based on conditioning events. Intuitively, it may be understood as discounting evidence represented by a dataset due to some relative misgivings about the collection process.

Note that for CK, negative weights have no interpretation.


The total instance weight can be calculated. Naturally, if the instance weights are all 1, then the total weight is the same as the length of the dataset.

In [16]:
dataset.total_weight()

6.0

## Hard Dataset Dump

A dump of a dataset can be printed, showing state indexes and weights. This is intended for debugging, ad hoc testing and pedagogical purposes.

In [17]:
dataset.dump()

rvs: [letter, sat]
instances (5, with total weight 6.0):
(0, 0) * 1.0
(0, 1) * 1.0
(1, 0) * 1.0
(1, 1) * 2.0
(0, 0) * 1.0


A hard dataset dump can show the states directly, using the `as_states` keyword argument.


In [18]:
dataset.dump(as_states=True)

rvs: [letter, sat]
instances (5, with total weight 6.0):
('Yes', 'High') * 1.0
('Yes', 'Low') * 1.0
('No', 'High') * 1.0
('No', 'Low') * 2.0
('Yes', 'High') * 1.0


## Soft Datasets

A soft dataset represents instances of soft evidence for a set of random variables.
For each instance, there is a distribution over the states for each random variable.

Traditionally, each instance in a dataset records a single state for each random variable. In CK, a value of a random variable may be a distribution over the possible states of a random variable. When a random variable has a single state value for an instance, then the distribution has probability = 1 for one state, and all other states have zero probability.

As with a hard dataset, the "length" of a dataset is the number of instances in the dataset. The length is determined at construction time, and cannot be modified.
Similarly, random variables can be added to and removed from a soft dataset.

There are multiple ways to initialise a soft dataset. Here we create a dataset with two random variables included at construction time.

In [19]:
from ck.dataset import SoftDataset

letter_data = np.array([
    [0.6, 0.4],
    [1.0, 0.0],
    [0.0, 1.0],
    [0.3, 0.7],
    [0.9, 0.1],
])

grade_data = np.array([
    [0.6, 0.3, 0.1],
    [0.0, 0.0, 1.0],
    [0.0, 1.0, 0.0],
    [0.3, 0.4, 0.3],
    [0.8, 0.1, 0.1],
])


dataset = SoftDataset([
    (letter, letter_data),
    (grade, grade_data),
])

len(dataset)

5

As with a hard dataset, a soft dataset has weighted instances.

In [20]:
dataset.weights

array([1., 1., 1., 1., 1.])

The instance weights can be updated.

In [21]:
dataset.weights[1] = 2
dataset.weights[3] = 0
dataset.weights[4] = 2

dataset.weights

array([1., 2., 1., 0., 2.])

And the total instance weight can be calculated

In [22]:
dataset.total_weight()

6.0

When dumping a soft dataset, the state weights are printed.

In [23]:
dataset.dump()

rvs: [letter, grade]
instances (5, with total weight 6.0):
([0.6 0.4], [0.6 0.3 0.1]) * 1.0
([1. 0.], [0. 0. 1.]) * 2.0
([0. 1.], [0. 1. 0.]) * 1.0
([0.3 0.7], [0.3 0.4 0.3]) * 0.0
([0.9 0.1], [0.8 0.1 0.1]) * 2.0


The state weights for a random variable and instance represent _soft evidence_. The intuitive interpretation of soft evidence is that it represents uncertainty in the true value of the random variable for that instance.

Formally, a soft evidence value is equivalent to replicating the instance, once for each possible value of the attribute, then multiplying the weight of the instance by the probability of the value.

For example,

| difficult | intelligent | grade | high_sat | letter | _weight_ |
|-----------|-------------|-------|----------|--------|----------|
| 0         | (0.2, 0.8)  | 2     | 0        | 1      | _w_      |

is equivalent to

| difficult | intelligent | grade | high_sat | letter | _weight_  |
|-----------|-------------|-------|----------|--------|-----------|
| 0         | 0           | 2     | 0        | 1      | _w_ × 0.2 |
| 0         | 1           | 2     | 0        | 1      | _w_ × 0.8 |

This interpretation is consistent for any non-negative weighting of values, not just probability distributions.
However, it is not recommended to use soft evidence except as distributions over attribute values as it may make the weight of evidence difficult to see, even though the semantics is well-defined.

For example,

| difficult | intelligent | grade | high_sat | letter | _weight_ |
|-----------|-------------|-------|----------|--------|----------|
| 0         | (2.4, 0.1)  | 2     | 0        | 1      | _w_      |

is equivalent to

| difficult | intelligent | grade | high_sat | letter | _weight_  |
|-----------|-------------|-------|----------|--------|-----------|
| 0         | 0           | 2     | 0        | 1      | _w_ × 2.4 |
| 0         | 1           | 2     | 0        | 1      | _w_ × 0.1 |


The formulation of soft evidence remains consistent when there are multiple attributes with soft evidence.

For example,

| difficult | intelligent | grade  | high_sat   | letter  | _weight_ |
|-----------|-------------|--------|------------|---------|----------|
| 0         | (0.6, 0.4)  | 2      | (0.1, 0.9) | 1       | _w_      |

is equivalent to

| difficult | intelligent | grade  | high_sat | letter  | _weight_        |
|-----------|-------------|--------|----------|---------|-----------------|
| 0         | 0           | 2      | 0        | 1       | _w_ × 0.6 × 0.1 |
| 0         | 0           | 2      | 1        | 1       | _w_ × 0.6 × 0.9 |
| 0         | 1           | 2      | 0        | 1       | _w_ × 0.4 × 0.1 |
| 0         | 1           | 2      | 1        | 1       | _w_ × 0.4 × 0.9 |


Here is how we can create a dataset with soft evidence (with a single instance weight of 10)

In [24]:
dataset = SoftDataset(weights=[10])

dataset.add_rv(difficult)[0,:] = (1, 0)         # difficult = 'Yes'
dataset.add_rv(intelligent)[0,:] = (2.4, 0.1)   # intelligent = (2.4, 0.1)
dataset.add_rv(grade)[0,:] = (0, 0, 1)          # grade = '3'
dataset.add_rv(sat)[0,:] = (0.1, 0.9)           # sat = (0.1, 0.9)
dataset.add_rv(letter)[0,:] = (0, 1)            # letter = 'No'

dataset.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (1, with total weight 10):
([1. 0.], [2.4 0.1], [0. 0. 1.], [0.1 0.9], [0. 1.]) * 10


## Normalisation
A soft dataset is considered _normalised_ when, for each instance, either:
1. the instance weight is zero and the state weights for every random variable are zero, or
2. the instance weight is positive and the state weights for each random variable sum to one.

For example,

| difficult | intelligent | grade | high_sat   | letter | _weight_ |
|-----------|-------------|-------|------------|--------|----------|
| 0         | (2.4, 0.1)  | 2     | (0.1, 0.9) | 1      | 10       |

is equivalent to the normalised dataset

| difficult | intelligent  | grade  | high_sat   | letter  | _weight_ |
|-----------|--------------|--------|------------|---------|----------|
| 0         | (0.96, 0.04) | 2      | (0.1, 0.9) | 1       | 10 × 2.5 |


In [25]:
dataset.dump()
print()

dataset.normalise()

dataset.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (1, with total weight 10):
([1. 0.], [2.4 0.1], [0. 0. 1.], [0.1 0.9], [0. 1.]) * 10

rvs: [difficult, intelligent, grade, sat, letter]
instances (1, with total weight 25):
([1. 0.], [0.96 0.04], [0. 0. 1.], [0.1 0.9], [0. 1.]) * 25


## Sampled Datasets

A hard dataset can be created from a sampler.

In this example, a sampled is created using the "Student" example PGM.

In [26]:
from ck.example import Student
from ck.pgm_compiler import DEFAULT_PGM_COMPILER
from ck.pgm_circuit.wmc_program import WMCProgram

pgm = Student()
wmc = WMCProgram(DEFAULT_PGM_COMPILER(pgm))
sampler = wmc.sample_direct()

print(*sampler.rvs)

difficult intelligent grade sat letter


In [27]:
from ck.dataset.sampled_dataset import dataset_from_sampler

dataset = dataset_from_sampler(sampler, 20)

dataset.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (20, with total weight 20.0):
(0, 1, 2, 1, 0) * 1.0
(0, 1, 2, 0, 0) * 1.0
(0, 0, 1, 0, 0) * 1.0
(0, 0, 1, 0, 0) * 1.0
(1, 1, 1, 1, 1) * 1.0
(0, 0, 1, 0, 1) * 1.0
(0, 1, 2, 0, 0) * 1.0
(0, 0, 1, 1, 0) * 1.0
(0, 0, 2, 0, 0) * 1.0
(0, 0, 2, 0, 0) * 1.0
(1, 0, 0, 0, 1) * 1.0
(0, 0, 0, 0, 1) * 1.0
(1, 1, 0, 1, 1) * 1.0
(0, 0, 0, 0, 1) * 1.0
(0, 0, 0, 0, 1) * 1.0
(1, 0, 0, 0, 1) * 1.0
(1, 0, 0, 0, 1) * 1.0
(1, 1, 2, 1, 0) * 1.0
(0, 0, 2, 0, 0) * 1.0
(1, 0, 0, 1, 1) * 1.0


## Cross Tables

A cross-table records the total weight for possible combinations of states for some random variables. A cross-table is a dictionary mapping from state indices of the cross-table random variables (as a tuple) to a weight (as a float).

Here is an example of a manually created cross-table.

In [28]:
from ck.dataset.cross_table import CrossTable, cross_table_from_dataset

crosstab = CrossTable(rvs=[difficult, intelligent, grade, sat, letter])

crosstab[(0, 0, 0, 0, 0)] = 3.4
crosstab[(0, 0, 0, 0, 1)] = 2.1
crosstab[(0, 1, 1, 0, 0)] = 9.3

for instance, weight in crosstab.items():
    print(instance, weight)

(0, 0, 0, 0, 0) 3.4
(0, 0, 0, 0, 1) 2.1
(0, 1, 1, 0, 0) 9.3


 A cross-table can be constructed from a dataset.

This example uses the dataset created above using a sampler.

In [29]:
crosstab = cross_table_from_dataset(dataset)

print(*crosstab.rvs)
for instance, weight in crosstab.items():
    print(instance, weight)

difficult intelligent grade sat letter
(0, 1, 2, 1, 0) 1.0
(0, 1, 2, 0, 0) 2.0
(0, 0, 1, 0, 0) 2.0
(1, 1, 1, 1, 1) 1.0
(0, 0, 1, 0, 1) 1.0
(0, 0, 1, 1, 0) 1.0
(0, 0, 2, 0, 0) 3.0
(1, 0, 0, 0, 1) 3.0
(0, 0, 0, 0, 1) 3.0
(1, 1, 0, 1, 1) 1.0
(1, 1, 2, 1, 0) 1.0
(1, 0, 0, 1, 1) 1.0


Observe that a cross-table is logically equivalent to a dataset. A hard dataset can be created directly from cross-table using `dataset_from_cross_table`, like the following.

In [30]:
from ck.dataset.dataset_from_crosstable import dataset_from_cross_table

dataset_2 = dataset_from_cross_table(crosstab)

dataset_2.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (12, with total weight 20.0):
(0, 1, 2, 1, 0) * 1.0
(0, 1, 2, 0, 0) * 2.0
(0, 0, 1, 0, 0) * 2.0
(1, 1, 1, 1, 1) * 1.0
(0, 0, 1, 0, 1) * 1.0
(0, 0, 1, 1, 0) * 1.0
(0, 0, 2, 0, 0) * 3.0
(1, 0, 0, 0, 1) * 3.0
(0, 0, 0, 0, 1) * 3.0
(1, 1, 0, 1, 1) * 1.0
(1, 1, 2, 1, 0) * 1.0
(1, 0, 0, 1, 1) * 1.0


A cross-table can also be sampled directly using a `CrossTableSampler`. Instances will be drawn from the sampler according to their weight in the given cross-table.

Note that if the given cross-table is modified after constructing the sampler, the sampler will not be affected.

In [31]:
from ck.dataset.sampled_dataset import CrossTableSampler

crosstab_sampler = CrossTableSampler(crosstab)

for inst in crosstab_sampler.take(8):
    print(inst)


(0, 0, 2, 0, 0)
(0, 0, 0, 0, 1)
(0, 0, 1, 0, 0)
(0, 0, 0, 0, 1)
(1, 1, 0, 1, 1)
(0, 0, 0, 0, 1)
(0, 0, 2, 0, 0)
(1, 1, 1, 1, 1)
