# Datasets

Within CK, data for machine learning and other analyses are represented using a `Dataset`.

Logically a dataset is a table of data where columns represent random variables and each row represents an _instance_ (also known as a _sample_, _record_, _row_, _datapoint_, or _joint states_).

The _length_ of a dataset is the number of instances in the dataset. In CK, the length of a dataset is determined at construction time, and cannot be modified. Random variables can be added to and removed from a dataset.

Each instance in a dataset has an instance weight. An instance weight represents a _weight of evidence_.
The value of an instance weight is notionally 1, but can be set to other values to represent multiple or fractional evidence.
This is explained in detail below.

The values and weights of a dataset are mutable and can be updated.

A CK dataset is either a `HardDataset` or `SoftDataset`. These are explained in the following sections, with reference to the following simple PGM.

In [1]:
from ck.pgm import PGM, RandomVariable

pgm: PGM = PGM()
difficult: RandomVariable = pgm.new_rv('difficult', ('Yes', 'No'))
intelligent: RandomVariable = pgm.new_rv('intelligent', ('Yes', 'No'))
grade: RandomVariable = pgm.new_rv('grade', ('1', '2', '3'))
sat: RandomVariable = pgm.new_rv('sat', ('High', 'Low'))
letter: RandomVariable = pgm.new_rv('letter', ('Yes', 'No'))


## Hard Datasets

A hard dataset represents instances of hard evidence for a collection of random variables. That is, for each instance, each random variable is in a specific state.

There are multiple ways to initialise a dataset. Here we create a dataset of length 5, with no random variables.

In [2]:
from ck.dataset import HardDataset

# Create and empty dataset - no instances and no random variables.
dataset: HardDataset = HardDataset(length=5)

len(dataset)


5

This dataset has no random variables.

In [3]:
dataset.rvs

()

## Updating Datasets
Each random variable of a hard dataset is associated with a series, which is a 1D numpy array.

The length of the series is the same as the length of the dataset, and the *i*th element that holds the state index of the random variable for the *i*th instance.

One way to add a random variable to a dataset is to have the dataset allocate memory for the series, and initialise it to zero.

In [4]:
letter_series = dataset.add_rv(letter)

print(*dataset.rvs)

letter


In [5]:
letter_series

array([0, 0, 0, 0, 0], dtype=uint8)

A dataset series is mutable.

In [6]:
letter_series[2] = 1
letter_series[3] = 1

The series for a random variable can be accessed using the `state_idxs` method.

In [7]:
dataset.state_idxs(letter)

array([0, 0, 1, 1, 0], dtype=uint8)

A random variable can be added with state , by state indexes. If a numpy array is given, then the dataset will keep a reference to the data, otherwise it will be copied into a new numpy array.

In [8]:
import numpy as np

difficult_data = np.array([1, 1, 0, 1, 0])
dataset.add_rv_from_state_idxs(difficult, difficult_data)

print(*dataset.rvs)

letter difficult


In [9]:
dataset.state_idxs(difficult)

array([1, 1, 0, 1, 0])

In [10]:
difficult_data[0] = 0

dataset.state_idxs(difficult)

array([0, 1, 0, 1, 0])

A random variable can also be added using the states of a random variable, rather than state indices.

In [11]:
sat_series = dataset.add_rv_from_states(sat, ['High', 'Low', 'High', 'Low', 'High'])

sat_series

array([0, 1, 0, 1, 0], dtype=uint8)

In [12]:
print(*dataset.rvs)

letter difficult sat


A random variable can be removed from a dataset.

In [13]:
dataset.remove_rv(difficult)

print(*dataset.rvs)

letter sat


## Instance Weights

If two or more instances in a dataset are identical, then mathematically this may be represented in a dataset equally well by combining identical instances and noting how many of that instance was present.

For example, this dataset

| difficult | intelligent | grade | sat | letter |
|-----------|-------------|-------|-----|--------|
| 0         | 1           | 2     | 0   | 1      |
| 1         | 1           | 2     | 0   | 1      |
| 1         | 1           | 2     | 0   | 1      |
| 0         | 1           | 1     | 1   | 0      |
| 1         | 1           | 0     | 0   | 0      |
| 1         | 1           | 0     | 0   | 0      |

may be represented using weighted instances as

| difficult | intelligent | grade | sat | letter | _weight_ |
|-----------|-------------|-------|-----|--------|----------|
| 0         | 1           | 2     | 0   | 1      | 1        |
| 1         | 1           | 2     | 0   | 1      | 2        |
| 0         | 1           | 1     | 1   | 0      | 1        |
| 1         | 1           | 0     | 0   | 0      | 2        |

Instances of weight zero have no effect on the meaning of a dataset and can be removed. So this example dataset is also equivalent to the above.

| difficult | intelligent | grade | sat | letter | _weight_ |
|-----------|-------------|-------|-----|--------|----------|
| 0         | 1           | 2     | 0   | 1      | 1        |
| 1         | 1           | 2     | 0   | 1      | 2        |
| 0         | 1           | 1     | 1   | 0      | 1        |
| 1         | 0           | 0     | 0   | 1      | 0        |
| 1         | 1           | 0     | 0   | 0      | 2        |


The instance weights of a dataset are access using the `weights` method, with returns a 1D numpy array with the same length as the dataset.

In [14]:
dataset.weights

array([1., 1., 1., 1., 1.])

The instance weights of a dataset are directly modifiable

In [15]:
dataset.weights[3] = 2

dataset.weights

array([1., 1., 1., 2., 1.])

The notion of an instance weight may be generalised further by introducing the concept of _weight of evidence_. A dataset represents a _weight of evidence_ that is the count of observations which lead to the data.

Multiplying the instance weights of a dataset by a constant value does not change the empirical joint probability distribution, it only changes the weight of evidence. We use this principle to define fractional weights.

For a dataset with fractional weighted instances, there is some constant such that when the instance weights of a dataset are multiplied by the constant, then all instance weights are integer. The dataset with fractional weighted instances is equivalent to the dataset with integer weighted instances, but with a weight of evidence proportionally reduced by the constant.

There is a Bayesian interpretation for re-weighting evidence based on conditioning events. Intuitively, it may be understood as discounting evidence represented by a dataset due to some relative misgivings about the collection process.

Note that for CK, negative weights have no interpretation.


The total instance weight can be calculated. Naturally, if the instance weights are all 1, then the total weight is the same as the length of the dataset.

In [16]:
dataset.total_weight()

6.0

## Dataset Dump

A dump of a dataset can be printed, showing state indexes and weights. This is intended for debugging, ad hoc testing and pedagogical purposes.

In [17]:
dataset.dump()

rvs: [letter, sat]
instances (5, with total weight 6.0):
(0, 0) * 1.0
(0, 1) * 1.0
(1, 0) * 1.0
(1, 1) * 2.0
(0, 0) * 1.0


A hard dataset dump can show the states directly, using the `as_states` keyword argument.


In [18]:
dataset.dump(as_states=True)

rvs: [letter, sat]
instances (5, with total weight 6.0):
('Yes', 'High') * 1.0
('Yes', 'Low') * 1.0
('No', 'High') * 1.0
('No', 'Low') * 2.0
('Yes', 'High') * 1.0


## Soft Datasets

A soft dataset represents instances of soft evidence for a set of random variables.
For each instance, there is a distribution over the states for each random variable.

Traditionally, each instance in a dataset records a single state for each random variable. In CK, a value of a random variable may be a distribution over the possible states of a random variable. When a random variable has a single state value for an instance, then the distribution has probability = 1 for one state, and all other states have zero probability.

As with a hard dataset, the "length" of a dataset is the number of instances in the dataset. The length is determined at construction time, and cannot be modified.
Random variables can be added to and removed from a soft dataset.

There are multiple ways to initialise a soft dataset. Here we create a dataset with two random variables included at construction time.

In [19]:
from ck.dataset import SoftDataset

letter_data = [
    [0.6, 0.4],
    [1.0, 0.0],
    [0.0, 1.0],
    [0.3, 0.7],
    [0.9, 0.1],
]

grade_data = [
    [0.6, 0.3, 0.1],
    [0.0, 0.0, 1.0],
    [0.0, 1.0, 0.0],
    [0.3, 0.4, 0.3],
    [0.8, 0.1, 0.1],
]


dataset: SoftDataset = SoftDataset([
    (letter, letter_data),
    (grade, grade_data),
])

len(dataset)

5

As with a hard dataset, a soft dataset has weighted instances.

In [20]:
dataset.weights

array([1., 1., 1., 1., 1.])

The instance weights can be updated.

In [21]:
dataset.weights[1] = 2
dataset.weights[3] = 0
dataset.weights[4] = 2

dataset.weights

array([1., 2., 1., 0., 2.])

And the total instance weight can be calculated

In [22]:
dataset.total_weight()

6.0

When dumping a soft dataset, the state weights are printed.

In [23]:
dataset.dump()

rvs: [letter, grade]
instances (5, with total weight 6.0):
([0.6 0.4], [0.6 0.3 0.1]) * 1.0
([1. 0.], [0. 0. 1.]) * 2.0
([0. 1.], [0. 1. 0.]) * 1.0
([0.3 0.7], [0.3 0.4 0.3]) * 0.0
([0.9 0.1], [0.8 0.1 0.1]) * 2.0


The state weights for a random variable and instance represent _soft evidence_. The intuitive interpretation of soft evidence is that it represents uncertainty in the true value of the random variable for that instance.

Formally, a soft evidence value is equivalent to replicating the instance, once for each possible value of the attribute, then multiplying the weight of the instance by the probability of the value.

For example,

| difficult | intelligent | grade | sat | letter | _weight_ |
|-----------|-------------|-------|-----|--------|----------|
| 0         | (0.2, 0.8)  | 2     | 0   | 1      | _w_      |

is equivalent to

| difficult | intelligent | grade | sat | letter | _weight_  |
|-----------|-------------|-------|-----|--------|-----------|
| 0         | 0           | 2     | 0   | 1      | _w_ × 0.2 |
| 0         | 1           | 2     | 0   | 1      | _w_ × 0.8 |

This interpretation is consistent for any non-negative weighting of values, not just probability distributions.
However, it is not recommended to use soft evidence except as distributions over attribute values as it may make the weight of evidence difficult to see, even though the semantics is well-defined.

For example,

| difficult | intelligent | grade | sat | letter | _weight_ |
|-----------|-------------|-------|-----|--------|----------|
| 0         | (2.4, 0.1)  | 2     | 0   | 1      | _w_      |

is equivalent to

| difficult | intelligent | grade | sat | letter | _weight_  |
|-----------|-------------|-------|-----|--------|-----------|
| 0         | 0           | 2     | 0   | 1      | _w_ × 2.4 |
| 0         | 1           | 2     | 0   | 1      | _w_ × 0.1 |


The formulation of soft evidence remains consistent when there are multiple attributes with soft evidence.

For example,

| difficult | intelligent | grade  | sat        | letter  | _weight_ |
|-----------|-------------|--------|------------|---------|----------|
| 0         | (0.6, 0.4)  | 2      | (0.1, 0.9) | 1       | _w_      |

is equivalent to

| difficult | intelligent | grade  | sat | letter  | _weight_        |
|-----------|-------------|--------|-----|---------|-----------------|
| 0         | 0           | 2      | 0   | 1       | _w_ × 0.6 × 0.1 |
| 0         | 0           | 2      | 1   | 1       | _w_ × 0.6 × 0.9 |
| 0         | 1           | 2      | 0   | 1       | _w_ × 0.4 × 0.1 |
| 0         | 1           | 2      | 1   | 1       | _w_ × 0.4 × 0.9 |


Here is how we can create a dataset with soft evidence (with a single instance weight of 10)

In [24]:
dataset: SoftDataset = SoftDataset(weights=[10])

dataset.add_rv(difficult)[0,:] = (1, 0)         # difficult = 'Yes'
dataset.add_rv(intelligent)[0,:] = (2.4, 0.1)   # intelligent = (2.4, 0.1)
dataset.add_rv(grade)[0,:] = (0, 0, 1)          # grade = '3'
dataset.add_rv(sat)[0,:] = (0.1, 0.9)           # sat = (0.1, 0.9)
dataset.add_rv(letter)[0,:] = (0, 1)            # letter = 'No'

dataset.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (1, with total weight 10.0):
([1. 0.], [2.4 0.1], [0. 0. 1.], [0.1 0.9], [0. 1.]) * 10.0


## Normalisation
A soft dataset is considered _normalised_ when, for each instance, either:
1. the instance weight is zero and the state weights for every random variable are zero, or
2. the instance weight is positive and the state weights for each random variable sum to one.

For example,

| difficult | intelligent | grade | sat        | letter | _weight_ |
|-----------|-------------|-------|------------|--------|----------|
| 0         | (2.4, 0.1)  | 2     | (0.1, 0.9) | 1      | 10       |

is equivalent to the normalised dataset

| difficult | intelligent  | grade  | sat        | letter  | _weight_ |
|-----------|--------------|--------|------------|---------|----------|
| 0         | (0.96, 0.04) | 2      | (0.1, 0.9) | 1       | 10 × 2.5 |


In [25]:
dataset.dump()
print()

dataset.normalise()

dataset.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (1, with total weight 10.0):
([1. 0.], [2.4 0.1], [0. 0. 1.], [0.1 0.9], [0. 1.]) * 10.0

rvs: [difficult, intelligent, grade, sat, letter]
instances (1, with total weight 25.0):
([1. 0.], [0.96 0.04], [0. 0. 1.], [0.1 0.9], [0. 1.]) * 25.0


## Dataset Builder

CK Datasets are optimised for efficiency which can make constructing ad hoc datasets a little cumbersome.

A `DatasetBuilder` solves this issue by providing a flexible API for defining a dataset, which can then be converted to a `HardDataset` or `SoftDataset` as needed.

A dataset builder holds all values in memory and can be a mixture of hard, soft or missing evidence.

The following example uses a `DatasetBuilder` to define a dataset for two random variables.

In [26]:
from ck.dataset.dataset_builder import DatasetBuilder, Record
from ck.pgm import PGM

pgm = PGM()
x = pgm.new_rv('x', (True, False))
y = pgm.new_rv('y', ('yes', 'no', 'maybe'))

builder = DatasetBuilder([x, y])

This dataset builder currently has no records.

In [27]:
len(builder)

0

Just like a `Dataset`, a `DatasetBuilder` knows the random variables it is for.

In [28]:
print(*builder.rvs)

x y


In [29]:
record: Record = builder.append()

len(record)

2

A record can be appended to the builder. The record is a sequence of values, co-indexed with the builder's random variables. Each value in a new record is "missing" represented by `None`.

In [30]:
record[0], record[1]

(None, None)

A record also has a weight (notionally 1).

In [31]:
record.weight

1

A value can be set to hard evidence (by state index).

In [32]:
record[0] = 1

record[0], record[1]


(1, None)

A value can also be set to soft evidence.

In [33]:
record[1] = (0.2, 0.3, 0.5)

record[0], record[1]

(1, (0.2, 0.3, 0.5))

A record weight can also be updated.

In [34]:
record.weight = 3

print(record.weight)

3


A record's string representation is similar to that of a `Dataset` dump, allowing for a mixture of hard and soft evidence.

In [35]:
print(record)

(1, (0.2, 0.3, 0.5)) * 3


When appending records, values may be provided. The following adds three more records.

In [36]:
builder.append(1, 2)
builder.append(None, [0.7, 0.1, 0.2])
builder.append(0, 1).weight = 2


Here is a dump of the data so far...

In [37]:
builder.dump()

rvs: [x, y]
instances (4, with total weight 7):
(1, (0.2, 0.3, 0.5)) * 3
(1, 2) * 1
(None, [0.7, 0.1, 0.2]) * 1
(0, 1) * 2


Records can also be inserted. The insert `index` has same semantics as `list.insert`.

In [38]:
builder.insert(0)  # insert at the start, all values missing
builder.insert(2, [0, 0]) # insert before index position 2, using the given values

builder.dump()

rvs: [x, y]
instances (6, with total weight 9):
(None, None) * 1
(1, (0.2, 0.3, 0.5)) * 3
(0, 0) * 1
(1, 2) * 1
(None, [0.7, 0.1, 0.2]) * 1
(0, 1) * 2


Record values can also be set using random variable states.

In [39]:
builder[0].set_states(True, 'maybe')
builder.append().set_states(False, 'no')

builder.dump(as_states=True, missing='?')

rvs: [x, y]
instances (7, with total weight 10):
(True, 'maybe') * 1
(False, (0.2, 0.3, 0.5)) * 3
(True, 'yes') * 1
(False, 'maybe') * 1
(?, [0.7, 0.1, 0.2]) * 1
(True, 'no') * 2
(False, 'no') * 1


## Convertion Between Datasets



### using a dataset builder

A dataset builder can be converted to a hard or soft dataset.

When converting to a hard dataset, the state of a random variable with soft evidence will be the state with the highest weight. Ties broken arbitrarily.

Missing values are set to the number of states of the random variable. That is an invalid state index for the random variable. Alternative values can be specified. In this example the value 99 is used to make the effect clearly visible.

The following examples use the dataset builder from above

In [40]:
from ck.dataset.dataset_builder import hard_dataset_from_builder

dataset: HardDataset = hard_dataset_from_builder(builder, missing=99)
dataset.dump()

rvs: [x, y]
instances (7, with total weight 10.0):
(0, 2) * 1.0
(1, 2) * 3.0
(0, 0) * 1.0
(1, 2) * 1.0
(99, 0) * 1.0
(0, 1) * 2.0
(1, 1) * 1.0


When converting to a soft dataset, by default missing values have state weights set to NaN.

In [41]:
from ck.dataset.dataset_builder import soft_dataset_from_builder

dataset: SoftDataset = soft_dataset_from_builder(builder)
dataset.dump()

rvs: [x, y]
instances (7, with total weight 10.0):
([1. 0.], [0. 0. 1.]) * 1.0
([0. 1.], [0.2 0.3 0.5]) * 3.0
([1. 0.], [1. 0. 0.]) * 1.0
([0. 1.], [0. 0. 1.]) * 1.0
([nan nan], [0.7 0.1 0.2]) * 1.0
([1. 0.], [0. 1. 0.]) * 2.0
([0. 1.], [0. 1. 0.]) * 1.0


To add records to a dataset builder from a hard or soft dataset, use `DatasetBuilder.append_dataset`.

In [42]:
hard_dataset: HardDataset = HardDataset([
    (letter, [0, 0, 1, 1]),
    (grade, [0, 1, 1, 0]),
])

soft_dataset: SoftDataset = SoftDataset(
    [
        (letter, [[0.6, 0.4], [1.0, 0.0], [0.0, 1.0], [0.3, 0.7]]),
        (grade, [[0.6, 0.2, 0.2], [0.0, 0.0, 1.0], [0.0, 1.0, 0.0], [0.3, 0.4, 0.3]]),
    ],
    weights=[2.2, 1.0, 5.8, 1.0],
)

builder: DatasetBuilder = DatasetBuilder([letter, grade])

builder.append_dataset(hard_dataset)
builder.append_dataset(soft_dataset)

builder.dump()

rvs: [grade, letter]
instances (8, with total weight 14.0):
(0, 0) * 1.0
(1, 0) * 1.0
(1, 1) * 1.0
(0, 1) * 1.0
([0.6 0.2 0.2], [0.6 0.4]) * 2.2
([0. 0. 1.], [1. 0.]) * 1.0
([0. 1. 0.], [0. 1.]) * 5.8
([0.3 0.4 0.3], [0.3 0.7]) * 1.0


### direct dataset conversion

While a `DatasetBuilder` can be used to convert between datasets, it is possible to directly convert between `HardDataset` and `SoftDataset` objects.

When converting to a hard dataset from a soft dataset, the state of a random variable will be the state with the highest weight. Ties broken arbitrarily.

In [43]:
soft_dataset: SoftDataset = SoftDataset.from_hard_dataset(hard_dataset)
soft_dataset.dump()

rvs: [letter, grade]
instances (4, with total weight 4.0):
([1. 0.], [1. 0. 0.]) * 1.0
([1. 0.], [0. 1. 0.]) * 1.0
([0. 1.], [0. 1. 0.]) * 1.0
([0. 1.], [1. 0. 0.]) * 1.0


In [44]:
hard_dataset: HardDataset = HardDataset.from_soft_dataset(soft_dataset)
hard_dataset.dump()

rvs: [letter, grade]
instances (4, with total weight 4.0):
(0, 0) * 1.0
(0, 1) * 1.0
(1, 1) * 1.0
(1, 0) * 1.0


### expanding a soft dataset

A hard dataset can be constructed with the same data semantics as a given soft dataset
by expanding and cross-multiplying soft evidence (discussed above).

Any state weights in the soft dataset that represents uncertainty over states
of a random variable will be converted to an equivalent set of weighted hard
instances. This means that the returned dataset may have a number of instances
different to that of the given soft dataset.

This implementation works by constructing a cross-table from the given soft dataset.
(Cross-tables are explained below.) This means that the resulting hard dataset
will have no duplicated instances and no instances with weight zero.

The ordering of instances in the returned dataset is not guaranteed.

In [45]:
from ck.dataset.dataset_from_crosstable import expand_soft_dataset

weights = [2.0, 0.0, 3.0, 5.0]
x_data = [
    [1.0, 0.0],
    [0.0, 1.0],
    [0.1, 0.8],
    [0.2, 0.3],
]
y_data = [
    [1.0, 0.0, 0.0],
    [0.3, 0.4, 0.7],
    [0.0, 0.0, 1.0],
    [0.4, 0.0, 0.7],
]
soft_dataset = SoftDataset(
    [
        (x, x_data),
        (y, y_data)
    ],
    weights=weights,
)


hard_dataset = expand_soft_dataset(soft_dataset)
hard_dataset.dump()

rvs: [x, y]
instances (4, with total weight 7.449999999999999):
(0, 0) * 2.4
(0, 2) * 1.0
(1, 2) * 3.45
(1, 0) * 0.6


## Dataset from CSV

A module is provided to create a dataset from CSV lines.

Each CSV line should contain comma-separated state indices for a nominated sequence of random variables.
(Other separators apart from commas can be used.)

Whitespace around values are ignored. Blank lines and comments are ignored. By default, comments start with a hash (#).

Optionally, a column of the input can be read as instance weights.

Here is an example.

In [46]:
from ck.dataset.dataset_from_csv import hard_dataset_from_csv
from ck.pgm import PGM

pgm: PGM = PGM()
x = pgm.new_rv('x', 2)
y = pgm.new_rv('y', 3)
z = pgm.new_rv('z', 4)

lines  = [
    '1, 2, 3  # this is a comment',
    '', # a blank line
    '0, 1, 2',
    '# this is also a comment',
    '0, 0, 1',
]
dataset: HardDataset = hard_dataset_from_csv(pgm.rvs, lines)
dataset.dump()

rvs: [x, y, z]
instances (3, with total weight 3.0):
(1, 2, 3) * 1.0
(0, 1, 2) * 1.0
(0, 0, 1) * 1.0


If the CSV data is as a single string, simply use `splitlines`.

In [47]:
csv  = """
    1, 1, 0  # this is a comment
    1, 1, 2
    0, 2, 1
    """
dataset: HardDataset = hard_dataset_from_csv(pgm.rvs, csv.splitlines())
dataset.dump()

rvs: [x, y, z]
instances (3, with total weight 3.0):
(1, 1, 0) * 1.0
(1, 1, 2) * 1.0
(0, 2, 1) * 1.0


If a header line appears in the data, then only columns matching the random variable names will be loaded.

In [48]:
csv  = """
    z, Q, y, x
    3, 7, 2, 1
    2, 6, 1, 0
    """
dataset: HardDataset = hard_dataset_from_csv(pgm.rvs, csv.splitlines())
dataset.dump()

rvs: [x, y, z]
instances (2, with total weight 2.0):
(1, 2, 3) * 1.0
(0, 1, 2) * 1.0


Here is an example with a weight column, nominated using a column index. (Positive and negative column indices are permitted.)

In [49]:
csv  = """
    0, 2, 3, 1.3
    1, 1, 2, 4.5
    """
dataset: HardDataset = hard_dataset_from_csv(pgm.rvs, csv.splitlines(), weights=-1)
dataset.dump()


rvs: [x, y, z]
instances (2, with total weight 5.8):
(0, 2, 3) * 1.3
(1, 1, 2) * 4.5


A weight column can also be nominated by name, if the CSV data has a header line.

In [50]:
csv  = """
    Weight, z, Q, y, x
    2.3,    3, 7, 2, 1
    5.5,    2, 6, 1, 0
    """
dataset: HardDataset = hard_dataset_from_csv(pgm.rvs, csv.splitlines(), weights='Weight')
dataset.dump()


rvs: [x, y, z]
instances (2, with total weight 7.8):
(1, 2, 3) * 2.3
(0, 1, 2) * 5.5


Because a text file is an iterable of lines, one can be used directly as the data source for `hard_dataset_from_csv`. Typical usage may be something like:

```
with open('my_data.csv', 'r') as file:
    dataset = hard_dataset_from_csv(rvs, file)
```

Here we create an in-memory file using `StringIO`.

In [51]:
from io import StringIO

file = StringIO('''
    1, 2, 3
    0, 1, 2
''')
dataset: HardDataset = hard_dataset_from_csv(pgm.rvs, file)
dataset.dump()


rvs: [x, y, z]
instances (2, with total weight 2.0):
(1, 2, 3) * 1.0
(0, 1, 2) * 1.0


## Sampled Datasets

A hard dataset can be created from a sampler.

In this example, a sampled dataset is created using the "Student" PGM.

In [52]:
from ck.example import Student
from ck.pgm_compiler import DEFAULT_PGM_COMPILER
from ck.pgm_circuit.wmc_program import WMCProgram

pgm = Student()
wmc = WMCProgram(DEFAULT_PGM_COMPILER(pgm))
sampler = wmc.sample_direct()

print(*sampler.rvs)

difficult intelligent grade sat letter


In [53]:
from ck.dataset.sampled_dataset import dataset_from_sampler

dataset = dataset_from_sampler(sampler, 10)

dataset.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (10, with total weight 10.0):
(0, 1, 2, 1, 0) * 1.0
(1, 0, 0, 0, 1) * 1.0
(1, 0, 0, 0, 0) * 1.0
(1, 1, 0, 1, 1) * 1.0
(1, 0, 0, 0, 1) * 1.0
(0, 0, 2, 0, 0) * 1.0
(0, 0, 0, 0, 1) * 1.0
(0, 0, 0, 0, 0) * 1.0
(0, 1, 2, 1, 0) * 1.0
(0, 0, 1, 0, 1) * 1.0


## Cross-tables

A cross-table records the total weight for possible combinations of states for some random variables. Its primary purpose is to represent empirical distributions over joint states of the random variables.

Practically, a cross-table is a mutable mapping from states of the cross-table random variables (as a tuple of state indices) to a weight (as a float). Instances with weight zero are not explicitly represented in a cross-table.

Here is an example of a manually created cross-table.

In [54]:
from ck.dataset.cross_table import CrossTable, cross_table_from_dataset

crosstab = CrossTable(rvs=[difficult, intelligent, grade, sat, letter])

crosstab[(0, 0, 0, 0, 0)] = 3.4
crosstab[(0, 0, 0, 0, 1)] = 2.1
crosstab[(0, 1, 1, 0, 0)] = 9.3

for instance, weight in crosstab.items():
    print(instance, weight)

(0, 0, 0, 0, 0) 3.4
(0, 0, 0, 0, 1) 2.1
(0, 1, 1, 0, 0) 9.3


 A cross-table can be constructed from a dataset.

This example uses the dataset created above using a sampler.

In [55]:
crosstab = cross_table_from_dataset(dataset)

print(*crosstab.rvs)
for instance, weight in crosstab.items():
    print(instance, weight)

print('total weight:', crosstab.total_weight())

difficult intelligent grade sat letter
(0, 1, 2, 1, 0) 2.0
(1, 0, 0, 0, 1) 2.0
(1, 0, 0, 0, 0) 1.0
(1, 1, 0, 1, 1) 1.0
(0, 0, 2, 0, 0) 1.0
(0, 0, 0, 0, 1) 1.0
(0, 0, 0, 0, 0) 1.0
(0, 0, 1, 0, 1) 1.0
total weight: 10.0


Even though zero weighted instances are not directly represented, they can still be queried.

In [56]:
crosstab[(1, 0, 0, 0, 0)]

np.float64(1.0)

Observe that a cross-table is logically equivalent to a dataset. A hard dataset can be created directly from cross-table using `dataset_from_cross_table`, like the following.

In [57]:
from ck.dataset.dataset_from_crosstable import dataset_from_cross_table

dataset_2 = dataset_from_cross_table(crosstab)

dataset_2.dump()

rvs: [difficult, intelligent, grade, sat, letter]
instances (8, with total weight 10.0):
(0, 1, 2, 1, 0) * 2.0
(1, 0, 0, 0, 1) * 2.0
(1, 0, 0, 0, 0) * 1.0
(1, 1, 0, 1, 1) * 1.0
(0, 0, 2, 0, 0) * 1.0
(0, 0, 0, 0, 1) * 1.0
(0, 0, 0, 0, 0) * 1.0
(0, 0, 1, 0, 1) * 1.0


A cross-table can also be sampled, using a `CrossTableSampler`. Instances will be drawn from the sampler according to their weight in the given cross-table.

Note that if the given cross-table is modified after constructing the sampler, the sampler will not be affected.

In [58]:
from ck.dataset.sampled_dataset import CrossTableSampler

crosstab_sampler = CrossTableSampler(crosstab)

for inst in crosstab_sampler.take(8):
    print(inst)


(0, 0, 0, 0, 1)
(0, 0, 0, 0, 0)
(1, 0, 0, 0, 1)
(0, 0, 2, 0, 0)
(0, 1, 2, 1, 0)
(1, 0, 0, 0, 0)
(1, 0, 0, 0, 1)
(1, 0, 0, 0, 1)
