# Datasets

Topics:
* How are numerical data returned (and expected) in **DynamicalSystems.jl**.
* Basic `Dataset` handling.
* Neighborhoods.

---

Much of the functionality of **DynamicalSystems.jl** uses numerical data. We required a `struct` that would unify behavior across all functions that either return or require numerical data.

To this end, we use a struct called simply `Dataset`.

In [75]:
using DynamicalSystems

In [76]:
x = rand(1000)
y = rand(1000)
dataset = Dataset(x,y)

2-dimensional Dataset{Float64} with 1000 points:
 0.676144    0.359398 
 0.955477    0.254457 
 0.18383     0.212144 
 0.85152     0.886629 
 0.709406    0.603019 
 0.934935    0.843218 
 0.493945    0.299921 
 0.95915     0.182324 
 0.818529    0.83918  
 0.00764545  0.445326 
 0.289665    0.976146 
 0.682668    0.930761 
 0.0574555   0.917293 
 ⋮                    
 0.858771    0.436908 
 0.580176    0.092008 
 0.0994581   0.500926 
 0.481314    0.0281003
 0.935286    0.47825  
 0.96947     0.729731 
 0.304957    0.804241 
 0.117145    0.909034 
 0.365606    0.0523087
 0.534662    0.17669  
 0.824977    0.765574 
 0.289769    0.878223 


A `Dataset` is a subtype of `AbstractDataset`:

In [77]:
typeof(dataset) <: AbstractDataset

All subtypes of `AbstractDataset` contain their data in the field `data`:

In [78]:
dataset.data

1000-element Array{StaticArrays.SArray{Tuple{2},Float64,1,2},1}:
 [0.676144, 0.359398]  
 [0.955477, 0.254457]  
 [0.18383, 0.212144]   
 [0.85152, 0.886629]   
 [0.709406, 0.603019]  
 [0.934935, 0.843218]  
 [0.493945, 0.299921]  
 [0.95915, 0.182324]   
 [0.818529, 0.83918]   
 [0.00764545, 0.445326]
 [0.289665, 0.976146]  
 [0.682668, 0.930761]  
 [0.0574555, 0.917293] 
 ⋮                     
 [0.858771, 0.436908]  
 [0.580176, 0.092008]  
 [0.0994581, 0.500926] 
 [0.481314, 0.0281003] 
 [0.935286, 0.47825]   
 [0.96947, 0.729731]   
 [0.304957, 0.804241]  
 [0.117145, 0.909034]  
 [0.365606, 0.0523087] 
 [0.534662, 0.17669]   
 [0.824977, 0.765574]  
 [0.289769, 0.878223]  

We chose a vector of `SVectors` as the internal representation of the data because it gives big performance gains in many functions of the library.

# Creating a Dataset

Most functions of **DynamicalSystems.jl** that return numerical data, return it in the form of a `Dataset`.

If you already have the data though, there are many ways to create a dataset:

In [79]:
# From matrix:
m = rand(1000, 5)
data = Dataset(m)

# using re-interpret if each *row* is one variable:
m = transpose(m)
data2 = reinterpret(Dataset, m)

data2 == data

In [80]:
# From columns
x, y = rand(1000), rand(1000)
data = Dataset(x, y)

# All points of a dataset must be equally sized:
z = rand(100)
Dataset(x, z)

LoadError: [91mBoundsError: attempt to access 100-element Array{Float64,1} at index [101][39m

# Handling a Dataset

Special attention has been given so that subtypes of `AbstractDataset` behave exactly like some `Matrix` where each *row* is a *point* of the dataset, while each *column* is *one dynamic variable*. The reason we choose this approach is because traditionally this is how scientific data are recorded, exported and passed around.

In [81]:
# Create some random dataset:
dataset = Dataset(rand(10000, 3))

dataset[:, 2] # this is the second variable timeseries
dataset[1] == dataset[1, :] # this is the first datapoint (D-dimensional)
dataset[5, 3] # value of the third variable, at the 5th timepoint

In [82]:
# iteration:
f = 0.0
for point in dataset
    f += mean(point)
end

# get columns
x, y = columns(dataset)
mean(x) ≈ mean(y)

# minima, maxima, etc...
mini = minima(dataset)
maxi = maxima(dataset)
mini, maxi = minmaxima(dataset)

([4.42448e-5, 6.93729e-5, 8.14024e-6], [0.999911, 0.999826, 0.999993])

In [83]:
# Function minmaxima is faster than using the two individual ones
using BenchmarkTools
@btime maxima($dataset)
@btime minima($dataset)
@btime minmaxima($dataset)

  29.439 μs (1 allocation: 112 bytes)
  32.000 μs (1 allocation: 112 bytes)
  51.199 μs (2 allocations: 224 bytes)


([4.42448e-5, 6.93729e-5, 8.14024e-6], [0.999911, 0.999826, 0.999993])

# I/O
Input/output functionality for an `AbstractDataset` is already achieved using base Julia, specifically `writedlm` and `readdlm`.

The thing to note is that all data of an `AbstractDataset` is contained within its field `data`.

To write and read a dataset, simply do:

In [84]:
# I will write and read using delimiter ','
writedlm("data.txt", data.data, ',')

# Don't forget to convert the matrix to a Dataset when reading
data = Dataset(readdlm("data.txt", ',', Float64))

2-dimensional Dataset{Float64} with 1000 points:
 0.358779   0.569593 
 0.523091   0.627772 
 0.215193   0.994672 
 0.0320648  0.791795 
 0.901462   0.109973 
 0.0314922  0.887737 
 0.33808    0.419241 
 0.11554    0.991734 
 0.961374   0.456671 
 0.810466   0.711083 
 0.182282   0.831986 
 0.194045   0.306351 
 0.815428   0.659271 
 ⋮                   
 0.067466   0.362375 
 0.604193   0.357249 
 0.960955   0.0402399
 0.943888   0.487839 
 0.754618   0.0852721
 0.216454   0.744111 
 0.942508   0.344652 
 0.211623   0.066528 
 0.313829   0.935558 
 0.12399    0.358765 
 0.125221   0.696234 
 0.932107   0.850727 


In [85]:
# delete the dummy file we created:
rm("data.txt")
isfile("data.txt")

# Neighborhoods 

A "neighborhood" is a collection of points that is near a given point. `Dataset`s interface the amazing module [`NearestNeighbors`](https://github.com/KristofferC/NearestNeighbors.jl) by Kristoffer Carlsson in order to find this neighborhood.

We use the function `neighborhood`. The call singature is:
```julia
neighborhood(point, tree, ntype)
```


`point` is simply the query point. Tree is the structure required by [`NearestNeighbors`](https://github.com/KristofferC/NearestNeighbors.jl), and is obtained simply by:

In [86]:
tree = KDTree(dataset)

NearestNeighbors.KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Distances.Euclidean,Float64}
  Number of points: 10000
  Dimensions: 3
  Metric: Distances.Euclidean(0.0)
  Reordered: true

The third argument to `neighborhood` is the *type* of the neighborhood. There are two types of neighborhoods!

The first one is defined as the `k` nearest points to a given point. It is represented in code by:

In [87]:
mybuddies = FixedMassNeighborhood(3) # for experts: does a knn search

DynamicalSystemsBase.FixedMassNeighborhood(3)

In [88]:
point = ones(3)
n = neighborhood(point, tree, mybuddies)

3-element Array{Int64,1}:
 7975
 2977
 2304

Notice that the `neighborhood` function does not return the points themselves, but rather the indices of the points in the original data:

In [104]:
for i in n
    println(dataset[i])
end

[0.981401, 0.909007, 0.993898]
[0.95096, 0.969996, 0.982008]
[0.944646, 0.961959, 0.954716]


The second type of neighborhood is all the points that are within some given distance `ε` from a point.

In code, we represent this as:

In [90]:
where_u_at = FixedSizeNeighborhood(0.001) # for experts: does an inrange search

DynamicalSystemsBase.FixedSizeNeighborhood(0.001)

In [91]:
n2 = neighborhood(point, tree, where_u_at)

0-element Array{Int64,1}

In [92]:
plz_come_closer = FixedSizeNeighborhood(0.1)
n2 = neighborhood(point, tree, plz_come_closer)

3-element Array{Int64,1}:
 2977
 2304
 7975

In [93]:
for i in n2
    println(dataset[i])
end

[0.95096, 0.969996, 0.982008]
[0.944646, 0.961959, 0.954716]
[0.981401, 0.909007, 0.993898]


Okay, so points that have distance < ε are accepted as a neighborhood.

What is the "distance" though? When defining a `tree`, you can optionally give a distance estimation function. By default this is the Euclidean distance, but others also work:

In [94]:
using Distances # to get distance functions
funky_tree = KDTree(dataset, Chebyshev())

NearestNeighbors.KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Distances.Chebyshev,Float64}
  Number of points: 10000
  Dimensions: 3
  Metric: Distances.Chebyshev()
  Reordered: true

In [95]:
n3 = neighborhood(point, funky_tree, plz_come_closer)

8-element Array{Int64,1}:
 2977
 2304
 9854
 7975
 6586
 6683
 9061
 5634

Before moving on, let's see one last thing:

In [96]:
# the point I want the neighborhood is now part of my dataset:
point = dataset[end]

3-element StaticArrays.SArray{Tuple{3},Float64,1,3}:
 0.537344
 0.73174 
 0.789504

In [97]:
# Reminder of where the tree comes from:
tree = KDTree(dataset)

# Find suuuuuuuper close neighbors:
ε = 0.000001
where_u_at = FixedSizeNeighborhood(ε)
n2 = neighborhood(point, tree, where_u_at)

# Find the nearest neighbor:
my_best_friend = FixedMassNeighborhood(1)
n3 = neighborhood(point, tree, my_best_friend)

println(n2)
println(n3)

[10000]
[10000]


In [98]:
length(dataset) == n2[1] == n3[1]

*Apparently my best friend is myself. Who would have thought...*

What is happening here is that the `neighborhood` also counted the point itself, since it is also part of the dataset. Almost always this behavior needs to be avoided. For this reason, there is a second method for `neighborhood`:

```julia
neighborhood(point, tree, ntype, idx::Int, w::Int = 1)
```

In this case, `idx` is the index of the point in the original data. `w` stands for the Theiler window (positive integer).

Only points that have index
`abs(i - idx) ≥ w` are returned as a neighborhood, to exclude close temporal neighbors.
The default `w=1` is the case of exluding the `point` itself.

Let's revisit the last example (using the default value of `w = 1`):

In [99]:
point = dataset[end]
idx = length(dataset)

n2 = neighborhood(point, tree, where_u_at, idx)
n3 = neighborhood(point, tree, my_best_friend, idx)

println(n2)
println(n3)

Int64[]
[443]


As you can see, there isn't *any* neighbor with distance `< 0.000001` in this dataset, but there is always a nearest neighbor:

In [100]:
dataset[n3[1]]

3-element StaticArrays.SArray{Tuple{3},Float64,1,3}:
 0.552423
 0.730101
 0.800918

# Documentation Strings

In [101]:
?Dataset

search: [1mD[22m[1ma[22m[1mt[22m[1ma[22m[1ms[22m[1me[22m[1mt[22m Abstract[1mD[22m[1ma[22m[1mt[22m[1ma[22m[1ms[22m[1me[22m[1mt[22m [1mD[22myn[1ma[22mmicalSys[1mt[22memsB[1ma[22m[1ms[22m[1me[22m @[1md[22m[1ma[22m[1mt[22meform[1ma[22mt_[1ms[22mtr



```
Dataset{D, T} <: AbstractDataset{D,T}
```

A dedicated interface for datasets, i.e. vectors of vectors. It contains *equally-sized datapoints* of length `D`, represented by `SVector{D, T}`.

It can be used exactly like a matrix that has each of the columns be the timeseries of each of the dynamic variables. [`trajectory`](@ref) always returns a `Dataset`. For example,

```julia
ds = Systems.towel()
data = trajectory(ds, 1000) #this returns a dataset
data[:, 2] # this is the second variable timeseries
data[1] == data[1, :] # this is the first datapoint (D-dimensional)
data[5, 3] # value of the third variable, at the 5th timepoint
```

Use `Matrix(dataset)` or `reinterpret(Matrix, dataset)` and `Dataset(matrix)` or `reinterpret(Dataset, matrix)` to convert. The `reinterpret` methods are cheaper but assume that each variable/timeseries is a *row* and not column of the `matrix`.

If you have various timeseries vectors `x, y, z, ...` pass them like `Dataset(x, y, z, ...)`. You can use `columns(dataset)` to obtain the reverse, i.e. all columns of the dataset in a tuple.


In [102]:
?neighborhood

search: [1mn[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m Abstract[1mN[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m FixedSize[1mN[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m



```
neighborhood(point, tree, ntype)
neighborhood(point, tree, ntype, n::Int, w::Int = 1)
```

Return a vector of indices which are the neighborhood of `point` in some `data`, where the `tree` was created using `tree = KDTree(data [, metric])`. The `ntype` is the type of neighborhood and can be any subtype of [`AbstractNeighborhood`](@ref).

Use the second method when the `point` belongs in the data, i.e. `point = data[n]`. Then `w` stands for the Theiler window (positive integer). Only points that have index `abs(i - n) ≥ w` are returned as a neighborhood, to exclude close temporal neighbors. The default `w=1` is the case of exluding the `point` itself.

## References

`neighborhood` simply interfaces the functions `knn` and `inrange` from [NearestNeighbors.jl](https://github.com/KristofferC/NearestNeighbors.jl) by using the argument `ntype`.


In [103]:
?AbstractNeighborhood

search: [1mA[22m[1mb[22m[1ms[22m[1mt[22m[1mr[22m[1ma[22m[1mc[22m[1mt[22m[1mN[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m



```
AbstractNeighborhood
```

Supertype of methods for deciding the neighborhood of points for a given point.

Concrete subtypes:

  * `FixedMassNeighborhood(K::Int)` : The neighborhood of a point consists of the `K` nearest neighbors of the point.
  * `FixedSizeNeighborhood(ε::Real)` : The neighborhood of a point consists of all neighbors that have distance < `ε` from the point.

See [`neighborhood`](@ref) for more.
