# Datasets

Topics:
* How are numerical data returned (and expected) in **DynamicalSystems.jl**.
* Basic `Dataset` handling.
* Neighborhoods.

---

Much of the functionality of **DynamicalSystems.jl** uses numerical data. We required a `struct` that would unify behavior across all functions that either return or require numerical data.

To this end, we use a struct called simply `Dataset`.

In [97]:
# Pkg.add("DynamicalSystems")
using DynamicalSystems

In [98]:
x = rand(1000)
y = rand(1000)
dataset = Dataset(x,y)

2-dimensional Dataset{Float64} with 1000 points:
 0.472028   0.868781  
 0.719971   0.299689  
 0.471108   0.773293  
 0.69701    0.00752744
 0.868774   0.556061  
 0.739722   0.257121  
 0.605932   0.6215    
 0.673209   0.84794   
 0.615773   0.818738  
 0.0563429  0.0792874 
 0.107899   0.578316  
 0.742416   0.0695425 
 0.770237   0.459329  
 ⋮                    
 0.584646   0.152565  
 0.0456508  0.696457  
 0.995857   0.0571723 
 0.229369   0.517942  
 0.43142    0.891898  
 0.868771   0.0867077 
 0.273804   0.600074  
 0.129711   0.72327   
 0.867674   0.298835  
 0.32328    0.711076  
 0.633262   0.824527  
 0.228927   0.715258  


A `Dataset` is a subtype of `AbstractDataset`:

In [99]:
typeof(dataset) <: AbstractDataset

All subtypes of `AbstractDataset` contain their data in the field `data`:

In [100]:
dataset.data

1000-element Array{StaticArrays.SArray{Tuple{2},Float64,1,2},1}:
 [0.472028, 0.868781]  
 [0.719971, 0.299689]  
 [0.471108, 0.773293]  
 [0.69701, 0.00752744] 
 [0.868774, 0.556061]  
 [0.739722, 0.257121]  
 [0.605932, 0.6215]    
 [0.673209, 0.84794]   
 [0.615773, 0.818738]  
 [0.0563429, 0.0792874]
 [0.107899, 0.578316]  
 [0.742416, 0.0695425] 
 [0.770237, 0.459329]  
 ⋮                     
 [0.584646, 0.152565]  
 [0.0456508, 0.696457] 
 [0.995857, 0.0571723] 
 [0.229369, 0.517942]  
 [0.43142, 0.891898]   
 [0.868771, 0.0867077] 
 [0.273804, 0.600074]  
 [0.129711, 0.72327]   
 [0.867674, 0.298835]  
 [0.32328, 0.711076]   
 [0.633262, 0.824527]  
 [0.228927, 0.715258]  

* `Array{StaticArrays.SArray{Tuple{2},Float64,1,2},1}` means that the data are contained in a `Vector` of `SVector`s (which are statically sized vectors).

We chose a vector of `SVectors` as the internal representation of the data because it gives big performance gains in many functions of the library.

# Creating a Dataset

Most functions of **DynamicalSystems.jl** that return numerical data, return it in the form of a `Dataset`.

If you already have the data though, there are many ways to create a dataset:

In [101]:
# From matrix where each column is one variable:
m = rand(1000, 5)
data = Dataset(m)

5-dimensional Dataset{Float64} with 1000 points:
 0.694059   0.526635   0.618079   0.243967  0.0273586 
 0.142516   0.276359   0.372204   0.250356  0.0153108 
 0.0282799  0.754551   0.888349   0.760556  0.84195   
 0.641885   0.482615   0.485954   0.935955  0.110467  
 0.80491    0.789272   0.186082   0.946868  0.282679  
 0.763561   0.506229   0.842992   0.97191   0.606674  
 0.0287769  0.569042   0.0258469  0.289738  0.538121  
 0.322388   0.753073   0.896772   0.331758  0.00492456
 0.582966   0.993267   0.778473   0.896547  0.121311  
 0.959226   0.0258572  0.72479    0.88013   0.944469  
 0.987886   0.753014   0.778941   0.331303  0.373205  
 0.192928   0.100636   0.796195   0.958008  0.781769  
 0.862178   0.285546   0.49393    0.697706  0.46161   
 ⋮                                                    
 0.539431   0.274683   0.441079   0.152299  0.219667  
 0.174208   0.1242     0.684654   0.193309  0.00254341
 0.0075836  0.716203   0.695002   0.619174  0.793613  
 0.898181   0.99

In [102]:
# using re-interpret if each *row* is one variable:
m = transpose(m)
data2 = reinterpret(Dataset, m)

5-dimensional Dataset{Float64} with 1000 points:
 0.694059   0.526635   0.618079   0.243967  0.0273586 
 0.142516   0.276359   0.372204   0.250356  0.0153108 
 0.0282799  0.754551   0.888349   0.760556  0.84195   
 0.641885   0.482615   0.485954   0.935955  0.110467  
 0.80491    0.789272   0.186082   0.946868  0.282679  
 0.763561   0.506229   0.842992   0.97191   0.606674  
 0.0287769  0.569042   0.0258469  0.289738  0.538121  
 0.322388   0.753073   0.896772   0.331758  0.00492456
 0.582966   0.993267   0.778473   0.896547  0.121311  
 0.959226   0.0258572  0.72479    0.88013   0.944469  
 0.987886   0.753014   0.778941   0.331303  0.373205  
 0.192928   0.100636   0.796195   0.958008  0.781769  
 0.862178   0.285546   0.49393    0.697706  0.46161   
 ⋮                                                    
 0.539431   0.274683   0.441079   0.152299  0.219667  
 0.174208   0.1242     0.684654   0.193309  0.00254341
 0.0075836  0.716203   0.695002   0.619174  0.793613  
 0.898181   0.99

In [103]:
data2 == data

In [104]:
# From individual columns
x, y = rand(1000), rand(1000)
data = Dataset(x, y)

# All points of a dataset must be equally sized:
z = rand(100)
Dataset(x, z)

LoadError: [91mBoundsError: attempt to access 100-element Array{Float64,1} at index [101][39m

# Handling a Dataset

** Subtypes of `AbstractDataset` behave like some `Matrix` where each *row* is a *point* of the dataset, while each *column* is *one dynamic variable*. **

* The reason we choose this approach is because traditionally this is how scientific data are recorded, exported and passed around.

In [107]:
# Create some random dataset:
dataset = Dataset(rand(10000, 3));
typeof(dataset)

DynamicalSystemsBase.Dataset{3,Float64}

In [109]:
dataset[:, 2] # this is the second variable series

10000-element Array{Float64,1}:
 0.292268 
 0.109471 
 0.621963 
 0.532432 
 0.518103 
 0.0712626
 0.578914 
 0.351177 
 0.540381 
 0.973785 
 0.387589 
 0.0684011
 0.100386 
 ⋮        
 0.0216077
 0.90342  
 0.687168 
 0.0915575
 0.0452472
 0.205436 
 0.868943 
 0.42547  
 0.512864 
 0.127515 
 0.581307 
 0.249815 

In [110]:
dataset[1, :] # this is the first datapoint (D-dimensional)

3-element StaticArrays.SArray{Tuple{3},Float64,1,3}:
 0.915279
 0.292268
 0.192741

In [112]:
dataset[1] # acessing with a single index is like accessing a vector of points

3-element StaticArrays.SArray{Tuple{3},Float64,1,3}:
 0.915279
 0.292268
 0.192741

In [113]:
dataset[5, 2] # value of the second variable, at the 5th timepoint

In [115]:
# get columns
x, y = columns(dataset)
mean(x), mean(y)

(0.4983321405744996, 0.5022744857188011)

In [116]:
# iteration:
f = 0.0
for point ∈ dataset  # same as `point ∈ dataset` , do \in<tab> for ∈
    f += mean(point)
end
f

In [117]:
# minima, maxima
mini = minima(dataset)
maxi = maxima(dataset)
mini, maxi = minmaxima(dataset)

([0.000150045, 0.000185872, 2.74229e-5], [0.99997, 0.999916, 0.999866])

In [118]:
# Function minmaxima is faster than using the two individual ones
using BenchmarkTools
@btime maxima($dataset)
@btime minima($dataset)
@btime minmaxima($dataset)

  25.864 μs (1 allocation: 112 bytes)
  25.864 μs (1 allocation: 112 bytes)
  43.107 μs (2 allocations: 224 bytes)


([0.000150045, 0.000185872, 2.74229e-5], [0.99997, 0.999916, 0.999866])

# I/O
Input/output functionality for an `AbstractDataset` is already achieved using base Julia:
1. `writedlm` for writting
2. `readdlm` for reading.

To write and read a dataset, simply use the `data` field:

In [119]:
dataset = Dataset(rand(1000,3))

# I will write and read using delimiter ','
writedlm("data.txt", dataset.data, ',')

# Load the data from text:
data = Dataset(readdlm("data.txt", ',', Float64))
# Don't forget to convert the matrix to a Dataset when reading!

3-dimensional Dataset{Float64} with 1000 points:
 0.386571    0.264453   0.866901 
 0.585694    0.929369   0.289556 
 0.00989398  0.11149    0.434989 
 0.800302    0.875426   0.763153 
 0.847926    0.0655121  0.715585 
 0.556292    0.698036   0.29814  
 0.693338    0.668298   0.175626 
 0.344344    0.681541   0.247855 
 0.1215      0.366678   0.394113 
 0.163083    0.0673518  0.770159 
 0.737952    0.589141   0.176348 
 0.108682    0.298084   0.434295 
 0.980144    0.910287   0.350069 
 ⋮                               
 0.64557     0.603952   0.803946 
 0.81762     0.414984   0.587891 
 0.50988     0.493074   0.67168  
 0.71252     0.0171345  0.519987 
 0.019764    0.189032   0.816701 
 0.799808    0.924068   0.761454 
 0.670693    0.200921   0.259871 
 0.648783    0.288457   0.497269 
 0.196873    0.279591   0.435158 
 0.5129      0.538277   0.0276421
 0.975748    0.895334   0.323878 
 0.809626    0.155533   0.911393 


In [120]:
# delete the dummy file we created:
rm("data.txt")
isfile("data.txt")

# Neighborhoods 

A "neighborhood" is a collection of points that is near a given point. `Dataset`s interface the module [`NearestNeighbors`](https://github.com/KristofferC/NearestNeighbors.jl) in order to find this neighborhood.

We use the function `neighborhood`. The call singature is:
```julia
neighborhood(point, tree, ntype)
```


`point` is simply the query point. `tree` is the structure required by [`NearestNeighbors`](https://github.com/KristofferC/NearestNeighbors.jl), and is obtained simply by:

In [121]:
tree = KDTree(dataset)

NearestNeighbors.KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Distances.Euclidean,Float64}
  Number of points: 1000
  Dimensions: 3
  Metric: Distances.Euclidean(0.0)
  Reordered: true

The third argument to `neighborhood` is the *type* of the neighborhood. 

* There are two types of neighborhoods!

The first one is defined as the `k` nearest points to a given point. It is represented in code by:

In [122]:
mybuddies = FixedMassNeighborhood(3) # for experts: does a knn search

DynamicalSystemsBase.FixedMassNeighborhood(3)

In [123]:
point = ones(3)
n = neighborhood(point, tree, mybuddies)

3-element Array{Int64,1}:
 250
 196
 526

Notice that the `neighborhood` function does not return the points themselves, but rather the indices of the points in the original data:

In [126]:
println("neighborhood of $(point) is:")

for i in n
    println(dataset[i])
end

neighborhood of [1.0, 1.0, 1.0] is:
[0.794312, 0.950849, 0.902336]
[0.968463, 0.961057, 0.890061]
[0.849936, 0.858092, 0.973547]


---

The second type of neighborhood is all the points that are within some given distance `ε` from a point.

In code, we represent this as:

In [127]:
where_u_at = FixedSizeNeighborhood(0.001) # for experts: does an inrange search

DynamicalSystemsBase.FixedSizeNeighborhood(0.001)

In [128]:
n2 = neighborhood(point, tree, where_u_at)

0-element Array{Int64,1}

In [132]:
plz_come_closer = FixedSizeNeighborhood(0.2)
n2 = neighborhood(point, tree, plz_come_closer)

1-element Array{Int64,1}:
 196

In [134]:
println("neighborhood of $(point) is:")

for i in n2
    println(dataset[i])
end

neighborhood of [1.0, 1.0, 1.0] is:
[0.968463, 0.961057, 0.890061]


Okay, so points that have distance < ε are accepted as a neighborhood.

What is the "distance" though? When defining a `tree`, you can optionally give a distance function. By default this is the Euclidean distance, but others also work:

In [135]:
using Distances # to get distance functions
funky_tree = KDTree(dataset, Chebyshev())

NearestNeighbors.KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Distances.Chebyshev,Float64}
  Number of points: 1000
  Dimensions: 3
  Metric: Distances.Chebyshev()
  Reordered: true

In [136]:
n3 = neighborhood(point, funky_tree, plz_come_closer)

4-element Array{Int64,1}:
 526
 642
 954
 196

# Excluding temporal neighbors

Before moving on, let's see one last thing:

In [137]:
# the point I want the neighborhood is now part of my dataset:
point = dataset[end]

3-element StaticArrays.SArray{Tuple{3},Float64,1,3}:
 0.809626
 0.155533
 0.911393

In [138]:
# Let's calculate again the two neighborhoods

tree = KDTree(dataset)

# Find suuuuuuuper close neighbors:
ε = 0.000001
where_u_at = FixedSizeNeighborhood(ε)
n2 = neighborhood(point, tree, where_u_at)

# Find the nearest neighbor:
my_best_friend = FixedMassNeighborhood(1)
n3 = neighborhood(point, tree, my_best_friend)

println(n2)
println(n3)

[1000]
[1000]


In [139]:
length(dataset) == n2[1] == n3[1]

*Apparently my best friend is myself. Who would have thought...*

**What is happening here is that the `neighborhood` also counted the point itself, since it is also part of the dataset.**

* Almost always this behavior needs to be avoided. For this reason, there is a second method for `neighborhood`:

```julia
neighborhood(point, tree, ntype, idx::Int, w::Int = 1)
```

In this case, `idx` is the index of the point in the original data. `w` stands for the Theiler window (positive integer).

Only points that have index
`abs(i - idx) ≥ w` are returned as a neighborhood, to exclude close temporal neighbors.

* The default `w=1` is the case of exluding the `point` itself.

Let's revisit the last example (using the default value of `w = 1`):

In [140]:
point = dataset[end]
idx = length(dataset)

n2 = neighborhood(point, tree, where_u_at, idx)
n3 = neighborhood(point, tree, my_best_friend, idx)

println(n2)
println(n3)

Int64[]
[437]


As you can see, there isn't *any* neighbor of `point` with distance `< 0.000001` in this dataset, but there is always a nearest neighbor:

In [143]:
println(dataset[n3[1]], " is the nearest neighbor of ", point)

[0.81069, 0.135168, 0.914128] is the nearest neighbor of [0.809626, 0.155533, 0.911393]


# Documentation Strings

In [144]:
?Dataset

search: [1mD[22m[1ma[22m[1mt[22m[1ma[22m[1ms[22m[1me[22m[1mt[22m Abstract[1mD[22m[1ma[22m[1mt[22m[1ma[22m[1ms[22m[1me[22m[1mt[22m [1mD[22myn[1ma[22mmicalSys[1mt[22memsB[1ma[22m[1ms[22m[1me[22m @[1md[22m[1ma[22m[1mt[22meform[1ma[22mt_[1ms[22mtr



```
Dataset{D, T} <: AbstractDataset{D,T}
```

A dedicated interface for datasets, i.e. vectors of vectors. It contains *equally-sized datapoints* of length `D`, represented by `SVector{D, T}`.

It can be used exactly like a matrix that has each of the columns be the timeseries of each of the dynamic variables. [`trajectory`](@ref) always returns a `Dataset`. For example,

```julia
ds = Systems.towel()
data = trajectory(ds, 1000) #this returns a dataset
data[:, 2] # this is the second variable timeseries
data[1] == data[1, :] # this is the first datapoint (D-dimensional)
data[5, 3] # value of the third variable, at the 5th timepoint
```

Use `Matrix(dataset)` or `reinterpret(Matrix, dataset)` and `Dataset(matrix)` or `reinterpret(Dataset, matrix)` to convert. The `reinterpret` methods are cheaper but assume that each variable/timeseries is a *row* and not column of the `matrix`.

If you have various timeseries vectors `x, y, z, ...` pass them like `Dataset(x, y, z, ...)`. You can use `columns(dataset)` to obtain the reverse, i.e. all columns of the dataset in a tuple.


In [145]:
?neighborhood

search: [1mn[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m Abstract[1mN[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m FixedSize[1mN[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m



```
neighborhood(point, tree, ntype)
neighborhood(point, tree, ntype, n::Int, w::Int = 1)
```

Return a vector of indices which are the neighborhood of `point` in some `data`, where the `tree` was created using `tree = KDTree(data [, metric])`. The `ntype` is the type of neighborhood and can be any subtype of [`AbstractNeighborhood`](@ref).

Use the second method when the `point` belongs in the data, i.e. `point = data[n]`. Then `w` stands for the Theiler window (positive integer). Only points that have index `abs(i - n) ≥ w` are returned as a neighborhood, to exclude close temporal neighbors. The default `w=1` is the case of excluding the `point` itself.

## References

`neighborhood` simply interfaces the functions `knn` and `inrange` from [NearestNeighbors.jl](https://github.com/KristofferC/NearestNeighbors.jl) by using the argument `ntype`.


In [146]:
?AbstractNeighborhood

search: [1mA[22m[1mb[22m[1ms[22m[1mt[22m[1mr[22m[1ma[22m[1mc[22m[1mt[22m[1mN[22m[1me[22m[1mi[22m[1mg[22m[1mh[22m[1mb[22m[1mo[22m[1mr[22m[1mh[22m[1mo[22m[1mo[22m[1md[22m



```
AbstractNeighborhood
```

Supertype of methods for deciding the neighborhood of points for a given point.

Concrete subtypes:

  * `FixedMassNeighborhood(K::Int)` : The neighborhood of a point consists of the `K` nearest neighbors of the point.
  * `FixedSizeNeighborhood(ε::Real)` : The neighborhood of a point consists of all neighbors that have distance < `ε` from the point.

See [`neighborhood`](@ref) for more.
