Skip to content

tpapp/RaggedData.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RaggedData

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. Build Status Coverage Status codecov.io

Introduction

This package is for ingesting and working with ragged (non-rectangular) data. To illustrate, let 1, 2, … denote observations for individuals with a given index. A dataset may look like this:

file1:

id,observation
1,obs11
1,obs12
1,obs13
2,obs21
3,obs31

file2:

id,observation
1,obs14
2,obs22
2,obs23
3,obs32

which has 4 observations for individual 1, 3 for individual 2, and 2 for individual 3.

Ingestion

The type RaggedCounter(T,S) will help count keys of type T with integers ::S (eg Int64, though Int32 should do for most datasets, and may save memory if you are processing a lot of data). In the first pass, you iterate through the data (possibly parsing and saving it, see LargeColumns.jl), using (::RaggedCounter)(id) and you end up with the following counts:

id count
1 4
2 3
3 2

Then for the second pass, you collate the records so that that observations for the same individual are adjacent. Use collate_index_keys to return a collate::RaggedCollate and an index::RaggedIndex object, and possibly the keys, in the order they were encountered.

The RaggedCollate object can keep track of indices for the second pass: next_index! will return the next index for a given key.

Ragged indices

A RaggedIndex object can be used to address ragged data packed flat into a vector. For example, if the vector contains observations

111122233

Then the first ragged index would be 1:4, the second 5:8, the third 9:10.

Ragged columns

RaggedColumns and RaggedColumn are thin wrapper types to address a tuple of vectors using ragged indices.

Acknowledgments

Work on this library was supported by the Austrian National Bank Jubiläumsfonds grant #17378.

About

Library for collating and indexing ragged data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages