GitHub - scicloj/tablecloth at 18d6e904a28220cae6c6f3bc9420b447e35a3853

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
data		data
docs		docs
src/tablecloth		src/tablecloth
test/tablecloth		test/tablecloth
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.Rmd		README.Rmd
README.md		README.md
deps.edn		deps.edn
project.clj		project.clj

Repository files navigation

Versions

tech.ml.dataset 5.x (master branch)

tech.ml.dataset 4.x (4.0 branch)

[scicloj/tablecloth "4.04"]

Introduction

tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.

I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.

During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:

Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
Single entry point for common operations - one function dispatching on given arguments.
group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
Most operations recognize regular dataset and grouped dataset and process data accordingly.
One function form to enable thread-first on dataset.

Important! This library is not the replacement of tech.ml.dataset nor a separate library. It should be considered as a addition on the top of tech.ml.dataset.

If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:

Join the discussion on Zulip

Documentation

Please refer detailed documentation with examples

Usage example

(require '[tablecloth.api :as api])

(-> "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
    (api/dataset {:key-fn keyword})
    (api/group-by (fn [row]
                    {:symbol (:symbol row)
                     :year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
    (api/aggregate #(tech.v3.datatype.functional/mean (% :price)))
    (api/order-by [:symbol :year])
    (api/head 10))

_unnamed [10 3]:

:summary	:year	:symbol
21.74833333	2000	AAPL
10.17583333	2001	AAPL
9.40833333	2002	AAPL
9.34750000	2003	AAPL
18.72333333	2004	AAPL
48.17166667	2005	AAPL
72.04333333	2006	AAPL
133.35333333	2007	AAPL
138.48083333	2008	AAPL
150.39333333	2009	AAPL

Contributing

Tablecloth is open for contribution. The best way to start is discussion on Zulip.

Development tools for documentation

Documentation is written in RMarkdown, that means that you need R to create html/md/pdf files. Documentation contains around 600 code snippets which are run during build. There are two files:

README.Rmd
docs/index.Rmd

Prepare following software:

Install R
Install rep, nRepl client
Install pandoc
Run nRepl
Run R and install R packages: install.packages(c("rmarkdown","knitr"), dependencies=T)
Load rmarkdown: library(rmarkdown)
Render readme: render("README.Rmd","md_document")
Render documentation: render("docs/index.Rmd","all")

Guideline

Before commiting changes please perform tests. I ususally do: lein do clean, check, test and build documentation as described above (which also tests whole library).
Keep API as simple as possible:
- first argument should be a dataset
- if parametrizations is complex, last argument should accept a map with not obligatory function arguments
- avoid variadic associative destructuring for function arguments
- usually function should working on grouped dataset as well, accept parallel? argument then (if applied).
Follow potemkin pattern and import functions to the API namespace using tech.v3.datatype.export-symbols/export-symbols function
Functions which are composed out of API function to cover specific case(s) should go to tablecloth.utils namespace.
Always update README.Rmd, CHANGELOG.md, docs/index.Rmd, tests and function docs are highly welcomed
Always discuss changes and PRs first

TODO

tests
tutorials

Licence

The MIT Licence

License

scicloj/tablecloth

Folders and files

Latest commit

History

Repository files navigation

Versions

tech.ml.dataset 5.x (master branch)

tech.ml.dataset 4.x (4.0 branch)

Introduction

Documentation

Usage example

Contributing

Development tools for documentation

Guideline

TODO

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Languages