Skip to content

Latest commit

 

History

History
288 lines (186 loc) · 11.8 KB

2019-05-23-clojure-scientists.md

File metadata and controls

288 lines (186 loc) · 11.8 KB

{:title "Are we scientists yet?" :layout :post :toc true :author "Alan Marazzi" :tags ["data science"]}

Previous version of this post

The Clojure community is moving a lot lately on the data science front, but we were feeling we needed some organization and more open discussion about these themes. This is the Clojureverse thread that started it all. Here we try to collect and record the current state of things, and I would like to stress the fact that this is owned by the community!

The structure of this:

  • Name of the problem - data science is a stack of problems and one must have solutions to all of them to really be productive
  • Notable examples - what's considered standard nowadays in other languages
  • Status - the current status of the matter
  • Next - the next best actions

Multidimensional arrays, Linear-algebra

Generic computation libraries. Here we should strive for the best: both GPU and CPU capability, multidimensional arrays, broadcasting, etc

Notable examples

Status

There are many libraries popping out at various levels of maturity, some of them are:

Next

We probably don't need more libraries in this realm. What would be great next is:

  • Extended docs - Something like https://docs.scipy.org/doc/numpy/index.html
  • Tutorials - Common use cases, advanced stuff, etc
  • Bridges & extensions - Libraries and packages connecting these frameworks to ther libraries or extending their functionality

Plotting

Plotting is important for both analysis and presentation of results. Thanks to Clojurescript we might probably have an edge over other languages here.

Notable examples

Status

Here there are many libraries as well, some of them are:

Next

There's a lot of active development in this realm, what would be helpful:

  • Tutorials - Common use cases, advanced stuff, etc

Geospatial library

Deal with coordinates on a map.

Notable examples

Status

There's something in this realm, mostly dated:

Next

This is another area where Clojure could shine thanks to its concurrency model. The fact it would be easy to deal with Spark or Onyx it's certainly a plus in case you have big data, while for smaller things parallel Clojure might be enough and speed up pipelines considerably.

  • Libraries - This is an area where we are still lacking
  • Pluggability - It would be very interesting to see libraries built on top of Spark or similar
  • New things - Tooling in this space is very primitive even for more mainstream languages, people most of the time end up with bash scripts doing most of the work and gluing stuff together. This realm is a possible win for Clojure if we're able to come up with better solutions than other languages

Dataframe or similar

Today's data scientists are used to work with tabular data, we have to deal with it.

Notable examples

Status

The picture has improved lately, but there still isn't consensus.

Next

  • Extended docs - Something like https://pandas.pydata.org/pandas-docs/stable/
  • Tutorials - Common use cases, advanced stuff, etc
  • Bridges & extensions - Libraries and packages connecting these frameworks to ther libraries or extending their functionality

Graphs

Graphs can smartly and efficiently solve many problems, most of the time a well thought and built graph can substitute much more complex solutions

Notable examples

Status

The state of things is pretty good and it makes sense considering the native Clojure data structures and the nature of graphs

Next

Graphs are mostly a solved problem, but only lately they are starting to be used extensively and there are many improvements that can be achieved in distributing graphs

  • Extend - Considering that GraphEngine runs on the CLR it should be possible (and very interesting) to get a Clojure API
  • Tutorials - Let's show people the power of weilding graphs and Clojure together!

Statistics & probprog

Very important as the base for ML systems, simulations and data analytics.

Notable examples

Status

There are already many examples:

Next

The main building blocks are all here, what we are missing are:

  • Bridges - At least some of these libraries should be able to seamlessly communicate among them and with dataset abstractions and with arrays. Ideally we would have one of these able to run on GPU either directly (like bayadera) or through MxNet
  • Extensions - Better and easier abstractions. For instance a function to easily calculate ROC-AUC
  • Extended docs - Something like https://pandas.pydata.org/pandas-docs/stable/
  • Tutorials - Common use cases, advanced stuff, etc
  • Bayesian extensions - There's still nothing with gradient-based algorithms such as Hamiltonian Monte Carlo, it would be really cool to get something like TensorFlow probability in the MxNet realm

Machine learning

General modeling, the aim should be to have something simple, usable, reliable and with a consistent interface.

Notable examples

Status

Something is moving lately in this area:

Next

We can still decide wether we want to pursue an R model (with many small libraries) or the scikit-learn way (one big framework with batteries included), the important thing should be to have a common interface to algorithms and utilities.

Such interface would be the opposite of what happens in the R world, where developers and researchers are more free to deliver their ideas (R is usually the first language to get implementations of new algorithms), but at the same time the cognitive overhead for users is pretty high.

  • Bridges - At least some of these libraries should be able to seamlessly communicate among them and with dataset abstractions and with arrays
  • Extensions - More models, faster or more memory efficient training and so on
  • Extended docs - Something like https://scikit-learn.org/stable/index.html
  • Tutorials - Common use cases, advanced stuff, etc

NLP

Natural Language Processing is at the bleeding edge at the moment, but Clojure is lagging behind at the moment.

Notable examples

Status

There are mainly 2 libraries dealing with these things at the moment, and one is currently looking for maintainers:

Next

It might very well be that all we need is a couple of very thorough and dedicated libraries, but we're not there yet

  • Maintain - clojurenlp is currently looking for maintainers, get in touch with them if you're interested
  • Extend - increase the functionality and the performance

Image processing

Before doing anything with CNNs you have to read, process and transform images. The state of things here is much better than for many of the other sections!

Notable examples

Status

We're basically ready to do anything we want with images!

Deep learning

Important for computer vision, NLP and other problems.

Notable examples

Status

We're pretty much covered especially thanks to Carin Meier's work, what can be really improved are docs, examples and tutorials.

Next

  • Gluon - Having access to the Gluon API in MxNet would be very useful

Disclaimer None of the lists are to be considered complete, they are just some examples. Everything is amendable by the community, if you think something is missing, wrong, misplaced or anything else just let the community know!