Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do we want to build? #1

Open
LukeMathWalker opened this issue Apr 28, 2019 · 73 comments

Comments

Projects
None yet
@LukeMathWalker
Copy link
Member

commented Apr 28, 2019

Welcome!

I created this repository as a discussion hub for the ML ecosystem in Rust, "following" a talk I gave at the Rust meetup in London (slides).

I do believe that Rust has great potential in this area, but to fully realize this potential we need to provide building blocks: we need to tackle those shared challenges that, once removed, will enable more and more people to just come to Rust and build what they want to build.

The three building blocks I do see as fundamental for an ML ecosystem are:

  • n-dimensional arrays;
  • dataframes;
  • an ML model interface.

I have spent the last year, when it comes to open-source contributions, enhancing n-dimensional arrays: direct contributions to ndarray, statistical routines on top of it (ndarray-stats) and tutorials to help people to get into the Rust scientific ecosystem from Python, Julia or R. I do believe that ndarray is in more than a good shape when it comes to fulfil NumPy's role in the Rust ecosystem.

There is now movement as well when it comes to dataframes - a discussion is taking place at rust-dataframe/discussion#1 to explore use cases and potential designs. (The idea of opening this repository comes directly from this experiment of community-led design for dataframes).

Given that one of the two data structures that are usually consumed by ML models is ready (n-dimensional arrays) and the other one is baking (dataframes) I think it's time to start thinking about what to do with the ML-specific piece.

I don't want to steer the debate too much with the opening post (I'll chip in once the discussion starts), but the questions I'd like to see tackled are:

  • what use-cases could make Rust shine in the ML ecosystem?
  • what are the basic capabilities that have to be built to enable the usage of Rust for ML workloads?
  • how should we structure such a project? A core library with few traits and a set of separate crates tackling different aspects? A large battery-included scikit-learn equivalent?
  • why do you want to use Rust for ML?
@Kibouo

This comment has been minimized.

Copy link

commented May 1, 2019

I want to note that, while it works great, https://github.com/twistedfall/opencv-rust is not particularly user-friendly or 'clean' in rust terms.

Maybe we could have a look at it?

@flo-dhalluin

This comment has been minimized.

Copy link

commented May 1, 2019

I think the use case that coud make Rust shine, is deployment. Currently the de-facto "mainstream" stack is python based ( scikit-learn, np, pandas + your DL framework of choice, TensorFlow, Torch .. ). It shines for fast prototyping, because python, but it sucks for industrialization ( and deployment ), because ... python. I really think rust would do great. In that area, I kinda like TensorFlow serving, but it forces you to have a separate service ( that you call with their protobuf/RPC).
So :

  • nice conventions for training/inference
  • standard ways of serializing, loading models, and expose them to more "entreprisey" stacks, either with some kind of FFI ( for ex. jvm <-> jni ..) or RPC.
  • with all the goodies required for a industrial setups ( monitoring, robustness, ease of deployment ...).
@jbowles

This comment has been minimized.

Copy link
Member

commented May 1, 2019

I'm currently building a large project with rust (i mention it here: https://users.rust-lang.org/t/interest-for-nlp-in-rust/15331/9), where I am doing the data engineering in rust (lots of string metrics). [tldr; I found lots of disparate projects with 50% of what I needed for string metrics but instead rolled my own, trying to incorporate previous work and give credit] I want to feed the feature vectors to Julia to experiment with what I want to use for classification and modelling, and then I'll want to be able to use rust for inference/classification etc.... I had to pause development for business reasons, but I'm starting again: one of my biggest issues was not ML related but finding a nice pattern for parallel file download (seems like it should be simple, but maybe I'm spoiled by go's simplicity lol).

From this real-world project point of view, as well from my time spent thinking in the abstract and surveying the ML ecosystem in rust (about a year), I would think that a focus on data engineering in general and serving models is the way to go (this also seems to be a widely shared sentiment). In a practical sense I would like to see rust jobs for data engineer and machine learning engineer... that is, the bookends of a typical data science project; serving the data and serving the model.

That is, targeting software developers, infrastructure, computational math, and data people. Trying to convince research scientists to use Rust would be wasted effort; for most of these people software is a secondary skill and so they need something easy to learn, dynamically typed, with a REPL... I've watched this play out in the Python/R/Matlab versus Julia world... and while IMO Julia has a lot to offer current python/r/matlab devs and is similar enough to those languages, trying to get that group of people to use Julia is not easy, i can't imagine what it'd be like proposing Rust.

Here are some challenges I see:

  • Dataframes: figuring out what to do with missing data is a challenge (i watched julia community struggle with that this last year).
  • LinearAlgebra: ndarray, nalgebra are both active projects... is there duplicated effort? (there are others as well).
  • Rust types more friendly for math: I've seen the power in Julia of being able to specify AbstractArray as a type, or have a Real as a type, that allows you to build generic functions that accept a vector of float32 or float64.
  • Swift: google and numerous well-known people (chris lattner[LLVM, Swift], jeremy howard[fast.ai]) have put their support behind swift for tensorflow. IMO swift has really long way to go. But for Rust, tackling areas where the swift-for-tf project are not focusing on is good.
  • Support for Julia: integration with Python is a necessity; but if there is a competitor for the research-scientist in the Python world it is Julia and I'd imagine keeping an eye on playing well with Julia could be a benefit. Competition here is hard to forecast and julia/rust are on really different ends of the spectrum; while julia pushes solving the "two language" problem i see no problem using rust and julia in project; I doubt competition is an issue, not like with Swift.
@jbowles

This comment has been minimized.

Copy link
Member

commented May 1, 2019

I do believe that ndarray is in more than a good shape when it comes to fulfil NumPy's role in the Rust ecosystem.

Really looking forward to digging into ndarray. Though I've had a slight delay, I'm writing up ndarray examples for the Grokking Deep Learning book where andrew trask introduces deep learning with only numpy. He's expressed interest and welcomed the examples... :)

@soaxelbrooke

This comment has been minimized.

Copy link

commented May 1, 2019

A standardized tokenization implementation!

Tokenization fills the role of "turn the text into fixed vectors" that you'd feed into standard models. As an NLP practitioner and Rust user, tokenization is an incredibly important step in the pipeline, a big barrier to new people trying to apply NLP, and a place where lots of small bugs creep in due to non-standard implementations that take forever to find. Having a standard implementation for the simpler tokenization methods (like regex matching) would make NLP problems much more approachable in Rust.

@DhruvDh

This comment has been minimized.

Copy link

commented May 1, 2019

One part of machine learning where Rust could shine right now is simulation for Reinforcement learning.

For instance if I training an agent to play blackjack, the biggest bottleneck here is the "playing" blackjack over and over by the agent to collect enough data for training.

Rayon and Actix could be used to create fast and performant game "environments" now, without need for an established ML ecosystem.

@yngtodd

This comment has been minimized.

Copy link

commented May 1, 2019

I agree with @DhruvDh, using Rust to simulate environments for RL agents would be great.

Having something akin to OpenAI's gym interface would be really nice. Many RL researchers are going to still want to use Python and all the associated deep learning libraries. So, I would love to see RL environments rendered in Rust that could be interfaced with both Python and Rust for agents.

Edit: I imagine that algorithms like Monte Carlo Tree Search would be really useful if they were written in Rust. I would not want to wait on Python to handle that bit.

@masonk

This comment has been minimized.

Copy link

commented May 1, 2019

if I training an agent to play blackjack, the biggest bottleneck here is the "playing" blackjack over and over by the agent to collect enough data for training.

Along these lines, I am working on a hobby project (link), which does this. It isn't quite ready for even an alpha release yet, but I am in the final stages of cleaning up the API with the intent to publish it.

@masonk

This comment has been minimized.

Copy link

commented May 1, 2019

Things Rust definitely needs:

const generics
16-bit floats
GATs (for efficient, non-copying iterators)

Things that we might want but I'm not sure:
Standard Inference + Train traits
Standard data frames trait

@kazimuth

This comment has been minimized.

Copy link

commented May 1, 2019

I've been thinking of building a rust deep learning / GPU compute library on top of the TVM framework for a while now. I think it could address a lot of the things @flo-dhalluin is talking about. TVM's an amazing project that's currently flying a bit under the radar. It's an open source deep learning compiler - it compiles deep neural nets / large array operations to run on the GPU (or on OpenCL, or FPGA, or TPU, or WebGL...). You define an AST of computations via its API, and it spits out a small (<5mb) shared library containing just the operations you wanted, on whatever acceleration framework and target platform you want.

It currently has a working Rust runtime library, which lets you call a compiled model from Rust. It integrates with ndarray, and will let you e.g. take in an ndarray::Array, move it to a GPU, run whatever numerical operations you want on it, and get the result back as an ndarray::Array again.

That's pretty neat, and I don't think it would be too hard to build some really cool tools on top of it. My dream is something like:

lib.rs:

// a crate based on tvm
// `cargo build` will (by default) download + checksum a prebuilt TVM library
// that this links to, so that you don't have to wait for a whole compiler to compile.
// The download will only be ~50mb -- way smaller and easier than lots of other deep
// learning frameworks. It will also support running code on things besides cuda!
// The output binary won't need to link the compiler (by default) and will therefore be
// only a few megabytes.
extern crate tvmrs;

// a procedural macro that converts Rust code to Relay IR.
// Relay IR is TVM's high-level IR for defining neural networks / computation chains,
// sorta like a tensorflow Graph. It's also not too dissimilar to Rust.
// The macro will compile the IR with TVM at build-time, and link the resulting artifacts
// to this rust library.
tvmrs::accelerate! {

  // stateless operation
  fn relu_downsample(x: Tensor[c, n, h, w]) -> Tensor[c, n, h/2, w/2] {
     relu(downsample(x))
  }

  // stateful operation
  struct Block<oc> {
    conv: Conv2d<3,3,oc>,
    elu: Elu
  }
  impl Op for Block<oc> {
    fn run(self, input: Tensor[c, n, h, w]) -> Tensor[oc, n, h, w] {
       self.elu(self.conv(input))
    }
  }

  fn swap_channels(x: Tensor[2, n, h, w]) -> Tensor[2, n, h, w] {
    // a low-level tensor operation defined as a TVM Tensor expression.
    let out = compute!(x.shape, |cc, nn, hh, ww| x[(c + 1) % 2, nn, hh, ww)]);
    out
  }
 
  // a sequential network container.
  sequential! Network {
     #[opencl] Block<3,3,5>, // run on opencl
     #[opencl] relu_downsample,
     #[opencl] Conv2d::new(1,1,2),
     #[rust] debug,    // call a normal rust function
     #[cpu] swap_channels // run this part on CPU to maximize throughput
  }

  // Compute a derivative of the network.
  // Relay IR is designed to be differentiable.
  derivative! NetworkDerivative (Network);
}

// a normal rust function
fn debug(x: Tensor) {
  ...
}

train.rs:

fn main() {
  tvmrs::training_loop! {
    net: Network,
    dnet: NetworkDerivative,
    epochs: 37,
    training_data: dataset! {...},
    valid_data: dataset! {...},
    ...
  }
}

run.rs:

fn main() {
   let input = tvmrs::ndarray_from_stdin();
   let output = Network::load_params("params.bin").run(input);
   println!("{:?}", output);
}

(Further reading: Introduction to Relay, TVM Tensor expressions)

All of this is of course pending mountains of bikeshedding, i have no idea what the final API will look like.

One of the nifty things here is that this isn't limited to deep learning models. TVM can handle pretty much any algorithm made of large array operations. So if you wanted to run your SVM on GPU, you can do that pretty easily!

Steps to take here:

  • Talk to the TVM people and see what they think of all this. We could do this work under their umbrella or in a fresh project.
  • Write Rust bindings to the TVM compiler (instead of just the runtime). TVM is written in C++ but is designed to be easy to bind, a lot of the work has already been done here.
  • Design an API like my sketch above that wraps the bindings in some way that makes them easy to use for training + deployment.
  • Build up cargo tooling to allow e.g. prebuilt binary downloads, TVM's auto-tuner support, etc.
  • Beef up TVM's autodifferentiation support. TVM can differentiate Relay IR, but a lot of derivatives aren't actually implemented yet. We could also roll our own autodifferentiation system and just use TVM for compilation; I'd prefer to avoid duplicating work tho.
  • Start writing non-deep-learning algorithms with this system as well, to kick the tires.

If people are interested in this implementation path we could throw a repo together and start work.

I mainly want this because I'm don't want to be stuck using Python and Cuda all the time for my deep learning research :)))

@koute

This comment has been minimized.

Copy link

commented May 1, 2019

A few months ago I have started a crate of my own for deep learning. My goal is to have a library which:

  • Supports both inference and training.
  • Supports the most common deep neural network architectures.
  • Is GPU accelerated.
  • Doesn't use CUDA.
  • Supports every mainstream platform (Linux, MacOS, iOS, Android, Windows, WebAssembly) and hardware (AMD, NVIDIA, Intel GPUs) with a single codebase, and uses the same kernels for consistent results.
  • Is written in pure Rust so that it's trivial to cross-compile.
  • Has a simple to use Keras-like API.
  • Is small and simple enough that it can be reasonably understood and tested end-to-end. (Otherwise you risk situations like e.g. with TensorFlow where for two whole versions their dropout layer was completely broken.)

It's currently totally useless. Right now I'm in the process of adding a Vulkan backend (I have a few thousand lines of work-in-progress code on my disk which I've not pushed yet.); once I finish that in a few weeks I plan further build it up so that I can train CIFAR-10 up to at least ~90% accuracy, add some model import/export functionality (probably from/to the ONNX format) and only then it will be actually usable for something practical.

Some people would call this a waste of time and effort, and, well, I do agree that it would be probably more productive to not do this completely from scratch as I'm doing (e.g. by using TVM as kazimuth said), but I don't really care - I'm just trying to scratch my own itch.

@DhruvDh

This comment has been minimized.

Copy link

commented May 1, 2019

@kazimuth while I love the snippets you've shown here, a lot of my love for Rust exists because of the all the compile time checks the compiler does, and the wonderfully easy to comprehend error messages. I feel that if one is using Rust just as a way to compose, and run functionality defined in other languages then there isn't much to gain here. Might as well just use Python.

And TVM looks more like a tool for deploying neural nets rather than training them; which is very useful but I would prefer to do both in Rust.

There's also tch-rs - bindings to PyTorch's libtorch.

Something else that is also interesting is dual_num, which as I best understand it is some fancy math that might eventually let us to automatic differentiation.

@DhruvDh

This comment has been minimized.

Copy link

commented May 1, 2019

@koute the long term road-map is amazing but I don't get why bother putting effort into the tensorflow backend. Admittedly I don't have enough know-how to imagine what a native backend would look like and the kind of work it would need.

@koute

This comment has been minimized.

Copy link

commented May 1, 2019

@DhruvDh The TensorFlow backend will be most likely removed in the future. Currently it is there for a few reasons:

  • I wanted to quickly get something working to experiment with, and to be able to first work on the general interface of the library (e.g. defining the neural network graph, getting data in and out, etc.)
  • I can use it to write a comprehensive test suite and then cross-check that with my own backend. ML algorithms are very hard to write correctly, so I want the extra insurance not only that my algorithms match with what I have on paper, but also with another widely used framework. (Although from the amount of bugs I've encountered when dealing with TensorFlow it'd probably would have been better to pick a different framework...)
@jbowles

This comment has been minimized.

Copy link
Member

commented May 1, 2019

Some cool stuff coming to light. Is anyone familiar with work presented at c4ML? https://www.c4ml.org/
I don't think any of the presentations were using Rust... but certainly this is a space Rust could be competitive with. With that in mind, are any of the Rust compiler team interested in ML?

Here are some references to work being done in Swift and Julia (Note, Rust, Swift, Julia were all top of the list for google's tensorflow project that eventually became swift-for-tf).
(e.g., automatic differentiation, differentiable programming... https://github.com/tensorflow/swift/blob/master/docs/AutomaticDifferentiation.md, https://juliacomputing.com/blog/2019/02/19/growing-a-compiler.html). Swift MLIR (https://drive.google.com/file/d/1hUeAJXcAXwz82RXA5VtO5ZoH8cVQhrOK/view) and Julia Zygote (https://www.julialang.org/blog/2018/12/ml-language-compiler).

I don't know of any projects in Rust along these lines ^^ ... of course, they are also all funded (google, and julia computing).

@DhruvDh

This comment has been minimized.

Copy link

commented May 1, 2019

@koute yeah makes sense.

@jbowles There was this internals thread about Automatic Differentiation here.

@jbowles

This comment has been minimized.

Copy link
Member

commented May 1, 2019

@ehsanmok may be interested in this discussion ^^

thanks @DhruvDh

@kazimuth

This comment has been minimized.

Copy link

commented May 1, 2019

@DhruvDh that's a fair criticism, but really that's a problem whenever you want to use a hardware accelerator. You're always going to be calling into a language with different semantics from the host. Using Rust for glue gives you type-safety, performance, and lovely tooling. e.g. it's dead-simple to write a parallel image preprocessing pipeline in Rust, whereas with python you need a load of hacks (FFI, multiprocessing) to get acceptable performance. Also, you're free to define new low-level operations in Rust; users shouldn't ever need to use another language :)

And yeah, currently TVM's publicity is oriented around deployment, because that's where there's a gap in the python ecosystem. There's no reason their compiler wouldn't work for training too, though.

@jbowles I've worked with some of those projects; see my comment, I think we can borrow some of that work.

also CC @nhynes

@kazimuth

This comment has been minimized.

Copy link

commented May 1, 2019

Other thought: I wonder what interactive scientific programming would look like in Rust? There's a jupyter kernel but i'm not sure how usable it is.

It might be that rust should just be used for high-performance kernels and stuff, and be easy to call from other languages like you lay out in your presentation @LukeMathWalker.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 1, 2019

Wow, there really is a lurking interest 😛 This is just great.

The discussion has explored several different directions, I'd like to give more details on what I do envision (and where that need comes from).

I strongly align with @flo-dhalluin: I think Rust can really shine in delivering an end-to-end production workflow.
Rust has incredible potential when it comes to the beginning (data pipelines, preprocessing) and the end (performance web servers, using multiple protocols) of the ML workflow.
Establishing early on a way to get the whole workflow is going to be a key prerequisite for adoption - filling a painful gap in the ML ecosystem at large, delivering a top-notch experience with great tooling.

Tackling this challenge requires the building blocks I mentioned (n-dimensional arrays, dataframes) and some others that have been brought up (e.g. running code on different types of hardware, easy interop, reading/writing to a lot of different formats).

Certain capabilities can be borrowed from other languages, others we should probably port and develop natively in Rust (a sufficiently large zoo of preprocessing techniques and standard models).

While I do understand the interest in the Deep Learning area, I don't think it's realistic to kickstart an effort to make Rust a primary language for NN development: we should definitely be able to deploy and run NN models (the TVM project is an excellent example here), but I don't think we would be adding a lot of value by chasing huge projects like TensorFlow or PyTorch.
There are a lot of things in the TensorFlow ecosystem, instead, that are extremely interesting (e.g. TensorFlow serving) but they do end up locking you into TensorFlow itself: if we could replicate those conveniences in a framework-agnostic fashion, we could definitely capture a need in that space.

Summing it up, the minimum working prototype that I have in mind to show off what Rust can do goes along these lines:

  • Huge datasets as input;
  • Heavy-weight, massively parallel data preprocessing pipeline (e.g. NLP or images would be good candidates);
  • Very simple model to be trained on top of the pipeline output;
  • Configuration-based deployment of the serialized model using Rocket: you just define very basic things in a YAML file (e.g. HTTP vs gRPC, monitoring, logging, etc.) and you get a fully working web server that serves your model. This will have to rely on a sufficiently general Model trait.

If you could manage to get the experience right, I am quite sure the interest in Rust for this kind of use cases would skyrocket.

@koute

This comment has been minimized.

Copy link

commented May 1, 2019

While I do understand the interest in the Deep Learning area, I don't think it's realistic to kickstart an effort to make Rust a primary language for NN development: we should definitely be able to deploy and run NN models (the TVM project is an excellent example here), but I don't think we would be adding a lot of value by chasing huge projects like TensorFlow or PyTorch.

I agree, however, you're looking at it from a perspective of a data scientist who wants to fill in the gaps of their existing workflow and augment their ML pipeline with Rust. I'm looking at it from a perspective of a Rust developer who just wants to augment their existing application with a little ML without going through the hoops of exporting their data, processing it through a mainstream ML framework, and serializing it back so that it can be used by the application again.

In other words - my personal interest lies in not filling a gap in the existing ML ecosystem (although that's also most certainly worthwhile!), but in filling a gap in the Rust ecosystem by creating value for existing Rust users (and perhaps the users of other languages) so that they could take advantage of ML in a plug-and-play fashion with minimal amount of fuss. (Which is why things like wide hardware and platform support, simplicity, lack of non-Rust dependencies so it's easy to build and cross-compile, etc. is important.)

@jbowles

This comment has been minimized.

Copy link
Member

commented May 1, 2019

I can volunteer work to rust-ml for tokenizers, string distance metrics, and/or onehot encoding package. I've already been working on the first two as I have real-world projects that need these so I can double up. As far as a onehot package I'm interested to learn more how efficient onehot encoding is done under the hood and have a use for the package as well.

  • string distance metrics (jaro, jaro-winkler, ngram, qgram, ratcliff-obershelp)

  • tokenizers: for one, rust is awesome for writing tokenizers. But IME it's kinda hard to write general tokenizers since their use is often highly dependent on per-project needs (for example I wrote this [https://github.com/jbowles/nlpt-tkz] and used it for a project and its not found much use since). Or if there were consensus on using something ntlk tokenizers as a guide I don't mind working on those either. If there is a need for things like the examples below then I can cherry pick these out of my current project (a hotel and product matching thing) for a rust-ml package... these were written specifically for string comparison and not typical tokenization found in nlp pipelines but it would not be to hard adapt them to accept and return a specific data type...

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn on_word_splitter() {
        fn word_split(c: char) -> bool {
            match c {
                '\n' | '|' | '-' => true,
                _ => false,
            }
        }
        let res = TokenizerNaive::word_splitter("HelLo|tHere", &word_split);
        assert_eq!(res, vec!["HelLo", "tHere"])
    }
    #[test]
    fn on_tokens_lower_filter() {
        fn tokens_filter(c: char) -> bool {
            match c {
                '-' | '|' | '*' | ')' | '(' | '&' => true,
                _ => false,
            }
        }
        let res = TokenizerNaive::tokens_lower_with_filter("|HelLo tHere", &tokens_filter);
        assert_eq!(res, " hello there");

        let res1 = TokenizerNaive::tokens_lower_with_filter("HelLo|tHere", &tokens_filter);
        assert_eq!(res1, "hello there");

        let res2 = TokenizerNaive::tokens_lower_with_filter("HelLo tHere", &tokens_filter);
        assert_eq!(res2, "hello there");

        let res6 =
            TokenizerNaive::tokens_lower_with_filter("****HelLo *() $& )(tH*ere", &tokens_filter);
        assert_eq!(res6, "    hello     $    th ere");
    }

    #[test]
    fn on_pre_process() {
        let res = TokenizerNaive::pre_process("Hotel & Ristorante Bellora");
        assert_eq!(res, "hotel ristorante bellora");

        let res1 = TokenizerNaive::pre_process("Auténtico Hotel");
        assert_eq!(res1, "auténtico hotel");

        let res2 = TokenizerNaive::pre_process("Residence Chalet de l'Adonis");
        assert_eq!(res2, "residence chalet de l adonis");

        let res6 = TokenizerNaive::pre_process("HOTEL EXCELSIOR");
        assert_eq!(res6, "hotel excelsior");

        let res6 = TokenizerNaive::pre_process("Kotedzai Trys pusys,Pylimo ");
        assert_eq!(res6, "kotedzai trys pusys pylimo");

        let res6 = TokenizerNaive::pre_process("Inbursa Cancún Las Américas");
        assert_eq!(res6, "inbursa cancún las américas");
    }

    #[test]
    fn on_tokens_alphanumeric() {
        let res3 = TokenizerNaive::tokens_alphanumeric("|HelLo tHere");
        assert_eq!(res3, " HelLo tHere");

        let res4 = TokenizerNaive::tokens_alphanumeric("HelLo|tHere");
        assert_eq!(res4, "HelLo tHere");

        let res5 = TokenizerNaive::tokens_alphanumeric("HelLo * & )(tHere");
        assert_eq!(res5, "HelLo       tHere");
    }

    #[test]
    fn on_tokens_lower() {
        let res = TokenizerNaive::tokens_lower_str("HelLo tHerE");
        assert_eq!(res, "hello there")
    }

    #[test]
    fn on_tokens_simple() {
        assert_eq!(
            TokenizerNaive::chars("hello there"),
            ["h", "e", "l", "l", "o", " ", "t", "h", "e", "r", "e"]
        );
        assert_eq!(
            TokenizerNaive::chars("hello there").concat(),
            String::from("hello there")
        )
    }

    #[test]
    fn on_similarity_identity() {
        assert_eq!(TokenCmp::new_from_str("hello", "hello").similarity(), 100);
    }

    #[test]
    fn on_similarity_high() {
        assert_eq!(TokenCmp::new_from_str("hello b", "hello").similarity(), 83);
        assert_eq!(
            TokenCmp::new_from_str("this is a test", "this is a test!").similarity(),
            97
        );
        assert_eq!(
            TokenCmp::new_from_str("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear").similarity(),
            91
        );
    }
    #[test]
    fn on_token_sequencer() {
        let an = AlphaNumericTokenizer;
        let one = an.sequencer("Marriot &Beaches Resort|").join(" ");
        let two = an.sequencer("Marriot& Beaches^ Resort").join(" ");
        assert_eq!(one, two);
    }
    #[test]
    fn on_token_sort() {
        let s1 = "Marriot Beaches Resort foo";
        let s2 = "Beaches Resort Marriot bar";
        assert_eq!(TokenCmp::new_from_str(s1, s2).similarity(), 62);
        let sim = token_sort(s1, s2, &TokenCmp::new_sort, &TokenCmp::similarity);
        assert_eq!(sim, 87);
    }
    #[test]
    fn on_token_sort_again() {
        let s1 = "great is scala";
        let s2 = "java is great";
        assert_eq!(TokenCmp::new_from_str(s1, s2).similarity(), 37);
        let sim = token_sort(s1, s2, &TokenCmp::new_sort_join, &TokenCmp::similarity);
        assert_eq!(sim, 81);
    }
    #[test]
    fn on_amstel_match_for_nate() {
        let sabre = "INTERCONTINENTAL AMSTEL AMS";
        let ean = "InterContinental Amstel Amsterdam";
        assert_eq!(TokenCmp::new_from_str(sabre, ean).similarity(), 20);
        assert_eq!(TokenCmp::new_from_str(sabre, ean).partial_similarity(), 14);
        assert_eq!(
            token_sort(sabre, ean, &TokenCmp::new_sort, &TokenCmp::similarity),
            79
        );

        assert_eq!(
            token_sort(
                sabre,
                ean,
                &TokenCmp::new_sort,
                &TokenCmp::partial_similarity
            ),
            78
        );
    }

    #[test]
    fn on_partial_similarity_identity() {
        let t = TokenCmp::new_from_str("hello", "hello");
        assert_eq!(t.partial_similarity(), 100);
    }

    #[test]
    fn on_partial_similarity_high() {
        let t = TokenCmp::new_from_str("hello b", "hello");
        assert_eq!(t.partial_similarity(), 100);
    }

    #[test]
    fn on_similarity_and_whitespace_difference() {
        let t1 = TokenCmp::new_from_str("hello bar", "hello");
        let t2 = TokenCmp::new_from_str("hellobar", "hello");
        let sim1 = t1.similarity();
        let sim2 = t2.similarity();
        assert_ne!(sim1, sim2);
        assert!(sim1 < sim2);
        assert_eq!(sim1, 71);
        assert_eq!(sim2, 77);
    }
@kazimuth

This comment has been minimized.

Copy link

commented May 2, 2019

Summing it up, the minimum working prototype that I have in mind to show off what Rust can do goes along these lines:

This is a very cool idea :)

Question: what would a general Model trait look like? I think the challenge is striking a balance between generality and specificity; you don't want to tie people down too much, but you need some sort of understanding of what you're doing to be able to use it in a general context.

We might want to brainstorm a list of goals / requirements for the design, before we start writing code. Maybe in another issue?

@jbowles

But IME it's kinda hard to write general tokenizers since their use is often highly dependent on per-project needs

Do you think it would be possible to do something with a trait-based approach here? Like, the rust pattern of building up a stack of combinators, you get Parallel<Lower<UnicodeSplitter<...>>> and it ends up near-handwritten performance? I don't know much about NLP, so forgive me if I'm missing stuff here.

@jbowles

This comment has been minimized.

Copy link
Member

commented May 2, 2019

@kazimuth yes i think that would be the way; allow the user to compose a tokenizer.

The TokenizerNaive i showed above is naive specifically because it is not trait based; it does some text normalization for the user, allowing the user to build and pass in a function for char matching/filtering.

I do have a trait-based approach (ideas i got from this Text-Analysis-in-Rust-Tokenization) in my current project but those are in service of tokenizing for comparing token similarity.

With full-blown tokenization an API should support allowing a user to compose the various things they need (e.g., a char filter, normalizing text, etc...) like your example. The hard part I'm really referring to is the output of the tokenization. For example,

I have functionsequencer that returns Vec of tokens

Vec<std::borrow::Cow<'a, str>>;

First, I'm new enough to rust to still not totally understand all the consequences of using Cow :) ... and also instead of a Vec<> it likely needs to return a different kind of vector that plays well with onehot or word embeddings, etc... If you are familiar with python scikit-learn think of the "Vectorizers" it has for turning arrays of strings into arrays of numbers [IMO this is always the hardest part of NLP]

texts = ["foo bar", "bar foo zaz", "did bar", "zaz bar jazz", "good jazz zaxx"]

tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())

d_vtz = CountVectorizer()
print(d_vtz.fit_transform(texts))

h_vtz = HashingVectorizer()
print(h_vtz.fit_transform(texts)

It seems what one would want in rust is a tokenizer that returns vectors of tokens that can just be "plugged in" to lots of different ways to turn text into numbers.

@davechallis

This comment has been minimized.

Copy link

commented May 2, 2019

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once.

This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

Modules/components I'd love to see:

  • text vectorisation (e.g. fast/parallel versions of count/tfidf vectorisers)
  • dimensionality reduction (e.g. PCA, tSNE)
  • scaling/normalisation
  • hyperparameter optimisation
  • data stucture interop (e.g. to/from pandas/arrow/parquet etc.)
@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 2, 2019

Question: what would a general Model trait look like? I think the challenge is striking a balance between generality and specificity; you don't want to tie people down too much, but you need some sort of understanding of what you're doing to be able to use it in a general context.

We might want to brainstorm a list of goals / requirements for the design, before we start writing code. Maybe in another issue?

An article I found very interesting, from 2 years ago, is this one: http://athemathmo.github.io/2016/09/07/typesystem-machine-learning.html
It's from the author of rusty-machine if I am not mistaken. We should definitely brainstorm a list of goals and requirements here before starting to write code out. It would also be worthwhile to see what features in the lang team pipeline could be useful for us.

I agree, however, you're looking at it from a perspective of a data scientist who wants to fill in the gaps of their existing workflow and augment their ML pipeline with Rust. I'm looking at it from a perspective of a Rust developer who just wants to augment their existing application with a little ML without going through the hoops of exporting their data, processing it through a mainstream ML framework, and serializing it back so that it can be used by the application again.

In other words - my personal interest lies in not filling a gap in the existing ML ecosystem (although that's also most certainly worthwhile!), but in filling a gap in the Rust ecosystem by creating value for existing Rust users (and perhaps the users of other languages) so that they could take advantage of ML in a plug-and-play fashion with minimal amount of fuss. (Which is why things like wide hardware and platform support, simplicity, lack of non-Rust dependencies so it's easy to build and cross-compile, etc. is important.)

My loyalty is divided, to say the least: I'd love to be able to host 100% of my workflow in Rust because I strongly believe in the language potential and in the potential of the tooling around it.
I wouldn't say though that our goals are at odds @koute : it's just a matter of deciding in which order we should be tackling challenges.
A good set of crates for preprocessing and deployment is going to be just as necessary for a purely Rust-based workflow as they are for a mixed-language workflow.
Once they are established, we can then shift focus on porting more and more models and algorithm to Rust.
I wholeheartedly agree with @davechallis:

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once.
This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

Thanks to the strong packaging and distribution story provided by Rust, the effort of flashing out algorithms and preprocessing tools can be extremely distributed: once there is a set of agreed-upon traits as interfaces, we can leverage the influx of people who are fascinated and allow them to be productive and develop new crates without having to worry about the fundamentals.
That's why I think it's strategical to have a pure Rust implementation of DataFrames and n-dimensional arrays, for instance.
We don't need a huge monolith like SciPy or Scikit-learn.

@swfsql

This comment has been minimized.

Copy link

commented May 2, 2019

@kazimuth that Jupyter kernel is usable; I'm starting to learn ai with it in here:
https://github.com/swfsql/deep-learning-coursera (by oxidizing python code)
(Currently, only the first assignment is in Rust)

@jbowles

This comment has been minimized.

Copy link
Member

commented May 2, 2019

This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.
💯

Seems to me one of the more difficult problems doing this in rust is getting common types and traits and types defined for different packages to interface with. If I'm not mistaken @LukeMathWalker you seem to point towards using Ndarray as basically numpy. I'm all on board with that.

What if there were something like a core package that defined some of the core traits and structs and types? I can see lots of pros/cons for doing that.

@kazimuth

This comment has been minimized.

Copy link

commented May 2, 2019

@jbowles RE: tokenizer API
Hm, I see the challenge there. Well, for one thing you should probably use Iterators in between operations instead of Vecs, or design a similar trait to Iterator; that should reduce the problem of having to have big buffers between each transformation. Then I think the path would be to pick-and-choose input requirements for each operation, and then operations output whatever they want. E.g. HashVectorizer takes impl Iterator<impl Deref<str>>, and then users can pass in Iterator<&str>, Iterator<String>, Iterator<Cow>, whatever.

This gets at a broader problem with a simple function-y Model(Input) -> Output trait; it works for in-memory datasets, but once your dataset is large enough that you want to start streaming / distributing work over multiple machines, the abstraction sorta breaks down. We could instead do something graphy, where you just have nodes that ingest and spit out streams of data... but then we'll have to work with something graphy, with nodes that ingest and spit out streams of data :P

It might make sense to just start implementing without a core crate of traits, and once we've smacked into enough walls in the design space, we can figure out what the interfaces to our systems tend to look like, and retrofit a core design around that.

@nhynes

This comment has been minimized.

Copy link

commented May 2, 2019

Although I'm not sure that Rust is going to usurp Python and C++ as the de-facto ML programming model, it's definitely a worthy goal. Along those lines, I think that flashlight (and the underlying arrayfire library) has an interface that we might want to emulate.

In any case, the real key feature of PyTorch and JAX is the expressivity of Python backed by a high-performance JIT tensor compiler. I'm pretty sure it's possible to do something similar in Rust by writing a compiler plugin that tracks the types+ops of ndarrays and provides the data to a JIT compiler.

Maybe something like

#[jit]
fn mlp(
    data: &Array<2, f32>,
    weights: Vec<&Array<2, f32>,
    labels: &Array<1, u8>
) -> f32 {
    let fc1 = data.dot(weights[0]); // fn dot -> Array<D, T, Op=gemm>
    Array::pointwise_max(0, fc1) // Array<D, T, Op=Max<0, fc1>> 
}

This is just a sketch and depends on how cost generics actually pan out, but the idea is that a compiler plugin can find the #[jit] functions and either pre-compile them or add them to a runtime cache and replace the original definition with a call into the cache. This is not quite dissimilar to TVM's hybrid mode. We probably don't want to write a tensor compiler, so we could offload that to TVM and link in the static library.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

Personally, I'd be happy to devote the bulk of my time to the Classical ML work stream, extending to General Preprocessing, Deployment and DataFrames if needed/when the bulk of the work is done.

How would online / streaming / non-in-memory machine learning fit into this breakdown, or would it be an additional focus area?

I think of it as a cross-cutting concern, in the sense that each area has to do "its homeworks" to enable streaming ML.

@rth

This comment has been minimized.

Copy link

commented May 7, 2019

Thanks for this summary @LukeMathWalker ! For each of these topics, I think a first step could be to list exiting crates (https://github.com/anowell/are-we-learning-yet does some of that already), get some input from their maintainers, and also review what existing solutions exist in the C++/Python/etc space.

For NLP, would you mind creating say rust-ml/nlp-discussion repo in this org ? I agree that some of the problems solved by these different groups are related and it might be preferable not to isolate them too much, at least in the beginning.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 7, 2019

For NLP, would you mind creating say rust-ml/nlp-discussion repo in this org ? I agree that some of the problems solved by these different groups are related and it might be preferable not to isolate them too much, at least in the beginning.

For discussion purposes, I'd say that it would enough just to create a new issue here (in the spirit of keeping things close together and visible). What do you think?

@rth

This comment has been minimized.

Copy link

commented May 7, 2019

For discussion purposes, I'd say that it would enough just to create a new issue here (in the spirit of keeping things close together and visible). What do you think?

Well for NLP I was hoping to create one issue say for basic text processing tools, one for POS tagging, one for lematization, etc. with a summary of existing tools. Something you can come back a few month later and update as things progress. Certainly more detailed technical discussion can happen in specific crates, but it might still be good to have some general place to set goals and track progress. One doesn't really need to see all the other issues in this repo while working on NLP (or receive notifications for all other topics).

What do you think @danieldk @jbowles @sebpuetz ?

@danieldk

This comment has been minimized.

Copy link

commented May 7, 2019

@rth sounds good to me!

@sebpuetz

This comment has been minimized.

Copy link

commented May 7, 2019

While this is not yet branching out into different issues, I think @twuebi could be interested in contributing towards a lemmatizer since he has written a rusty neural seq2seq lemmatizer before.

@jbowles

This comment has been minimized.

Copy link
Member

commented May 7, 2019

I'm good either way (wrt general/specific discussion areas); I like to see general discussions that touch on various areas, but I also see the benefit of being able to focus on specific discussions rust-ml/nlp-discussion... i guess i don't see the harm in nlp isolation since it would still be under the rustML repo.

One example is how the Julia community has organized... many focused github orgs, for example all the packages under JuliaML and JuliaText... though I would think for here it would be nice to have text packages under RustML. Notice each org keeps a Roadmap.jl... and they have discussions around the roadmap that also links to specific packages.

Perhaps we could have high level discussions around a RustML Roadmap and allow more focused discussions to orbit around subsections or Roadmap or specific crates?

@milesgranger

This comment has been minimized.

Copy link

commented May 7, 2019

At the risk of complete embarrassment, I've been working on a general purpose dataframe lib black-jack a bit akin to python's pandas. This was something I felt was really missing in the foundational steps of general ML workflows for Rust. It has so much more work to go, but at least while working in rust, it's always a pleasure. :)

I'm not even proposing this lib should be considered, but a general dataframe manipulation lib is a must for me, and would be willing to contribute to any other more focused or mature project in this area.

@jblondin

This comment has been minimized.

Copy link

commented May 7, 2019

@milesgranger There is an existing dataframe discussion going on here that would love your input!

Speaking of which, should we consider moving the dataframe discussion into a repo / issue in this organization? Depending on how we want to decide to structure it.

FWIW, I think I'm in favor of having separate repos within this organization for each of the 'working groups' (so their discussions can be more easily organized), along with general discussion and library boundary / coupling issues happening in this repo. Having a single issue for each area makes sense for now, but could get unwieldy quickly -- personally I find long threads a bit intimidating and hard to properly process.

Just my thoughts; I'm really ok with however it's structured.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 8, 2019

I did not consider the notification issue 😛 I have created the nlp-discussion repo - https://github.com/rust-ml/nlp-discussion

@jbowles @sebpuetz @rth @danieldk: you should all have received invite for the rust-ml organization, so that you can manage the nlp-discussion repository independently.

I plan to find some time tonight to open a new repo for the classical ML/general preprocessing stream of work as well, with an initial survey of what is available/what is missing in the current ecosystem.

I like the idea of having links to the roadmap of each group @jbowles - for now I have added the work stream breakdown in the README.md of this repository. Once the different streams start working and have a roadmap/plan of action, I will add links to each one of them.

I'd be happy to move the dataframe discussion under this organization, if we want to group them together @jblondin.

@danieldk

This comment has been minimized.

Copy link

commented May 8, 2019

@jbowles @sebpuetz @rth @danieldk : I have made a bunch of 'Existing work' issues for common NLP tasks, where we could make an inventory of what relevant projects exist for Rust and discuss their states. I may have missed some tasks, so feel free to add them. Also, I was not too sure whether it makes sense to add tasks such as machine translation or question answering now.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 9, 2019

I have opened a repository to discuss general pre-processing and classical ML - given that so far only me and @jblondin have expressed interest in this area I think it makes sense to keep together for the time being.

The repo is here: https://github.com/rust-ml/classical-ml-discussion.

I have started an issue to map the existing ecosystem and the functionality we want to get in terms of preprocessing. I will start populating it as soon as I have some spare time.

@davidB

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

Hi, I'm a contributor to evcxr_jupyter (the rust kernel for jupyter). feel free to report any issue or request you have to make it a useful tool for ml and rust.
I also work on a crate to provide way to display data, struct,... for evcxr_jupyter. But currently I only prototype for nalgebra Matrix, because I don't have user request, I didn't know which dataframe, array lib to target first. I also start to investigate how to provide some kind of data-viz backed by vega/vega-lite like Altair.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 12, 2019

Welcome to the discussion @davidB!
Something similar to Altair would be extremely useful - there is very little in that space right now.
What usage pattern have you seen so far the Rust kernel for Jupyter?

@davidB

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

When I join the rust kernel for Jupyter team, it was to provide tools that could help my team's member to choose Rust (as an alternative to Python) for their ml experiment (and the software around). So being able to show them that they can use Rust in Jupyter (or vscode) and can manipulate equivalent of numpy, panda, and matplot/altair, by doing some conversion of "classical" notebook (mnist, ...). But currently I'm not aware of any usage of the kernel :-(, except the demo we provided.

@davidB

This comment has been minimized.

Copy link
Contributor

commented May 12, 2019

To provide something similar to Altair, I need to select a dataframe format as data input. I read the discussion #1 on rust-dataframe/discussion, to find a response, but I need crates that work on rust stable toolchain (else I can't promote it in my team and our customers/sponsors).

Suggestion are welcome. I'll continue to lurk, and I will try to provide some viz extension for the kernel about ndarray.

@jblondin

This comment has been minimized.

Copy link

commented May 12, 2019

I worked on a server-based data visualization solution (backend here, frontend example here), but I backburnered it to instead work on a dataframe library.

If you have specific needs from a dataframe, please let us know in the dataframe discussion - there are several works in progress, but I don't believe anything is close to being a static API. The purpose of that discussion is to try to get to that point! 😄

I, too, would like to target stable -- the current issue is that we'd also like to support Apache Arrow (for interoperability reasons), the Rust implementation of which is built on nightly (mostly for the specialization feature). Given that the dataframe project is in a fairly nascent state itself, specialization could (hopefully) end up stabilizing before we're that far along with the dataframe API anyway, so we could target stable at that point. But we're not there yet.

Regardless, I think a Jupyter kernel and plotting is a fantastic use case and definitely something we should keep in mind.

@tspooner

This comment has been minimized.

Copy link

commented May 15, 2019

Bit of a shameless plug here. But I'm a PhD student researching reinforcement learning and I've been working on an RL framework (https://github.com/tspooner/rsrl) for a good while now. Rust has been an absolute joy to work with, especially for RL. As some of you mentioned (@DhruvDh), Rust has a lot of scope for parallelization in this area which I'm currently working on in my spare time.

There's a lot left to do, and it's been slow progress recently with work, but any feedback would be welcome!?

@aeroaks

This comment has been minimized.

Copy link

commented May 16, 2019

Hi @davidB I would also be interested in that, Is there a repo already?

I also start to investigate how to provide some kind of data-viz backed by vega/vega-lite like Altair.

@davidB

This comment has been minimized.

Copy link
Contributor

commented May 16, 2019

@aeroaks I only experiment on my desktop, I'm not happy with the way to create vega-lite struct. But I'll push what I did (this week in train: basic display of ndarray's Array2, vector, a vegalite struct) in the repo https://github.com/davidB/evcxr_displayers later today. I hope to be able to share some progress, screenshot this week-end.

@LukeMathWalker Can we open a new issue in this repo to discuss data-viz ? (until/else feel free to open issue/discussion on evcxr_displayers )

@stefan-k

This comment has been minimized.

Copy link

commented May 17, 2019

I'm a newbie in ML, I think in order to make rust shine in ML, we should think of include more basic optimization algorithms.

We're trying to build a large collection of optimization algorithms with argmin. If there is interest in using argmin for ML, I'd be very happy to help.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 19, 2019

@LukeMathWalker Can we open a new issue in this repo to discuss data-viz ? (until/else feel free to open issue/discussion on evcxr_displayers )

Absolutely, please go ahead and do it!

@ThomAub

This comment has been minimized.

Copy link

commented May 22, 2019

Hello everyone,
very interesting conversation. I would love to see more ML crates. I don't have much experience with rust atm so I'm currently building some toy project using ndarray to build simple NN from scratch. Is there some specific way that I can help this project? I'm interested in the deep learning and interoperability side but I would be happy to help on any other track.

@danieldk

This comment has been minimized.

Copy link

commented May 22, 2019

very interesting conversation. I would love to see more ML crates. I don't have much experience with rust atm so I'm currently building some toy project using ndarray to build simple NN from scratch. Is there some specific way that I can help this project?

That sounds like a fun project! Since you are new to Rust, I think you could provide very valuable feedback. Those of us who have used Rust for a few years don't really notice the sharp edges anymore. But with a 'beginner's mind' you will have a much better idea of problems in the current ecosystem. So if you encounter any paper cuts using ndarray, I am sure the feedback would be useful to bluss (ndarray's creator/maintainer). I would also very useful if you could summarize your findings (good and bad) here.

@LukeMathWalker

This comment has been minimized.

Copy link
Member Author

commented May 22, 2019

Welcome @ThomAub!

I definitely agree with @danieldk: the first valuable contribution, especially to ndarray, would be to report any issue you are having. It doesn't have to be a bug: even if you later figured out how to get around it, it's useful feedback for us to either improve the API or the docs.

If you want to get started immediately with some coding, you can have a look at the classical ml or NLP discussion repositories in this organisation.
You can also have a look at the open issues in ndarray, ndarray-linalg, etc. - we would be happy to provide support and get you onboarded.

I plan to publish a structured list of issues we need help for in the classical ML repo as soon as we have some rudimentary design ready.

@aeroaks

This comment has been minimized.

Copy link

commented May 24, 2019

@LukeMathWalker Can we open a new issue in this repo to discuss data-viz ? (until/else feel free to open issue/discussion on evcxr_displayers )

Absolutely, please go ahead and do it!

@davidB should I create an issue here to discuss data-viz?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.