Skip to content

Latest commit

 

History

History
204 lines (142 loc) · 13.6 KB

README.md

File metadata and controls

204 lines (142 loc) · 13.6 KB

Beta Machine Learning Toolkit

Machine Learning made simple :-)

   

The Beta Machine Learning Toolkit is a repository with several Machine Learning algorithms, started from implementing in the Julia language the concepts taught in the MITX 6.86x - Machine Learning with Python: from Linear Models to Deep Learning course (note we bear no affiliation with that course).

DOI Build status codecov.io

Theoretical notes describing most of these algorithms are at the companion repository https://github.com/sylvaticus/MITx_6.86x.

The focus of the library is skewed toward user-friendliness rather than computational efficiency, the code is (relatively) easy to read and, for the algorithms implementation, uses mostly Julia core without external libraries (a little bit like numpy-ml for Python-Numpy) but it is not heavily optimised (and GPU is not supported). . For excellent and mature machine learning algorithms in Julia that support huge datasets or to organise complex and partially automated pipelines of algorithms please consider the packages in the above section "Alternative packages".

As the focus is on simplicity, functions have pretty longer but more explicit names than usual.. for example the Dense layer is a DenseLayer, the RBF kernel is radialKernel, etc. As we didn't aim for heavy optimisation, we were able to keep the API (Application Programming Interface) both beginner-friendly and flexible. Contrary to established packages, most methods provide reasonable defaults that can be overridden when needed (like the neural network optimiser, the verbosity level, or the loss function). For example, one can implement its own layer as a subtype of the abstract type Layer or its own optimisation algorithm as a subtype of OptimisationAlgorithm or even specify its own distance metric in the clustering Kmedoids algorithm..

That said, Julia is a relatively fast language and most hard job is done in multithreaded functions or using matrix operations whose underlying libraries may be multithreaded, so it is reasonably fast for small exploratory tasks and mid-size analysis (basically everything that holds in your PC's memory).

Documentation

Please refer to the package documentation (stable | dev) or use the Julia inline package system (just press the question mark ? and then, on the special help prompt help?>, type the module or function name). The package documentation is made of two distinct parts. The first one is an extensively commented tutorial that covers most of the library, the second one is the reference manual covering the library's API.

We currently implemented the following modules in BetaML: Perceptron (linear and kernel-based classifiers), Trees (Decision Trees and Random Forests), Nn (Neural Networks), Clustering (Kmean, Kmenoids, Expectation-Maximisation, Missing value imputation, ...) and Utils.

If you are looking for an introductory book on Julia, have a look on "Julia Quick Syntax Reference"(Apress,2019).

The package can be easily used in R or Python employing JuliaCall or PyJulia respectively, see the documentation tutorial on the "Getting started" section.

Examples

We see how to use three different algorithms to learn the relation between floral sepals and petal measures (first 4 columns) and the specie's name (5th column) in the famous iris flower dataset.

The first two algorithms are example of supervised learning, the third one of unsupervised learning.

  • Using Random Forests for classification
using DelimitedFiles, BetaML

iris  = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)
x = iris[:,1:4]
y = iris[:,5]
((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])
(ytrain,ytest) = dropdims.([ytrain,ytest],dims=2)
myForest       = buildForest(xtrain,ytrain,30)
ŷtrain         = predict(myForest, xtrain)
ŷtest          = predict(myForest, xtest)
trainAccuracy  = accuracy(ŷtrain,ytrain) # 1.00
testAccuracy   = accuracy(ŷtest,ytest)   # 0.956
  • Using an Artificial Neural Network for multinomial categorisation
# Load Modules
using BetaML, DelimitedFiles, Random, StatsPlots # Load the main module and ausiliary modules
Random.seed!(123); # Fix the random seed (to obtain reproducible results)

# Load the data
iris     = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)
iris     = iris[shuffle(axes(iris, 1)), :] # Shuffle the records, as they aren't by default
x        = convert(Array{Float64,2}, iris[:,1:4])
y        = map(x->Dict("setosa" => 1, "versicolor" => 2, "virginica" =>3)[x],iris[:, 5]) # Convert the target column to numbers
y_oh     = oneHotEncoder(y) # Convert to One-hot representation (e.g. 2 => [0 1 0], 3 => [0 0 1])

# Split the data in training/testing sets
((xtrain,xtest),(ytrain,ytest),(ytrain_oh,ytest_oh)) = partition([x,y,y_oh],[0.8,0.2],shuffle=false)
(ntrain, ntest) = size.([xtrain,xtest],1)

# Define the Artificial Neural Network model
l1   = DenseLayer(4,10,f=relu) # Activation function is ReLU
l2   = DenseLayer(10,3)        # Activation function is identity by default
l3   = VectorFunctionLayer(3,3,f=softmax) # Add a (parameterless) layer whose activation function (softMax in this case) is defined to all its nodes at once
mynn = buildNetwork([l1,l2,l3],squaredCost,name="Multinomial logistic regression Model Sepal") # Build the NN and use the squared cost (aka MSE) as error function (crossEntropy could also be used)

# Training it (default to ADAM)
res = train!(mynn,scale(xtrain),ytrain_oh,epochs=100,batchSize=6) # Use optAlg=SGD() to use Stochastic Gradient Descent instead

# Test it
ŷtrain        = predict(mynn,scale(xtrain))   # Note the scaling function
ŷtest         = predict(mynn,scale(xtest))
trainAccuracy = accuracy(ŷtrain,ytrain,tol=1) # 0.983
testAccuracy  = accuracy(ŷtest,ytest,tol=1)   # 1.0

# Visualise results
testSize    = size(ŷtest,1)
ŷtestChosen =  [argmax(ŷtest[i,:]) for i in 1:testSize]
groupedbar([ytest ŷtestChosen], label=["ytest" "ŷtest (est)"], title="True vs estimated categories") # All records correctly labelled !
plot(0:res.epochs,res.ϵ_epochs, ylabel="epochs",xlabel="error",legend=nothing,title="Avg. error per epoch on the Sepal dataset")

  • Using the Expectation-Maximisation algorithm for clustering
using BetaML, DelimitedFiles, Random, StatsPlots # Load the main module and ausiliary modules
Random.seed!(123); # Fix the random seed (to obtain reproducible results)

# Load the data
iris     = readdlm(joinpath(dirname(Base.find_package("BetaML")),"..","test","data","iris.csv"),',',skipstart=1)
iris     = iris[shuffle(axes(iris, 1)), :] # Shuffle the records, as they aren't by default
x        = convert(Array{Float64,2}, iris[:,1:4])
x        = scale(x) # normalise all dimensions to (μ=0, σ=1)
y        = map(x->Dict("setosa" => 1, "versicolor" => 2, "virginica" =>3)[x],iris[:, 5]) # Convert the target column to numbers

# Get some ranges of minVariance and minCovariance to test
minVarRange   = collect(0.04:0.05:1.5)
minCovarRange = collect(0:0.05:1.45)

# Run the gmm(em) algorithm for the various cases...
sphOut   = [gmm(x,3,mixtures=[SphericalGaussian() for i in 1:3],minVariance=v, minCovariance=cv, verbosity=NONE) for v in minVarRange, cv in minCovarRange[1:1]]
diagOut  = [gmm(x,3,mixtures=[DiagonalGaussian() for i in 1:3],minVariance=v, minCovariance=cv, verbosity=NONE)  for v in minVarRange, cv in minCovarRange[1:1]]
fullOut  = [gmm(x,3,mixtures=[FullGaussian() for i in 1:3],minVariance=v, minCovariance=cv, verbosity=NONE)  for v in minVarRange, cv in minCovarRange]

# Get the Bayesian information criterion (AIC is also available)
sphBIC  = [sphOut[v,cv].BIC for v in 1:length(minVarRange), cv in 1:1]
diagBIC = [diagOut[v,cv].BIC for v in 1:length(minVarRange), cv in 1:1]
fullBIC = [fullOut[v,cv].BIC for v in 1:length(minVarRange), cv in 1:length(minCovarRange)]

# Compare the accuracy with true categories
sphAcc  = [accuracy(sphOut[v,cv].pₙₖ,y,ignoreLabels=true) for v in 1:length(minVarRange), cv in 1:1]
diagAcc = [accuracy(diagOut[v,cv].pₙₖ,y,ignoreLabels=true) for v in 1:length(minVarRange), cv in 1:1]
fullAcc = [accuracy(fullOut[v,cv].pₙₖ,y,ignoreLabels=true) for v in 1:length(minVarRange), cv in 1:length(minCovarRange)]

plot(minVarRange,[sphBIC diagBIC fullBIC[:,1] fullBIC[:,15] fullBIC[:,30]], markershape=:circle, label=["sph" "diag" "full (cov=0)" "full (cov=0.7)" "full (cov=1.45)"], title="BIC", xlabel="minVariance")
plot(minVarRange,[sphAcc diagAcc fullAcc[:,1] fullAcc[:,15] fullAcc[:,30]], markershape=:circle, label=["sph" "diag" "full (cov=0)" "full (cov=0.7)" "full (cov=1.45)"], title="Accuracies", xlabel="minVariance")

  • Other examples

Further examples, with more advanced techniques in order to improve predictions, are provided in the documentation tutorial. At the opposite, very "micro" examples of usage of the various functions can be studied in the unit-tests available in the test folder

Alternative packages

Category Packages
ML toolkits/pipelines ScikitLearn.jl, AutoMLPipeline.jl, MLJ.jl
Neural Networks Flux.jl, Knet
Decision Trees DecisionTree.jl
Clustering Clustering.jl, GaussianMixtures.jl
Missing imputation Impute.jl

TODO

Short term

  • Implement utility functions to do hyper-parameter tuning using cross-validation as back-end

Mid/Long term

  • Add convolutional layers and RNN support
  • Reinforcement learning (Markov decision processes)

Contribute

Contributions to the library are welcome. We are particularly interested in the areas covered in the "TODO" list above, but we are open to other areas as well. Please however consider that the focus is mostly didactic/research, so clear, easy to read (and well documented) code and simple API with reasonable defaults are more important that highly optimised algorithms. For the same reason, it is fine to use verbose names. Please open an issue to discuss your ideas or make directly a well-documented pull request to the repository. While not required by any means, if you are customising BetaML and writing for example your own neural network layer type (by subclassing AbstractLayer), your own sampler (by subclassing AbstractDataSampler) or your own mixture component (by subclassing AbstractMixture), please consider to give it back to the community and open a pull request to integrate them in BetaML.

Citations

If you use BetaML please cite as:

  • Lobianco, A., (2021). BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia. Journal of Open Source Software, 6(60), 2849, https://doi.org/10.21105/joss.02849
@article{Lobianco2021,
  doi       = {10.21105/joss.02849},
  url       = {https://doi.org/10.21105/joss.02849},
  year      = {2021},
  publisher = {The Open Journal},
  volume    = {6},
  number    = {60},
  pages     = {2849},
  author    = {Antonello Lobianco},
  title     = {BetaML: The Beta Machine Learning Toolkit, a self-contained repository of Machine Learning algorithms in Julia},
  journal   = {Journal of Open Source Software}
}

Acknowledgements

The development of this package at the Bureau d'Economie Théorique et Appliquée (BETA, Nancy) was supported by the French National Research Agency through the Laboratory of Excellence ARBRE, a part of the “Investissements d'Avenir” Program (ANR 11 – LABX-0002-01).

BLogos