Examples of WhizzML scripts and libraries
Each script or library is in a directory in this folder. For each one
you will always find a readme explaining what's its purpose and usage,
the actual whizzml code in a
.whizzml file, and the JSON metadata
needed to create BigML resources.
By convention, when the artifact is a library, the files are called
metadata.json, while for a script we use
covariate-shiftDetermine if there is a shift in data distribution between two datasets.
model-or-ensembleDecide whether to use models or ensembles for predictions, given an input source.
deduplicateRemoves contiguos duplicate rows of a dataset, where "duplicate" means a concrete field's value is the same for a set of contiguous rows.
remove-anomaliesUsing Flatline and an anomaly detector, remove from an input dataset its anomalous rows.
smacdown-braninSimple example of SMACdown, using the Branin function as evaluator.
smacdown-ensembleUse SMACdown to discover the best possible ensemble to model a given dataset id.
find-neighborsUsing cluster distances as a metric, find instances in a dataset close to a given row.
stacked-generalizationSimple stacking using decision tree, ensembles and logistic regression.
best-firstFeature selection using a greedy algorithm.
gradient-boostingBoosting algorithm using gradient descent.
model-per-clusterScripts and library to model data after clustering and make predictions using the resulting per-cluster model.
model-per-categoryScripts to model and predict from an input dataset with a hand-picked root node field.
best-kScripts and library implementing Pham-Dimov-Nguyen algorithm for choosing the best k in k-means clusters.
seeded-best-kScripts and library implementing Pham-Dimov-Nguyen algorithm for choosing the best k in k-means clusters, with user-provided seeds.
anomaly-shiftCalculate the average anomaly between two given datasets.
cross-validationScripts for performing k-fold crossvalidation.
clean-datasetScripts and library for cleaning up a dataset.
borutaScript for feature selection using the Boruta algorithm.
cluster-classificationScript that determines which input fields are most important for differentiating between clusters.
anomaly-benchmarkingScript that takes any dataset (classification or regression) and turns it into a binary classification problem with the two classes "normal" and "anomalous".
sliding-windowScript that extends a dataset with new fields containing row-shifted values from numeric fields. For casting time series forecasting as a supervised learning problem.
unify-optypeScript that matches the field optypes to a given dataset
stratified-samplingScript that implements the stratified sampling technique.
low-coverageScript that removes all the sparse fields that have coverage less than a given threshold coverage.
stacked-predictionsScript that builds several predictors and returns a copy of the original dataset with the last field the most popular prediction.
calendar-adjustmentGiven a dataset containing one or more monthly time series and a datestamp, scales the time series values by the number of days in the corresponding months, returning a dataset extended with new fields containing the scaled values.
stepwise-regressionFinds the best features for building a logistic regression using a greedy algorithm.
ordinal-encoderGiven a dataset, encodes categorical fields using ordinal ecoding, which uses a single column of integers to represent field classes (levels).
batch-explanationsA simple way to perform many predictions with explanations
best-firstcode to find the list of fields in your dataset that produce the best models. Allows iteration and uses cross-validation.
multi-labelClassification for datasets with a multi-label (items) objective field.
recursive-feature-eliminationScript to select the n best features for modeling a given dataset, using a recursive algorithm
name-clustersScript to give names to clusters using important field names and their values
dimensionality-reductionScript for dimensionality reduction using PCA and topic modelling.
fuzzy-norms: Computing fuzzy-logic T-norms and T-conorms as new dataset features
automl: Automated Machine Learning withing BigML
correlations-matrix: Generates a CSV that contains the matrix of correlations between the numeric and categorical fields in a dataset
batch-association-sets: Adds new features to a dataset by creating new fields based on the combinations that appear in the association rules extracted from it.
supervised-model-quality: Creates the evaluation associated to the user-given supervised model (fusions excluded). The evaluation is created by splitting the dataset used in the model into a train/test split.
bulk-move: Moves selected resources in bulk to a user-provided project. The resources to be moved are selected by applying the user-provided filters.
statistical-tests: Performs statistical tests from a given dataset and creates a report with the results
time-aware cross-validation: Cross-Validation considering temporal order
How to install
There are three kinds of installable WhizzML artifacts in this repo, identified by the field "kind" in their metadata: libraries, scripts and packages. The latter are compounds of libraries and scripts, possibly with interdependencies, meant to be installed together.
Libraries and scripts are easily installed at the BigML dashboard. To install a script, navigate to 'Scripts' and then hover over the installation dropdown. Choose 'Import script from GitHub' and paste in the url to the example's folder. To install a library, first navigate to 'Libraries', and the rest of the process is the same.
Packages can be installed in either of the following ways:
If you have bigmler installed in your system, just checkout the repository 'whizzml/examples' and, at its top level, issue the command:
make compile PKG=example-name
example-name with the actual example name. That will
create all of the example's scripts and libraries for you.
Using the web UI
Install each of the libraries separately, using the urls to each of their folders. (For example, https://github.com/whizzml/examples/tree/master/clean-dataset/clean-data)
Install each of the scripts separately, using the urls to each of their folders.
If a script requires a library, you will get the error message 'Library ../my-library not loaded.' Load the library by clicking in the textbox above the error message and typing the first few letters of the library's name. Select the library, then create the script as usual.
Compiling packages and running tests
for a list of possibilities, including:
teststo run all available test scripts (which live in the
testsubdirectory of some packages), which typically use bigmler.
compileto use bigmler to register in BigML the resources associated with one or more packages in the repository.
cleanto delete resources and outputs (both remote and local) created by
distcheckcombines most of the above to check that all the scripts in the repository are working: this target should build cleanly before merging into
The verbosity of the tests output can be controlled with the variable
VERBOSITY, which runs from 0 (the default, mostly silent) to 2.
make tests VERBOSITY=1
If you write your own test scripts, include test-utils.sh for shared utilities.