# Get started with getML

In this article you will discover the basic concepts of getML. You will tackle a simple problem using the Python API in order to gain a technical understanding of the benefits of getML. More concretely, you will learn how to

1. [Start a new project](#Starting-a-new-project)
2. [Define a data model](#Defining-the-data-model)
3. [Train a ML model](#Training-a-model)

You have not installed getML on your machine yet? Head over to the [installation instructions](https://docs.get.ml/latest/tutorial/installation.html) before you get started.

### Introduction

Automated machine learning (AutoML) has attracted a great deal of attention in recent years. It strives towards simplifying the application of traditional machine learning methods to real-world business problems by automating key steps of a data science project, such as feature extraction, model selection, and hyperparameter optimization. With AutoML data scientists are able to develop and compare dozens of models, gain insights, generate predictions, and solve more business problems in less time.

While it is often claimed that AutoML covers the complete workflow of a data science project - from the raw data set to the deployable machine learning models - current solutions have one major drawback: They cannot handle real world business data. This data typically comes as relational data. The relevant information is scattered over a multitude of tables that a related via so-called join keys. In order to start an AutoML pipeline a flat feature table has to be created from the relational raw data by hand. This step is called feature engineering and is a tedious and error-prone process that accounts for up to 90% of the time in a Data Science project.

![](img/getml_scheme.png)

Scope of automation of established AutoML tools and getml in a data science project

getML adds automated feature engineering on relational data and time series to AutoML. The getML algorithms, Multirel and Relboost, find the right aggregations and subconditions needed to construct meaningful features from the raw relational data. This is done by performing a sophisticated, gradient-based search. In doing so getML brings the vision of end-to-end automation of machine learning within reach for the first time. Note that getML also includes automated model deployment via a HTTP endpoint or database connectors. This topic is covered in other material.

All functionality of getML is implemented in the so-called _getML engine_. The engine is implemented in C++ to achieve the highest performance and efficiency possible. The getML Python API acts as a bridge to call the C++ engine. In addition, the _getML monitor_ provides a Go-based graphical user interface to ease working with getML and significantly accelerate your workflow.

In this article you will learn the basic steps and commands to tackle your data science projects using the Python API. For illustration purpose we will also touch on how an example data set like the one used here would have been tackled with classical data science tools. In contrast, we will show how the most tedious part of a data science project - merging and aggregation a relation dataset - is automated using getML. At the end of this tutorial you are ready to tackle your own use cases with getML or dive deeper into our software using a variety of follow-up material.

## Starting a new project

After you've successfully [installed](https://docs.get.ml/latest/tutorial/installation.html) getML, you can launch it by executing the `run` script in the getML folder or double-clicking the application icon (depending on your operating system). This launches the getML engine. The engine is written in C++ for maximum performance and is responsible for all the heavy lifting. It is controlled via the Python API.

Before diving into the actual project, you need to log into the engine. This happens in the getML Monitor, the frontend to the engine. If you open the browser of your choice and visit http://localhost:1709/ you'll see a login screen. Click 'create new account' and follow the indicated steps. After you've activated your account by clicking the link in the activation E-mail you're ready to go. From now on, the entire analysis is run from Python. We will cover the getML monitor in a later tutorial but feel free to check what is going on while following this guide.

In [1]:
import getml
print("getML version: {}".format(getml.__version__))

getML version: 0.9.1


First, we create a new project. All data sets and models belonging to a project will be stored in ``~/.getML/projects``. If you switch to a different project, the corresponding files will be loaded into memory (data sets, however, have to be loaded explicitly in order not to clutter up your machines RAM).

In [2]:
from getml import engine
engine.set_project('getting_started')

### Data set

The data set used in this tutorial consists of 2 tables. The so-called population table represents the entities we want to make a prediction about in the analysis. The peripheral table contains additional information and is related to the population table via a join key. Such a data set could appear for example in a customer churn analysis where each row in the population table represents a customer and each row in the peripheral table represents a transaction. It could also be part of a predictive maintenance campaign where each row in the population table corresponds to a particular machine in a production line and each row in the peripheral table to a measurement from a certain sensor.

In this guide, however, we do not assume a particular use case. After all, getML is applicable to a wide range of problems from different domains. Use cases from specific fields are covered in other articles.

In [3]:
# Generate dataset

population_table, peripheral_table = getml.datasets.make_numerical(
    n_rows_population=500,
    n_rows_peripheral=100000,
    random_state=1709
)

AttributeError: module 'getml' has no attribute 'datasets'

This is the resulting population table

In [None]:
display(population_table.get())

The population table contains 4 columns. The rightmost column, `column_01`, contains a random numerical value. The next column, `targets`, is the one we want to predict in the analysis. To this end, we need to also use the information from the peripheral table. 

The relationship between the population and the peripheral table is established using the `join_key` and `time_stamp` columns: Join keys are used to connect one or more rows from one table with one or more rows from the other table. Time stamps are used to limit these joins by enforcing causality and thus ensuring that no data from the future is used during the training.

The peripheral table looks like this

In [None]:
display(peripheral_table.get())

In the peripheral table, `columns_01` also contains a random numerical value. The population table and the peripheral table have a one-to-many relationship via `join_key`. This means that one row in the population table is associated to many rows in the peripheral table. In order to use the information from the peripheral table, we need to merge the many rows corresponding to one entry in the population table into one so-called feature. This done using a certain aggregation.

![](img/getting_started_pic1.png)

Flat feature tables are created by merging and aggregating relational data

Such an aggregation could for example be to sum all values in `column_01`. We could also apply a subcondition, like taking only values into account that fall into a certain time range with respect to the entry in the population table. In SQL code such a feature would look like this

```sql
SELECT COUNT( * )
FROM POPULATION t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.time_stamp - t2.time_stamp <= TIME_WINDOW )
) AND t2.time_stamp <= t1.time_stamp
GROUP BY t1.join_key,
     t1.time_stamp;
```

Unfortunately, neither the right aggregation nor the right subconditions are clear a priori. The feature that allows us to predict the target best column could very well be the average of all values in `column_01` that fall below a certain threshold, or something completely different. If you were to tackle this problem with classical machine learning tools, you would have to write many SQL features by hand and find the best ones in a trial-and-error-like fashion. At best, you could apply some domain knowledge that guides you towards the right direction. This approach, however, bears two major disadvantages that preclude you from finding the best-performing features.

1. You might not have sufficient domain knowledge.
2. The process is time-consuming, tedious, and error-probe.

This is where getML set in. It finds the correct features for you - automatically. You do not need to manually merge and aggregate tables in order to get started with a data science project. On top, getML uses the derived features in a classical AutoML setting to easily make predictions with established and well-performing algorithms. This means getML provides an end-to-end solution starting from the relational data to a trained ML-model. How this is done via the getML Python API is demonstrated in the following.

## Defining the data model

The next step is finding features in the data that allow an accurate prediction of the target variable in the population table. This is achieved using a `MultirelModel`. This model is responsible for the entire process from feature engineering to prediciting the target variable based on the generated features. The `MultirelModel` requires a predefined data model in order to efficiently represent the data in memory. This is achieved via Placeholders. Placeholders are light and abstract representations of `DataFrame`s and their relations amongst eachother.

In [None]:
population_placeholder = population_table.to_placeholder()

peripheral_placeholder = peripheral_table.to_placeholder()

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp")


Now we can define the model. On top of the `Placeholder`s representing the `DataFrame`s you also have to provide a predictor. Additionally, you can alter some hyperparameters like the number of features you want to train or the list of aggregations to select from when building features.

In [None]:
model = getml.models.MultirelModel(
    name='getting_started_model',
    aggregation=[
        getml.aggregations.Count,
        getml.aggregations.Sum
    ],
    population=population_placeholder,
    peripheral=[peripheral_placeholder],
    loss_function=getml.loss_functions.SquareLoss(),
    predictor=getml.predictors.LinearRegression(),
    num_features=10,
).send()

We have chosen a narrow search field in aggregations space by only letting the model choose between ``Count`` and ``Sum``. Also we use a simple `LinearRegression` since we want to predict a numerical variable. For the sake of demonstration, we chose to only construct 10 different features. In real world projects you would construct at least ten times this number to get significantly better results.

## Training a model

When fitting the model, we pass it the actual data contained in the `DataFrame`s

In [None]:
model = model.fit(
    population_table=population_table,
    peripheral_tables=[peripheral_table]
)

That's it. The Multirel feature engineering routines as well as the `LinearRegression` contained in the `MultirelModel` are now trained on our test data set. 

### Scoring the model

Let's generate another population table as validation data set in order to see how well the trained model performs on new data. For numerical predictions this results in three different scores: mean absolute error (MAE), root mean squared error (RMSE) and the square correlation coefficient (rsquared).

In [None]:
population_table_score, peripheral_table_score = getml.datasets.make_numerical(
    n_rows_population=200,
    n_rows_peripheral=8000,
    random_state=1710
)

scores = model.score(
    population_table=population_table_score,
    peripheral_tables=[peripheral_table_score]
)
print(scores)

The mean absolute error is roughly 7%. Our model is able to predict the target variable in the newly generated data set very accurately.

### Making predictions

You can also make predictions using the model you have just trained

In [None]:
population_table_predict, peripheral_table_predict = getml.datasets.make_numerical(
    n_rows_population=200,
    n_rows_peripheral=8000,
    random_state=1711
)


yhat = model.predict(
    population_table=population_table_predict,
    peripheral_tables=[peripheral_table_predict]
)

### Extracting features

Of course you can also transform a specific data set into the corresponding features in order to insert them into another machine learning algorithm.

In [None]:
features = model.transform(
    population_table=population_table_predict,
    peripheral_tables=[peripheral_table_predict]
)

If you want to see the SQL code for each feature you can do so by clicking on the feature in the monitor or calling the `to_sql` method on the `MultirelModel`.

In [None]:
print(model.to_sql())

The definition of `feature_2` is

```sql
CREATE TABLE FEATURE_2 AS
SELECT COUNT( * ) AS feature_2,
       t1.join_key,
       t1.time_stamp
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY join_key, time_stamp ASC ) AS rownum
     FROM POPULATION
) t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.time_stamp - t2.time_stamp <= 0.499323 )
) AND t2.time_stamp <= t1.time_stamp
GROUP BY t1.rownum,
         t1.join_key,
         t1.time_stamp;

```

This very much resembles the ad hoc definition we tried in the beginning. The correct aggregation to use on this data set is `Count` with the subcondition that only entries within a time window of 0.5 are considered. getML extracted this definition completely autonomosly. 

## Next steps

This guide has shown you the very basics of getML. Starting from a simple data you have completed a full project including feature engineering and linear regression using an automated end-to-end pipeline. The most tedious part of this process - finding the right aggregations and subconditions to contruct a feature table from the relational data model - was also included in this pipeline. 


But there's more! Related articles show application of getML on real world data sets and demonstrate the excellent results getML can achieve in competitions. Furthermore there is an entire article about model deployment with getml.

Also, don't hesitate to [contact us](https://get.ml/contact/lets-talk) with your feedback.