# Get Started with getML

You need to complete the [installation instructions](https://docs.get.ml/latest/tutorial/installation.html) before you can get started.

In this guide you will discover the basic concepts of the getML Python API.

1. [Starting a new project](#Starting-a-new-project)
2. [Defining a data model]()
3. [Training a ML model]()

The main results is a technical understanding of the functionality and benefits of getML

### Introduction

Automated machine learning (AutoML) has attraced a great deal of attention in recent years. It strives towards simplifying the application of traditional machine learning methods to real-world business problems by automating key steps of a data science project, such as model data pre-processing, feature extraction, model selection and hyperparameter optimization. With AutoML data scientists can to develop and compare dozens of models, find insights and predictions, and solve more business problems faster.

While it is often claimed that AutoML covers the complete pipeline from the raw dataset to the deployable machine learning models, current solutions lack one major step in the data science process. Real world business data typically appears in the form of relational data. The relevant information is scattered over a multitude of tables that a related via so-called join keys. In order to start an AutoML pipeline a flat feature table has to be created from the relational raw data by hand. This step is called feature engineering and is a tedious and error-prone process that accounts for up to 90% of the time in a Data Science project.

![scheme](img/getml_scheme.png)

getML adds automated feature engineering on relational data and time series to AutoML. The getML algorithms, Multirel and Relboost, find the right aggregations and subconditions needed to construct meaningful features by performing a sophisticaed, gradient-based search. In doing so getML brings the vision of end-to-end automation of machine learning within reach for the first time.

In this guide you will learn the basic steps and commands to tackle your data science projects using the Python API. For illustration purpose we will also touch on how an example dataset like the one used here would have been tackled with classical data sience tools. In contrast, we will show how the most tedious part of a data science project - merging and aggregation a relation dataset - is automated using getML's MultirelModel. At the end of this tutorial you are ready to tackle your own use cases with getML or dive deeper into our software using a variery of follow-up material.

### Dataset

The dataset used in this tutorial consists of 2 tables. A population table with 500 rows and one peripheral table with 100000 rows. Such a dataset could appear for example in a customer churn analysis where each row in the population table represents a customer and each row in the peripheral table represents a transaction. I could also be part of a predictive maintainance campaign where each row in the population table corresponds to a machine in a production line and each row in the peripheral table is a measurement from a certain sensor.

In this guide we do not assume a particular use case. After all, getML is applicable to a wide range of problems from different domains. Use cases from specific fields will be covered in other tutorials.

The population table used in this analysis looks like this

| time_stamp                  | join_key | targets | column_01   | 
|:----------------------------|:---------|:--------|:------------|
| 1970-01-01T11:18:00.114261Z | 0        | 101     | -0.629518   | 
| 1970-01-01T21:35:41.185307Z | 1        | 88      | -0.962169   | 
| 1970-01-01T02:03:27.430873Z | 2        | 17      | 0.732649    | 
| 1970-01-01T08:45:55.322776Z | 3        | 74      | -0.462678   | 
| 1970-01-01T10:37:51.538818Z | 4        | 96      | -0.837399   | 
| 1970-01-01T01:36:50.690797Z | 5        | 12      | 0.322344    | 
| ...                         | ...      | ...     | ...         | 

It contains 4 columns. The rightmost column, `column_01`, contains a random numerical value. The next column, `targets`, is the one we want to predict in the following analysis. To this end we need to also use the information from the peripheral table. 

The relationship between the population and the peripheral table is established using the `join_key` and `time_stamp` columns: Join keys are used to connect one or more rows from one table with one or more rows from the other table. Time stamps are used to limit this join. This adheres to the “golden rule” of preditive analytics, i.e. not to use any data from the future.

The peripheral table looks like this

| time_stamp                  | join_key | column_01  | 
|:----------------------------|:---------|:-----------|
| 1970-01-01T07:49:02.153328Z | 26       | -0.296267  | 
| 1970-01-01T18:13:22.004830Z | 12       | 0.592168   | 
| 1970-01-01T08:40:46.879346Z | 42       | -0.985272  | 
| 1970-01-01T04:33:29.892692Z | 295      | 0.226407   | 
| 1970-01-01T13:06:06.752214Z | 321      | -0.443054  | 
| ...                         | ...      | ...        | 

Again, `columns_01` contains a random numerical value. The population table and the peripheral table have a one-to-many relationship via `join_key`, i.e. one row in the population table is associated to many rows in the periperhal table. In order to use the information from the peripheral table we need to merge the many rows corresponding to one entry in the population table into one so-called feature. This done using a certain aggregation.

![](img/getting_started_pic1.png)

Such an aggregation could for example be to sum all values in `column_01`. We could also apply a subcondition, like taking only values into account that fall into a certain time range with respect to the entry in the population table. In SQL code such a feature would look like this

```sql
SELECT COUNT( * )
FROM POPULATION t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.time_stamp - t2.time_stamp <= TIME_WINDOW )
) AND t2.time_stamp <= t1.time_stamp
GROUP BY t1.join_key,
     t1.time_stamp;
```

Unfortunately, neither the right aggregation nor the right subconditions are clear a priori. The feature that allows us to predict the target column could very well be the average of all values in `column_01` that fall below a certain threshold, or something completely different. If you were to tackle this problem with classical machine learning tool you would write many SQL features by hand and find the best features in a trial-and-error fashion. At best, you could apply some domain knowledge that guides you in the right direction. This approch, however, bears two distanvantages:

1. You might not have the needed domain knowledge to know where to start
2. The process is long and error-probe. You are very likely to miss the most meaningful features

This is where getML set in. It finds the correct features for you. You do not need to manually merge and aggregate tables in order to get started with a Data Science project. On top, getML uses the derived features to train a predictor. This means getML provides an end-to-end solution starting from the relational data to a trained ML-model. How this is done concretely is demonstrated in the following.

## Starting a new project

After you've successfully [installed](https://docs.get.ml/latest/tutorial/installation.html) getML, you can launch it by executing the `run` script in the getML folder or double-clicking the application icon (depening on your operating system). This launches the getML engine. The engine is written in C++ for maximum performance and is responsible for all the heavy lifting. You control it via the Python API, as demonstrated in the following.

Before diving into the actual project, you need to log into the engine. This happens in the getML Monitor, the frontend to the engine. If you open the browser of your choice an visit http://localhost:1709/ you'll see a login screen. Click 'create new account' and follow the indicated steps. After you've activated your account by clicking the link in the activation E-mail you're ready to go. From now on, the entire analysis is run from Python. We will cover the getML monitor in a later tutorial but feel free to check whats going on while following this guide.


First, we create a new project. All datasets and models belonging to a project will be stored in ``~/.getML/projects``. If you switch to a different projekt, the corresponding files will be loaded into memory.

In [None]:
import getml
getml.engine.set_project('getting_started')

# Generate dataset
population_table, peripheral_table = getml.datasets.make_numerical(
    n_rows_population=500,
    n_rows_peripheral=100000,
    random_state=1709
)

## Building a model

The next step is finding features in the data that allow an accurate prediction the target variable in the population table. This is achieved using a MultirelModel. This model is responsible for the entire process from feature engineering to prediciting the target variable based on the generated features. The MultirelModel requires a predefined data model with all joins defined. This is achieve via Placeholders

In [None]:
population_placeholder = population_table.to_placeholder()

peripheral_placeholder = peripheral_table.to_placeholder()

population_placeholder.join(peripheral_placeholder,
                            "join_key",
                            "time_stamp")


Now we can define the Model. On top of the placeholder for the Data Frames it requires a list of aggregations to select from when building features. You also have to define a loss function, a predictor and some hyperparameters like the number of features you want to train.

In [None]:
model = getml.models.MultirelModel(
    name='getting_started_model',
    aggregation=[
        getml.aggregations.Count,
        getml.aggregations.Sum
    ],
    population=population_placeholder,
    peripheral=[peripheral_placeholder],
    loss_function=getml.loss_functions.SquareLoss(),
    predictor=getml.predictors.LinearRegression(),
    num_features=10,
).send()

We have chosen a narrow search field in aggregations space by only letting the model choose between ``Count`` and ``Sum``. We used Square Loss as loss function and a linear regression, since we want to predict a numerical variable. For the sake of demonstration, we chose to only construct 10 different features.

## Fitting a model

When fitting the model, we pass it the actual DataFrames

In [None]:
model = model.fit(
    population_table=population_table,
    peripheral_tables=[peripheral_table]
)

That's it. The Multirel model is now trained on our test dataset. 

### Scoring the model

Let's generate another population table in order to see how well the trained model performs on new data. The available score are mean absolute error, root mean squared error and the square correlation coefficient.

In [None]:
population_table_score, peripheral_table_score = getml.datasets.make_numerical(
    n_rows_population=200,
    n_rows_peripheral=8000,
    random_state=1710
)

scores = model.score(
    population_table=population_table_score,
    peripheral_tables=[peripheral_table_score]
)

The mean absolute error is roughly 7%. Our model is able to predict the target variable in the newly generated dataset very accurately.

### Making predictions

You can also make predictions using the model you have just trained

In [None]:
population_table_predict, peripheral_table_predict = getml.datasets.make_numerical(
    n_rows_population=200,
    n_rows_peripheral=8000,
    random_state=1711
)


yhat = model.predict(
    population_table=population_table_predict,
    peripheral_tables=[peripheral_table_predict]
)

### Extracting features

Of course you can also extract the features for a specifig dataset in order to insert them into another machine learning algorithm.

In [None]:
features = model.transform(
    population_table=population_table_predict,
    peripheral_tables=[peripheral_table_predict]
)

If you want to see the SQL code for each feature you can do so by clicking on the feature in the monitor or calling the `to_sql` method on the MultirelModel.

In [None]:
print(model.to_sql())

The definiton of `feature_2` is

```
CREATE TABLE FEATURE_2 AS
SELECT COUNT( * ) AS feature_2,
       t1.join_key,
       t1.time_stamp
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY join_key, time_stamp ASC ) AS rownum
     FROM POPULATION
) t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.time_stamp - t2.time_stamp <= 0.499323 )
) AND t2.time_stamp <= t1.time_stamp
GROUP BY t1.rownum,
         t1.join_key,
         t1.time_stamp;

```

This very much resembles the ad hoc definition we tried in the beginning. The correct aggregation to use on this dataset is sum with the subconditions that only entries within a time window of 0.5 are considered. getML extracted this defintion completely autonomosly. 

## Next steps

This guide has shown you the very basics of getML. But there's more

* If you're want to find out more about getML in general, head over to the [webpage](https://get.ml)
* If you're intersted in more advanced projects on real world datasets, you can find examples in the [projects section]() of the documentation.
* If you're curious about other features of getML, go to our [user guide](https://docs.get.ml).

Also, don't hesitate to [contact us](https://get.ml/contact/lets-talk) with your feedback.