# Get Started with getML

getML adds automated feature engineering on relational data and time series to AutoML. In this guide you will learn the basic steps and commands to tackle your data science projects using the Python API. Please refer to our [installation instructions](https://docs.get.ml/latest/tutorial/installation.html) for installing getML.

## Starting a new project

After you've successfully [installed](https://docs.get.ml/latest/tutorial/installation.html) getML, you have to launch it by executing the `run` script in the getML folder or double-clicking the application icon (depening on your operating system). This launches the getML engine. The engine is written in C++ for maximum performance and is responsible for all the heavy lifting. You control it via the Python API.

Before diving into the actual project, you need to log into the engine. This happens in the getML Monitor, the frontend to the engine. If you open the browser of your choice an visit http://localhost:1709/ you'll see the following login screen.

![monitor-login](img/monitor-login.png)

Click 'create new account' and follow the steps indicated by the monitor. After you've activated your account by clicking the link in the activation E-mail you're ready to go. In principle, you can run the entire analysis only from Python. We will, however, visit the monitor from time to time in order to see what is going on in the engine and to visualize the results.

In [None]:
import getml.engine as engine
engine.set_project('getting_started')

## Staging the data

For the sake of simpliciy, we will use an artificial dataset in this tutorial that only consits of 2 tables. The population table and one peripheral table.

In [None]:
from getml.datasets import make_numerical
population_table, peripheral_table = make_numerical(
    n_rows_population=500,
    n_rows_peripheral=100000,
    random_state=1709
)

display(population_table, peripheral_table)

The population table contains one numerical variable in `column_01`, an unique integer `join_key`, a `time_stamp`, and the `targets` variable that we want to predict in the follwoing. The peripheral table also contains one numerical variable in `column_01`, an integer `join_key` that allows attributon of each row to a row in the population table and a `time_stamp`.

The first step of every analysis is to load the data into the getML engine. In order to optimize performance of the automated feature engineering algorithm, the role of each column has to be defined beforehand. The role of a column tells the algorithm weather a column is to be treated as numerical, categorical or discrete or weather it has a special meaning. It can e.g. represent a join key, a time stamp or the target variable.

In [None]:
peripheral_on_engine = engine.DataFrame(
    name="PERIPHERAL",
    join_keys=["join_key"],
    numerical=["column_01"],
    time_stamps=["time_stamp"]
)

peripheral_on_engine.send(peripheral_table)

population_on_engine = engine.DataFrame(
    name="POPULATION",
    join_keys=["join_key"],
    numerical=["column_01"],
    time_stamps=["time_stamp"],
    targets=["targets"]
)

population_on_engine.send(population_table)

If you go back to the monitor (in your browser) and click on the 'Data Frames' tab in the navigation menu, you should see the following screen

![monitor-dataframe](img/monitor-dataframe.png)

Two Data Frames, the population table with 500 rows and the periphera table with 100000 rows have been uploaded to the engine. You can click on each table to see the data and the roles you have define for each column.

## Building a model

The next step is finding features in the data that allow us to predict the target variable in the population table. This is achieved using a MultirelModel. This model is responsible for the entire process from feature engineering to prediciting the target variable based on the generated features. The MultirelModel requires a predefined data model with all joins defined

In [None]:
from getml import models

population_placeholder = models.Placeholder(
    name="POPULATION",
    numerical=["column_01"],
    join_keys=["join_key"],
    time_stamps=["time_stamp"],
    targets=["targets"]
)

peripheral_placeholder = models.Placeholder(
    name="PERIPHERAL",
    numerical=["column_01"],
    join_keys=["join_key"],
    time_stamps=["time_stamp"]
)

population_placeholder.join(peripheral_placeholder,
                            "join_key",
                            "time_stamp")


Now we can define the Model. On top of the placeholder for the Data Frames it requires a list of aggregations to select from when building features. You also have to define a loss function, a predictor and some hyperparameters like the number of features you want to train.

In [None]:
from getml import predictors
from getml import aggregations
from getml import loss_functions

model = models.MultirelModel(
    name='getting_started_model',
    aggregation=[
        aggregations.Count,
        aggregations.Sum
    ],
    population=population_placeholder,
    peripheral=[peripheral_placeholder],
    loss_function=loss_functions.SquareLoss(),
    predictor=predictors.LinearRegression(),
    num_features=10,
).send()

## Fitting a model

When fitting the model you have to pass it the actual data tables

In [None]:
model = model.fit(
    population_table=population_on_engine,
    peripheral_tables=[peripheral_on_engine]
)

After this step the model is available in the monitor in the 'Models' tab. Select the 'getting_started_model' in order to see the following screen.

![monitor-model](img/monitor-model.png)


You see the 10 features trained by the MultirelModel sorted by their importance. The most significant feature is `feature_2` with an importance of 	`0.6476`. Below you also see the data model that was defined for that model. We will come to the graphs that are empty at the moment in a second.

## Scoring the model

Let's generate a new population table in order to see how well the trained model performs. The available score are mean absolute error, root mean squared error and the square correlation coefficient.

In [None]:
population_table_score, peripheral_table_score = make_numerical(
    n_rows_population=200,
    n_rows_peripheral=8000,
    random_state=1710
)

scores = model.score(
    population_table=population_table_score,
    peripheral_tables=[peripheral_table_score]
)

print(scores)

Our model is able to predict the target variable in the newly generated dataset very accurately. The scores will automatically appear in the getML monitor as well as the correlation of each feature with the target.

![monitor-model](img/monitor-score.png)

## Making predictions

You can also make predictions using the model you have just trained

In [None]:
population_table_predict, peripheral_table_predict = make_numerical(
    n_rows_population=200,
    n_rows_peripheral=8000,
    random_state=1711
)


yhat = model.predict(
    population_table=population_table_predict,
    peripheral_tables=[peripheral_table_predict]
)

## Extracting features

Of course you can also extract the features for a specifig dataset in order to insert them into another machine learning algorithm.

In [None]:
features = model.transform(
    population_table=population_table_predict,
    peripheral_tables=[peripheral_table_predict]
)

If you want to see the SQL code for each feature you can do so by clicking on the feature in the monitor or calling the `to_sql` method on the MultirelModel.

In [None]:
print(model.to_sql())

The definiton of `feature_2` is

```
CREATE TABLE FEATURE_2 AS
SELECT COUNT( * ) AS feature_2,
       t1.join_key,
       t1.time_stamp
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY join_key, time_stamp ASC ) AS rownum
     FROM POPULATION
) t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.time_stamp - t2.time_stamp <= 0.499323 )
) AND t2.time_stamp <= t1.time_stamp
GROUP BY t1.rownum,
         t1.join_key,
         t1.time_stamp;

```

This is almost exactly the definition of the target variable in the artificial dataset we have generated in the beginning. getML extracted this defintion completely autonomosly.

## Next steps

This guide has shown you the very basics of getML. But there's more

* If you're want to find out more about getML in general, head over to the [webpage](https://get.ml)
* If you're intersted in more advanced projects on real world datasets, you can find examples in the [projects section]() of the documentation.
* If you're curious about other features of getML, go to our [user guide](https://docs.get.ml).

Also, don't hesitate to [contact us](https://get.ml/contact/lets-talk) with your feedback.