# Why relational learning matters

The main purpose of this notebook is to demonstrate how powerful **relational learning** can be.

Relational learning is one of the most **underappreciated fields of machine learning**. Even though relational learning is very **relevant to many real world data science projects**, many data scientists don't even know what relational learning is. 

There are many subdomains of relational learning, but the most important one is **extracting features from relational data**: Most business data is **relational**, meaning that it is spread out over **several relational tables**. However, most machine learning algorithms require that the data be presented in the form of a single flat table. So we need to extract features from our relational data. Some people also call this **data wrangling**.

Most data scientists we know extract features from relational data manually or by using crude, brute-force approaches (randomly generate thousands of features and then do a feature selection). This is very time-consuming and does not produce good features.

This example demonstrates how powerful a real relational learning algorithm can be. Based on a public-domain dataset on consumer behavior, we use a **relational boosting algorithm** to predict whether purchases were **made as a gift**. We show that with relational learning, we can get an **AUC of over 90%**. The generated features would have been **impossible to build by hand** or by using brute-force approaches.

This notebook is **self-contained**. You can just run it to reproduce our results. You will need **getML** in order to do so.
You can **download it for free**:

https://getml.com/product

If you want to learn more, about getML, check out the **official documentation**:

https://docs.getml.com/latest/index.html

## The challenge

The **Consumer Expenditure Data Set** is a public domain data set provided by the American Bureau of Labor Statistics (https://www.bls.gov/cex/pumd.htm). It includes the **diary entries**, where American consumers are asked to keep diaries of the products they have purchased each month.

These consumer goods are categorized using a **six-digit classification system** the **UCC**. This system is hierarchical, meaning that **every digit represents an increasingly granular category**.

For instance, all UCC codes beginning with **‘200’ represent beverages**. UCC codes beginning with **‘20011’ represents beer** and **‘200111’ represents ‘beer and ale’** and ‘200112’ represents ‘nonalcoholic beer’ (https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx).

The diaries also contain a flag that indicates whether the product was purchased as a gift. The challenge is to predict that flag using other information in the diary entries.

This can be done based on the following considerations:

1. Some items are **less likely to be purchased as gifts** than others (for instance, it is unlikely that toilet paper is ever purchased as a gift).

2. Items that diverge from the **usual consumption patterns** are more likely to be gifts.

In total, there are three tables which we find interesting:

1. EXPD, which contains information on the **consumer expenditures**, including the target variable GIFT.

2. FMLD, which contains socio-demographic information on the **households**.

3. MEMD, which contains socio-demographic information on each **member of the households**.


In [1]:
import datetime
import os
from urllib import request
import time
import zipfile

from getml import data 
from getml import engine 
from getml.models import loss_functions
from getml import models 
from getml.data import placeholder
from getml.data import roles 
from getml import predictors

## Getting the data

First, we need to **download and unzip** the data. For your convenience, we have scripted this.

If you want to download them manually, you can use this link: https://www.bls.gov/cex/pumd.htm.

We will use the diary CSV files for the year 2015. There are more years you can use, but we will limit ourselves to this one year for now.

In [20]:
fname = "diary15.zip"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://www.bls.gov/cex/pumd/data/comma/diary15.zip", 
        "diary15.zip"
    )

RAW_DATA_FOLDER = "diary15/"

if not os.path.exists(RAW_DATA_FOLDER):
    with zipfile.ZipFile(fname, 'r') as dzip: 
        dzip.extractall() 

We now set the project.

In getML, every data frame and model is **tied to a project**. If you change the project, then the memory is flushed and all unsaved changes are lost (but don't worry, models are saved automatically).

In [4]:
engine.set_project("consumer-expenditure-notebook")

Creating new project 'consumer-expenditure-notebook'


## Loading the data

We load the data directly into getML data frames. There are other ways to do this, such as using pandas or loading the data into a data base first. But we will us this approach.

In [5]:
# -----------------------------------------------------------------------------
# Load EXPD

expd_fnames = [
    RAW_DATA_FOLDER + "expd151.csv",
    RAW_DATA_FOLDER + "expd152.csv",
    RAW_DATA_FOLDER + "expd153.csv",
    RAW_DATA_FOLDER + "expd154.csv"
]

# The sniffer will interpret NEWID and UCC
# as a numeric column. But we want it
# to be treated as a string.
expd_roles = {"unused_string": ["UCC", "NEWID"]}

df_expd = data.DataFrame.from_csv(
    fnames=expd_fnames,
    name="EXPD",
    roles=expd_roles
)

# -----------------------------------------------------------------------------
# Load FMLD

fmld_fnames = [
    RAW_DATA_FOLDER + "fmld151.csv",
    RAW_DATA_FOLDER + "fmld152.csv",
    RAW_DATA_FOLDER + "fmld153.csv",
    RAW_DATA_FOLDER + "fmld154.csv"
]

# The sniffer will interpret NEWID
# as a numeric column. But we want it
# to be treated as a string.
fmld_roles = {"unused_string": ["NEWID"]}

df_fmld = data.DataFrame.from_csv(
    fnames=fmld_fnames,
    name="FMLD",
    roles=fmld_roles
)

# -----------------------------------------------------------------------------
# Load MEMD

memd_fnames = [
    RAW_DATA_FOLDER + "memd151.csv",
    RAW_DATA_FOLDER + "memd152.csv",
    RAW_DATA_FOLDER + "memd153.csv",
    RAW_DATA_FOLDER + "memd154.csv"
]

# The sniffer will interpret NEWID
# as a numeric column. But we want it
# to be treated as a string.
memd_roles = {"unused_string": ["NEWID"]}

df_memd = data.DataFrame.from_csv(
    fnames=memd_fnames,
    name="MEMD",
    roles=memd_roles
)

When we check out the data frames view in the **getML monitor**, we should now already be able to see the data frames we have just loaded.

![alt text](data-frames-view.png "data frames view")

## Data exploration

The first thing we want to do is to define the target.

Strangely enough, the "GIFT" column we want to predict is encoded into 1 for gift and 2 for no gift. We want to turn that into a binary column.

In [6]:
target = (df_expd["GIFT"] == 1)

df_expd.add(target, "TARGET", roles.target)

We will also turn the "EXPNMO" column into a numerical value (the CSV sniffer has interpreted it as a string, because of missing values).

In [7]:
df_expd.set_role("EXPNMO", roles.numerical)

Now its time for some data exploration. In the getML monitor, we click on "EXPD" in the "Data Frames" table. We then click on the "UCC" header or the little magnifying glass below. We have now reached the column view for UCC.

In the "Settings" card on the bottom right, we can choose to plot the UCC against the TARGET.

![alt text](ucc-vs-target.png "UCC vs TARGET")

These two plots tell us two things:

The "Frequency" plot tells us that the UCC codes are **not evenly distributed**. Some categories of items are purchased far more frequently than others.

The "UCC vs. TARGET" plot tells us that some categories of items are **far more likely to be purchased as gift**. The range is between 60% for some items to 0% for others. In other words, knowing the item's UCC will already get us pretty far towards predicting whether it has been purchased as a gift.

Let's look at something else, which is sort of fun:

![alt text](expnmo-vs-target.png "EXPNMO vs TARGET")

As we mentioned earlier, EXPNMO is the month the expenditure was made. These two plots tell us two things:

The "Frequency" plot tells us that the number of expenditures is **pretty evenly distributed** over the year.

The "EXPNMO vs. TARGET" plot tells us that **purchases made in December are far more likely to be gifts** (almost 5% for December, roughly 2% for all other months).

## Annotating the columns

We now want to annotate the data in EXPD. We have already done so for "EXPNMO" and our target variable, but we want to do the same for other columns as well.

Specifically, we will assign **roles** and **units** to the columns. To learn more about roles and units, check out the **documentation**:

https://docs.getml.com/latest/user_guide/annotating_data/annotating_data.html

### EXPD

In [8]:
# -----------------------------------------------------------------------------
# Make EXPNYR and COST numerical columns

df_expd.set_role(["EXPNYR", "COST"], roles.numerical)

df_expd.set_unit(["EXPNMO"], "month")
df_expd.set_unit(["COST"], "cost")

# -----------------------------------------------------------------------------
# Make newid a join key.

df_expd.set_role("NEWID", roles.join_key)

# -----------------------------------------------------------------------------
# Remove all entries, for which EXPNYR or EXPNYR are nan.

expnyr = df_expd["EXPNYR"]
expnmo = df_expd["EXPNMO"]

not_nan = (expnyr.is_nan() | expnmo.is_nan()).is_false()

df_expd = df_expd.where("EXPD", not_nan)

# -----------------------------------------------------------------------------
# Generate time stamps.

expnyr = df_expd["EXPNYR"]
expnmo = df_expd["EXPNMO"]

ts = (expnyr.as_str() + "/" + expnmo.as_str()).as_ts(["%Y/%n"])

df_expd.add(ts, "TIME_STAMP", roles.time_stamp)


The next part requires some more explanation: As we mentioned earlier, the UCC is based on a **hierarchical system of classification**.

We make use of that fact and create a bunch of substrings. **UCC1** contains only the **first digit**, **UCC2** the **first two digits** and so on. This will also create units with the same name as the column name:

In [9]:
ucc = df_expd["UCC"]

for i in range(5):
    substr = ucc.substr(0, i+1)
    df_expd.add(
            substr, 
            name="UCC" + str(i+1),
            role=roles.categorical,
            unit="UCC" + str(i+1))

df_expd.set_role("UCC", roles.categorical)
df_expd.set_unit("UCC", "UCC")

We're done with EXPD. We can now save our work:

In [18]:
df_expd = df_expd.save()

### MEMD

Next, we annotate the columns in MEMD. MEMD contains information on each member of the household. We just pick a couple of columns we find interesting and assign them the role categorical or numerical.

Also, we need to tell it that NEWID is our join key.

In [21]:
df_memd.set_role([
    "MARITAL",
    "SEX",
    "EMPLTYPE",
    "OCCULIST",
    "WHYNOWRK",
    "EDUCA",
    "MEDICARE",
    "PAYPERD",
    "RC_WHITE",
    "RC_BLACK",
    "RC_ASIAN",
    "RC_OTHER",
    "WKSTATUS"], roles.categorical)

df_memd.set_role(["AGE", "WAGEX"], roles.numerical)

df_memd.set_role("NEWID", roles.join_key)

df_memd = df_memd.save()


### POPULATION

Our next step is to create the POPULATION table. The POPULATION table defines the statistical population (hence the name) and contains our target variable.

We want to predict whether an expenditure was purchased as a gift, so EXPD is a good starting point. However, there is also the FMLD table. FMLD contains demographic information on the household as whole. Because this information is unique for every household, EXPD and FMLD are in a **many-to-one relationship**. We can therefore **directly join FMLD onto EXPD** and do not have extract any features from FMLD.

To learn more about joining, check out the **official documentation**:

https://docs.getml.com/latest/api/getml.data.DataFrame.html#getml.data.DataFrame.join

In [19]:
# -----------------------------------------------------------------------------
# Separate EXPD in training, testing, validation set

random = df_expd.random()

df_population_training = df_expd.where("POPULATION_TRAINING", random <= 0.7)

df_population_validation = df_expd.where("POPULATION_VALIDATION", (random <= 0.85) & (random > 0.7))

df_population_testing = df_expd.where("POPULATION_TESTING", random > 0.85)

# -----------------------------------------------------------------------------------------------
# NEWID in FMLD is unique - therefore, we can just LEFT JOIN it onto the POPULATION tables. 

income_ranks = [
    "INC_RANK",
    "INC_RNK1",
    "INC_RNK2",
    "INC_RNK3",
    "INC_RNK4",
    "INC_RNK5",
    "INC_RNKM"
]

df_fmld.set_role(income_ranks, roles.numerical)

for inc in income_ranks:
    df_fmld.set_unit(inc, inc)

df_fmld.set_role("NEWID", roles.join_key)

df_population_training = df_population_training.join(
        name="POPULATION_TRAINING", 
        other=df_fmld, 
        join_key="NEWID",
        other_cols=[
            df_fmld["INC_RANK"],
            df_fmld["INC_RNK1"],
            df_fmld["INC_RNK2"],
            df_fmld["INC_RNK3"],
            df_fmld["INC_RNK4"],
            df_fmld["INC_RNK5"],
            df_fmld["INC_RNKM"]
        ]
)

df_population_validation = df_population_validation.join(
        name="POPULATION_VALIDATION", 
        other=df_fmld, 
        join_key="NEWID",
        other_cols=[
            df_fmld["INC_RANK"],
            df_fmld["INC_RNK1"],
            df_fmld["INC_RNK2"],
            df_fmld["INC_RNK3"],
            df_fmld["INC_RNK4"],
            df_fmld["INC_RNK5"],
            df_fmld["INC_RNKM"]
        ]
)

df_population_testing = df_population_testing.join(
        name="POPULATION_TESTING", 
        other=df_fmld, 
        join_key="NEWID",
        other_cols=[
            df_fmld["INC_RANK"],
            df_fmld["INC_RNK1"],
            df_fmld["INC_RNK2"],
            df_fmld["INC_RNK3"],
            df_fmld["INC_RNK4"],
            df_fmld["INC_RNK5"],
            df_fmld["INC_RNKM"]
        ]
)

# -----------------------------------------------------------------------------------------------

df_population_training = df_population_training.save()

df_population_validation = df_population_validation.save()

df_population_testing = df_population_testing.save()


## Extracting the features

Enough with the data preparation. Let's get to the fun part: Extracting the features using **relational boosting**.

### Defining the data model

First, we define the data model.

What we want to do is the following: 

1. We want to compare every expenditure made to all **previous expenditures by the same household** (EXPD).

2. We want to aggregate all available information on the **individual members of the household** (MEMD).

![alt text](data-model.png "data model")


In [13]:
population_placeholder = placeholder.Placeholder("POPULATION")

expd_placeholder = placeholder.Placeholder("EXPD")

memd_placeholder = placeholder.Placeholder("MEMD")

population_placeholder.join(
    expd_placeholder,
    join_key="NEWID",
    time_stamp="TIME_STAMP"
)

population_placeholder.join(
    memd_placeholder,
    join_key="NEWID"
)

### Setting the hyperparameters

We use **xgboost** as our predictor and **relboost** (short for relational boosting) to generate our features. You are free to play with the hyperparameters.

To learn more about the hyperparameters, check out the **official documentation**:

https://docs.getml.com/latest/api/getml.models.RelboostModel.html#getml.models.RelboostModel

https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html#getml.predictors.XGBoostClassifier

In [14]:
predictor = predictors.XGBoostClassifier(
    booster="gbtree",
    n_estimators=100,
    n_jobs=6,
    max_depth=7,
    reg_lambda=0.0
)

model = models.RelboostModel(
    population=population_placeholder,
    peripheral=[expd_placeholder, memd_placeholder],
    loss_function=loss_functions.CrossEntropyLoss(),
    shrinkage=0.1,
    min_num_samples=200,
    num_features=20,
    predictor=predictor,
    include_categorical=True
)

### Fitting the model

OK, let's go:

In [15]:
model = model.fit(
    population_table=df_population_training,
    peripheral_tables=[df_expd, df_memd]
)

Loaded data. Features are now being trained...
Trained model.
Time taken: 0h:7m:52.608328



### Scoring the model

We want to know how well we did. We will to an in-sample and an out-of-sample evaluation:

In [16]:
scores = model.score(
    population_table=df_population_training,
    peripheral_tables=[df_expd, df_memd]
)

print("In-sample:")
print(scores)
print()

scores = model.score(
    population_table=df_population_validation,
    peripheral_tables=[df_expd, df_memd]
)

print("Out-of-sample:")
print(scores)
print()

In-sample:
{'accuracy': [0.9853041903498234], 'auc': [0.9614690327986333], 'cross_entropy': [0.053243525274554004]}

Out-of-sample:
{'accuracy': [0.9810605427974948], 'auc': [0.9093526416770293], 'cross_entropy': [0.06733930534674173]}



There you go. We just got an **out-of-sample AUC of almost 91%**.

## Studying the features

It is very important that we get an idea about the features that the relational boosting algorithm has produced.

We can look at them in the getML monitor by going to the "Models" view, clicking on the name of the model in the "Models" table and then clicking on the name of the feature in the "Features" table.

We cannot look at all of the features, so will just checkout the most important one.

![alt text](feature_2.png "feature_2")

FEATURE_2 has a correlation of just over 20.5% with the targets and accounts for about 16% of the overall predictive power.

Here is the SQL code for this feature:

```sql
CREATE TABLE FEATURE_2 AS
SELECT AVG( 
    CASE
        WHEN ( t1.UCC3 IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC IN ( '010210', '060110', '060210', '090110', '090210', '110410', '110510', '120110', '120310', '120410', '130212', '130310', '130320', '140110', '160211', '170110', '170210', '180310', '180320', '180410', '190111', '190211', '550210', '620912', '650210', '020110', '020210', '020310', '020610', '020620', '020710', '020810', '030110', '030710', '080110', '100410', '100510', '110110', '110210', '170520', '170532', '180220', '180420', '180510', '180710', '280120', '330210', '190112', '190113', '330110', '470111', '630110', '190313', '260110', '270000', '040310', '120210', '190114', '190212', '190321', '270310', '560110', '070240', '140210', '190311', '190322', '260210', '040110', '070230', '100110', '140220', '330310', '030610', '050210', '160110', '160310', '170310', '170531', '470211', '020510', '140230', '350110', '500110', '030810', '040210', '040410', '040510', '050110', '050310', '060310', '160320', '170410', '180110', '170533', '190324', '250210', '270210', '310352', '520531', '550310', '150212', '180611', '180612', '280140', '490312', '690120', '030410', '070110', '150211', '170510', '400110', '190312', '200512', '200532', '670310', '670903', '002100', '590230', '180210', '380340', '480212', '110310', '270410', '240210', '620310', '130211', '140340', '200522', '470220', '250900', '580000', '009000', '190314', '210210', '130121', '190214', '490300', '010310', '530311', '560210', '240310', '630210', '310316', '200511', '220120', '230110', '440210', '140330', '690110', '130122', '670902', '030210', '330610', '380210', '380420', '440120', '550320', '240110', '520110', '180620', '620410', '620221', '550330', '620810', '030310', '690114', '680220', '050410', '280110', '680903', '590210', '240120', '490313', '560330', '230900', '380320', '190323', '620420', '530110', '550110', '190213', '310332', '620213', '220210', '999900', '290410', '560400', '290440', '620320', '310232', '570000', '630220', '470112' ) ) AND ( t2.UCC IN ( '110410', '110510', '120310', '130212', '140110', '170110', '180320', '190111', '190211', '330510', '360210', '550210', '620912', '640110', '010120', '020110', '020310', '020810', '020820', '030710', '100510', '110110', '140420', '170520', '180220', '180420', '180510', '280120', '320370', '190112', '190113', '630110', '190313', '260110', '270000', '120210', '190114', '190212', '190321', '270310', '370213', '610310', '070240', '140210', '190311', '260210', '040110', '070230', '140220', '330310', '010320', '030610', '050210', '160110', '170531', '470211', '240320', '400210', '500110', '640310', '660000', '030810', '040210', '040410', '040510', '050110', '050310', '060310', '170410', '180110', '370125', '600210', '640210', '170533', '190324', '270210', '310352', '320904', '340510', '370211', '550310', '340110', '030410', '070110', '130110', '320420', '400110', '480213', '190312', '200512', '200532', '340410', '390310', '390321', '390322', '420115', '640410', '670310', '590230', '210110', '180210', '480212', '610320', '640220', '110310', '030510', '320330', '620310', '130211', '140340', '200410', '320130', '380333', '400310', '200522', '470220', '360311', '360312', '360330', '250900', '190314', '620121', '190214', '370314', '490300', '010310', '240310', '630210', '620111', '200511', '220120', '230110', '340520', '440210', '590110', '140330', '130122', '390223', '670902', '030210', '330610', '380420', '380901', '340530', '440120', '004190', '550320', '620926', '340210', '610110', '380410', '310220', '320110', '180620', '620410', '620221', '320221', '550330', '370901', '030310', '050410', '280110', '680903', '590210', '640120', '380320', '600420', '620420', '530110', '390230', '550110', '190213', '620213', '999900', '560400', '320521', '310232', '570000', '620510', '320150', '630220', '340310' ) ) THEN -5.128687
        WHEN ( t1.UCC3 IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC IN ( '010210', '060110', '060210', '090110', '090210', '110410', '110510', '120110', '120310', '120410', '130212', '130310', '130320', '140110', '160211', '170110', '170210', '180310', '180320', '180410', '190111', '190211', '550210', '620912', '650210', '020110', '020210', '020310', '020610', '020620', '020710', '020810', '030110', '030710', '080110', '100410', '100510', '110110', '110210', '170520', '170532', '180220', '180420', '180510', '180710', '280120', '330210', '190112', '190113', '330110', '470111', '630110', '190313', '260110', '270000', '040310', '120210', '190114', '190212', '190321', '270310', '560110', '070240', '140210', '190311', '190322', '260210', '040110', '070230', '100110', '140220', '330310', '030610', '050210', '160110', '160310', '170310', '170531', '470211', '020510', '140230', '350110', '500110', '030810', '040210', '040410', '040510', '050110', '050310', '060310', '160320', '170410', '180110', '170533', '190324', '250210', '270210', '310352', '520531', '550310', '150212', '180611', '180612', '280140', '490312', '690120', '030410', '070110', '150211', '170510', '400110', '190312', '200512', '200532', '670310', '670903', '002100', '590230', '180210', '380340', '480212', '110310', '270410', '240210', '620310', '130211', '140340', '200522', '470220', '250900', '580000', '009000', '190314', '210210', '130121', '190214', '490300', '010310', '530311', '560210', '240310', '630210', '310316', '200511', '220120', '230110', '440210', '140330', '690110', '130122', '670902', '030210', '330610', '380210', '380420', '440120', '550320', '240110', '520110', '180620', '620410', '620221', '550330', '620810', '030310', '690114', '680220', '050410', '280110', '680903', '590210', '240120', '490313', '560330', '230900', '380320', '190323', '620420', '530110', '550110', '190213', '310332', '620213', '220210', '999900', '290410', '560400', '290440', '620320', '310232', '570000', '630220', '470112' ) ) AND ( t2.UCC NOT IN ( '110410', '110510', '120310', '130212', '140110', '170110', '180320', '190111', '190211', '330510', '360210', '550210', '620912', '640110', '010120', '020110', '020310', '020810', '020820', '030710', '100510', '110110', '140420', '170520', '180220', '180420', '180510', '280120', '320370', '190112', '190113', '630110', '190313', '260110', '270000', '120210', '190114', '190212', '190321', '270310', '370213', '610310', '070240', '140210', '190311', '260210', '040110', '070230', '140220', '330310', '010320', '030610', '050210', '160110', '170531', '470211', '240320', '400210', '500110', '640310', '660000', '030810', '040210', '040410', '040510', '050110', '050310', '060310', '170410', '180110', '370125', '600210', '640210', '170533', '190324', '270210', '310352', '320904', '340510', '370211', '550310', '340110', '030410', '070110', '130110', '320420', '400110', '480213', '190312', '200512', '200532', '340410', '390310', '390321', '390322', '420115', '640410', '670310', '590230', '210110', '180210', '480212', '610320', '640220', '110310', '030510', '320330', '620310', '130211', '140340', '200410', '320130', '380333', '400310', '200522', '470220', '360311', '360312', '360330', '250900', '190314', '620121', '190214', '370314', '490300', '010310', '240310', '630210', '620111', '200511', '220120', '230110', '340520', '440210', '590110', '140330', '130122', '390223', '670902', '030210', '330610', '380420', '380901', '340530', '440120', '004190', '550320', '620926', '340210', '610110', '380410', '310220', '320110', '180620', '620410', '620221', '320221', '550330', '370901', '030310', '050410', '280110', '680903', '590210', '640120', '380320', '600420', '620420', '530110', '390230', '550110', '190213', '620213', '999900', '560400', '320521', '310232', '570000', '620510', '320150', '630220', '340310' ) ) THEN -2.507103
        WHEN ( t1.UCC3 IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC NOT IN ( '010210', '060110', '060210', '090110', '090210', '110410', '110510', '120110', '120310', '120410', '130212', '130310', '130320', '140110', '160211', '170110', '170210', '180310', '180320', '180410', '190111', '190211', '550210', '620912', '650210', '020110', '020210', '020310', '020610', '020620', '020710', '020810', '030110', '030710', '080110', '100410', '100510', '110110', '110210', '170520', '170532', '180220', '180420', '180510', '180710', '280120', '330210', '190112', '190113', '330110', '470111', '630110', '190313', '260110', '270000', '040310', '120210', '190114', '190212', '190321', '270310', '560110', '070240', '140210', '190311', '190322', '260210', '040110', '070230', '100110', '140220', '330310', '030610', '050210', '160110', '160310', '170310', '170531', '470211', '020510', '140230', '350110', '500110', '030810', '040210', '040410', '040510', '050110', '050310', '060310', '160320', '170410', '180110', '170533', '190324', '250210', '270210', '310352', '520531', '550310', '150212', '180611', '180612', '280140', '490312', '690120', '030410', '070110', '150211', '170510', '400110', '190312', '200512', '200532', '670310', '670903', '002100', '590230', '180210', '380340', '480212', '110310', '270410', '240210', '620310', '130211', '140340', '200522', '470220', '250900', '580000', '009000', '190314', '210210', '130121', '190214', '490300', '010310', '530311', '560210', '240310', '630210', '310316', '200511', '220120', '230110', '440210', '140330', '690110', '130122', '670902', '030210', '330610', '380210', '380420', '440120', '550320', '240110', '520110', '180620', '620410', '620221', '550330', '620810', '030310', '690114', '680220', '050410', '280110', '680903', '590210', '240120', '490313', '560330', '230900', '380320', '190323', '620420', '530110', '550110', '190213', '310332', '620213', '220210', '999900', '290410', '560400', '290440', '620320', '310232', '570000', '630220', '470112' ) ) AND ( t2.UCC IN ( '010210', '060110', '060210', '090110', '090210', '110510', '120110', '120310', '130212', '140110', '170110', '170210', '180310', '180410', '330510', '620912', '650210', '010120', '020110', '020310', '020610', '020620', '020710', '020810', '020820', '030110', '030710', '080110', '100210', '100410', '110110', '110210', '140420', '170520', '180220', '180420', '180510', '180710', '280120', '330210', '190113', '330110', '470111', '630110', '190313', '260110', '270000', '040310', '120210', '190114', '190212', '190321', '270310', '610310', '070240', '140210', '040110', '070230', '100110', '140220', '330310', '010320', '020410', '050210', '160310', '170310', '020510', '140230', '240320', '320905', '500110', '640310', '660000', '040210', '040410', '050110', '050310', '060310', '160320', '170410', '180110', '200111', '170533', '190324', '270210', '520531', '340110', '160212', '490312', '540000', '320903', '150211', '170510', '480213', '490000', '420115', '210110', '610320', '640220', '010110', '110310', '130211', '140340', '400310', '200522', '360513', '580000', '009000', '190314', '620121', '010310', '530311', '530210', '620111', '390210', '330610', '380420', '240110', '620926', '310220', '180620' ) ) THEN -6.507490
        WHEN ( t1.UCC3 IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC NOT IN ( '010210', '060110', '060210', '090110', '090210', '110410', '110510', '120110', '120310', '120410', '130212', '130310', '130320', '140110', '160211', '170110', '170210', '180310', '180320', '180410', '190111', '190211', '550210', '620912', '650210', '020110', '020210', '020310', '020610', '020620', '020710', '020810', '030110', '030710', '080110', '100410', '100510', '110110', '110210', '170520', '170532', '180220', '180420', '180510', '180710', '280120', '330210', '190112', '190113', '330110', '470111', '630110', '190313', '260110', '270000', '040310', '120210', '190114', '190212', '190321', '270310', '560110', '070240', '140210', '190311', '190322', '260210', '040110', '070230', '100110', '140220', '330310', '030610', '050210', '160110', '160310', '170310', '170531', '470211', '020510', '140230', '350110', '500110', '030810', '040210', '040410', '040510', '050110', '050310', '060310', '160320', '170410', '180110', '170533', '190324', '250210', '270210', '310352', '520531', '550310', '150212', '180611', '180612', '280140', '490312', '690120', '030410', '070110', '150211', '170510', '400110', '190312', '200512', '200532', '670310', '670903', '002100', '590230', '180210', '380340', '480212', '110310', '270410', '240210', '620310', '130211', '140340', '200522', '470220', '250900', '580000', '009000', '190314', '210210', '130121', '190214', '490300', '010310', '530311', '560210', '240310', '630210', '310316', '200511', '220120', '230110', '440210', '140330', '690110', '130122', '670902', '030210', '330610', '380210', '380420', '440120', '550320', '240110', '520110', '180620', '620410', '620221', '550330', '620810', '030310', '690114', '680220', '050410', '280110', '680903', '590210', '240120', '490313', '560330', '230900', '380320', '190323', '620420', '530110', '550110', '190213', '310332', '620213', '220210', '999900', '290410', '560400', '290440', '620320', '310232', '570000', '630220', '470112' ) ) AND ( t2.UCC NOT IN ( '010210', '060110', '060210', '090110', '090210', '110510', '120110', '120310', '130212', '140110', '170110', '170210', '180310', '180410', '330510', '620912', '650210', '010120', '020110', '020310', '020610', '020620', '020710', '020810', '020820', '030110', '030710', '080110', '100210', '100410', '110110', '110210', '140420', '170520', '180220', '180420', '180510', '180710', '280120', '330210', '190113', '330110', '470111', '630110', '190313', '260110', '270000', '040310', '120210', '190114', '190212', '190321', '270310', '610310', '070240', '140210', '040110', '070230', '100110', '140220', '330310', '010320', '020410', '050210', '160310', '170310', '020510', '140230', '240320', '320905', '500110', '640310', '660000', '040210', '040410', '050110', '050310', '060310', '160320', '170410', '180110', '200111', '170533', '190324', '270210', '520531', '340110', '160212', '490312', '540000', '320903', '150211', '170510', '480213', '490000', '420115', '210110', '610320', '640220', '010110', '110310', '130211', '140340', '400310', '200522', '360513', '580000', '009000', '190314', '620121', '010310', '530311', '530210', '620111', '390210', '330610', '380420', '240110', '620926', '310220', '180620' ) ) THEN 4.707977
        WHEN ( t1.UCC3 NOT IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC5 IN ( '32014', '32041', '32023', '32037', '61031', '32090', '66000', '60021', '32061', '34011', '32042', '32034', '00400', '34041', '34090', '42011', '61032', '36031', '36033', '36042', '36051', '60031', '00410', '34052', '41012', '34053', '32043', '32011', '32022', '39023', '32052', '34031' ) ) AND ( t2.UCC IN ( '060210', '090110', '120110', '170110', '170210', '190211', '330510', '550210', '640110', '020310', '030110', '080110', '150110', '180220', '180710', '190112', '470111', '630110', '270000', '610310', '190322', '330310', '020510', '320905', '640310', '180110', '640210', '610320' ) ) THEN -14.247980
        WHEN ( t1.UCC3 NOT IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC5 IN ( '32014', '32041', '32023', '32037', '61031', '32090', '66000', '60021', '32061', '34011', '32042', '32034', '00400', '34041', '34090', '42011', '61032', '36031', '36033', '36042', '36051', '60031', '00410', '34052', '41012', '34053', '32043', '32011', '32022', '39023', '32052', '34031' ) ) AND ( t2.UCC NOT IN ( '060210', '090110', '120110', '170110', '170210', '190211', '330510', '550210', '640110', '020310', '030110', '080110', '150110', '180220', '180710', '190112', '470111', '630110', '270000', '610310', '190322', '330310', '020510', '320905', '640310', '180110', '640210', '610320' ) ) THEN 2.172690
        WHEN ( t1.UCC3 NOT IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC5 NOT IN ( '32014', '32041', '32023', '32037', '61031', '32090', '66000', '60021', '32061', '34011', '32042', '32034', '00400', '34041', '34090', '42011', '61032', '36031', '36033', '36042', '36051', '60031', '00410', '34052', '41012', '34053', '32043', '32011', '32022', '39023', '32052', '34031' ) ) AND ( t1.INC_RNK4 > 0.057071 ) THEN 3.502799
        WHEN ( t1.UCC3 NOT IN ( '010', '060', '090', '110', '120', '130', '140', '160', '170', '180', '190', '290', '330', '550', '620', '640', '650', '020', '030', '080', '100', '150', '280', '470', '630', '260', '270', '040', '220', '560', '070', '050', '240', '310', '350', '400', '500', '200', '380', '250', '520', '490', '540', '690', '580', '480', '670', '002', '680', '590', '210', '009', '530', '230', '440', '570', '999' ) ) AND ( t1.UCC5 NOT IN ( '32014', '32041', '32023', '32037', '61031', '32090', '66000', '60021', '32061', '34011', '32042', '32034', '00400', '34041', '34090', '42011', '61032', '36031', '36033', '36042', '36051', '60031', '00410', '34052', '41012', '34053', '32043', '32011', '32022', '39023', '32052', '34031' ) ) AND ( t1.INC_RNK4 <= 0.057071 OR t1.INC_RNK4 IS NULL ) THEN 20.811932
        ELSE NULL
    END
) AS feature_2,
     t1.NEWID,
     t1.TIME_STAMP
FROM (
    SELECT *,
        ROW_NUMBER() OVER ( ORDER BY NEWID, TIME_STAMP ASC ) AS rownum
    FROM POPULATION
) t1
LEFT JOIN EXPD t2
ON t1.NEWID = t2.NEWID
WHERE t2.TIME_STAMP <= t1.TIME_STAMP
GROUP BY t1.rownum,
         t1.NEWID,
         t1.TIME_STAMP;```


There are a two things we can learn from this:

1. This feature is mainly based on the UCC codes. Not only on the UCC codes of the product in question (marked t1.UCC), but it also compares the UCC code to other products that the household has purchased (marked t2.UCC). This means that both the **product itself**, but also the **household's usual consumption patterns** predict whether this item was purchased as a gift.

2. It should also be fairly obvious that you could have never written a feature like this manually or by using brute-force approaches. You need to use **relational learning algorithms** to produce features like this. If there is one thing you take away from this, let it be this: **Relational learning matters.**

# Conclusion

In this notebook, we have shown how you can use relational learning to predict whether items were purchased as a gift. We did this to highlight the **importance of relational learning**. Relational learning can be used in many real-world data science applications, but unfortunately most data scientists don't even know what relation learning is.

If you want to learn more about getML in specific, check out the **official documentation**:

https://docs.getml.com/latest/index.html

You can also **download it for free**:

https://getml.com/product
