# KumoRFM Quickstart

[[**Blog**](https://kumo.ai/company/news/kumo-relational-foundation-model/) | [**Paper**](https://kumo.ai/research/kumo_relational_foundation_model.pdf)]

**KumoRFM (Kumo Relational Foundation Model)** is a Foundation Model for machine learning on enterprise data. With just your data and a few lines of code, you can generate accurate predictions in real-time: no model training or pipelines required.

## Introduction

KumoRFM is grounded in three key world views:

> **1. Enterprise data is a graph.**

Enterprise data is a graph where tables are connected by keys.   
Below is an example database where `ITEMS` table and `ORDERS` table are linked by `item_id`.

<div align="center">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/relational-database.png" width="500" />
</div>

Once we structure enterprise data as a graph, we can apply pre-trained [*Relational Graph Transformers*](https://kumo.ai/research/relational-graph-transformers/) to extract insights and patterns.

> **2. With timestamps, we place events on a timeline.**

By placing events on a timeline, we unlock the ability to model how things evolve over time.
This makes it possible to select any point in time and predict what is likely to happen next, based on the sequence and patterns in historical data.

<div align="center">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/timeline.png" width="300" />
</div>

> **3. Machine learning tasks can be described via predictive queries.**

All major machine learning tasks—regression, classification, recommendation—can be defined using a *Predictive Query language (PQL)*.

<div align="center">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/predictive-query-multiple.png" width="600" />
</div>

If you know SQL, picking up PQL is a breeze.
It will feel familiar right away.
Learn more about PQL [here](https://kumo.ai/docs/pquery-structure).



**Let's get started!**



## Step 1. Install the Kumo Python SDK

KumoRFM provides an [SDK](https://kumo-ai.github.io/kumo-sdk/docs/get_started/rfm/index.html) in Python.
The Kumo SDK is available for Python 3.9 to Python 3.13.

In [None]:
!pip install kumoai

In [None]:
import kumoai.experimental.rfm as rfm

**Note:** The API of `kumoai.experimental.rfm` may change in the near future.

## Step 2. Get an API key

You will need an API key to make calls to KumoRFM.
Use the widget below to generate one for free by clicking "Generate API Key".
If you don't have a KumoRFM account, the widget will prompt you to signup.

You will see the following when your key has been created successfully:

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/api-key-created.png" width="300" />
</div>

In [None]:
import os

if not os.environ.get("KUMO_API_KEY"):
    rfm.authenticate()

## Step 3. Initialize a client

If you completed step 2 via the widget, you don't need to change anything. `KUMO_API_KEY` is already set as environment variable.

If you bring the API key from the website, you can manually change the `KUMO_API_KEY` below:


In [None]:
# Initialize a Kumo client with your API key:
KUMO_API_KEY = os.environ.get("KUMO_API_KEY")
# print(KUMO_API_KEY)

In [None]:
rfm.init(api_key=KUMO_API_KEY)

In [None]:
rfm

## Step 4. Import your data

KumoRFM interacts with a set of `pd.DataFrame` objects:

In [None]:
import pandas as pd

root = 's3://kumo-sdk-public/rfm-datasets/online-shopping'
users_df = pd.read_parquet(f'{root}/users.parquet')
items_df = pd.read_parquet(f'{root}/items.parquet')
orders_df = pd.read_parquet(f'{root}/orders.parquet')
# NOTE You can use `pd.read_csv(...)` to read CSV files instead.
# You don't need to use s3 to import your data.
# For Colab users, you can upload your data to the file folder (the folder icon on the left panel), and directly import from there.
# You can also download this notebook, and import your data locally.

We can inspect the data and its types in-memory:

In [None]:
# Inspect a `pandas.DataFrame`:
users_df.head(4)

In [None]:
# Inspect the data types of a `pandas.DataFrame`:
users_df.dtypes

In [None]:
# Optional: Change the data type of columns (if necessary):
users_df['user_id'] = users_df['user_id'].astype(int)

## Step 5. Create KumoRFM tables

A `rfm.LocalTable` acts as a lightweight abstraction of a `pandas.DataFrame`, providing additional integration.

A `LocalTable` defines three critical things about the table:

1. **Semantic types** define how column data will be encoded. For instance, an integer column can be encoded either as `"numercial"` or `"categorical"`, depending on the actual meaning of the data (see reference at the end of this section for more details).

2. A **primary key** uniquely identifies each row in a table (*e.g.*, `user_id` is the primary key in the `users` table). It serves two purpose: (1) when creating a graph, it's the reference point to link other tables (2) when making predictions, it identifies the entity to generate predictions for. For instance, if you want to predict user outcomes, you'll need a table with `user_id` as the primary key.

3. A **time column** defines when the event happened (*i.e.* a timestamp column). Time columns help establish the sequence of events. KumoRFM automatically draws from these time slices to inform future predictions.

Don't worry if it doesn't all click right away—it'll become clearer once you reach the prediction section. For now, go ahead and follow along with the notebook.

KumoRFM is smart enough to infer most things correctly.
You may still want to inspect the results of inferred metadata to ensure correctness downstream:

In [None]:
users = rfm.LocalTable(users_df, name="users").infer_metadata()
orders = rfm.LocalTable(orders_df, name="orders").infer_metadata()
items = rfm.LocalTable(items_df, name="items").infer_metadata()

If you prefer more explicit control, you can manually assign metadata during table creation instead of relying on automatic inference:

In [None]:
orders = rfm.LocalTable(
    orders_df,
    name="orders",
    primary_key="order_id",
    time_column="date"
)

You can inspect the metadata of the table ...

In [None]:
users.print_metadata()

... and apply any required changes manually:

In [None]:
# Update the semantic type (stype) of columns:
users['user_id'].stype = "ID"
users['age'].stype = "numerical"

# Set primary key:
users.primary_key = "user_id"

# Set time column:
orders.time_column = "date"

**Quick Reference:**

1. **`stype` (semantic type)**:
   - A `stype` will determine how the column will be encoded downstream.
   - Correctly setting each column's stype is critical for model performance. For instance, if you want to perform missing value imputation, the semantic type will determine whether it is treated as a regression task (`stype="numerical"`) or a classification task (`stype="categorical")`.

| Type | Explanation | Example |
|------|-------------|---------|
| `"numerical"` | Numerical values (*e.g.*, `price`, `age`) | `25`, `3.14`, `-10` |
| `"categorical"` | Discrete categories with limited cardinality | Color: `"red"`, `"blue"`, `"green"` (one cell may only have one category) |
| `"multicategorical"` | Multiple categories in a single cell | `"Action\|Drama\|Comedy"`, `"Action\|Thriller"` |
| `"ID"` | An identifier, *e.g.*, primary keys or foreign keys | `user_id: 123`, `product_id: PRD-8729453` |
| `"text"` | Natural language text | Descriptions |
| `"timestamp"` | Specific point in time | `"2025-07-11"`,  `"2023-02-12 09:47:58`" |
| `"sequence"` | Custom embeddings or sequential data  | `[0.25, -0.75, 0.50, ...]` |

2. **`primary_key`**:
   - Indicates the column that will be used as primary key to link other tables to it. For instance, `user_id` should be the primary key for table `users`.
   - If there are duplicated primary keys, the system will only keep the first one.
   - `primary_key` can only be assigned to columns holding integers, floating point values or strings.
   - Each table can have at most one `primary_key` column.

3. **`time_column`**:
   - Indicates the timestamp column that record when the event occurred.
   - Time column data must be able to be parsed via `pandas.to_datetime`.
   - Each table can have at most one `time_column` column.

## Step 6. Create a graph in two simple steps

We are now ready to inter-connect our tables to form a `LocalGraph`.
But how to get started with building a graph? What tables should you include?

A good guiding principle is to start simple: begin with just the minimal set of tables needed to support the prediction task you care about. Focus on the core entities and relationships essential to prediction.

For example, suppose your goal is to predict a user's future orders. At a minimum, your graph only needs two tables:

- `users`: representing each user
- `orders`: representing the orders placed by those users

This minimal setup forms a usable graph for prediction. From there, you can gradually add complexity. For instance, you might later introduce an `items` table, so that RFM can take into account `item` information.

 **1. Select the tables:**

In [None]:
graph = rfm.LocalGraph(tables=[
    users,
    orders,
    items,
])

**2. Link the tables:**

In the `orders` table (`src_table`), there exists a column named `user_id` (`fkey`), which we can use as a foreign key to link to the primary key in the `users` table (`dst_table`).
You don't need to specify the primary key here since it's already known as part of the metadata of the `users` table.

In [None]:
graph.link(src_table="orders", fkey="user_id", dst_table="users");

Also link from the foreign key `item_id` in the `orders` table to the `items` table.

In [None]:
graph.link(src_table="orders", fkey="item_id", dst_table="items");

In [None]:
!pip install graphviz

You can verify that graph connectivity is set up by visualizing the graph ...

In [None]:
# Requires graphviz to be installed

graph.visualize();

... or by printing all necessary information:

In [None]:
graph.print_metadata()
graph.print_links()

You can update and modify links as needed:

In [None]:
# Remove link:
graph.unlink(src_table="orders", fkey="user_id", dst_table="users")

# Re-add link:
graph.link(src_table="orders", fkey="user_id", dst_table="users");

In addition, there exists a handy short-cut that lets you create a `LocalGraph` directly from a set of `pandas.DataFrame` objects, bypassing the step of manual `LocalTable` creation:

In [None]:
graph = rfm.LocalGraph.from_data({
    'users': users_df,
    'orders': orders_df,
    'items': items_df,
}, infer_metadata=True)

## Step 7. Write a predictive query

You are now ready to plug your graph into `KumoRFM` to make predictions!

The great thing about the graph is that it's a one-time setup—once it's in place, you can generate a variety of predictions from it and power many business use cases.

In [None]:
model = rfm.KumoRFM(graph)

**Note:** The data is synthetic, and the query and results are intended for demo purposes. We encourage you to benchmark the model using your own data.

### Example 1A: Forecast 30-day product demand

Predict the revenue (sum of order prices) the item with `item_id=42` will generate in the next 30 days.

In [None]:
query = "PREDICT SUM(orders.price, 0, 30, days) FOR items.item_id=42"

df = model.predict(query)
display(df)

How to interpret the result:
1. `ENTITY`: The item with `item_id=42`
1. `ANCHOR_TIMESTAMP`: Assuming predicting at anchor timestamp `2024-09-19`, what's happening between `(2024-09-19, 2024-10-18]`? By default, `anchor_time` is the maximum timestamp on the temporal graph.
1. `TARGET_PRED`: How much revenue `item_id=42` generates in the next 30 days.

**You can use the result for sales forecasting:**

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/sales-forecasting.png" width="500" />
</div>


### Example 1B: Forecast 30-day product demand, with an `anchor_time`

By default, predictions are based on the maximum timestamp in your temporal graph. However, you can explicitly set a historical `anchor_time` to simulate what a prediction would have looked like at that point in time.

For instance, if `anchor_time` is `"2024-09-20"`, the model will predict—assuming today is `"2024-09-20"`—the product demand in the next 30 days.
KumoRFM will only use information before the `anchor_time` to avoid data leakage.

This feature can be useful when you want to evaluate model performance yourself based on time-based splits.

In [None]:
df = model.predict(query, anchor_time=pd.Timestamp("2024-09-20"))
display(df)

### Example 2. Predict customer churn

Predict the likelihood that users  with `user_id=42` and `user_id=123` will place zero orders in the next 90 days.

In [None]:
query = "PREDICT COUNT(orders.*, 0, 90, days)=0 FOR users.user_id IN (42, 123)"

df = model.predict(query)
display(df)

How to interpret the result:
1. `ENTITY`: The user with `user_id=42` or `user_id=123`
1. `ANCHOR_TIMESTAMP`: Assuming we are predicting at this moment in time, what's happening in the next 90 days?
1. `TARGET_PRED`: Whether the event (`COUNT(orders.*, 0, 90, days)=0`) will happen (`True`: Event will happen; `False`: Event will not happen)
1. `False_PROB`: The probability that the event will not happen
1. `True_PROB`: The probability that the event will happen.

**You can use the result to prevent customer churn (*e.g.*, sending a personalized coupon):**

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/churn.png" width="500" />
</div>

### Example 3. Product recommendation

Predict the top-10 items that user with `user_id=123` is likely going to buy in the next 30 days.

In [None]:
query = "PREDICT LIST_DISTINCT(orders.item_id, 0, 30, days) RANK TOP 10 FOR users.user_id=123"

df = model.predict(query)
display(df)

How to interpret the result:
1. `ENTITY`: The user with `user_id=123`
1. `ANCHOR_TIMESTAMP`: Assuming we are predicting at this moment in time, what's happening in the next 30 days?
1. `CLASS`: The items (`item_id`)
1. `SCORE`: Higher score indicates higher likelihood

**You can use the result to power product recommendation:**

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/product-recommendation.png" width="500" />
</div>


### Example 4. Infer entity attributes

Predict the age of user with `user_id=8` (the original `age` field is `N/A` for this user).

In [None]:
query = "PREDICT users.age FOR users.user_id=8"

df = model.predict(query)
display(df)

How to interpret the result:
1. `ENTITY`: the user with `user_id=8`
1. `ANCHOR_TIMESTAMP`: assuming we are predicting at this moment in time
1. `TARGET_PRED`: The predicted `age` of the user

**You can use the result for customer segmentation:**

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/customer-segmantation.png" width="500" />
</div>


## We'd love to hear from you!

1. **Found a bug or have a feature request?**  

   Submit issues directly on [GitHub](https://github.com/kumo-ai/kumo-rfm). Your feedback helps us improve RFM for everyone.

1. **Built something cool with RFM? We'd love to see it!**  

   Share your project on LinkedIn and tag @kumo.  
   We regularly spotlight on our official channels—yours could be next!

<div align="left">
  <img src="https://kumo-sdk-public.s3.us-west-2.amazonaws.com/rfm-colabs/kumo_ai_logo.jpeg" width="30" />
</div>

