# Senzing Entity Resolution Quickstart with Spark/Databricks

This tutorial is an introduction to using [Senzing](https://senzing.com/) with Spark dataframes. We'll load three example datasets from Spark dataframes into an instance of Senzing, and we'll find all the entities that are present within this data. This will show us duplicates within the data, and we'll gain the capability to merge the dataframes based on the results from Senzing.

We'll use the Senzing "Truthsets" demo data from https://github.com/Senzing/truth-sets/. The datasets are as follows:
- `customers`, a messy dataset of customer names and incomplete PII data. It includes addresses, dates of birth, emails, etc.
- `reference`, containing customer and organization information, with incomplete contact data
- `watchlist`, a list of fraudulent entities

We'll use all three datasets to figure out which rows in the `customers` dataset refer to the same entity (person).

We'll use the [Senzing V4](https://www.senzing.com/docs/4_beta/python/index.html) syntax and a sandbox Senzing gRPC server hosted within a Docker container.

### Steps in this tutorial

1. Set up the Senzing gRPC server, import the required modules, download the demo data
2. Load the data into Spark dataframes
3. Configure the Senzing engine so it's ready to receive data
4. Add our data to the Senzing repository and resolve entities
5. Run a cleanup process to ensure the entities are as accurate as possible.
6. Extract the resolved entities from Senzing
7. Add a new column to our Spark dataframe containing resolved entity details.

## Set up requirements

In this tutorial, we'll use the [`senzing`](https://garage.senzing.com/sz-sdk-python/index.html) and [`senzing_grpc`](https://garage.senzing.com/sz-sdk-python-grpc/) packages, in addition to PySpark. You can install both of these using the `requirements.txt` file in the repo folder containing this tutorial.

In [26]:
import grpc
from senzing import szengineflags, szerror
from senzing_grpc import SzAbstractFactoryGrpc
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, collect_list, array, array_except
import json
import os
import requests
import shutil

We'll start our [Senzing gRPC server](https://github.com/senzing-garage/serve-grpc/tree/main) using Docker.

Run the following command `docker run -it --publish 8261:8261 --rm senzing/serve-grpc` in a terminal window.


Then, we'll download the example data:

In [2]:
data_path = "./data/"
data_url_prefix = "https://raw.githubusercontent.com/Senzing/truth-sets/refs/heads/main/truthsets/demo/"
data_filenames = [
    "customers.json",
    "reference.json",
    "watchlist.json",
]

In [3]:
os.makedirs(data_path, exist_ok=True)

for filename in data_filenames:
    url = data_url_prefix + filename
    filepath = data_path + filename
    if not os.path.exists(filepath):
        response = requests.get(url, stream=True, timeout=10)
        response.raw.decode_content = True
        with open(filepath, "wb") as file:
            shutil.copyfileobj(response.raw, file)

## Load data into Spark DataFrames


First, we'll start a new Spark session.

In [3]:
spark = (
    SparkSession.builder.appName("Senzing Quickstart").master("local[*]").config("spark.driver.bindAddress", "127.0.0.1").getOrCreate()
)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/07/21 16:35:12 WARN Utils: Your hostname, Catherines-MacBook-Air-2.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.4 instead (on interface en0)
25/07/21 16:35:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/21 16:35:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Next, we'll load the three datasets from JSON files into Spark dataframes.

In [4]:
customers = spark.read.json("data/customers.json")

                                                                                

In [5]:
customers.show(10)

25/07/21 16:35:18 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------------+------------+---------+--------------------+----------------+----------+---------+------+--------+-----------+-------+-------------+----------------------+---------------------+------------------+------+-------------------+------------------+----------------+----------------+---------------+------------+----------+------------------+-----------------+-----------------+-------------------+----------------+---------+-----------+------------------+-----------+----------+
|    ADDR_CITY|ADDR_COUNTRY|ADDR_FULL|          ADDR_LINE1|ADDR_POSTAL_CODE|ADDR_STATE|ADDR_TYPE|AMOUNT|CATEGORY|DATA_SOURCE|   DATE|DATE_OF_BIRTH|DRIVERS_LICENSE_NUMBER|DRIVERS_LICENSE_STATE|     EMAIL_ADDRESS|GENDER|NATIONAL_ID_COUNTRY|NATIONAL_ID_NUMBER|NATIVE_NAME_FULL|PASSPORT_COUNTRY|PASSPORT_NUMBER|PHONE_NUMBER|PHONE_TYPE|PRIMARY_NAME_FIRST|PRIMARY_NAME_FULL|PRIMARY_NAME_LAST|PRIMARY_NAME_MIDDLE|PRIMARY_NAME_ORG|RECORD_ID|RECORD_TYPE|SECONDARY_NAME_ORG| SSN_NUMBER|    STATUS|
+-------------+---------

We can see that Robert Smith is in the dataset as Robert, Bob, B, and Robbie, with variations in mailing address and date of birth.

Next, we'll load the `reference` and `watchlist` datasets.

In [6]:
reference = spark.read.json("data/reference.json")

In [7]:
reference.show(10)

+---------------+------------+--------------------+--------------------+----------------+----------+----------+--------------+-----------+----+-------------+-------------+--------------------+--------------------+----------------+----------+------------------+-----------------+-----------------+--------------------+---------+------------+--------------+---------------+----------------+------------------+--------+
|      ADDR_CITY|ADDR_COUNTRY|           ADDR_FULL|          ADDR_LINE1|ADDR_POSTAL_CODE|ADDR_STATE| ADDR_TYPE|      CATEGORY|DATA_SOURCE|DATE|DATE_OF_BIRTH|EMAIL_ADDRESS|       EMPLOYER_NAME|    NATIVE_NAME_FULL|    PHONE_NUMBER|PHONE_TYPE|PRIMARY_NAME_FIRST|PRIMARY_NAME_FULL|PRIMARY_NAME_LAST|    PRIMARY_NAME_ORG|RECORD_ID| RECORD_TYPE|REL_ANCHOR_KEY|REL_POINTER_KEY|REL_POINTER_ROLE|SECONDARY_NAME_ORG|  STATUS|
+---------------+------------+--------------------+--------------------+----------------+----------+----------+--------------+-----------+----+-------------+---------

In [8]:
watchlist = spark.read.json("data/watchlist.json")

In [9]:
watchlist.show(10)

+-------------+------------+--------------------+----------------+----------+---------+--------+-----------+------+-------------+----------------------+---------------------+--------------------+-------------+------+----------------+------------+------------------+-----------------+-----------------+-------------------+---------+-----------+-----------+--------+
|    ADDR_CITY|ADDR_COUNTRY|          ADDR_LINE1|ADDR_POSTAL_CODE|ADDR_STATE|ADDR_TYPE|CATEGORY|DATA_SOURCE|  DATE|DATE_OF_BIRTH|DRIVERS_LICENSE_NUMBER|DRIVERS_LICENSE_STATE|       EMAIL_ADDRESS|EMPLOYER_NAME|GENDER|NATIVE_NAME_FULL|PHONE_NUMBER|PRIMARY_NAME_FIRST|PRIMARY_NAME_FULL|PRIMARY_NAME_LAST|PRIMARY_NAME_MIDDLE|RECORD_ID|RECORD_TYPE| SSN_NUMBER|  STATUS|
+-------------+------------+--------------------+----------------+----------+---------+--------+-----------+------+-------------+----------------------+---------------------+--------------------+-------------+------+----------------+------------+------------------+-----

All three datasets have already been mapped to the [Senzing Entity Specification](https://www.senzing.com/docs/entity_specification/index.html). If you want to use your own data with Senzing, you'll need to map your data to this format.

## Configure Senzing

Next, we need to set everything up so that we can call the Senzing engine on the gRPC server. These steps assume that you're running Senzing locally for development purposes.

Senzing uses an [Abstract Factory](https://garage.senzing.com/sz-sdk-python/senzing.html#module-senzing.szabstractfactory) to create everything that's required to perform entity resolution. We'll create a new abstract factory like so:

In [10]:
grpc_channel = grpc.insecure_channel("localhost:8261")
sz_abstract_factory = SzAbstractFactoryGrpc(grpc_channel)

We'll get the Senzing version details to check connectivity and confirm everything is working.

In [11]:
sz_product = sz_abstract_factory.create_product()
print(json.dumps(json.loads(sz_product.get_version()), indent=2))

{
  "PRODUCT_NAME": "Senzing SDK",
  "VERSION": "4.0.0",
  "BUILD_VERSION": "4.0.0.25184",
  "BUILD_DATE": "2025-07-03",
  "BUILD_NUMBER": "2025_07_03__16_38",
  "COMPATIBILITY_VERSION": {
    "CONFIG_VERSION": "11"
  },
  "SCHEMA_VERSION": {
    "ENGINE_SCHEMA_VERSION": "4.0",
    "MINIMUM_REQUIRED_SCHEMA_VERSION": "4.0",
    "MAXIMUM_REQUIRED_SCHEMA_VERSION": "4.99"
  }
}


Next, we create the Senzing objects that we need: the configuration manager, the diagnostic in case of errors, and the engine that will perform entity resolution when we load our data.

In [12]:
sz_configmanager = sz_abstract_factory.create_configmanager()
sz_diagnostic = sz_abstract_factory.create_diagnostic()
sz_engine = sz_abstract_factory.create_engine()

We create a new config and add the names of the data sources.

In [13]:
config_id = sz_configmanager.get_default_config_id()
sz_config = sz_configmanager.create_config_from_config_id(config_id)

In [14]:
for data_source in ["CUSTOMERS", "REFERENCE", "WATCHLIST"]:
    sz_config.register_data_source(data_source)

And we replace the default config with our updated config:

In [15]:
new_json_config = sz_config.export()
new_config_id = sz_configmanager.register_config(new_json_config, "Add example data")
sz_configmanager.replace_default_config_id(config_id, new_config_id)

Because we've changed the Senzing configuration, Senzing objects need to be updated. We do this by reinitializing with the new config.

In [16]:
sz_abstract_factory.reinitialize(new_config_id)

## Add records to Senzing

Now that the Senzing engine is set up, we can add our data. We do this using [`sz_engine.add_record()`](https://garage.senzing.com/sz-sdk-python/senzing.html#senzing.szengine.SzEngine.add_record), which adds a single record into the Senzing repository.

This method has three required arguments:
- `data_source_code`, the identifier that we assigned to each dataset when we added the datasets to the Senzing config. In this tutorial, it's one of ['CUSTOMERS', 'REFERENCE', 'WATCHLIST'].
- `record_id`, a unique identifier for each record. This is the `RECORD_ID` column in our example datasets.
- `record_definition`, which is the row (record) we're adding.

So to simply add a record, we would use the following code:

```
sz_engine.add_record(record['DATA_SOURCE'],
            record['RECORD_ID'],
            record)

We'll add an [optional flag](https://senzing.com/docs/4_beta/flags/flags_add/index.html) so that the Senzing engine outputs the entity ID that is affected when we add each row. Then we'll add this entity ID to a Python set (so that no duplicates are possible). We'll use this later on when we want to extract the details of related entities from Senzing.

This is particularly useful when you're adding new data to a large existing dataset, and you only want to see what entities have been affected by the new records.

We'll use Spark's local iterator to iterate through each row in each dataframe, convert each row into a dictionary, and add it to the Senzing repository. We'll extract the entity ID from the info printed out by Senzing:

In [17]:
def get_affected_entities(info_string):
    # helper function to extract the entity id
    info = json.loads(info_string)
    return [entity["ENTITY_ID"] for entity in info["AFFECTED_ENTITIES"]]

In [19]:
affected_entities = set()

for data_source in [customers, reference, watchlist]:
    for row in data_source.rdd.toLocalIterator():
        record = {k: v for k, v in row.asDict().items() if v is not None}

        info = sz_engine.add_record(
            record["DATA_SOURCE"],
            record["RECORD_ID"],
            record,
            szengineflags.SzEngineFlags.SZ_WITH_INFO,
        )

        affected_entities.update(get_affected_entities(info))
        print(info)

{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1001","AFFECTED_ENTITIES":[{"ENTITY_ID":1}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1002","AFFECTED_ENTITIES":[{"ENTITY_ID":1}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1003","AFFECTED_ENTITIES":[{"ENTITY_ID":1}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1004","AFFECTED_ENTITIES":[{"ENTITY_ID":1}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1005","AFFECTED_ENTITIES":[{"ENTITY_ID":1}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1009","AFFECTED_ENTITIES":[{"ENTITY_ID":6}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1010","AFFECTED_ENTITIES":[{"ENTITY_ID":6}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1011","AFFECTED_ENTITIES":[{"ENTITY_ID":8}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1015","AFFECTED_ENTITIES":[{"ENTITY_ID":9}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1016","AFFECTED_ENTITIES":[{"ENTITY_ID":9}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1017","AFFECTED_ENTITIES":[{"ENTITY_ID":9}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1018","AFFECTED_ENTITIES"

When each row is added, we'll see the details printed out. It should look like this for the first row:

`{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"1001","AFFECTED_ENTITIES":[{"ENTITY_ID":1}]}`

In [20]:
print(affected_entities)

{1, 5, 6, 8, 9, 13, 14, 15, 17, 18, 19, 20, 22, 24, 27, 28, 29, 30, 31, 33, 36, 39, 40, 43, 44, 45, 47, 49, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 76, 77, 79, 81, 83, 85, 87, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 120, 125, 126, 129, 130, 131, 133, 135, 137, 140, 141, 144, 145, 150, 151, 152, 154, 158}


## Process REDO records

The [redo process](https://senzing.zendesk.com/hc/en-us/articles/360007475133-Processing-REDO) in Senzing is a periodic cleanup of the entities in the Senzing repository. You'll need to run the redo process every so often when you're using Senzing with your own data.

The most common use for this is when Senzing discovers a value is overused across entities. You might add 50 records with different names but the same phone number. At first, the shared phone number suggests that these entities are related, but at a certain point the system will spot that the phone number is no longer a good identifier. It will then create redo records in a separate table, and you can run the redo process to clean up the entities.

We'll run the redo process using `sz_engine.get_redo_record()`, which gets each record in the redo table, and `sz_engine.process_redo_record()`, which cleans up the entities based on the redo record. Carrying out this process can generate more records in the redo table, so we'll run it in a `while` loop until there are no more redo records.

We'll also use the `SZ_WITH_INFO` flag again to output the affected entities, and we'll update our set of affected entities with these entity IDs.

In [21]:
while True:
    redo_record = sz_engine.get_redo_record()
    if not redo_record:
        break
    info = sz_engine.process_redo_record(redo_record, flags=szengineflags.SzEngineFlags.SZ_WITH_INFO)
    affected_entities.update(get_affected_entities(info))
    print(info)

{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"2181","AFFECTED_ENTITIES":[{"ENTITY_ID":102}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"2191","AFFECTED_ENTITIES":[{"ENTITY_ID":104}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"2192","AFFECTED_ENTITIES":[{"ENTITY_ID":105}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"2182","AFFECTED_ENTITIES":[{"ENTITY_ID":103}]}
{"DATA_SOURCE":"CUSTOMERS","RECORD_ID":"2171","AFFECTED_ENTITIES":[{"ENTITY_ID":100},{"ENTITY_ID":101}]}


### Take a look at some results

Let's take a quick look at the entities found by Senzing. We know that there's someone named Robert Smith in the dataset, and we think their date of birth is 11/12/1978. We can look them up in the Senzing repository using `sz_engine.search_by_attributes`.

In [22]:
search_query = {
    "name_full": "robert smith",
    "date_of_birth": "11/12/1978",
}
search_result = sz_engine.search_by_attributes(json.dumps(search_query))
print(json.dumps(json.loads(search_result), indent=2))

{
  "RESOLVED_ENTITIES": [
    {
      "MATCH_INFO": {
        "MATCH_LEVEL_CODE": "RESOLVED",
        "MATCH_KEY": "+NAME+DOB",
        "ERRULE_CODE": "SNAME_SSTAB",
        "CANDIDATE_KEYS": {
          "DOB": [
            {
              "FEAT_ID": 21,
              "FEAT_DESC": "11/12/1978"
            }
          ],
          "NAMEDATE_KEY": [
            {
              "FEAT_ID": 14,
              "FEAT_DESC": "RPRT|SM0|DOB.MMDD_HASH=1211"
            },
            {
              "FEAT_ID": 15,
              "FEAT_DESC": "RPRT|SM0|DOB=71211"
            },
            {
              "FEAT_ID": 30,
              "FEAT_DESC": "RPRT|SM0|DOB.MMYY_HASH=1178"
            }
          ],
          "NAME_KEY": [
            {
              "FEAT_ID": 6,
              "FEAT_DESC": "RPRT|SM0"
            }
          ]
        },
        "FEATURE_SCORES": {
          "DOB": [
            {
              "INBOUND_FEAT_ID": 21,
              "INBOUND_FEAT_DESC": "11/12/1978",
            

We can see all the information about the Robert Smith entity that is currently in the Senzing repository. This person is also in our datasets with the names "B Smith", "Bob J Smith" and "Bob Smith".

## Add resolved entities to the Spark dataframe

Our final step in this tutorial is to add a new column to the `customers` dataframe containing a list of all the resolved entities found by Senzing.

So if records 1001, 1002 and 1003 are linked to the same entity, we'll add the list [1002, 1003] to the row for record 1001.

You could then use this information to merge rows in the original dataframe, but we won't cover this in the tutorial.

### Get entity mappings from Senzing


We'll use our list of affected entities to extract all the resolved entities, then we'll build a map from the entity ID to all the record IDs for that entity.

In [23]:
def get_records_for_entity(entity_id):
    records = json.loads(sz_engine.get_entity_by_entity_id(entity_id))[
        "RESOLVED_ENTITY"
    ]["RECORDS"]
    return [records[i]["RECORD_ID"] for i in range(len(records))]

In [27]:
# build entity to record map
entity_to_record = {}
for entity in affected_entities:
    try:
        entity_to_record[entity] = get_records_for_entity(entity)
    except szerror.SzError:
        entity_to_record[entity] = []

The first entries in this dictionary should look like this:

```
{1: ['1002', '1001', '1003', '1004'],
 5: ['1005', '1006'],
 6: ['1009', '1010', '1011', '1012', '1014'],
 ...`
```

In [29]:
entity_to_record

{1: ['1002', '1001', '1003', '1004'],
 5: ['1005', '1006'],
 6: ['1009', '1010', '1011', '1012', '1014'],
 8: [],
 9: ['1015', '1016', '1017', '1018'],
 13: ['1019'],
 14: ['1020', '1021'],
 15: ['1022', '1023', '1024'],
 17: ['1025'],
 18: ['1026'],
 19: ['1028'],
 20: ['1030', '1031'],
 22: ['1032', '1033'],
 24: ['1034', '1035', '1036', '1038'],
 27: ['1039'],
 28: ['1040'],
 29: ['1043'],
 30: ['1044', '1046'],
 31: ['1045'],
 33: ['1047', '1048', '1049'],
 36: ['1050', '1051', '1052'],
 39: ['1053', '1055'],
 40: ['1054', '1056'],
 43: ['1057'],
 44: ['1058'],
 45: ['1059', '1060'],
 47: ['1061', '1062'],
 49: ['1063', '1064', '1065', '1066', '1067', '1068'],
 55: ['1069', '1070', '2013'],
 57: ['1071', '1072', '2014'],
 59: ['1073', '1074'],
 61: ['1075', '1076'],
 63: ['1077', '1078'],
 65: ['1079', '1080'],
 67: ['1081', '1082'],
 69: ['1083', '1084'],
 71: ['1085', '1086'],
 73: ['1087', '1088'],
 75: ['1089'],
 76: ['1090'],
 77: ['1091', '1092'],
 79: ['1093', '1094'],
 81: 

## Join entity records to Spark DataFrame

Our final step in this tutorial is to create a new column with details of all the rows that Senzing has resolved to the same entity.

To build this column, we'll first flatten the entity to record map and convert it to a new Spark dataframe:

In [30]:
entity_record_data = [
    (entity_id, record_id)
    for entity_id, records in entity_to_record.items()
    for record_id in records
]
entity_record_df = spark.createDataFrame(entity_record_data, ["ENTITY_ID", "RECORD_ID"])

Then, we'll group this new dataframe by the entity ID.

In [31]:
entity_grouped = entity_record_df.groupBy("ENTITY_ID").agg(
    collect_list("RECORD_ID").alias("ALL_RECORD_IDS")
)

We'll join it to the original `customers` dataframe, and we'll also add a column with the entity ID for that row:

In [32]:
customers = customers.join(entity_record_df, "RECORD_ID", "left")
customers = customers.join(entity_grouped, "ENTITY_ID", "left")

And we'll make a new column without the original RECORD_ID:

In [33]:
customers = customers.withColumn(
    "RELATED_RECORD_IDS",
    array_except(col("ALL_RECORD_IDS"), array(col("RECORD_ID").cast("string"))),
)

In [34]:
customers.show(10, truncate=False)

                                                                                

+---------+---------+-------------+------------+---------+-----------------------------------+----------------+----------+---------+------+--------+-----------+-------+-------------+----------------------+---------------------+------------------+------+-------------------+------------------+----------------+----------------+---------------+------------+----------+------------------+-----------------+-----------------+-------------------+----------------+-----------+------------------+-----------+----------+------------------------------+------------------------+
|ENTITY_ID|RECORD_ID|ADDR_CITY    |ADDR_COUNTRY|ADDR_FULL|ADDR_LINE1                         |ADDR_POSTAL_CODE|ADDR_STATE|ADDR_TYPE|AMOUNT|CATEGORY|DATA_SOURCE|DATE   |DATE_OF_BIRTH|DRIVERS_LICENSE_NUMBER|DRIVERS_LICENSE_STATE|EMAIL_ADDRESS     |GENDER|NATIONAL_ID_COUNTRY|NATIONAL_ID_NUMBER|NATIVE_NAME_FULL|PASSPORT_COUNTRY|PASSPORT_NUMBER|PHONE_NUMBER|PHONE_TYPE|PRIMARY_NAME_FIRST|PRIMARY_NAME_FULL|PRIMARY_NAME_LAST|PRIMARY_NA

There's now a column in our `customers` dataframe that contains all the other records that have been resolved to the same entity! It should look like this:

| RECORD_ID | ENTITY_ID | RELATED_RECORD_IDS |
|-----------|-----------|-------------------|
| 1004 |1|[1002, 1001, 1003] |
|1010|6|[1009, 1011, 1012, 1014]|
|1016|9|[1015, 1017, 1018]|
...

In [35]:
customers.select("RECORD_ID", "ENTITY_ID", "RELATED_RECORD_IDS").show(10, truncate=False)

+---------+---------+------------------------+
|RECORD_ID|ENTITY_ID|RELATED_RECORD_IDS      |
+---------+---------+------------------------+
|1004     |1        |[1002, 1001, 1003]      |
|1010     |6        |[1009, 1011, 1012, 1014]|
|1016     |9        |[1015, 1017, 1018]      |
|1017     |9        |[1015, 1016, 1018]      |
|1005     |5        |[1006]                  |
|1011     |6        |[1009, 1010, 1012, 1014]|
|1015     |9        |[1016, 1017, 1018]      |
|1009     |6        |[1010, 1011, 1012, 1014]|
|1003     |1        |[1002, 1001, 1004]      |
|1002     |1        |[1001, 1003, 1004]      |
+---------+---------+------------------------+
only showing top 10 rows


In [36]:
spark.stop()

## Next steps

Link to new tutorial

TODO where else should people go if they want to learn more?