# Demo Notebook For Training and Running Zingg Entity Resolution Workflow on Databricks
This notebook runs the Zingg Febrl Example on Databricks. 
Please refer to the 
1. [Zingg Python API](https://readthedocs.org/projects/zingg/) 
2. [Zingg Official Documentation](https://www.docs.zingg.ai) 
for details.

Please ensure your cluster has the following installed
1. Zingg from pypi
2. Zingg jar from the [repo](https://github.com/zinggAI/zingg/releases)
3. tabulate from pypi

Please execute each cell one by one as per the instructions provided.

If you face any issue, please [log an issue](https://github.com/zinggAI/zingg/issues)

You can also join [Zingg's Slack community](https://join.slack.com/t/zinggai/shared_invite/zt-w7zlcnol-vEuqU9m~Q56kLLUVxRgpOA)

In [0]:
pip install zingg

Processing ./zingg-0.5.0-py2.py3-none-any.whl
Collecting py4j==0.10.9 (from zingg==0.5.0)
  Downloading py4j-0.10.9-py2.py3-none-any.whl.metadata (1.3 kB)
Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/198.6 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m194.6/198.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: py4j, zingg
Successfully installed py4j-0.10.9 zingg-0.5.0
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

In [0]:
pip show zingg

Name: zingg
Version: 0.5.0
Summary: Zingg Entity Resolution, Data Mastering and Deduplication
Home-page: https://github.com/zinggAI/zingg
Author: Zingg.AI
Author-email: sonalgoyal4@gmail.com
License: https://github.com/zinggAI/zingg/blob/main/LICENSE
Location: /local_disk0/.ephemeral_nfs/envs/pythonEnv-dc05a68c-c7cb-4059-b739-b8a68c8a94fa/lib/python3.12/site-packages
Requires: py4j
Required-by: 


In [0]:
pip install tabulate

Collecting tabulate
  Obtaining dependency information for tabulate from https://files.pythonhosted.org/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-any.whl.metadata
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

In [0]:
pip show tabulate

Name: tabulate
Version: 0.9.0
Summary: Pretty-print tabular data
Home-page: 
Author: 
Author-email: Sergey Astanin <s.astanin@gmail.com>
License: MIT
Location: /local_disk0/.ephemeral_nfs/envs/pythonEnv-08a4885f-0955-41a7-bf85-4ced38eff560/lib/python3.11/site-packages
Requires: 
Required-by: 


# Define locations for the model
The Zingg models and training data are persisted in dbfs. 

**Please edit the model id in the cell below to reflect your model.**

In [0]:
##you can change these to the locations of your choice
##these are the only two settings that need to change
zinggDir = "/models"
modelId = "model25March_1"

In [0]:
from pyspark.sql import SparkSession
SparkSession.builder.getOrCreate()

# Setup common functions for use in Zingg. 
These functions setup the internal folders used by Zingg, and help with labeling and training Zingg. 

**No change is needed in the cell below.**

In [0]:
##please leave the following unchanged
MARKED_DIR = zinggDir + "/" + modelId + "/trainingData/marked/"
UNMARKED_DIR = zinggDir + "/" + modelId + "/trainingData/unmarked/"

# MARKED_DIR_DBFS = "/dbfs" + MARKED_DIR
# UNMARKED_DIR_DBFS = "/dbfs" + UNMARKED_DIR  


import pandas as pd
import numpy as np
 
import time
import uuid
 
from tabulate import tabulate
from ipywidgets import widgets, interact, GridspecLayout
import base64

import pyspark.sql.functions as fn

##this code sets up the Zingg Python interface
from zingg.client import *
from zingg.pipes import *

def cleanModel():
    dbutils.fs.rm(MARKED_DIR, recurse=True)
    # drop unmarked data
    dbutils.fs.rm(UNMARKED_DIR, recurse=True)
    return

# assign label to candidate pair
def assign_label(candidate_pairs_pd, z_cluster, label):
  '''
  The purpose of this function is to assign a label to a candidate pair
  identified by its z_cluster value.  Valid labels include:
     0 - not matched
     1 - matched
     2 - uncertain
  '''
  
  # assign label
  candidate_pairs_pd.loc[ candidate_pairs_pd['z_cluster']==z_cluster, 'z_isMatch'] = label
  
  return
 
def count_labeled_pairs(marked_pd):
  '''
  The purpose of this function is to count the labeled pairs in the marked folder.
  '''

  n_total = len(np.unique(marked_pd['z_cluster']))
  n_positive = len(np.unique(marked_pd[marked_pd['z_isMatch']==1]['z_cluster']))
  n_negative = len(np.unique(marked_pd[marked_pd['z_isMatch']==0]['z_cluster']))
  
  return n_positive, n_negative, n_total

# setup widget 
available_labels = {
    'No Match':0,
    'Match':1,
    'Uncertain':2
    }
#dbutils.widgets.dropdown('label', 'Uncertain', available_labels.keys(), 'Is this pair a match?')




HERE
jvm
<py4j.java_gateway.JVMView object at 0x7f88cc3cd2b0>




# Start building the Zingg program 
The following cell sets up the initial arguments for Zingg. 

**No change is needed in the cell below.**

In [0]:

#build the arguments for zingg
args = Arguments()
# Set the modelid and the zingg dir. You can use this as is
args.setModelId(modelId)
args.setZinggDir(zinggDir)


# Define the input
Our data is in csv so we provide a schema. You can choose other formats like parquet by using Pipe with parquet as the format.
You can also pass in a dataframe by using a Pipe with the in memory format. 
Please refer to [Pipes](https://zingg.readthedocs.io/en/latest/zingg.html#zingg.pipes.Pipe) for details on different formats

**Please modify this for your data.**

In [0]:
schema = "recId string, fname string, lname string, stNo string, add1 string, add2 string, state string, areacode string, dob string, ssn string"
inputPipe = CsvPipe("zingg_input", "/FileStore/tables/data.csv", schema)

args.setData(inputPipe)

set schema 


# Configure the output
Here we configure the putput to be a csv, but similar to the input above, the output can be a file format like parquet or delta or a data store like MySQL

**Please modify this for your data.**

In [0]:
#setting outputpipe in 'args'
outputPipe = CsvPipe("resultFebrl", "/tmp/febrlOutput25March_1")
args.setOutput(outputPipe)

# Define the match fields and their types

The cell below is used to configure Zingg with the fields for use in matching and the match types.
Details on the field definitions can be found at [Zingg official docs](https://www.docs.zingg.ai)

**Please modify this for your data.**

In [0]:
#set field definitions 
#please change these 
recId = FieldDefinition("recId", "string", MatchTypes.FUZZY)
fname = FieldDefinition("fname", "string", MatchTypes.FUZZY)
lname = FieldDefinition("lname", "string", MatchTypes.FUZZY)
stNo = FieldDefinition("stNo", "string", MatchTypes.FUZZY)
add1 = FieldDefinition("add1","string", MatchTypes.FUZZY)
add2 = FieldDefinition("add2", "string", MatchTypes.FUZZY)
areacode = FieldDefinition("areacode", "string", MatchTypes.FUZZY)
state = FieldDefinition("state", "string", MatchTypes.FUZZY)
dob = FieldDefinition("dob", "string", MatchTypes.FUZZY)
ssn = FieldDefinition("ssn", "string", MatchTypes.FUZZY)

fieldDefs = [recId, fname, lname, stNo, add1, add2, areacode, state, dob, ssn]
args.setFieldDefinition(fieldDefs)

# Performance settings

The numPartitions define how data is split across the cluster. Please change this as per your data and cluster size by referring to the performance section of the Zingg docs.
The labelDataSampleSize is used for sampling in findTrainingData. It lets Zingg select pairs for labeling in a reasonable amount of time. 
If the findTrainingData phase is taking to much time, please reduce this by atleast 1/10th of its previous value and try again.

**Please modify this for your data.**

In [0]:

# The numPartitions define how data is split across the cluster. 
# Please change the fllowing as per your data and cluster size by referring to the docs.

args.setNumPartitions(4)
args.setLabelDataSampleSize(0.5)



# Get Zingg to find pairs for user labeling
**No change is needed in the cell below.**

In [0]:
options = ClientOptions([ClientOptions.PHASE,"generateDocs"])

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

['--phase', 'generateDocs']
arguments for client options are  ['--phase', 'generateDocs', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']
jvm
<py4j.java_gateway.JVMView object at 0x7f88cc3cd2b0>


# Prepare for user labeling
**No change is needed in the cell below.**

In [0]:
options = ClientOptions([ClientOptions.PHASE,"label"])

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.init()

['--phase', 'label']
arguments for client options are  ['--phase', 'label', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']
jvm
<py4j.java_gateway.JVMView object at 0x7ff1802ba270>


# See if we have records for labeling

**No change is needed to the cell below.**

In [0]:
# get candidate pairs
candidate_pairs_pd = getPandasDfFromDs(zingg.getUnmarkedRecords())
 
# if no candidate pairs, run job and wait
if candidate_pairs_pd.shape[0] == 0:
  print('No unlabeled candidate pairs found.  Run findTraining job ...')

else:
    # get list of pairs (as identified by z_cluster) to label 
    z_clusters = list(np.unique(candidate_pairs_pd['z_cluster'])) 

    # identify last reviewed cluster
    last_z_cluster = '' # none yet

    # print candidate pair stats
    print('{0} candidate pairs found for labeling'.format(len(z_clusters)))
  



21 candidate pairs found for labeling


# Start labeling to make Zingg learn how we want to match our data

**No change is needed to the cell below.**

In [0]:
# Label Training Set

# define variable to avoid duplicate saves
ready_for_save = False
print(candidate_pairs_pd)

# user-friendly labels and corresponding zingg numerical value
# (the order in the dictionary affects how displayed below)
LABELS = {
  'Uncertain':2,
  'Match':1,
  'No Match':0  
  }

# GET CANDIDATE PAIRS
# ========================================================
#candidate_pairs_pd = get_candidate_pairs()
n_pairs = int(candidate_pairs_pd.shape[0]/2)
# ========================================================

# DEFINE IPYWIDGET DISPLAY
# ========================================================
display_pd = candidate_pairs_pd.drop(
  labels=[
    'z_zid', 'z_prediction', 'z_score', 'z_isMatch', 'z_zsource'
    ], 
  axis=1)

# define header to be used with each displayed pair
html_prefix = "<p><span style='font-family:Courier New,Courier,monospace'>"
html_suffix = "</p></span>"
header = widgets.HTML(value=f"{html_prefix}<b>" + "<br />".join([str(i)+"&nbsp;&nbsp;" for i in display_pd.columns.to_list()]) + f"</b>{html_suffix}")

# initialize display
vContainers = []
vContainers.append(widgets.HTML(value=f'<h2>Indicate if each of the {n_pairs} record pairs is a match or not</h2></p>'))

# for each set of pairs
for n in range(n_pairs):

  # get candidate records
  candidate_left = display_pd.loc[2*n].to_list()
  print(candidate_left)
  candidate_right = display_pd.loc[(2*n)+1].to_list()
  print(candidate_right)

  # define grid to hold values
  html = ''

  for i in range(display_pd.shape[1]):

    # get column name
    column_name = display_pd.columns[i]

    # if field is image
    if column_name == 'image_path':

      # define row header
      html += '<tr>'
      html += '<td><b>image</b></td>'

      # read left image to encoded string
      l_endcode = ''
      if candidate_left[i] != '':
        with open(candidate_left[i], "rb") as l_file:
          l_encode = base64.b64encode( l_file.read() ).decode()

      # read right image to encoded string
      r_encode = ''
      if candidate_right[i] != '':
        with open(candidate_right[i], "rb") as r_file:
          r_encode = base64.b64encode( r_file.read() ).decode()      

      # present images
      html += f'<td><img src="data:image/png;base64,{l_encode}"></td>'
      html += f'<td><img src="data:image/png;base64,{r_encode}"></td>'
      html += '</tr>'

    elif column_name != 'image_path':  # display text values

      if column_name == 'z_cluster': z_cluster = candidate_left[i]

      html += '<tr>'
      html += f'<td style="width:10%"><b>{column_name}</b></td>'
      html += f'<td style="width:45%">{str(candidate_left[i])}</td>'
      html += f'<td style="width:45%">{str(candidate_right[i])}</td>'
      html += '</tr>'

  # insert data table
  table = widgets.HTML(value=f'<table data-title="{z_cluster}" style="width:100%;border-collapse:collapse" border="1">'+html+'</table>')
  z_cluster = None

  # assign label options to pair
  label = widgets.ToggleButtons(
    options=LABELS.keys(), 
    button_style='info'
    )

  # define blank line between displayed pair and next
  blankLine=widgets.HTML(value='<br>')

  # append pair, label and blank line to widget structure
  vContainers.append(widgets.VBox(children=[table, label, blankLine]))

# present widget
display(widgets.VBox(children=vContainers))
# ========================================================

# mark flag to allow save 
ready_for_save = True


    z_zid         z_cluster  z_prediction  ...        dob       ssn    z_zsource
0      58   1742921053392:0          -1.0  ...   19930203   4562381  zingg_input
1      26   1742921053392:0          -1.0  ...   19840802   3624304  zingg_input
2      58  1742921053392:12          -1.0  ...   19930203   4562381  zingg_input
3      44  1742921053392:12          -1.0  ...   19180205   1909717  zingg_input
4      46  1742921053392:16          -1.0  ...   19461101   4783085  zingg_input
5      44  1742921053392:16          -1.0  ...   19180205   1909717  zingg_input
6      58  1742921053392:20          -1.0  ...   19930203   4562381  zingg_input
7      11  1742921053392:20          -1.0  ...   19390410   9201057  zingg_input
8      46  1742921053392:24          -1.0  ...   19461101   4783085  zingg_input
9      11  1742921053392:24          -1.0  ...   19390410   9201057  zingg_input
10     24  1742921053392:28          -1.0  ...   19590807   2863290  zingg_input
11     11  1742921053392:28 

VBox(children=(HTML(value='<h2>Indicate if each of the 21 record pairs is a match or not</h2></p>'), VBox(chil…

# Save all the labels provided by the user 
No change is needed to the cell below.

In [0]:
if not ready_for_save:
  print('No labels have been assigned. Run the previous cell to create candidate pairs and assign labels to them before re-running this cell.')

else:

  # ASSIGN LABEL VALUE TO CANDIDATE PAIRS IN DATAFRAME
  # ========================================================
  # for each pair in displayed widget
  for pair in vContainers[1:]:

    # get pair and assigned label
    html_content = pair.children[1].get_interact_value() # the displayed pair as html
    user_assigned_label = pair.children[1].get_interact_value() # the assigned label

    # extract candidate pair id from html pair content
    start = pair.children[0].value.find('data-title="')
    if start > 0: 
      start += len('data-title="') 
      end = pair.children[0].value.find('"', start+2)
    pair_id = pair.children[0].value[start:end]



    # assign label to candidate pair entry in dataframe
    candidate_pairs_pd.loc[candidate_pairs_pd['z_cluster']==pair_id, 'z_isMatch'] = LABELS.get(user_assigned_label)
  # ========================================================

  # SAVE LABELED DATA TO ZINGG FOLDER
  # ========================================================
  # make target directory if needed
  dbutils.fs.mkdirs(MARKED_DIR)
  
  # save label assignments
  # save labels
  zingg.writeLabelledOutputFromPandas(candidate_pairs_pd,args)

  # count labels accumulated
  marked_pd_df = getPandasDfFromDs(zingg.getMarkedRecords())
  n_pos, n_neg, n_tot = count_labeled_pairs(marked_pd_df)
  print(f'You have accumulated {n_pos} pairs labeled as positive matches.')
  print(f'You have accumulated {n_neg} pairs labeled as not matches.')
  print("If you need more pairs to label, re-run the cell for 'findTrainingData'")
  # ========================================================  

  # save completed
  ready_for_save = False



You have accumulated 7 pairs labeled as positive matches.
You have accumulated 23 pairs labeled as not matches.
If you need more pairs to label, re-run the cell for 'findTrainingData'


StructType([StructField('z_zid', LongType(), True), StructField('z_cluster', StringType(), True), StructField('z_prediction', DoubleType(), True), StructField('z_score', DoubleType(), True), StructField('z_isMatch', IntegerType(), True), StructField('fname', StringType(), True), StructField('lname', StringType(), True), StructField('stNo', StringType(), True), StructField('add1', StringType(), True), StructField('add2', StringType(), True), StructField('city', StringType(), True), StructField('areacode', StringType(), True), StructField('state', StringType(), True), StructField('dob', StringType(), True), StructField('ssn', StringType(), True), StructField('z_zsource', StringType(), True)])

# Update Label

In [0]:
options = ClientOptions([ClientOptions.PHASE,"updateLabel"])
markedRecords = getPandasDfFromDs(zingg.getMarkedRecords())

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.init()
print(markedRecords)

['--phase', 'updateLabel']
arguments for client options are  ['--phase', 'updateLabel', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']
jvm
<py4j.java_gateway.JVMView object at 0x7ff1802ba270>
    z_zid         z_cluster  z_prediction  ...        dob       ssn    z_zsource
0      24  1742921053392:28          -1.0  ...   19590807   2863290  zingg_input
1      11  1742921053392:28          -1.0  ...   19390410   9201057  zingg_input
2      43   1742921053392:3          -1.0  ...   19241223   7522263  zingg_input
3      13   1742921053392:3          -1.0  ...   19241223   7522263  zingg_input
4      33  1742921053392:32          -1.0  ...   19830807   2932837  zingg_input
5      11  1742921053392:32          -1.0  ...   19390410   9201057  zingg_input
6      26  1742921053392:36          -1.0  ...   19840802   3624304  zingg_input
7      11  1742921053392:36          -1.0  ...   19390410   9201057  zingg_input
8      46   1742921053392:4          -1.



# Persist the Zingg models by training on the labels and run them to predict matches
No change is needed to the cell below.

In [0]:
options = ClientOptions([ClientOptions.PHASE,"trainMatch"])

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

['--phase', 'trainMatch']
arguments for client options are  ['--phase', 'trainMatch', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']
jvm
<py4j.java_gateway.JVMView object at 0x7ff1802ba270>


# Voila! We are done. Lets check the results. 

**No change is needed to the cell below.**

In [0]:
dbutils.fs.ls("/tmp")

[FileInfo(path='dbfs:/tmp/checkpoint/', name='checkpoint/', size=0, modificationTime=1742921489851),
 FileInfo(path='dbfs:/tmp/febrlOutput/', name='febrlOutput/', size=0, modificationTime=1742921489851),
 FileInfo(path='dbfs:/tmp/febrlOutput25Nov/', name='febrlOutput25Nov/', size=0, modificationTime=1742921489852),
 FileInfo(path='dbfs:/tmp/febrlOutput25Nov_1/', name='febrlOutput25Nov_1/', size=0, modificationTime=1742921489852),
 FileInfo(path='dbfs:/tmp/febrlOutput26Nov_1/', name='febrlOutput26Nov_1/', size=0, modificationTime=1742921489852),
 FileInfo(path='dbfs:/tmp/febrlOutputNov2124/', name='febrlOutputNov2124/', size=0, modificationTime=1742921489852),
 FileInfo(path='dbfs:/tmp/febrlOutputNov2124_0/', name='febrlOutputNov2124_0/', size=0, modificationTime=1742921489852),
 FileInfo(path='dbfs:/tmp/output26Nov_1/', name='output26Nov_1/', size=0, modificationTime=1742921489852),
 FileInfo(path='dbfs:/tmp/output27Nov_1/', name='output27Nov_1/', size=0, modificationTime=1742921489852

In [0]:
outputDF = spark.read.csv('/tmp/febrlOutput25Nov_1')

In [0]:

colNames = ["z_minScore", "z_maxScore", "z_cluster", "recId", "fname", "lname", "stNo", "add1", "add2", "state", "areacode", "dob", "ssn"]
outputDF.toDF(*colNames).show(10)

+------------------+------------------+---------+--------------+--------+-------------+----+------------------+-------------+-----+--------+--------+-------+
|        z_minScore|        z_maxScore|z_cluster|         recId|   fname|        lname|stNo|              add1|         add2|state|areacode|     dob|    ssn|
+------------------+------------------+---------+--------------+--------+-------------+----+------------------+-------------+-----+--------+--------+-------+
|0.9999999999999993|0.9999999999999993|        4|rec-1022-dup-1| jackson|     eglinton| 840|     fowles street|   moun tjiew|   sa|    2830|19830807|2932837|
|0.9999999999999998|0.9999999999999998|       29|  rec-1034-org| jasmine|        chang| 210|    magnolia drive|sunset valley|  vic|    3021|19930203|4562381|
|0.9999999999999998|0.9999999999999998|       18|rec-1029-dup-2|annalise|   stephenson|  81|rose scott circuit|cordoba manor|  vic|    4226|19461101|4783085|
|0.9999999999999999|0.9999999999999999|       25|  r

In [0]:
options = ClientOptions([ClientOptions.PHASE,"generateDocs"])

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

['--phase', 'generateDocs']
arguments for client options are  ['--phase', 'generateDocs', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']
jvm
<py4j.java_gateway.JVMView object at 0x7ff1802ba270>


In [0]:
DOCS_DIR = zinggDir + "/" + modelId + "/docs/"

In [0]:
# check that docs succesfully generated
dbutils.fs.ls('file:'+DOCS_DIR)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-3995452315908000>, line 2[0m
[1;32m      1[0m [38;5;66;03m# check that docs succesfully generated[39;00m
[0;32m----> 2[0m dbutils[38;5;241m.[39mfs[38;5;241m.[39mls([38;5;124m'[39m[38;5;124mfile:[39m[38;5;124m'[39m[38;5;241m+[39mDOCS_DIR)

[0;31mNameError[0m: name 'DOCS_DIR' is not defined

In [0]:
displayHTML(open(DOCS_DIR+"model.html", 'r').read())

"Unmarked 0/30, Marked 30/30 (7 Matches, 23 Non-Matches, 0 Unsure)"

CLUSTER,Z_SCORE,Z_ISMATCH,RECID,FNAME,LNAME,STNO,ADD1,ADD2,AREACODE,STATE,DOB,SSN,Z_ZSOURCE
1742920958097:0,0.0,1.0,rec-1022-dup-1,jackson,eglinton,840,fowles street,moun tjiew,sa,2830,19830807,2932837,zingg_input
1742920958097:0,,,rec-1022-dup-1,jackson,eglinton,840,fowles street,moun tjiew,sa,2830,19830807,2932837,zingg_input
1742920958097:1,0.0,0.0,rec-1029-dup-2,annalise,stephenson,81,rose scott circuit,cordoba manor,vic,4226,19461101,4783085,zingg_input
1742920958097:1,,,rec-1022-dup-2,jackson,eglinton,840,fowles street,mou nview,sa,2830,19830807,2932837,zingg_input
1742920958097:10,0.0,0.0,rec-1022-dup-3,jackson,christo,840,fowles street,mountview,sa,2830,19830807,2932837,zingg_input
1742920958097:10,,,rec-1029-dup-3,kylee,turale,81,cordoba manor,ashfield,vic,4226,19461101,4783085,zingg_input
1742920958097:14,0.0,1.0,rec-1029-dup-3,kylee,turale,81,cordoba manor,ashfield,vic,4226,19461101,4783085,zingg_input
1742920958097:14,,,rec-1029-dup-3,kylee,turale,81,cordoba manor,ashfield,vic,4226,19461101,4783085,zingg_input
1742920958097:18,0.0,0.0,rec-1029-dup-3,kylee,turale,81,cordoba manor,ashfield,vic,4226,19461101,4783085,zingg_input
1742920958097:18,,,rec-1022-dup-3,jackson,christo,840,fowles street,mountview,sa,2830,19830807,2932837,zingg_input


In [0]:
displayHTML(open(DOCS_DIR+"data.html", 'r').read())

Field Name,Field Type,Nullable
cnpj,StringType,True
razao_social,StringType,True
nome_fantasia,StringType,True
endereco,StringType,True
cidade,StringType,True
is_matriz,StringType,True
setor,StringType,True


# Improve accuracy - ask Zingg to recommend stopwords for a particular column. 
These are common occuring words like pvt, ltd etc which occur across a lot of rows and do not add any meaning to the matching. You can read more about hese at [the Zingg docs](https://docs.zingg.ai/zingg-0.4.0/improving-accuracy/stopwordsremoval)

**In the following cell, we are asking for recommendations for the city column. Please edit as per your needs**

In [0]:
options = ClientOptions([ClientOptions.PHASE,"recommend", "--column", "city"])
args.setStopWordsCutoff(0.5)
#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

['--phase', 'recommend', '--column', 'city']
arguments for client options are  ['--phase', 'recommend', '--column', 'city', '--license', 'zinggLic.txt', '--email', 'zingg@zingg.ai', '--conf', 'dummyConf.json']


# View recommendations and configure them in the Zingg arguments.

Please check the FieldDefinition's setStopWords() method to set the stop words for each field

**We are using the recommendations as is, but you can massage the values, add or remove words that matter to the dataset to get the best results**

In [0]:
#dbutils.fs.ls(zinggDir+"/"+modelId+"/stopWords")
stopwordsForCity = spark.read.csv(zinggDir+"/"+modelId+"/stopWords/city")
stopwordsForCity.show()

+------+-------+
|   _c0|    _c1|
+------+-------+
|z_word|z_count|
+------+-------+

