##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv
source venv/bin/activate && pip install jupyterlab
```

In [1]:
%%capture
## This is here as a reference only
# Users and application developers must use the right tag for the latest from pypi
#!pip install data-prep-toolkit
#!pip install data-prep-toolkit-transforms
#!pip install data-prep-connector

##### ***** Import required Classes and modules

In [2]:
import ast
import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_python import parse_args, ServiceOrchestrator

##### ***** Setup runtime parameters for this transform
We will only provide a description for the parameters used in this example. For a complete list of parameters, please refer to the README.md for this transform:
|parameter:type | value | description |
|-|-|-|
| input_folder:str | \${PWD}/ray/test-data/input/ | folder that contains the input parquet files for the fuzzy dedup algorithm |
| output_folder:str | \${PWD}/ray/output/ | folder that contains the all the intermediate results and the output parquet files for the fuzzy dedup algorithm |
| contents_column:str | contents | name of the column that stores document text |
| document_id_column:str | int_id_column | name of the column that stores document ID |
| num_permutations:int | 112 | number of permutations to use for minhash calculation |
| num_bands:int | 14 | number of bands to use for band hash calculation |
| num_minhashes_per_band | 8 | number of minhashes to use in each band |
| operation_mode:{filter_duplicates,filter_non_duplicates,annotate} | filter_duplicates | operation mode for data cleanup: filter out duplicates/non-duplicates, or annotate duplicate documents |

In [3]:
# create parameters
input_folder = os.path.join(os.path.abspath(""), "python", "test-data", "input")
output_folder = os.path.join(os.path.abspath(""), "python", "output")
params = {
    # transform configuration parameters
    "input_folder": input_folder,
    "output_folder": output_folder,
    "contents_column": "contents",
    "document_id_column": "int_id_column",
    "num_permutations": 112,
    "num_bands": 14,
    "num_minhashes_per_band": 8,
    "operation_mode": "filter_duplicates",
}

##### ***** Use ray runtime to invoke each transform in the fuzzy dedup pipeline

In [4]:

sys.argv = ParamsUtils.dict_to_req(d=params)
args = parse_args()
# Initialize the orchestrator
orchestrator = ServiceOrchestrator(global_params=args)
# Launch python fuzzy dedup execution
orchestrator.orchestrate()

13:30:29 INFO - Starting SignatureCalculation step
13:30:29 INFO - Got parameters for SignatureCalculation
13:30:29 INFO - minhash parameters are : {'document_id_column': 'int_id_column', 'contents_column': 'contents', 'seed': 42, 'num_permutations': 112, 'jaccard_similarity_threshold': 0.75, 'word_shingle_size': 5, 'num_bands': 14, 'num_minhashes_per_band': 8, 'num_segments': 1, 'shingle_option': 'word'}
13:30:29 INFO - data factory scdata_ is using local configuration without input/output path
13:30:29 INFO - data factory scdata_ max_files -1, n_sample -1
13:30:29 INFO - data factory scdata_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:30:29 INFO - pipeline id pipeline_id
13:30:29 INFO - code location None
13:30:29 INFO - data factory data_ is using local data access: input_folder - /Users/touma/data-prep-kit/transforms/universal/fdedup/python/test-data/input output_folder - /Users/touma/data

##### **** The specified folder will include the transformed parquet files.

In [5]:
import glob
glob.glob("python/output/cleaned/*")

['python/output/cleaned/metadata.json',
 'python/output/cleaned/data_1',
 'python/output/cleaned/data_2']

***** print the input data

In [6]:
import polars as pl
input_df_1 = pl.read_parquet(os.path.join(os.path.abspath(""), "python", "test-data", "input", "data_1", "df1.parquet"))
input_df_2 = pl.read_parquet(os.path.join(os.path.abspath(""), "python", "test-data", "input", "data_2", "df2.parquet"))
input_df = input_df_1.vstack(input_df_2)

with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1):
    print(input_df)

shape: (12, 2)
┌───────────────┬──────────────────────────────────────────────────────────────────────────────────┐
│ int_id_column ┆ contents                                                                         │
│ ---           ┆ ---                                                                              │
│ i64           ┆ str                                                                              │
╞═══════════════╪══════════════════════════════════════════════════════════════════════════════════╡
│ 1             ┆ Von Maur Department Store Opens Third Location in Michigan                       │
│               ┆ PR Newswire October 12, 2019                                                     │
│               ┆ 145-year-old Retailer Anchors Woodland Mall Just Outside Grand Rapids;           │
│               ┆ New Location Continues Strategic National Expansion Plans                        │
│               ┆ DAVENPORT, Iowa, Oct. 12, 2019 /PRNewswire/ -- Von Maur De

***** print the output result

In [7]:
import polars as pl
output_df_1 = pl.read_parquet(os.path.join(os.path.abspath(""), "python", "output", "cleaned", "data_1", "df1.parquet"))
output_df_2 = pl.read_parquet(os.path.join(os.path.abspath(""), "python", "output", "cleaned", "data_2", "df2.parquet"))
output_df = output_df_1.vstack(output_df_2)
with pl.Config(fmt_str_lengths=10000000, tbl_rows=-1):
    print(output_df)

shape: (4, 2)
┌───────────────┬──────────────────────────────────────────────────────────────────────────────────┐
│ int_id_column ┆ contents                                                                         │
│ ---           ┆ ---                                                                              │
│ i64           ┆ str                                                                              │
╞═══════════════╪══════════════════════════════════════════════════════════════════════════════════╡
│ 1             ┆ Von Maur Department Store Opens Third Location in Michigan                       │
│               ┆ PR Newswire October 12, 2019                                                     │
│               ┆ 145-year-old Retailer Anchors Woodland Mall Just Outside Grand Rapids;           │
│               ┆ New Location Continues Strategic National Expansion Plans                        │
│               ┆ DAVENPORT, Iowa, Oct. 12, 2019 /PRNewswire/ -- Von Maur Dep