Skip to content

Commit

Permalink
Merge branch '0.3.4' of https://github.com/zinggAI/zingg into 0.3.4
Browse files Browse the repository at this point in the history
  • Loading branch information
sonalgoyal committed Aug 5, 2022
2 parents 2ad0fb5 + 16d67ff commit 0f37318
Show file tree
Hide file tree
Showing 14 changed files with 36 additions and 39 deletions.
5 changes: 3 additions & 2 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,21 @@
* [Data Sources and Sinks](dataSourcesAndSinks/connectors.md)
* [Zingg Pipes](dataSourcesAndSinks/pipes.md)
* [Snowflake](dataSourcesAndSinks/snowflake.md)
* [Jdbc](dataSourcesAndSinks/jdbc.md)
* [JDBC](dataSourcesAndSinks/jdbc.md)
* [Postgres](connectors/jdbc/postgres.md)
* [MySQL](connectors/jdbc/mysql.md)
* [Cassandra](dataSourcesAndSinks/cassandra.md)
* [MongoDB](dataSourcesAndSinks/mongodb.md)
* [Neo4j](dataSourcesAndSinks/neo4j.md)
* [Parquet](dataSourcesAndSinks/parquet.md)
* [BigQuery](dataSourcesAndSinks/bigquery.md)
* [Working With Python](working-with-python.md)
* [Running Zingg on Cloud](running/running.md)
* [Running on AWS](running/aws.md)
* [Running on Azure](running/azure.md)
* [Running on Databricks](running/databricks.md)
* [Zingg Models](zModels.md)
* [Pretrained models](pretrainedModels.md)
* [Pre-trained models](pretrainedModels.md)
* [Improving Accuracy](improving-accuracy/README.md)
* [Ignoring Commonly Occuring Words While Matching](accuracy/stopWordsRemoval.md)
* [Defining Domain Specific Blocking And Similarity Functions](accuracy/definingOwn.md)
Expand Down
1 change: 0 additions & 1 deletion docs/dataSourcesAndSinks/connectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: Data Sources and Sinks
nav_order: 3
has_children: true
description: Data sources and file formats supported by Zingg
---

# Data Sources and Sinks
Expand Down
6 changes: 1 addition & 5 deletions docs/setup/link.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
---
description: To match two datasets against each other
---

# Linking across datasets

In many cases like reference data mastering, enrichment, etc, two individual datasets are free of duplicates but they need to be matched against each other. The link phase is used for such scenarios.
In many cases like reference data mastering, enrichment, etc, two individual datasets are duplicates free but they need to be matched against each other. The link phase is used for such scenarios.

`./zingg.sh --phase link --conf config.json`

Expand Down
13 changes: 6 additions & 7 deletions docs/setup/match.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
---
layout: default
title: Find the matches
parent: Step By Step Guide
nav_order: 8
description: Identifying matching records
---

# Finding the matches

Finds the records which match with each other.
### match
Finds the records which match with each other.

`./zingg.sh --phase match --conf config.json`

As can be seen in the image below, matching records are given the same z\_cluster id. Each record also gets a z\_minScore and z\_maxScore which shows the least/greatest it matched with other records in the same cluster.
As can be seen in the image below, matching records are given the same z_cluster id. Each record also gets a z_minScore and z_maxScore which shows the least/greatest it matched with other records in the same cluster.

![Match results](../../assets/match.gif)
![Match results](/assets/match.gif)

If records across multiple sources have to be matched, the [link phase](link.md) should be used.
If records across multiple sources have to be matched, the [link phase](./link.md) should be used.
10 changes: 3 additions & 7 deletions docs/setup/train.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,10 @@
---
layout: default
title: Build and save the model
parent: Step By Step Guide
nav_order: 7
description: Guide to build and save model
---

# Building and saving the model

### train - training and saving the models
Builds up the Zingg models using the training data from the above phases and writes them to the folder zinggDir/modelId as specified in the config.

```
./zingg.sh --phase train --conf config.json
```
./zingg.sh --phase train --conf config.json
1 change: 0 additions & 1 deletion docs/setup/training/addOwnTrainingData.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ parent: Creating training data
nav_order: 3
title: Using preexisting training data
grand_parent: Step By Step Guide
description: Instructions on using existing training data with Zingg
---

# Using pre-existing training data
Expand Down
3 changes: 1 addition & 2 deletions docs/setup/training/createTrainingData.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@ parent: Step By Step Guide
nav_order: 6
title: Creating training data
has_children: true
description: Guide to working with training data
---

# Working With Training Data
# Training data

Zingg builds models to predict similarity. Training data is needed to build these models. The next sections describe how you can use the Zingg Interactive Labeler to create the training data.
3 changes: 1 addition & 2 deletions docs/setup/training/exportLabeledData.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@ parent: Creating training data
title: Exporting labeled data as csv
grand_parent: Step By Step Guide
nav_order: 4
description: Writing labeled data to CSV for exporting
---

# Exporting Labeled Data

If we need to send our labeled data for a subject matter expert to review or if we want to build another model in a new location and [reuse training efforts](addOwnTrainingData.md) from earlier, we can write our labeled data to a CSV.
If we need to send our labeled data for a subject matter expert to review or if we want to build another model in a new location and [reuse training effort](addOwnTrainingData.md) from earlier, we can write our labeled data to a csv 

`./scripts/zingg.sh --phase exportModel --conf <path to conf> --location <folder to save the csv>`
3 changes: 1 addition & 2 deletions docs/setup/training/findAndLabel.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ parent: Creating training data
title: Find training data and labelling
grand_parent: Step By Step Guide
nav_order: 2
description: Phase which creates training data
---

# Find And Label
Expand All @@ -12,4 +11,4 @@ This phase is composed of two phases namely [findTrainingData](findTrainingData.

`./zingg.sh --phase findAndLabel --conf config.json`

As this phase runs findTrainingData and label together, it should be run only for small datasets where findTrainingData takes a short time to run, else the user will have to wait long for the console for labeling.&#x20;
As this is phase runs findTrainingData and label together, it should be run only for small datasets where findTrainingData takes a short time to run, else the the user will have to wait long for the console for labeling.&#x20;
2 changes: 1 addition & 1 deletion docs/setup/training/findTrainingData.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
parent: Creating training data
nav_order: 1
grand_parent: Step By Step Guide
description: Pairs of records that could be similar to train Zingg
description: pairs of records that could be similar to train Zingg
---

# Finding Records For Training Set Creation
Expand Down
2 changes: 1 addition & 1 deletion docs/setup/training/label.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@ The label phase opens an interactive learner where the user can mark the pairs f

Proceed running findTrainingData followed by label phases till you have at least 30-40 positives, or when you see the predictions by Zingg converging with the output you want. At each stage, the user will get different variations of attributes across the records. Zingg performs pretty well with even a small number of training, as the samples to be labeled are chosen by the algorithm itself.

The showConcise flag when passed to the Zingg command line only shows fields that are NOT DONT\_USE.
The showConcise flag when passed to the Zingg command line only shows fields which are NOT DONT\_USE
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
---
description: Requirements to optimize the performance
---

# Tuning Label, Match And Link Jobs

#### numPartitions
Expand Down
4 changes: 0 additions & 4 deletions docs/updatingLabels.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
---
description: To update the existing labeled pairs as the data modifies
---

# Updating Labeled Pairs

**Please note: This is an experimental feature. Please keep a backup copy of your model folder in a separate place before running this**
Expand Down
18 changes: 18 additions & 0 deletions docs/working-with-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
description: A whole new way to work with Zingg!
---

# Working With Python

Instead of configuring Zingg using the JSON, we can now use Python to build and run Zingg entity and identity resolution programs. This is handy when you want to run Zingg on an existing Spark cluster. To run on local machine, please do the installation of the release before running Zingg python programs.

The Zingg Python package can be installed by invoking

`python -m pip install zingg`

Detailed documentation of the python api is available at [https://readthedocs.org/projects/zingg/](https://readthedocs.org/projects/zingg/)

Example programs for python exist under examples/febrl

``

0 comments on commit 0f37318

Please sign in to comment.