Merge branch '0.3.4' of https://github.com/zinggAI/zingg into 0.3.4

zinggAI · Aug 5, 2022 · 0f37318 · 0f37318
2 parents 2ad0fb5 + 16d67ff
commit 0f37318
Show file tree

Hide file tree

Showing 14 changed files with 36 additions and 39 deletions.
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -39,20 +39,21 @@
 * [Data Sources and Sinks](dataSourcesAndSinks/connectors.md)
   * [Zingg Pipes](dataSourcesAndSinks/pipes.md)
   * [Snowflake](dataSourcesAndSinks/snowflake.md)
-  * [Jdbc](dataSourcesAndSinks/jdbc.md)
+  * [JDBC](dataSourcesAndSinks/jdbc.md)
     * [Postgres](connectors/jdbc/postgres.md)
     * [MySQL](connectors/jdbc/mysql.md)
   * [Cassandra](dataSourcesAndSinks/cassandra.md)
   * [MongoDB](dataSourcesAndSinks/mongodb.md)
   * [Neo4j](dataSourcesAndSinks/neo4j.md)
   * [Parquet](dataSourcesAndSinks/parquet.md)
   * [BigQuery](dataSourcesAndSinks/bigquery.md)
+* [Working With Python](working-with-python.md)
 * [Running Zingg on Cloud](running/running.md)
   * [Running on AWS](running/aws.md)
   * [Running on Azure](running/azure.md)
   * [Running on Databricks](running/databricks.md)
 * [Zingg Models](zModels.md)
-  * [Pretrained models](pretrainedModels.md)
+  * [Pre-trained models](pretrainedModels.md)
 * [Improving Accuracy](improving-accuracy/README.md)
   * [Ignoring Commonly Occuring Words While Matching](accuracy/stopWordsRemoval.md)
   * [Defining Domain Specific Blocking And Similarity Functions](accuracy/definingOwn.md)

diff --git a/docs/dataSourcesAndSinks/connectors.md b/docs/dataSourcesAndSinks/connectors.md
@@ -2,7 +2,6 @@
 title: Data Sources and Sinks
 nav_order: 3
 has_children: true
-description: Data sources and file formats supported by Zingg
 ---
 
 # Data Sources and Sinks

diff --git a/docs/setup/link.md b/docs/setup/link.md
@@ -1,10 +1,6 @@
----
-description: To match two datasets against each other
----
-
 # Linking across datasets
 
-In many cases like reference data mastering, enrichment, etc, two individual datasets are free of duplicates but they need to be matched against each other. The link phase is used for such scenarios.
+In many cases like reference data mastering, enrichment, etc, two individual datasets are duplicates free but they need to be matched against each other. The link phase is used for such scenarios.
 
 `./zingg.sh --phase link --conf config.json`
 

diff --git a/docs/setup/match.md b/docs/setup/match.md
@@ -1,18 +1,17 @@
 ---
+layout: default
 title: Find the matches
 parent: Step By Step Guide
 nav_order: 8
-description: Identifying matching records
 ---
 
-# Finding the matches
-
-Finds the records which match with each other.
+### match
+Finds the records which match with each other. 
 
 `./zingg.sh --phase match --conf config.json`
 
-As can be seen in the image below, matching records are given the same z\_cluster id. Each record also gets a z\_minScore and z\_maxScore which shows the least/greatest it matched with other records in the same cluster.
+As can be seen in the image below, matching records are given the same z_cluster id. Each record also gets a z_minScore and z_maxScore which shows the least/greatest it matched with other records in the same cluster. 
 
-![Match results](../../assets/match.gif)
+![Match results](/assets/match.gif)
 
-If records across multiple sources have to be matched, the [link phase](link.md) should be used.
+If records across multiple sources have to be matched, the [link phase](./link.md) should be used.
diff --git a/docs/setup/train.md b/docs/setup/train.md
@@ -1,14 +1,10 @@
 ---
+layout: default
 title: Build and save the model
 parent: Step By Step Guide
 nav_order: 7
-description: Guide to build and save model
 ---
-
-# Building and saving the model
-
+### train - training and saving the models
 Builds up the Zingg models using the training data from the above phases and writes them to the folder zinggDir/modelId as specified in the config.
 
-```
-./zingg.sh --phase train --conf config.json
-```
+    ./zingg.sh --phase train --conf config.json
diff --git a/docs/setup/training/addOwnTrainingData.md b/docs/setup/training/addOwnTrainingData.md
@@ -3,7 +3,6 @@ parent: Creating training data
 nav_order: 3
 title: Using preexisting training data
 grand_parent: Step By Step Guide
-description: Instructions on using existing training data with Zingg
 ---
 
 # Using pre-existing training data

diff --git a/docs/setup/training/createTrainingData.md b/docs/setup/training/createTrainingData.md
@@ -3,9 +3,8 @@ parent: Step By Step Guide
 nav_order: 6
 title: Creating training data
 has_children: true
-description: Guide to working with training data
 ---
 
-# Working With Training Data
+# Training data
 
 Zingg builds models to predict similarity. Training data is needed to build these models. The next sections describe how you can use the Zingg Interactive Labeler to create the training data.
diff --git a/docs/setup/training/exportLabeledData.md b/docs/setup/training/exportLabeledData.md
@@ -3,11 +3,10 @@ parent: Creating training data
 title: Exporting labeled data as csv
 grand_parent: Step By Step Guide
 nav_order: 4
-description: Writing labeled data to CSV for exporting
 ---
 
 # Exporting Labeled Data
 
-If we need to send our labeled data for a subject matter expert to review or if we want to build another model in a new location and [reuse training efforts](addOwnTrainingData.md) from earlier, we can write our labeled data to a CSV.
+If we need to send our labeled data for a subject matter expert to review or if we want to build another model in a new location and [reuse training effort](addOwnTrainingData.md) from earlier, we can write our labeled data to a csv&#x20;
 
 `./scripts/zingg.sh --phase exportModel --conf <path to conf> --location <folder to save the csv>`
diff --git a/docs/setup/training/findAndLabel.md b/docs/setup/training/findAndLabel.md
@@ -3,7 +3,6 @@ parent: Creating training data
 title: Find training data and labelling
 grand_parent: Step By Step Guide
 nav_order: 2
-description: Phase which creates training data
 ---
 
 # Find And Label
@@ -12,4 +11,4 @@ This phase is composed of two phases namely [findTrainingData](findTrainingData.
 
 `./zingg.sh --phase findAndLabel --conf config.json`
 
-As this phase runs findTrainingData and label together, it should be run only for small datasets where findTrainingData takes a short time to run, else the user will have to wait long for the console for labeling.&#x20;
+As this is phase runs findTrainingData and label together, it should be run only for small datasets where findTrainingData takes a short time to run, else the the user will have to wait long for the console for labeling.&#x20;
diff --git a/docs/setup/training/findTrainingData.md b/docs/setup/training/findTrainingData.md
@@ -2,7 +2,7 @@
 parent: Creating training data
 nav_order: 1
 grand_parent: Step By Step Guide
-description: Pairs of records that could be similar to train Zingg
+description: pairs of records that could be similar to train Zingg
 ---
 
 # Finding Records For Training Set Creation

diff --git a/docs/setup/training/label.md b/docs/setup/training/label.md
@@ -15,4 +15,4 @@ The label phase opens an interactive learner where the user can mark the pairs f
 
 Proceed running findTrainingData followed by label phases till you have at least 30-40 positives, or when you see the predictions by Zingg converging with the output you want. At each stage, the user will get different variations of attributes across the records. Zingg performs pretty well with even a small number of training, as the samples to be labeled are chosen by the algorithm itself.
 
-The showConcise flag when passed to the Zingg command line only shows fields that are NOT DONT\_USE.
+The showConcise flag when passed to the Zingg command line only shows fields which are NOT DONT\_USE
diff --git a/docs/stepbystep/configuration/tuning-label-match-and-link-jobs.md b/docs/stepbystep/configuration/tuning-label-match-and-link-jobs.md
@@ -1,7 +1,3 @@
----
-description: Requirements to optimize the performance
----
-
 # Tuning Label, Match And Link Jobs
 
 #### numPartitions

diff --git a/docs/updatingLabels.md b/docs/updatingLabels.md
@@ -1,7 +1,3 @@
----
-description: To update the existing labeled pairs as the data modifies
----
-
 # Updating Labeled Pairs
 
 **Please note: This is an experimental feature. Please keep a backup copy of your model folder in a separate place before running this**

diff --git a/docs/working-with-python.md b/docs/working-with-python.md
@@ -0,0 +1,18 @@
+---
+description: A whole new way to work with Zingg!
+---
+
+# Working With Python
+
+Instead of configuring Zingg using the JSON, we can now use Python to build and run Zingg entity and identity resolution programs. This is handy when you want to run Zingg on an existing Spark cluster. To run on local machine, please do the installation of the release before running Zingg python programs.
+
+The Zingg Python package can be installed by invoking
+
+`python -m pip install zingg`
+
+Detailed documentation of the python api is available at [https://readthedocs.org/projects/zingg/](https://readthedocs.org/projects/zingg/)
+
+Example programs for python exist under examples/febrl
+
+``
+