Running using Databricks Connect #582 #583

vikasgupta78 · 2023-05-08T08:47:41Z

changes to be able to run using Databricks Connect i.e. still invoke the python script from user's machine but the actual would be run / submitted to data bricks

sonalgoyal

lets discuss

sonalgoyal · 2023-05-09T16:56:22Z

common/client/src/main/java/zingg/common/client/ITrainingHelper.java

+
+	public String getMsg2(double prediction, double score);
+
+	public void displayRecords(ZFrame<D, R, C> records, String preMessage, String postMessage);


What happens if:

we have two interfaces here. TrainingDataModel and LabelDataViewHelper. The data model has methods for reading and writing training pairs, getting scores etc. The view has messages.

TrainingDataModel should extend from ZinggBase and automatically gets pipeutil and other context stuff. ZinggBase already has the methods to get stats etc..and other methods can be moved there. You can use TDM in labeller and labelupdater just like we use the trainer and matcher in trainmatcher.

TDM and LabelDataViewHelper are returned from Client methods and used in python.

1st draft available in commit 48b2134 please review

sonalgoyal · 2023-05-09T16:56:41Z

common/client/src/main/java/zingg/common/client/ITrainingHelper.java

+
+	public void updateLabellerStat(int selected_option, int increment);
+
+	public void printMarkedRecordsStat();


Wont this go in the view?

kept update in model and print in view , commit 48b2134, please review

sonalgoyal · 2023-05-11T07:58:43Z

test/InMemPipeDataBricks.py

+options = ClientOptions([ClientOptions.PHASE,inpPhase])
+
+#Zingg execution for the given phase
+zingg = Zingg(args, options)


the labeler should get automatically kicked off in execute based on the phase. User should not have to program anything here.

zingg usage should be zingg.sh --run pyprog .

handled in commit 26f1135

sonalgoyal · 2023-05-11T08:02:05Z

test/InMemPipeDataBricks.py

@@ -0,0 +1,64 @@
+from zingg.client import *


Can we create on single file defining the data schema etc and use that for both databricks and local? Only the locations of the zinggDir etc will change in the Databricks specific file.

sonalgoyal · 2023-05-11T08:06:33Z

docs/running/databricks.md

+
+6. Now run zingg using the shell script with --run option, SPARK session would be made remotely to data bricks and job would run on your databricks environment
+https://docs.zingg.ai/zingg/stepbystep/zingg-command-line
+
 # Running on Databricks

 The cloud environment does not have the system console for the labeler to work. Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud.


arent we giving a labeler for the user on the client machine?

done in commit a3ddd46 with shell script changes

sonalgoyal · 2023-05-18T04:20:12Z

common/core/src/main/java/zingg/common/core/executor/LabelUpdater.java

@@ -33,12 +35,12 @@ public void execute() throws ZinggClientException {
 		}
 	}

-	public void processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException {
+	public ZFrame<D,R,C> processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException {


why do we need to return a zframe here?

This is done so that writing of labelled output happens in a a separate method. This is needed for python api to work.

sonalgoyal · 2023-05-18T04:21:27Z

common/core/src/main/java/zingg/common/core/executor/Labeller.java

-			processRecordsCli(unmarkedRecords);
+			ZFrame<D,R,C>  updatedLabelledRecords = processRecordsCli(unmarkedRecords);
+			if (updatedLabelledRecords != null) {
+				getTrainingHelper().writeLabelledOutput(updatedLabelledRecords,args);


move null check to the method writeLabelledOutput

fixed in commit 36878b7

sonalgoyal · 2023-05-18T04:22:38Z

common/core/src/main/java/zingg/common/core/executor/Labeller.java

-			notSurePairsCount = getUnsureMarkedRecordsStat(markedRecords);
-			totalCount = markedRecords.count() / 2;
-		} 
-	}

 	public ZFrame<D,R,C> getUnmarkedRecords() {


arent hese methods are already defined in zinggbase/trainingdatahelper?

removed duplication in commit 48b2134

sonalgoyal · 2023-05-18T04:24:32Z

common/core/src/main/java/zingg/common/core/executor/Labeller.java

-					updateLabellerStat(selected_option, 1);
-					printMarkedRecordsStat();
+					getTrainingHelper().updateLabellerStat(selected_option, 1);
+					getTrainingHelper().printMarkedRecordsStat();
 					if (selected_option == 9) {


make 9 as a constant in the view

done in commit 552d091

sonalgoyal · 2023-05-18T04:24:51Z

common/core/src/main/java/zingg/common/core/executor/Labeller.java

 					//String msgHeader = msg1 + msg2;

 					selected_option = displayRecordsAndGetUserInput(getDSUtil().select(currentPair, displayCols), msg1, msg2);
-					updateLabellerStat(selected_option, 1);
-					printMarkedRecordsStat();
+					getTrainingHelper().updateLabellerStat(selected_option, 1);


not sure whats 1 here. please check.

constant INCREMENT = 1 defined in commit ea7e8f4

sonalgoyal · 2023-05-18T04:37:49Z

python/zingg/client.py

+    global _spark_ctxt
+    global _sqlContext
+    global _spark    
+    jar_path = os.getenv('ZINGG_HOME')+'/zingg-0.3.5-SNAPSHOT.jar'


move name of jar to a global constant up in the code

commit 47b493a

sonalgoyal · 2023-05-18T04:40:07Z

python/zingg/client.py

+    _sqlContext = SQLContext(_spark_ctxt)
+    return 1
+
+def initClient():


we can edit the zingg script to have a new option --run-databricks so that user doesnt have to set the env . It is more explicit and gives user the ability to run locally or remote within the same env

done in commit a3ddd46

sonalgoyal · 2023-05-18T04:41:37Z

python/zingg/client.py

+    _spark_ctxt = SparkContext.getOrCreate()
+    _sqlContext = SQLContext(_spark_ctxt)
+    _spark = SparkSession.builder.getOrCreate()
+    return 1


why return 1?

to signal that all is done without error, if calling code wants to check

0.3.5 sync

vikasgupta78 added 9 commits May 2, 2023 20:08

using dbfs

e54046d

refactor labeller to move common methods to training helper

64f7a06

supporting both normal and data bricks connect client

681465e

getting spark session thru new method in client

e0371aa

getting spark session thru new method in client

3eab8d7

getting spark session thru new method in client

99e9ca1

getting spark session and JVM thru new method in client

fe01f2c

label updater should overwrite and not append

6764867

Running using Databricks Connect

4051246

vikasgupta78 mentioned this pull request May 8, 2023

Running using Databricks Connect #582

Closed

vikasgupta78 added this to the 0.3.5 milestone May 8, 2023

vikasgupta78 mentioned this pull request May 8, 2023

labeller and update labeller #341

Closed

vikasgupta78 added 2 commits May 11, 2023 10:48

indentation issue

7097759

util method to write using pandas df

f23c1b2

sonalgoyal requested changes May 15, 2023

View reviewed changes

sonalgoyal reviewed May 18, 2023

View reviewed changes

vikasgupta78 and others added 13 commits May 26, 2023 08:51

Merge pull request #1 from zinggAI/0.3.5

645b03d

0.3.5 sync

refactor view and model into separate classes 1st cut

48b2134

extra null check removed

36878b7

constant QUIT_LABELING = 9 defined

552d091

constant INCREMENT = 1 defined

ea7e8f4

refactoring

47b493a

lazy initialization

18f9c77

compile error

40b817a

label update methods and refactoring

26f1135

validity check added

2df6704

compile issues resolved

c5abe56

DB Connect check added

53ae447

syntax issues

880b4f6

shell script changes

a3ddd46

sonalgoyal merged commit 28dd2a9 into zinggAI:0.3.5 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running using Databricks Connect #582 #583

Running using Databricks Connect #582 #583

vikasgupta78 commented May 8, 2023

sonalgoyal left a comment

sonalgoyal May 9, 2023

vikasgupta78 May 26, 2023

sonalgoyal May 9, 2023

vikasgupta78 May 26, 2023

sonalgoyal May 11, 2023

sonalgoyal May 11, 2023

vikasgupta78 May 31, 2023

sonalgoyal May 11, 2023

sonalgoyal May 11, 2023

vikasgupta78 Jun 1, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 30, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 30, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 30, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 30, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 30, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 31, 2023

sonalgoyal May 18, 2023

vikasgupta78 Jun 1, 2023

sonalgoyal May 18, 2023

vikasgupta78 May 31, 2023


		public String getMsg2(double prediction, double score);

		public void displayRecords(ZFrame<D, R, C> records, String preMessage, String postMessage);


		public void updateLabellerStat(int selected_option, int increment);

		public void printMarkedRecordsStat();

Running using Databricks Connect #582 #583

Running using Databricks Connect #582 #583

Conversation

vikasgupta78 commented May 8, 2023

sonalgoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment