Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running using Databricks Connect #582 #583

Merged
merged 25 commits into from Jun 2, 2023
Merged

Conversation

vikasgupta78
Copy link
Collaborator

changes to be able to run using Databricks Connect i.e. still invoke the python script from user's machine but the actual would be run / submitted to data bricks

Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets discuss


public String getMsg2(double prediction, double score);

public void displayRecords(ZFrame<D, R, C> records, String preMessage, String postMessage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if:

we have two interfaces here. TrainingDataModel and LabelDataViewHelper. The data model has methods for reading and writing training pairs, getting scores etc. The view has messages.

TrainingDataModel should extend from ZinggBase and automatically gets pipeutil and other context stuff. ZinggBase already has the methods to get stats etc..and other methods can be moved there. You can use TDM in labeller and labelupdater just like we use the trainer and matcher in trainmatcher.

TDM and LabelDataViewHelper are returned from Client methods and used in python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st draft available in commit 48b2134 please review


public void updateLabellerStat(int selected_option, int increment);

public void printMarkedRecordsStat();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this go in the view?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept update in model and print in view , commit 48b2134, please review

options = ClientOptions([ClientOptions.PHASE,inpPhase])

#Zingg execution for the given phase
zingg = Zingg(args, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the labeler should get automatically kicked off in execute based on the phase. User should not have to program anything here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zingg usage should be zingg.sh --run pyprog .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled in commit 26f1135

@@ -0,0 +1,64 @@
from zingg.client import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create on single file defining the data schema etc and use that for both databricks and local? Only the locations of the zinggDir etc will change in the Databricks specific file.


6. Now run zingg using the shell script with --run option, SPARK session would be made remotely to data bricks and job would run on your databricks environment
https://docs.zingg.ai/zingg/stepbystep/zingg-command-line

# Running on Databricks

The cloud environment does not have the system console for the labeler to work. Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arent we giving a labeler for the user on the client machine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in commit a3ddd46 with shell script changes

@@ -33,12 +35,12 @@ public void execute() throws ZinggClientException {
}
}

public void processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException {
public ZFrame<D,R,C> processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to return a zframe here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done so that writing of labelled output happens in a a separate method. This is needed for python api to work.

processRecordsCli(unmarkedRecords);
ZFrame<D,R,C> updatedLabelledRecords = processRecordsCli(unmarkedRecords);
if (updatedLabelledRecords != null) {
getTrainingHelper().writeLabelledOutput(updatedLabelledRecords,args);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move null check to the method writeLabelledOutput

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in commit 36878b7

notSurePairsCount = getUnsureMarkedRecordsStat(markedRecords);
totalCount = markedRecords.count() / 2;
}
}

public ZFrame<D,R,C> getUnmarkedRecords() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arent hese methods are already defined in zinggbase/trainingdatahelper?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed duplication in commit 48b2134

updateLabellerStat(selected_option, 1);
printMarkedRecordsStat();
getTrainingHelper().updateLabellerStat(selected_option, 1);
getTrainingHelper().printMarkedRecordsStat();
if (selected_option == 9) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 9 as a constant in the view

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in commit 552d091

//String msgHeader = msg1 + msg2;

selected_option = displayRecordsAndGetUserInput(getDSUtil().select(currentPair, displayCols), msg1, msg2);
updateLabellerStat(selected_option, 1);
printMarkedRecordsStat();
getTrainingHelper().updateLabellerStat(selected_option, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whats 1 here. please check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant INCREMENT = 1 defined in commit ea7e8f4

global _spark_ctxt
global _sqlContext
global _spark
jar_path = os.getenv('ZINGG_HOME')+'/zingg-0.3.5-SNAPSHOT.jar'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move name of jar to a global constant up in the code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 47b493a

_sqlContext = SQLContext(_spark_ctxt)
return 1

def initClient():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can edit the zingg script to have a new option --run-databricks so that user doesnt have to set the env . It is more explicit and gives user the ability to run locally or remote within the same env

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in commit a3ddd46

_spark_ctxt = SparkContext.getOrCreate()
_sqlContext = SQLContext(_spark_ctxt)
_spark = SparkSession.builder.getOrCreate()
return 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to signal that all is done without error, if calling code wants to check

@sonalgoyal sonalgoyal merged commit 28dd2a9 into zinggAI:0.3.5 Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants