-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running using Databricks Connect #582 #583
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets discuss
|
||
public String getMsg2(double prediction, double score); | ||
|
||
public void displayRecords(ZFrame<D, R, C> records, String preMessage, String postMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if:
we have two interfaces here. TrainingDataModel and LabelDataViewHelper. The data model has methods for reading and writing training pairs, getting scores etc. The view has messages.
TrainingDataModel should extend from ZinggBase and automatically gets pipeutil and other context stuff. ZinggBase already has the methods to get stats etc..and other methods can be moved there. You can use TDM in labeller and labelupdater just like we use the trainer and matcher in trainmatcher.
TDM and LabelDataViewHelper are returned from Client methods and used in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st draft available in commit 48b2134 please review
|
||
public void updateLabellerStat(int selected_option, int increment); | ||
|
||
public void printMarkedRecordsStat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wont this go in the view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept update in model and print in view , commit 48b2134, please review
options = ClientOptions([ClientOptions.PHASE,inpPhase]) | ||
|
||
#Zingg execution for the given phase | ||
zingg = Zingg(args, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the labeler should get automatically kicked off in execute based on the phase. User should not have to program anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zingg usage should be zingg.sh --run pyprog .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handled in commit 26f1135
@@ -0,0 +1,64 @@ | |||
from zingg.client import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create on single file defining the data schema etc and use that for both databricks and local? Only the locations of the zinggDir etc will change in the Databricks specific file.
|
||
6. Now run zingg using the shell script with --run option, SPARK session would be made remotely to data bricks and job would run on your databricks environment | ||
https://docs.zingg.ai/zingg/stepbystep/zingg-command-line | ||
|
||
# Running on Databricks | ||
|
||
The cloud environment does not have the system console for the labeler to work. Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arent we giving a labeler for the user on the client machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in commit a3ddd46 with shell script changes
@@ -33,12 +35,12 @@ public void execute() throws ZinggClientException { | |||
} | |||
} | |||
|
|||
public void processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException { | |||
public ZFrame<D,R,C> processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to return a zframe here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done so that writing of labelled output happens in a a separate method. This is needed for python api to work.
processRecordsCli(unmarkedRecords); | ||
ZFrame<D,R,C> updatedLabelledRecords = processRecordsCli(unmarkedRecords); | ||
if (updatedLabelledRecords != null) { | ||
getTrainingHelper().writeLabelledOutput(updatedLabelledRecords,args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move null check to the method writeLabelledOutput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in commit 36878b7
notSurePairsCount = getUnsureMarkedRecordsStat(markedRecords); | ||
totalCount = markedRecords.count() / 2; | ||
} | ||
} | ||
|
||
public ZFrame<D,R,C> getUnmarkedRecords() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arent hese methods are already defined in zinggbase/trainingdatahelper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed duplication in commit 48b2134
updateLabellerStat(selected_option, 1); | ||
printMarkedRecordsStat(); | ||
getTrainingHelper().updateLabellerStat(selected_option, 1); | ||
getTrainingHelper().printMarkedRecordsStat(); | ||
if (selected_option == 9) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make 9 as a constant in the view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in commit 552d091
//String msgHeader = msg1 + msg2; | ||
|
||
selected_option = displayRecordsAndGetUserInput(getDSUtil().select(currentPair, displayCols), msg1, msg2); | ||
updateLabellerStat(selected_option, 1); | ||
printMarkedRecordsStat(); | ||
getTrainingHelper().updateLabellerStat(selected_option, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure whats 1 here. please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constant INCREMENT = 1 defined in commit ea7e8f4
python/zingg/client.py
Outdated
global _spark_ctxt | ||
global _sqlContext | ||
global _spark | ||
jar_path = os.getenv('ZINGG_HOME')+'/zingg-0.3.5-SNAPSHOT.jar' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move name of jar to a global constant up in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit 47b493a
_sqlContext = SQLContext(_spark_ctxt) | ||
return 1 | ||
|
||
def initClient(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can edit the zingg script to have a new option --run-databricks so that user doesnt have to set the env . It is more explicit and gives user the ability to run locally or remote within the same env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in commit a3ddd46
_spark_ctxt = SparkContext.getOrCreate() | ||
_sqlContext = SQLContext(_spark_ctxt) | ||
_spark = SparkSession.builder.getOrCreate() | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why return 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to signal that all is done without error, if calling code wants to check
0.3.5 sync
changes to be able to run using Databricks Connect i.e. still invoke the python script from user's machine but the actual would be run / submitted to data bricks