Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed EMAIL, LICENSE, SPARK_MEM and elastic references from zingg.sh #253

Merged
merged 7 commits into from
May 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions client/src/main/java/zingg/client/Client.java
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ else if (options.get(ClientOptions.CONF).value.endsWith("env")) {
LOG.warn("Zingg processing has completed");
}
catch(ZinggClientException e) {
if (options != null) {
if (options != null && options.get(ClientOptions.EMAIL) != null) {
Email.email(options.get(ClientOptions.EMAIL).value, new EmailBody("Error running Zingg job",
"Zingg Error ",
e.getMessage()));
Expand All @@ -186,7 +186,7 @@ else if (options.get(ClientOptions.CONF).value.endsWith("env")) {
}
}
catch(ZinggClientException e) {
if (options != null) {
if (options != null && options.get(ClientOptions.EMAIL) != null) {
Email.email(options.get(ClientOptions.EMAIL).value, new EmailBody("Error running Zingg job",
"Zingg Error ",
e.getMessage()));
Expand Down
26 changes: 26 additions & 0 deletions config/zingg.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# file config/zingg-defaults.conf
# This file defines default spark properties. These properties are passed to 'spark-submit' as spark configurations (--conf)
# This is useful for setting default environmental settings.
# Entries in this file could be -
# A. Blank Lines
# B. Comment Lines(Starts with #)
# C. Property in key=value format
#
# Leading or trailing spaces could be fine.
# Please note that any key or value already comprising spaces or double quotes must be enclosed with single quotes ('')


### General properties
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.default.parallelism=8
spark.debug.maxToStringFields=200
spark.driver.memory=8g
spark.executor.memory=8g

# Additional Jars could be passed to spark through below configuration. Jars list should be comma(,) separated.
#spark.jars=
#spark.executor.extraClassPath=
#spark.driver.extraClassPath=

### Below property must be set for BigQuery
#spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
6 changes: 5 additions & 1 deletion docs/dataSourcesAndSinks/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,12 @@ The two driver jars namely **spark-bigquery-with-dependencies_2.12-0.24.2.jar**

```bash
export ZINGG_EXTRA_JARS=./spark-bigquery-with-dependencies_2.12-0.24.2.jar,./gcs-connector-hadoop2-latest.jar
export ZINGG_EXTRA_SPARK_CONF="--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
```
Set the following property in Zingg's configuration file i.e. in **config/zingg.conf**.
```bash
spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
```
Similarly, in place of setting env variable **ZINGG_EXTRA_JARS** as above, equivalent property **spark.jars** can also be set in the zingg.conf file.

If Zingg is run from outside Google cloud, it requires authentication, please set the following env variable to the location of the file containing service account key. A service account key can be created and downloaded in json format from [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started)

Expand Down
2 changes: 2 additions & 0 deletions docs/dataSourcesAndSinks/jdbc.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,5 @@ $ export ZINGG_EXTRA_JARS=path to postgresql-xx.jar
```
$ export ZINGG_EXTRA_JARS=path to mysql-connector-java-xx.jar
```

Please note, instead of setting env variable **ZINGG_EXTRA_JARS** as above, equivalent property **spark.jars** can be set in Zingg's configuration file (config/zingg.conf).
1 change: 1 addition & 0 deletions docs/dataSourcesAndSinks/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,4 @@ One must include Snowflake JDBC driver and Spark dependency on the spark classpa
```
export ZINGG_EXTRA_JARS=snowflake-jdbc-3.13.18.jar,spark-snowflake_2.12-2.10.0-spark_3.1.jar
```
Optionally, instead of setting env variable **ZINGG_EXTRA_JARS** as above, equivalent property **spark.jars** can be set in Zingg's configuration file (config/zingg.conf).
12 changes: 12 additions & 0 deletions scripts/load-zingg-env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/usr/bin/env bash

ZINGG_ENV_SH="zingg-env.sh"
export ZINGG_CONF_DIR="$(dirname "$0")"/../config

ZINGG_ENV_SH="${ZINGG_CONF_DIR}/${ZINGG_ENV_SH}"
if [[ -f "${ZINGG_ENV_SH}" ]]; then
# Promote all variable declarations to environment (exported) variables
set -a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to set them as env variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the alternative? As they are going to be supplied in another command (spark-submit), they must be set. Whether they have to be exported or not could be thought over!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we source them, we can read them here, no? Why export?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check if needed or not.
either all exported or none.

. ${ZINGG_ENV_SH}
set +a
fi
31 changes: 23 additions & 8 deletions scripts/zingg.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,34 @@
ZINGG_JARS=$ZINGG_HOME/zingg-0.3.3-SNAPSHOT.jar
EMAIL=xxx@yyy.com
LICENSE="test"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these too to the defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to environment file.

##for local
export SPARK_MEM=10g
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cant remove spark_mem, we should document to set it outside. docker should also have a way to set this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in zingg-env.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK has multiple conf files.(SIX)

  1. env. variables executed by the unix script (spark-env.sh) needed before starting a spark program
  2. Property File (spark-defaults.conf). It is taken care of Java/scala class
  3. log4j.properties
  4. Workers.template

We should keep things simple unless there they are must. We are replicating files 1) and 2) and only in scripts through env var and "--conf" params, respectively.

"--jar" has corresponding conf param as well.e.g
"spark.driver.extraClassPath=/path/myjarfile1.jar:/path/myjarfile2.jar"

Therefore, I think It's not required to processing based on the "pattern" matching.
If you thing otherwise, please let it know.

  • zingg-env.sh - have option to specify any property that starts like spark..(can be spark/executor.memory/spark.driver.memory,spark.hadopp.fs.impl..etc with sensible defaults where applicable and commented out for rest -eg bigquery hadoop fs). any property that starts with spark. put into conf. --conf spark.executor.memory=22g
    zingg.sh should not have any knowledge of which property. it just knows that satuff has to be passed into jars and conf.


if [[ -z "${ZINGG_EXTRA_JARS}" ]]; then
OPTION_JARS=""
else
OPTION_JARS="--jars ${ZINGG_EXTRA_JARS}"
fi

if [[ -z "${ZINGG_EXTRA_SPARK_CONF}" ]]; then
OPTION_SPARK_CONF=""
else
OPTION_SPARK_CONF="${ZINGG_EXTRA_SPARK_CONF}"
fi
function read_zingg_conf() {
local CONF_PROPS=""

ZINGG_CONF_DIR="$(cd "`dirname "$0"`"/../config; pwd)"

file="${ZINGG_CONF_DIR}/zingg.conf"
# Leading blanks removed; comment Lines, blank lines removed
PROPERTIES=$(sed 's/^[[:blank:]]*//;s/#.*//;/^[[:space:]]*$/d' $file)

while IFS='=' read -r key value; do
# Trim leading and trailing spaces
key=$(echo $key | sed 's/^[[:blank:]]*//;s/[[:blank:]]*$//;')
value=$(echo $value | sed 's/^[[:blank:]]*//;s/[[:blank:]]*$//;')
# Append to conf variable
CONF_PROPS+=" --conf ${key}=${value}"
done <<< "$(echo -e "$PROPERTIES")"

echo $CONF_PROPS
}

$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS $OPTION_SPARK_CONF --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE
OPTION_SPARK_CONF+=$(read_zingg_conf)
# All the additional options must be added here
ALL_OPTIONS=" ${OPTION_JARS} ${OPTION_SPARK_CONF} "
$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER ${ALL_OPTIONS} --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE