Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing to bigquery from dataproc #81

Closed
mikerlt opened this issue Sep 17, 2019 · 1 comment
Closed

writing to bigquery from dataproc #81

mikerlt opened this issue Sep 17, 2019 · 1 comment

Comments

@mikerlt
Copy link

mikerlt commented Sep 17, 2019

when trying to write to bigquery like so from pyspark in dataproc:
`bigquery = spark._sc._jvm.com.samelamin.spark.bigquery

Prepare the bigquery context

bq = bigquery.BigQuerySQLContext(spark._wrapped._jsqlContext)
bq.setBigQueryProjectId(BQ_PROJECT_ID)
bq.setGSProjectId(BQ_PROJECT_ID)
bq.setBigQueryGcsBucket(STAGING_BUCKET)
bq.setBigQueryDatasetLocation(DATASET_LOCATION)

Extract and Transform a dataframe

df = session.read.csv(...)

Load into a table or table partition

bqDF = bigquery.BigQueryDataFrame(df_master._jdf)
bqDF.saveAsBigQueryTable(
"{0}:{1}.{2}".format(BQ_PROJECT_ID, DATASET_ID, TABLE_NAME),
False, # Day paritioned when created
0, # Partition expired when created
bigquery.getattr("package$WriteDisposition$").getattr("MODULE$").WRITE_EMPTY(),
bigquery.getattr("package$CreateDisposition$").getattr("MODULE$").CREATE_IF_NEEDED(),
)`
I am getting the following exception:

`Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.github.samelamin#spark-bigquery_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c234c14e-60d0-482e-a149-6a46d2a4c43f;1.0
confs: [default]
found com.github.samelamin#spark-bigquery_2.11;0.2.6 in central
found com.databricks#spark-avro_2.11;4.0.0 in central
found org.slf4j#slf4j-api;1.7.5 in central
found org.apache.avro#avro;1.7.6 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found com.thoughtworks.paranamer#paranamer;2.3 in central
found org.xerial.snappy#snappy-java;1.0.5 in central
found org.apache.commons#commons-compress;1.4.1 in central
found org.tukaani#xz;1.0 in central
found com.google.cloud.bigdataoss#bigquery-connector;0.13.4-hadoop2 in central
found com.google.api-client#google-api-client-java6;1.24.1 in central
found com.google.api-client#google-api-client;1.24.1 in central
found com.google.oauth-client#google-oauth-client;1.24.1 in central
found com.google.http-client#google-http-client;1.24.1 in central
found com.google.code.findbugs#jsr305;3.0.2 in central
found org.apache.httpcomponents#httpclient;4.0.1 in central
found org.apache.httpcomponents#httpcore;4.0.1 in central
found commons-logging#commons-logging;1.1.1 in central
found commons-codec#commons-codec;1.6 in central
found com.google.http-client#google-http-client-jackson2;1.24.1 in central
found com.fasterxml.jackson.core#jackson-core;2.9.2 in central
found com.google.guava#guava;26.0-jre in central
found org.checkerframework#checker-qual;2.5.2 in central
found com.google.errorprone#error_prone_annotations;2.1.3 in central
found com.google.j2objc#j2objc-annotations;1.1 in central
found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central
found com.google.oauth-client#google-oauth-client-java6;1.24.1 in central
found com.google.api-client#google-api-client-jackson2;1.24.1 in central
found com.google.apis#google-api-services-storage;v1-rev135-1.24.1 in central
found com.google.apis#google-api-services-bigquery;v2-rev398-1.24.1 in central
found com.google.cloud.bigdataoss#util;1.9.4 in central
found com.google.auto.value#auto-value-annotations;1.6.2 in central
found com.google.cloud.bigdataoss#util-hadoop;1.9.4-hadoop2 in central
found com.google.cloud.bigdataoss#gcs-connector;1.9.4-hadoop2 in central
found com.google.cloud.bigdataoss#gcsio;1.9.4 in central
found joda-time#joda-time;2.9.3 in central
:: resolution report :: resolve 1060ms :: artifacts dl 27ms
:: modules in use:
com.databricks#spark-avro_2.11;4.0.0 from central in [default]
com.fasterxml.jackson.core#jackson-core;2.9.2 from central in [default]
com.github.samelamin#spark-bigquery_2.11;0.2.6 from central in [default]
com.google.api-client#google-api-client;1.24.1 from central in [default]
com.google.api-client#google-api-client-jackson2;1.24.1 from central in [default]
com.google.api-client#google-api-client-java6;1.24.1 from central in [default]
com.google.apis#google-api-services-bigquery;v2-rev398-1.24.1 from central in [default]
com.google.apis#google-api-services-storage;v1-rev135-1.24.1 from central in [default]
com.google.auto.value#auto-value-annotations;1.6.2 from central in [default]
com.google.cloud.bigdataoss#bigquery-connector;0.13.4-hadoop2 from central in [default]
com.google.cloud.bigdataoss#gcs-connector;1.9.4-hadoop2 from central in [default]
com.google.cloud.bigdataoss#gcsio;1.9.4 from central in [default]
com.google.cloud.bigdataoss#util;1.9.4 from central in [default]
com.google.cloud.bigdataoss#util-hadoop;1.9.4-hadoop2 from central in [default]
com.google.code.findbugs#jsr305;3.0.2 from central in [default]
com.google.errorprone#error_prone_annotations;2.1.3 from central in [default]
com.google.guava#guava;26.0-jre from central in [default]
com.google.http-client#google-http-client;1.24.1 from central in [default]
com.google.http-client#google-http-client-jackson2;1.24.1 from central in [default]
com.google.j2objc#j2objc-annotations;1.1 from central in [default]
com.google.oauth-client#google-oauth-client;1.24.1 from central in [default]
com.google.oauth-client#google-oauth-client-java6;1.24.1 from central in [default]
com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
commons-codec#commons-codec;1.6 from central in [default]
commons-logging#commons-logging;1.1.1 from central in [default]
joda-time#joda-time;2.9.3 from central in [default]
org.apache.avro#avro;1.7.6 from central in [default]
org.apache.commons#commons-compress;1.4.1 from central in [default]
org.apache.httpcomponents#httpclient;4.0.1 from central in [default]
org.apache.httpcomponents#httpcore;4.0.1 from central in [default]
org.checkerframework#checker-qual;2.5.2 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.codehaus.mojo#animal-sniffer-annotations;1.14 from central in [default]
org.slf4j#slf4j-api;1.7.5 from central in [default]
org.tukaani#xz;1.0 from central in [default]
org.xerial.snappy#snappy-java;1.0.5 from central in [default]
:: evicted modules:
org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 38 | 0 | 0 | 1 || 37 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c234c14e-60d0-482e-a149-6a46d2a4c43f
confs: [default]
0 artifacts copied, 37 already retrieved (0kB/19ms)
19/09/17 16:14:30 INFO org.spark_project.jetty.util.log: Logging initialized @4647ms
19/09/17 16:14:30 INFO org.spark_project.jetty.server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/09/17 16:14:31 INFO org.spark_project.jetty.server.Server: Started @4753ms
19/09/17 16:14:31 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/09/17 16:14:31 INFO org.spark_project.jetty.server.AbstractConnector: Started ServerConnector@2242b679{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
19/09/17 16:14:31 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in fairscheduler.xml or set spark.scheduler.allocation.file to a file that contains the configuration.
19/09/17 16:14:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at sidm-cluster-02-m/10.128.0.2:8032
19/09/17 16:14:31 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at sidm-cluster-02-m/10.128.0.2:10200
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.github.samelamin_spark-bigquery_2.11-0.2.6.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/joda-time_joda-time-2.9.3.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-java6-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-jackson2-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.guava_guava-26.0-jre.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.oauth-client_google-oauth-client-java6-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_util-1.9.4.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.api-client_google-api-client-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.http-client_google-http-client-1.24.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.httpcomponents_httpclient-4.0.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.apache.httpcomponents_httpcore-4.0.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/commons-logging_commons-logging-1.1.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/commons-codec_commons-codec-1.6.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.9.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.checkerframework_checker-qual-2.5.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.errorprone_error_prone_annotations-2.1.3.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.j2objc_j2objc-annotations-1.1.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/org.codehaus.mojo_animal-sniffer-annotations-1.14.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.auto.value_auto-value-annotations-1.6.2.jar added multiple times to distributed cache.
19/09/17 16:14:33 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:///root/.ivy2/jars/com.google.cloud.bigdataoss_gcsio-1.9.4.jar added multiple times to distributed cache.
19/09/17 16:14:35 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1568730559207_0014
19/09/17 16:16:10 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Traceback (most recent call last):
File "/tmp/5573a94d3db144558111c14b4dba7f38/ingest_bq.py", line 49, in
bigquery.getattr("package$CreateDisposition$").getattr("MODULE$").CREATE_IF_NEEDED(),
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o68.saveAsBigQueryTable.
: java.lang.NoSuchMethodError: com.google.cloud.hadoop.io.bigquery.BigQueryStrings.parseTableReference(Ljava/lang/String;)Lcom/google/api/services/bigquery/model/TableReference;
at com.samelamin.spark.bigquery.BigQueryDataFrame.saveAsBigQueryTable(BigQueryDataFrame.scala:40)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

19/09/17 16:16:11 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@2242b679{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
Job output is complete`

@samelamin
Copy link
Owner

Sounds like you are missing some libraries, try ensuring the bigdata-interop libraries are added to the class path

Using scala libs from python is always a pain im afraid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants