Issues in running Chapter 6 code #15

svishnu88 · 2015-01-31T16:59:37Z

I have packaged the chapter 6 and included the jar using spark-shell.

When I am trying to execute the below code without @transient

@transient val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "")
conf.set(XmlInputFormat.END_TAG_KEY, "")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)

I get Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration .

With transient in place I can proceed further, but after the below transformation

val plainText = rawXmls.flatMap(wikiXmlToPlainText)

I ran a plainText.count

And it gives me the below error.

java.lang.NoClassDefFoundError: com/google/common/base/Charsets
at com.cloudera.datascience.common.XmlInputFormat$XmlRecordReader.(XmlInputFormat.java:79)
at com.cloudera.datascience.common.XmlInputFormat.createRecordReader(XmlInputFormat.java:55)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

Am I missing something here.
I am using spark 1.2 and Hadoop 2.5.2

srowen · 2015-01-31T17:08:36Z

@sryza I am going to remove use of Guava since it's not worth bundling/distributing it as an assembly.

You're declaring a Configuration as a field of your own class or just putting this in the shell? it shouldn't be caught in closure, but may be.

svishnu88 · 2015-01-31T17:15:53Z

I am trying to execute the Configuration field in shell. After using transient I am not getting the error.

How I can solve the java.lang.NoClassDefFoundError: com/google/common/base/Charsets error.

srowen · 2015-01-31T17:29:00Z

@svishnu88 No the Guava issue should be solveable by just not using it directly in the code. Let's see if that's possible. The @transient may be needed to work around how the closure cleaner works. @jwills have you seen this?

srowen · 2015-02-03T03:16:58Z

OK, the Guava thing shouldn't be an issue. I'm still not sure what to make of the Configuration issue. I wonder if we need to include transient in the text as a precaution? still feels like the wrong way to address this.

srowen assigned sryza Feb 17, 2015

srowen added the bug label Feb 17, 2015

srowen modified the milestone: 1.0.0 Feb 17, 2015

sryza closed this as completed Feb 23, 2015

srowen mentioned this issue Feb 27, 2015

transient variable conf #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues in running Chapter 6 code #15

Issues in running Chapter 6 code #15

svishnu88 commented Jan 31, 2015

srowen commented Jan 31, 2015

svishnu88 commented Jan 31, 2015

srowen commented Jan 31, 2015

srowen commented Feb 3, 2015

Issues in running Chapter 6 code #15

Issues in running Chapter 6 code #15

Comments

svishnu88 commented Jan 31, 2015

srowen commented Jan 31, 2015

svishnu88 commented Jan 31, 2015

srowen commented Jan 31, 2015

srowen commented Feb 3, 2015