Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in running Chapter 6 code #15

Closed
svishnu88 opened this issue Jan 31, 2015 · 4 comments
Closed

Issues in running Chapter 6 code #15

svishnu88 opened this issue Jan 31, 2015 · 4 comments
Assignees
Labels
Milestone

Comments

@svishnu88
Copy link

I have packaged the chapter 6 and included the jar using spark-shell.

When I am trying to execute the below code without @transient

@transient val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "")
conf.set(XmlInputFormat.END_TAG_KEY, "")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)

I get Caused by: java.io.NotSerializableException: org.apache.hadoop.conf.Configuration .

With transient in place I can proceed further, but after the below transformation

val plainText = rawXmls.flatMap(wikiXmlToPlainText)

I ran a plainText.count

And it gives me the below error.

java.lang.NoClassDefFoundError: com/google/common/base/Charsets
at com.cloudera.datascience.common.XmlInputFormat$XmlRecordReader.(XmlInputFormat.java:79)
at com.cloudera.datascience.common.XmlInputFormat.createRecordReader(XmlInputFormat.java:55)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

Am I missing something here.
I am using spark 1.2 and Hadoop 2.5.2

@srowen
Copy link
Collaborator

srowen commented Jan 31, 2015

@sryza I am going to remove use of Guava since it's not worth bundling/distributing it as an assembly.

You're declaring a Configuration as a field of your own class or just putting this in the shell? it shouldn't be caught in closure, but may be.

@svishnu88
Copy link
Author

I am trying to execute the Configuration field in shell. After using transient I am not getting the error.

How I can solve the java.lang.NoClassDefFoundError: com/google/common/base/Charsets error.

@srowen
Copy link
Collaborator

srowen commented Jan 31, 2015

@svishnu88 No the Guava issue should be solveable by just not using it directly in the code. Let's see if that's possible. The @transient may be needed to work around how the closure cleaner works. @jwills have you seen this?

@srowen
Copy link
Collaborator

srowen commented Feb 3, 2015

OK, the Guava thing shouldn't be an issue. I'm still not sure what to make of the Configuration issue. I wonder if we need to include transient in the text as a precaution? still feels like the wrong way to address this.

@srowen srowen added the bug label Feb 17, 2015
@srowen srowen modified the milestone: 1.0.0 Feb 17, 2015
@sryza sryza closed this as completed Feb 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants