You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The provided Spark example from README for reading xz files is returning output where:
number of lines don't match
output comes out almost like bytestream:
....
(4377758,�����b�H�n��8NĂ��6�z.RS��6�q>����@�⧚2u�oX�+�׃�,�=E�(�X�1͜���v郕����ch�U{0PT�Hz�1`uX荲�͉�2q�N�l{�c6��Z�\�� M��&��]s^���P��$��+u|��=���Xh�<|�*)
(4377930,��KJ�0�Q0d������ִ��RVY(�o�����V�<I�8��M�6��cԖ�>,k)
...
Is this expected format? Why are there fewer lines on the output than uncompressed input?
Here's the Spark/Scala code :
def readXzfile() {
val conf = new SparkConf(true).setAppName("XzUncompressExample").set("spark.shuffle.manager", "SORT").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.akka.frameSize", "50").set("spark.storage.memoryFraction", "0.8").set("spark.cassandra.output.batch.size.rows", "6000").set("spark.executor.extraJavaOptions", "-XX:MaxJavaStackTraceDepth=-1").set("io.compression.codecs", "io.sensesecure.hadoop.xz.XZCodec");
val sc = new SparkContext(conf)
val hadoopConfiguration = new Configuration()
//val file = sc.textFile(fileName.getFileName)
//val rddOfXz = sc.newAPIHadoopFile("file:///Users/bparman/Perforce/testOldAnalyticsCommons10/gn-perseng/eg-analytics/analytics-commons/src/test/resources/*.xz", classOf[org.apache.hadoop.mapred.TextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf)
val rddOfXz = sc.newAPIHadoopFile("/user/ubuntu/raw/*.xz", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], hadoopConfiguration)
rddOfXz.foreach(println)
println("Total number of lines is "+rddOfXz.count())
rddOfXz.saveAsTextFile ("/user/ubuntu/uncompressed")
I tested on Spark 1.3.1 + Hadoop 2.6.0. If you use Spark 1.2.0, which typically bundled with Hadoop 2.4, you may also encounter other issues since hadoop-xz depends on hadoop 2.6. ---yongtang
The provided Spark example from README for reading xz files is returning output where:
....
(4377758,�����b�H�n��8NĂ��6�z.RS��6�q>����@�⧚2u�oX�+�׃�,�=E�(�X�1͜���v郕����ch�U{0PT�Hz�1`uX荲�͉�2q�N�l{�c6��Z�\�� M��&��]s^���P��$��+u|��=���Xh�<|�*)
(4377930,��KJ�0�Q0d������ִ��RVY(�o�����V�<I�8��M�6��cԖ�>,k)
...
Is this expected format? Why are there fewer lines on the output than uncompressed input?
Here's the Spark/Scala code :
def readXzfile() {
}
Here's my build file:
name := "detailed-commons"
organization := "com.mycompany.commons"
version := "1.0.2"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.2.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.1.1"
libraryDependencies ++= Seq(
("io.sensesecure" % "hadoop-xz" % "1.4").
exclude("commons-beanutils", "commons-beanutils-core").
exclude("commons-collections", "commons-collections")
)
publishTo := Some(Resolver.file("detailed-commons-assembly-1.0.2.jar", new File( Path.userHome.absolutePath+"/.ivy2/cache" )) )
The text was updated successfully, but these errors were encountered: