-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for MS Excel: Reading, Combining, Writing excel files #12
Comments
I think your use case is very valid. However, I do not understand the issue with the current solution. Another alternative might be in the future the templating feature (cf. #3). Using this feature you can load an Excel file and modify selected cells. The idea behind this feature was more that you use Excel to create nice drawings, formats etc. and then use HadoopOffice to fill this "template" with data/formulas. |
Thanks for validating the use case. Yes I've tried it. Reading any xlsx file to a DF it cannot be output to an excel file again (directly). Example:
results in an error similar to:
Probably because it returns a df with wrappedArrays when it's reading the xlsx |
I see. I will look at the error later, might be an issue. However, why do you need to write the file when you directly load it afterwards? For example, it is possible create one DF where you put all Excel cells inside (they can be part of different sheets, does not matter) and then write it to one file. No need for different DFs or writing/reloading. Example: |
Additionally, you could do a simple select after loading: I would have to test it, but let me know if you have the chance to test it before me. |
There is 2 reasons for this:
(I'm not saying it should work like this by the way! All I want is an easy / fast way to create a single xlsx based on multiple dfs which represent rows and collumns with raw values, saved as separate sheets in the xlsx, with 1 the front page containing formulas that link to the sheets of raw data. I thought it would already be possible by this hacky way) So a way could be: an easy way to store a list of rows of data (df) across multiple sheets in 1 xlsx, instead of only being able to specify the 1 "default" sheetname (For example by using the first/last column in the df as the sheet name) Update: But I think your explode method does what I need :D (=easily split it up in cells) I'll test it tomorrow! |
I understand. This is fine then to store and load them again. Please let me know if explode works! I have to correct also my earlier statement - as you already found out - just loading and directly storing does not work, but explode (or a similar statement) should work. |
Nope didn't work: I added the line to the code above
There's a small difference between the two dataframes
One shows spaces and the other doesn't. I've seen it before though, no clue where. Think we're close though! |
Ok, thx for testing another approach would be to use flat map to create out of the array several items of type SpreadSheetCellDAO - I hope I have later in the evening some time to run this on Spark shell.
… On 22. May 2017, at 09:13, jkommeren ***@***.***> wrote:
Nope didn't work:
I added the line to the code above
scala> df2.select(explode(df2("rows")).alias("rows")).write.format("org.zuinnote.spark.office.excel").mode(SaveMode.Overwrite).save("/user/spark/output2/")
17/05/22 07:05:50 WARN AzureFileSystemThreadPoolExecutor: Disabling threads for Delete operation as thread count 0 is <= 1
17/05/22 07:05:51 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 13, 10.0.0.22, executor 1): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: scala.MatchError: [1,,1,A1,Sheet2] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at org.zuinnote.spark.office.excel.ExcelOutputWriter$$anonfun$write$1.apply$mcVI$sp(ExcelOutputWriter.scala:77)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.zuinnote.spark.office.excel.ExcelOutputWriter.write(ExcelOutputWriter.scala:70)
at org.apache.spark.sql.execution.datasources.OutputWriter.writeInternal(OutputWriter.scala:93)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:245)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
There's a small difference between the two dataframes
scala> df2.select(explode(df2("rows")).alias("rows")).show()
+--------------------+
| rows|
+--------------------+
| [1,,1,A1,Sheet2]|
|[2,This is a comm...|
| [3,,3,A3,Sheet2]|
| [,,,,]|
|[2.5,,AVERAGE('Sh...|
+--------------------+
scala> df.show()
+--------------------+
| value|
+--------------------+
| [, , 1, A1, Sheet2]|
|[, This is a comm...|
| [, , 3, A3, Sheet2]|
|[, , AVERAGE('She...|
+--------------------+
One shows spaces and the other doesn't. I've seen it before though, no clue where. Think we're close though!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Maybe this helps? I did some printschemas df2's printschema before explode:
df2's printschema after explode:
printschema of dataframe that is compatible ("df") in example:
So it seems one is an array the other a struct, which would explain the different show outputs. You probably knew this already though. |
Okay so I got it working, but it is in no way pretty code.
Maybe this gives you some direction to go with :) |
Hi,
Thanks for the quick and proactive feedback!
I will investigate later the flatmap option as a fix for now.
Maybe in the future also to accept loaded files in DF directly.
Best regards
… On 22. May 2017, at 14:45, jkommeren ***@***.***> wrote:
Okay so I got it working, but it is in no way pretty code.
val sRdd = spark.sparkContext.parallelize(Seq(
Seq("","","1","A1","Sheet2"),
Seq("","This is a comment","2","A2","Sheet2"),
Seq("","","3","A3","Sheet2"),
Seq("","","AVERAGE('Sheet2'!A2:A3)","B1","Sheet1"))).repartition(1)
val df= sRdd.toDF()
df.write.format("org.zuinnote.spark.office.excel").mode(SaveMode.Overwrite).save("/user/spark/output/")
// read
val df2 = spark.read.format("org.zuinnote.spark.office.excel").load("/user/spark/output/")
// explode
val df2fixed = df2.select(explode(df2("rows")).alias("rows"))
// get values from struct to root
val df2fixed2 = df2fixed.select(df2fixed.col("rows.formattedValue"),df2fixed.col("rows.comment"), df2fixed.col("rows.formula"), df2fixed.col("rows.address"), df2fixed.col("rows.sheetName"))
// stick em together
val df3 = df2fixed2.withColumn("con", concat(df2fixed2("formattedValue"),lit(","), df2fixed2("comment"), lit(","), df2fixed2("formula"), lit(","),df2fixed2("address"),lit(","), df2fixed2("sheetName"))).select("con")
// filter out any empty values, and split them to an array
val df5 = df3.filter(x => x.getAs[String](0) !=",,,,").map(x => x.getAs[String](0).split(','))
// output
df5.write.format("org.zuinnote.spark.office.excel").mode(SaveMode.Overwrite).save("/user/spark/output2/")
Maybe this gives you some direction to go with :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
No prob! Thanks for looking into it! (and for your quick responses too!) |
|
Hello,
Currently your library supports reading excel files and writing them, which is excellent!
What I would really like to do is:
Create 1 page with calculations, referring other sheets (250+), with just raw data (480 cells per sheet)
What I have:
250+ dfs of raw data (integers, strings), 1 df for each sheet to be output
According to the documentation data can be written in two ways:
How I envisioned this could work with your current library, to avoid creating 480 cells each for 250+ dfs of sources:
This did not work unfortunately, because reading an xlsx will return a df that cannot be directly written as an xlsx with this library.
My question:
Is there an easy way to do this currently? (I'm not that familiar with Spark yet, so there could be something I'm missing!) If not, do you think it should be able to do this?
To me it makes sense to be able to import data, make some adjustments, and export it again.
What do you think?
Keep up the good work!
Kind regards,
Joris
The text was updated successfully, but these errors were encountered: