Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Writing Dates and Timestamps #188

Open
nevi-me opened this issue Nov 8, 2018 · 3 comments
Open

Writing Dates and Timestamps #188

nevi-me opened this issue Nov 8, 2018 · 3 comments

Comments

@nevi-me
Copy link

nevi-me commented Nov 8, 2018

I'm continuing with my adventures of writing csv to parquet, but I got stuck with how to write times/dates to parquet.
Specifically, how do I declare the schema (assuming I'm using the text format message schema {})?

I read up on the logical types and their mapping to/from data types, so I tried using i64 for my schema, but I think I'm missing something because I don't know how to map the type to a TIMESTAMP.

I also tried Google, to try look for the format of the schema, but with no luck (for timestamps). Is there some place that documents this?

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 9, 2018

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 9, 2018

I would use TIMESTAMP_MILLIS now, which is just INT64 with corresponding logical type, probably the easiest to write.

@nevi-me
Copy link
Author

nevi-me commented Nov 9, 2018

Thanks @sadikovi, I was confused by the UTC stuff on the timestamp logical type.

Writing a timestamp now works with message schema {REQUIRED INT64 MyField (TIMESTAMP_MILLIS)}, but I'm unable to read the parquet file back in Pandas or PySpark.

PySpark:

spark.read.parquet("file1.parquet").printSchema()
// this correctly shows the schema as below, but .show() throws an error
// printing schema
root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Indicator: boolean (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
# trying to show records

Py4JJavaError: An error occurred while calling o62.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 16, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: Dictionary encoding not supported for type: BOOLEAN 

Pandas:

pd.read_parquet("file1.parquet")

ArrowIOError: Not yet implemented: Dictionary encoding is not implemented for boolean values.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants