Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Hdfs as deep storage #48

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

vivek-balakrishnan-rovio
Copy link
Collaborator

@vivek-balakrishnan-rovio vivek-balakrishnan-rovio commented Nov 16, 2023

PR for HDFS deep storage support.

@fabricebaranski
Copy link

fabricebaranski commented Nov 16, 2023

It can be great also to change the version of spark and druid.
Currently I use spark 3.4.1 and druid 27.0.0
When using these versions, you need to change IndexIO and getBitmapSerdeFactory in TaskDataWriter.java.
Also, timestamp can be defined as long, so it can be great to adapt scala code of DruidDatasetExtensions
like that

      // if we're partitioning by day and the time column is of DateType we can simplify
      if (granularityString == GranularityType.DAY.name() &&  dataset.schema(timeColumn).dataType == DateType) {
        df = df.withColumn("__PARTITION_TIME__", col(timeColumn).cast(TimestampType))
      }
      else {
        if (dataset.schema(timeColumn).dataType == LongType) {
          df = df.withColumn("__PARTITION_TIME__",
            normalize_udf(column(timeColumn))
              .divide(1000)
              .cast(DataTypes.TimestampType))
            .withColumn(
              timeColumn,
            (col(timeColumn)/1000).cast(DataTypes.TimestampType))
        } else {
          df = df.withColumn("__PARTITION_TIME__",
            normalize_udf(unix_timestamp(column(timeColumn))
              .multiply(1000)
              .cast(DataTypes.LongType))
              .divide(1000)
              .cast(DataTypes.TimestampType))
        }
      }

@vivek-balakrishnan-rovio
Copy link
Collaborator Author

@fabricebaranski thanks, we plan to migrate to latest Druid version soon and I will include those changes you had suggested.

@vivek-balakrishnan-rovio vivek-balakrishnan-rovio merged commit 48dc6e4 into main Nov 17, 2023
1 check passed
@vivek-balakrishnan-rovio vivek-balakrishnan-rovio deleted the hdfs_storage_support branch November 17, 2023 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants