Spark-curves

Spark-curves provides SQL functions and ML Classes on Apache Spark about curve calculations such as curve similarity, clustering, etc. For example, it can apply to the analysis of traffic flow or human flow.

Setup

SBT Settings

Apache Spark SQL & MLlib 3.5.0 (on Scala 2.12 or 2.13) are required to use Spark-curves.

libraryDependencies ++= Seq(
  "io.github.toyotainfotech" %% "spark-curves" % "0.4.0",
  "org.apache.spark" %% "spark-sql" % "3.5.0" % Provided,
  "org.apache.spark" %% "spark-mllib" % "3.5.0" % Provided
)

Registration of SQL Functions

import io.github.toyotainfotech.spark.curves.sql.functions.registerCurveFunctions

val spark: SparkSession = ??? // use a SparkSession instance in your Spark environment

registerCurveFunctions(spark.sessionState.functionRegistry)

Examples

Curve Similarity

> SELECT discrete_frechet_distance(array(array(0d, 0d), array(2d, 0d), array(6d, 0d)), array(array(0d, 1d), array(2d, 1d), array(3d, 1d), array(6d, 1d)))
1.414

> SELECT continuous_dynamic_time_warping(array(array(-1d, 1d), array(0d, 1d), array(0d, 2d), array(1d, 2d), array(2d, 2d), array(3d, 2d), array(3d, 1d), array(4d, 1d)), array(array(-1d, 0d), array(0d, 0d), array(3d, 0d), array(4d, 0d)))
2.722

Curve Clustering

val klClustering = new KLClustering()
  .setK(2)
  .setL(16)
  .setIter(2)
  .setCurveSimilarityFunc("continuous_frechet_distance")
  .setCurveSimplifyFunc("polyline_sample_equidistant")
  .setPointsCenterFunc("smallest_circle_center")
  .setInputCurveCol("curve")
  .setInputCurveAxisFields(Array("x", "y"))
  .setOutputClusterIndexCol("clusterIndex")
  .setOutputSimilarityCol("similarity")
  .setOutputMatchingCurveCol("matching_curve")

val klClusterModel = klClustering.fit(trajectories)
val result = klClusterModel.transform(trajectories)

Documentation

See API.

Performance

Curve Clustering

The following experiment is clustering of up to 3,578,281 curves on Amazon EMR.

See DEIM 2023 paper (Japanese) for details.

License

Copyright 2023 TOYOTA MOTOR CORPORATION

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
docs		docs
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE.txt		LICENSE.txt
README.md		README.md
build.sbt		build.sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-curves

Setup

SBT Settings

Registration of SQL Functions

Examples

Curve Similarity

Curve Clustering

Documentation

Performance

Curve Clustering

License

About

Releases 1

Languages

License

ToyotaInfoTech/spark-curves

Folders and files

Latest commit

History

Repository files navigation

Spark-curves

Setup

SBT Settings

Registration of SQL Functions

Examples

Curve Similarity

Curve Clustering

Documentation

Performance

Curve Clustering

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Languages