Skip to content
Yosegi is a Schema-less columnar storage format. Provide flexible representation like JSON and efficient reading similar to other columnar storage formats.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci
.github
docs
src
.gitignore initial commit. Jan 18, 2019
CONTRIBUTING.md [Yosegi-2] Create CONTRIBUTING.md Feb 22, 2019
LICENSE.txt initial commit. Jan 18, 2019
README.md
pom.xml [Yosegi-10] Balance compression efficiency and CPU resources in write… Feb 28, 2019

README.md

Introduction to Yosegi

What does this project do?

Yosegi is a Schema-less columnar storage format. Provide flexible representation like JSON and efficient reading similar to other columnar storage formats.

Why is this project useful?

There was a problem that it is too large to compress and save the data as it is in the Big Data era. From the demand for improvement in compression ratio and read performance, several columnar data formats (for example, Apache ORC and Apache Parquet) were proposed. They achieve the high compression ratio from similar data in column and reading performance for grouping data by column when data is used.

However, these data formats are required the data structure in a row (or a record) should be defined before saving the data. It was necessary to decide how to use it at the time of data storage, and it was often a problem that it was difficult to decide what kind of data to use.

In this project, we provide a new columnar format which does not require the schema at the time of data storage with compression and read performance equal to (or higher in case) than other formats.

Use cases

Data Analysis

Analyzing big data requires store data compactly and get data smoothly. Yosegi as a columnar format is useful for this needs.

Data Lake

Data Lake is a data pool that is not required the data structure (as a schema) in the row at the time of data storage. And stored data can be used with defining its schema at the time of analyzing. See DataLake.

License

This project is on the Apache License. Please treat this project under this license.

How do I get started?

Java

For easy usage please see the quick start.

CLI

Please see the repository of yosegi-tools for details.

If you want to know what kind of function it has, look at the command list.

Apache Hadoop

Yosegi supports Apache Hadoop. Please see the repository of yosegi-hadoop for details.

For easy usage please see quick start.

Apache Hive

Yosegi supports Apache Hive. Please see the repository of yosegi-hive for details.

For easy usage please see quick start.

Apache Spark

Yosegi supports Apache Spark. Please see the repository of yosegi-spark for details.

For easy usage please see quick start.

Where can I get more help, if I need it?

Support and discussion of Yosegi are on the Mailing list.

We plan to support and discussion of Yosegi on the Mailing list. However, please contact us via GitHub until ML is opened.

How to contribute

We welcome to join this project widely.

For information on how to start contributing to the project, please refer to the Yosegi contribution guide.

Building

System requirement

Following environments are required.

  • Mac OS X or Linux
  • Java 8 Update 92 or higher (8u92+), 64-bit
  • Maven 3.3.9 or later (for building)

Maven

Yosegi sources can get from the Maven repository.

Compile sources

Compile each source following instructions.

$ mvn clean install
You can’t perform that action at this time.