Skip to content

A datalake based on xlang and native file system to support namespace, blob, replication, version, SQL

License

Notifications You must be signed in to change notification settings

xlang-foundation/xlang.datalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 

Repository files navigation

xlang.datalake

A datalake based on xlang and native file system to support namespace, blob, replication, version, SQL

Background

Open-Source project is programmer’s novel

In the past, had accumulated many experience on Microsoft’s document based database Comsosdb (Azure Cosmos DB), and its script language Scope (external side called U-SQL); and comparing with this Cosmos DB, had done some kind of technical research on MangoDB; this is one kind, and other kinds:

  • had long time technical experiences on relational database, and between years 2014-2016, was trying to integrate node-based database Neo4j Graph DB into my product, that time paid attention on NoSQL database, and bla bla bla…
  • AND during deep learning research and product development stage and practices on big data stuff, dataset is very important concept, what is dataset, a collection of data with any or free style hierarchical structure…
  • Even more, how do you think about python pickle serialization, PyTorch directly uses it to store its weights, so we need to also consider this into the design, interesting idea or strange?
  • When we do a website, for example, search-based documents website, like https://numpy.org , can we directly use existed files from native file system as its searchable document database?
  • If we want to do a website with image and even large size file like video files ( .mp4, .avi)? most of time, we use file system to directly store these files, not silly put into relational database as bob.
  • For database ( sql or no-sql), schema is very important, pre-defined schema or just meta data based schema?
  • when you have very very large amount of files stored in your local disk, how to do quick search? treat it like a data-lake with automatically indexing and also can be replicated into other computers like you are using cloud based data-lake.

Important Concepts and terms

  • SQL based query/update, with extended grammar
  • ACID MUST HAVE
  • Document Database with json, yaml, html, excel file, word file, pdf, image file, video files etc.
  • Document meta data --- retrieve from file system and file headers.
  • Document Parser—such as pdf parser, ms word parser etc.
  • Structured data
  • Table—still in relational database domain, and consider add colum-based table like Parquet
  • Join operator cross all kind of data
  • Container is just like a folder, folder path looks like namespace
    But file also can play like a container, for example, sqlite file, it is a container

Implementation

  • using xlang as primary coding language
  • using c++ to write libs imported into xlang to intrgrate with some kind of parser( word, excel, pdf...)

Planning and welcome to join this project

About

A datalake based on xlang and native file system to support namespace, blob, replication, version, SQL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published