Skip to content

Commit

Permalink
Arrow file formats (#290)
Browse files Browse the repository at this point in the history
* initial commit of testfiles.

* Adding in jarrow java files so we can change them and to respect the LGPL license on them.

* A better long term jarrow solution.

* Working through more test cases.  Found potential issue with lz4 across python and java.

* Testing that python can load arrow files written out by tmd.

* Cleaning up unit tests.

* Finalizing branch.

* Small fix

* Removing unused file.
  • Loading branch information
cnuernber committed Feb 23, 2022
1 parent 4039c72 commit 3b470e1
Show file tree
Hide file tree
Showing 11 changed files with 833 additions and 384 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
# Changelog
# 6.066
* Major rework of arrow support to include support for all known arrow file formats
and tested files in various formats across latest (7.0.0) pyarrow.

# 6.065
* Fixing [issue 287](https://github.com/techascent/tech.ml.dataset/issues/287) - dataset corrupt after
nippy serialization. This had of course nothing to do with nippy but was caused by a bug in
Expand Down
5 changes: 3 additions & 2 deletions project.clj
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
(defproject techascent/tech.ml.dataset "6.065"
(defproject techascent/tech.ml.dataset "6.066-SNAPSHOT"
:description "Clojure high performance data processing system"
:url "http://github.com/techascent/tech.ml.dataset"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.10.3" :scope "provided"]
[cnuernber/dtype-next "9.012"]
[cnuernber/dtype-next "9.013"]
[techascent/tech.io "4.09"
:exclusions [org.apache.commons/commons-compress]]
[com.univocity/univocity-parsers "2.9.0"]
Expand Down Expand Up @@ -43,6 +43,7 @@
org.slf4j/slf4j-api]
:scope "provided"]
[org.lz4/lz4-java "1.8.0" :scope "provided"]
[com.cnuernber/jarrow "1.000"]

[uncomplicate/neanderthal "0.43.3" :scope "provided"]
;;Geni dependencies
Expand Down
22 changes: 22 additions & 0 deletions scripts/arrow-dtypes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import pandas as pd
import pyarrow as pa
import pyarrow.feather as feather

# with pa.ipc.open_stream("test/data/alldtypes.arrow-ipc") as reader:
# df = reader.read_pandas()

# print(df)

# feather.write_feather(df, "test/data/alldtypes.arrow-feather")
# feather.write_feather(df, "test/data/alldtypes.arrow-feather-compressed", compression='zstd')

# df = df.drop(columns=["local_times"])

# feather.write_feather(df, "test/data/alldtypes.arrow-feather-v1", version=1)


with pa.ipc.open_file("test/data/alldtypes.arrow-file-zstd") as reader:
df = reader.read_pandas()


print(df)

0 comments on commit 3b470e1

Please sign in to comment.