Skip to content

Commit

Permalink
Fixing tests and merging master
Browse files Browse the repository at this point in the history
  • Loading branch information
cnuernber committed Apr 12, 2020
2 parents cd13aca + 1985868 commit f4e580a
Show file tree
Hide file tree
Showing 9 changed files with 318 additions and 245 deletions.
24 changes: 14 additions & 10 deletions docs/walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ _unnamed [2 3]:
| 2 | -32768 | 3 |
```

#### CSV/TSV Parsing Options
#### CSV/TSV/MAPSEQ/XLS/XLSX Parsing Options
It is important to note that there are several options for parsing files.
A few important ones are column whitelist/blacklists, num records,
and ways to specify exactly how to parse the string data:
Expand All @@ -81,11 +81,11 @@ user> (doc ds/->dataset)
-------------------------
tech.ml.dataset/->dataset
([dataset {:keys [table-name], :as options}] [dataset])
Create a dataset from either csv/tsv or a sequence of maps.
Create a dataset from either csv/tsv/xls/xlsx or a sequence of maps.
* A `String` or `InputStream` will be interpreted as a file (or gzipped file if it
ends with .gz) of tsv or csv data. The system will attempt to autodetect if this
is csv or tsv and has some engineering put into autodetecting column types all of
which can be overridden.
is csv or tsv and then engineering around detecting datatypes all of which can
be overridden.
* A sequence of maps may be passed in in which case the first N maps are scanned in
order to derive the column datatypes before the actual columns are created.
Options:
Expand All @@ -108,13 +108,17 @@ tech.ml.dataset/->dataset
- Return value must be implement tech.ml.dataset.parser.PColumnParser in
which case that is used or can return nil in which case the default
column parser is used.
- tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created.
parse-fn can be one of:
:relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop
the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the
column that tell you the values that failed to parse and their respective indexes.
fn? - function from str-> one of #{:missing :parse-failure value}. Exceptions here always kill the parse
process.
string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern.
DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
- map - the header-name-or-idx is used to lookup value. If not nil, then
can be either of the two above. Else the default column parser is used.
- tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created
and parse-fn will be called for every non-entry empty and is passed a string. The return value
is inserted in the container. For datetime types, the parse-fn can in addition be a string in
which case (DateTimeFormatter/ofPattern parse-fn) will be called or parse-fn can be a
DateTimeFormatter.
value can be any of the above options. Else the default column parser is used.
:parser-scan-len - Length of initial column data used for parser-fn's datatype
detection routine. Defaults to 100.

Expand Down
23 changes: 5 additions & 18 deletions src/tech/libs/fastexcel.clj
Original file line number Diff line number Diff line change
Expand Up @@ -100,24 +100,11 @@


(defn workbook->datasets
"Returns a sequence of dataset named after the sheets.
This supports a subset of the arguments for
tech.ml.dataset.parse/csv->columns. Specifically:
header-row? - Defaults to true, indicates the first row is a header.
parser-fn -
- keyword - all columns parsed to this datatype
- ifn? - called with two arguments: (parser-fn column-name-or-idx column-data)
- Return value must be implement PColumnParser in which case that is used
or can return nil in which case the default column parser is used.
- map - the header-name-or-idx is used to lookup value. If not nil, then
can be either of the two above. Else the default column parser is used.
- tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created
and parse-fn will be called for every non-entry empty and is passed a string. The return value
is inserted in the container. For datetime types, the parse-fn can in addition be a string in
which case (DateTimeFormatter/ofPattern parse-fn) will be called or parse-fn can be a
DateTimeFormatter.
parser-scan-len - Length of initial column data used for parser-fn. Defaults to 100."
"Returns a sequence of dataset named after the sheets. This supports a subset of the arguments
for tech.ml.dataset/->dataset. Specifically:
:header-row?
:parser-fn
:parser-scan-len"
([input options]
(let [workbook (input->workbook input options)]
(try
Expand Down
19 changes: 4 additions & 15 deletions src/tech/libs/poi.clj
Original file line number Diff line number Diff line change
Expand Up @@ -123,21 +123,10 @@

(defn workbook->datasets
"Returns a sequence of dataset named after the sheets. This supports a subset of the arguments
for tech.ml.dataset.parse/csv->columns. Specifically:
header-row? - Defaults to true, indicates the first row is a header.
parser-fn -
- keyword - all columns parsed to this datatype
- ifn? - called with two arguments: (parser-fn column-name-or-idx column-data)
- Return value must be implement PColumnParser in which case that is used
or can return nil in which case the default column parser is used.
- map - the header-name-or-idx is used to lookup value. If not nil, then
can be either of the two above. Else the default column parser is used.
- tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created
and parse-fn will be called for every non-entry empty and is passed a string. The return value
is inserted in the container. For datetime types, the parse-fn can in addition be a string in
which case (DateTimeFormatter/ofPattern parse-fn) will be called or parse-fn can be a
DateTimeFormatter.
parser-scan-len - Length of initial column data used for parser-fn. Defaults to 100."
for tech.ml.dataset/->dataset. Specifically:
:header-row?
:parser-fn
:parser-scan-len"
([input options]
(let [workbook (input->workbook input options)]
(try
Expand Down
16 changes: 10 additions & 6 deletions src/tech/ml/dataset/base.clj
Original file line number Diff line number Diff line change
Expand Up @@ -615,13 +615,17 @@ the correct type."
- Return value must be implement tech.ml.dataset.parser.PColumnParser in
which case that is used or can return nil in which case the default
column parser is used.
- tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created.
parse-fn can be one of:
:relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop
the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the
column that tell you the values that failed to parse and their respective indexes.
fn? - function from str-> one of #{:missing :parse-failure value}. Exceptions here always kill the parse
process.
string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern.
DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
- map - the header-name-or-idx is used to lookup value. If not nil, then
can be either of the two above. Else the default column parser is used.
- tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created
and parse-fn will be called for every non-entry empty and is passed a string. The return value
is inserted in the container. For datetime types, the parse-fn can in addition be a string in
which case (DateTimeFormatter/ofPattern parse-fn) will be called or parse-fn can be a
DateTimeFormatter.
value can be any of the above options. Else the default column parser is used.
:parser-scan-len - Length of initial column data used for parser-fn's datatype
detection routine. Defaults to 100.
Expand Down
37 changes: 37 additions & 0 deletions src/tech/ml/dataset/column.clj
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,12 @@
(:require [tech.ml.protocols.column :as col-proto]
[tech.ml.dataset.impl.column :as col-impl]
[tech.ml.dataset.string-table :as str-table]
[tech.ml.dataset.parse :as ds-parse]
[tech.parallel.for :as parallel-for]
[tech.v2.datatype :as dtype]
[tech.v2.datatype.protocols :as dtype-proto]
[tech.v2.datatype.casting :as casting]
[tech.v2.datatype.typecast :as typecast]
[tech.v2.datatype.readers.concat :as concat-rdr]
[tech.v2.datatype.readers.const :as const-rdr])
(:import [it.unimi.dsi.fastutil.longs LongArrayList]
Expand All @@ -13,6 +16,9 @@
[org.roaringbitmap RoaringBitmap]))


(declare new-column)


(defn is-column?
"Return true if this item is a column."
[item]
Expand Down Expand Up @@ -134,6 +140,37 @@ Implementations should check their metadata before doing calculations."
nil))))


(defn parse-column
"parse a text or a str column, returning a new column with the same name but with
a different datatype. This method is single-threaded.
parser-fn-or-kwd is nil by default and can the keyword :relaxed? or a function that
must return one of parsed-value, :tech.ml.dataset.parse/missing in which case a
missing value will be added or :tech.ml.dataset.parse/parse-failure in which case the
a missing index will be added and the string value will be recorded in the metadata's
:unparsed-data, :unparsed-indexes entries."
([datatype options col]
(let [colname (column-name col)
parse-fn (:parse-fn options datatype)
parser-scan-len (:parser-scan-len options 100)
col-reader (typecast/datatype->reader
:object
(-> (dtype/->reader col)
(ds-parse/convert-reader-to-strings)))
col-parser (ds-parse/make-parser parse-fn (column-name col)
(take parser-scan-len col-reader))
^RoaringBitmap missing (dtype-proto/as-roaring-bitmap (missing col))
n-elems (dtype/ecount col-reader)]
(dotimes [iter n-elems]
(if (.contains missing iter)
(ds-parse/missing! col-parser)
(ds-parse/parse! col-parser (.read col-reader iter))))
(let [{:keys [data missing metadata]} (ds-parse/column-data col-parser)]
(new-column colname data metadata missing))))
([datatype col]
(parse-column datatype {} col)))


(def object-primitive-array-types
{(Class/forName "[Ljava.lang.Boolean;") :boolean
(Class/forName "[Ljava.lang.Byte;") :int8
Expand Down

0 comments on commit f4e580a

Please sign in to comment.