Skip to content

Commit

Permalink
parse boolean string default to string, and update ->dataset docs (#167)
Browse files Browse the repository at this point in the history
* parse boolean string default to string, and update ->dataset docs

* add test data

* :parser-fn default to :bool, only "true" and "false" is parsed
  • Loading branch information
kimim committed Nov 13, 2020
1 parent 19100d7 commit d3a9dc8
Show file tree
Hide file tree
Showing 7 changed files with 105 additions and 33 deletions.
24 changes: 16 additions & 8 deletions docs/tech.v3.dataset.html
Expand Up @@ -25,14 +25,22 @@
<li><code>:max-chars-per-column</code> - Defaults to 4096. Columns with more characters that this will result in an exception.</li>
<li><code>:max-num-columns</code> - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: <a href="https://github.com/uniVocity/univocity-parsers/issues/301">https://github.com/uniVocity/univocity-parsers/issues/301</a></li>
<li><code>:n-initial-skip-rows</code> - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.</li>
<li><code>:parser-fn</code> -</li>
<li><code>keyword?</code> - all columns parsed to this datatype</li>
<li>tuple - pair of [datatype <code>parse-data</code>] in which case container of type [datatype] will be created. <code>parse-data</code> can be one of:
<li><code>:parser-fn</code> -
<ul>
<li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.</li>
<li><code>fn?</code> - function from str-&gt; one of <code>:tech.ml.dataset.parser/missing</code>, <code>:tech.ml.dataset.parser/parse-failure</code>, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column’s :unparsed-values and :unparsed-indexes will be updated.</li>
<li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName.</li>
<li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function to parse the value.</li>
<li><code>keyword?</code> - all columns parsed to this datatype. For example: <code>{:parser-fn :string}</code></li>
<li><code>map?</code> - <code>{column-name parse-method}</code> parse each column with specified <code>parse-method</code>. The <code>parse-method</code> can be:
<ul>
<li><code>keyword?</code> - parse the specified column to this datatype. For example: <code>{:parser-fn {:answer :boolean :id :int32}}</code></li>
<li>tuple - pair of <code>[datatype parse-data]</code> in which case container of type <code>[datatype]</code> will be created. <code>parse-data</code> can be one of:
<ul>
<li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.</li>
<li><code>fn?</code> - function from str-&gt; one of <code>:tech.ml.dataset.parser/missing</code>, <code>:tech.ml.dataset.parser/parse-failure</code>, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column’s :unparsed-values and :unparsed-indexes will be updated.</li>
<li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName.</li>
<li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function to parse the value.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li><code>map?</code> - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.</li>
Expand Down Expand Up @@ -139,4 +147,4 @@
<li><code>:max-chars-per-column</code> - csv,tsv specific, defaults to 65536 - values longer than this will cause an exception during serialization.</li>
<li><code>:max-num-columns</code> - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of columns an exception will be thrown during serialization.</li>
<li><code>:quoted-columns</code> - csv specific - sequence of columns names that you would like to always have quoted.</li>
</ul></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/io.clj#L205">view source</a></div></div></div></body></html>
</ul></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/io.clj#L209">view source</a></div></div></div></body></html>
40 changes: 22 additions & 18 deletions src/tech/v3/dataset/io.clj
Expand Up @@ -129,24 +129,28 @@
- `:n-initial-skip-rows` - Skip N rows initially. This currently may include the header
row. Works across both csv and spreadsheet datasets.
- `:parser-fn` -
- `keyword?` - all columns parsed to this datatype
- tuple - pair of [datatype `parse-data`] in which case container of type
[datatype] will be created. `parse-data` can be one of:
- `:relaxed?` - data will be parsed such that parse failures of the standard
parse functions do not stop the parsing process. :unparsed-values and
:unparsed-indexes are available in the metadata of the column that tell
you the values that failed to parse and their respective indexes.
- `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
`:tech.ml.dataset.parser/parse-failure`, or the parsed value.
Exceptions here always kill the parse process. :missing will get marked
in the missing indexes, and :parse-failure will result in the index being
added to missing, the unparsed the column's :unparsed-values and
:unparsed-indexes will be updated.
- `string?` - for datetime types, this will turned into a DateTimeFormatter via
DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid
argument to Charset/forName.
- `DateTimeFormatter` - use with the appropriate temporal parse static function
to parse the value.
- `keyword?` - all columns parsed to this datatype. For example: `{:parser-fn :string}`
- `map?` - `{column-name parse-method}` parse each column with specified `parse-method`.
The `parse-method` can be:
- `keyword?` - parse the specified column to this datatype. For example:
`{:parser-fn {:answer :boolean :id :int32}}`
- tuple - pair of `[datatype parse-data]` in which case container of type
`[datatype]` will be created. `parse-data` can be one of:
- `:relaxed?` - data will be parsed such that parse failures of the standard
parse functions do not stop the parsing process. :unparsed-values and
:unparsed-indexes are available in the metadata of the column that tell
you the values that failed to parse and their respective indexes.
- `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
`:tech.ml.dataset.parser/parse-failure`, or the parsed value.
Exceptions here always kill the parse process. :missing will get marked
in the missing indexes, and :parse-failure will result in the index being
added to missing, the unparsed the column's :unparsed-values and
:unparsed-indexes will be updated.
- `string?` - for datetime types, this will turned into a DateTimeFormatter via
DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid
argument to Charset/forName.
- `DateTimeFormatter` - use with the appropriate temporal parse static function
to parse the value.
- `map?` - the header-name-or-idx is used to lookup value. If not nil, then
value can be any of the above options. Else the default column parser
is used.
Expand Down
20 changes: 15 additions & 5 deletions src/tech/v3/dataset/io/column_parsers.clj
Expand Up @@ -56,7 +56,14 @@

(def default-coercers
(merge
{:boolean #(if (string? %)
{:bool #(if (string? %)
(let [^String data %]
(cond
(.equals "true" data) true
(.equals "false" data) false
:else parse-failure))
(boolean %))
:boolean #(if (string? %)
(let [^String data %]
(cond
(or (.equalsIgnoreCase "t" data)
Expand Down Expand Up @@ -141,6 +148,8 @@
(loop [n-elems n-elems]
(when (< n-elems idx)
(case simple-dtype
:bool
(.addBoolean container false)
:boolean
(.addBoolean container false)
:int64
Expand Down Expand Up @@ -282,9 +291,9 @@


(def default-parser-datatype-sequence
[:boolean :int16 :int32 :int64 :float64 :uuid
[:bool :int16 :int32 :int64 :float64 :uuid
:packed-duration :packed-local-date
:zoned-date-time :string :text])
:zoned-date-time :string :text :boolean])


(defn- promote-container
Expand Down Expand Up @@ -363,7 +372,8 @@
(cond
(== 0 n-valid)
parser-data
(and (or (= :boolean container-dtype)
(and (or (= :bool container-dtype)
(= :boolean container-dtype)
(casting/numeric-type? container-dtype))
(casting/numeric-type?
(packing/unpack-datatype parser-datatype)))
Expand Down Expand Up @@ -399,7 +409,7 @@
(defn promotional-string-parser
([column-name parser-datatype-sequence]
(let [first-dtype (first parser-datatype-sequence)]
(PromotionalStringParser. (column-base/make-container first-dtype)
(PromotionalStringParser. (column-base/make-container (if (= :bool first-dtype) :boolean first-dtype))
first-dtype
false
(default-coercers first-dtype)
Expand Down
11 changes: 11 additions & 0 deletions test/data/datatype_parser.csv
@@ -0,0 +1,11 @@
id, char, word, bool, boolstr, boolean
1, t, true, true, true, t
2, f, False, true, true, y
3, y, YES, false, false, n
4, n, NO, false, false, f
5, T, positive, true, true, true
6, F, negative, false, false, false
7, Y, yep, true, true, positive
8, N, not, false, false, negative
9, A, pos, false, False, negative
10, z, neg, false, false, negative
3 changes: 2 additions & 1 deletion test/tech/v3/dataset/ames_test.clj
Expand Up @@ -21,7 +21,8 @@
(dtype/->vector new-col)))))


(def src-ds (ds/->dataset "test/data/ames-house-prices/train.csv"))
(def src-ds (ds/->dataset "test/data/ames-house-prices/train.csv"
{:parser-fn {"CentralAir" :boolean}}))


(defn missing-pipeline
Expand Down
2 changes: 1 addition & 1 deletion test/tech/v3/dataset/parse_test.clj
Expand Up @@ -54,7 +54,7 @@
["BsmtHalfBath" :int16]
["BsmtQual" :string]
["BsmtUnfSF" :int16]
["CentralAir" :boolean]
["CentralAir" :string]
["Condition1" :string]
["Condition2" :string]
["Electrical" :string]
Expand Down
38 changes: 38 additions & 0 deletions test/tech/v3/dataset_test.clj
Expand Up @@ -17,6 +17,44 @@
(:import [java.util List HashSet UUID]
[java.io File]))

(deftest datatype-parser
(let [ds (ds/->dataset "test/data/datatype_parser.csv")]
(is (= :int16 (dtype/get-datatype (ds/column ds "id"))))
(is (= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (ds/column ds "id")))
(is (= :string (dtype/get-datatype (ds/column ds "char"))))
(is (= ["t", "f", "y", "n", "T", "F", "Y", "N", "A", "z"]
(ds/column ds "char")))
(is (= :string (dtype/get-datatype (ds/column ds "word"))))
(is (= ["true", "False", "YES", "NO", "positive", "negative", "yep", "not", "pos", "neg"]
(ds/column ds "word")))
(is (= :boolean (dtype/get-datatype (ds/column ds "bool"))))
(is (= [true, true, false, false, true, false, true, false, false, false]
(ds/column ds "bool")))
(is (= :string (dtype/get-datatype (ds/column ds "boolstr"))))
(is (= ["true", "true", "false", "false", "true", "false", "true", "false", "False", "false"]
(ds/column ds "boolstr")))
(is (= :string (dtype/get-datatype (ds/column ds "boolean"))))
(is (= ["t", "y", "n", "f", "true", "false", "positive", "negative", "negative", "negative"]
(ds/column ds "boolean"))))
(let [ds (ds/->dataset "test/data/datatype_parser.csv" {:parser-fn {"boolean" :boolean
"boolstr" :boolean}})]
(is (= :int16 (dtype/get-datatype (ds/column ds "id"))))
(is (= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (ds/column ds "id")))
(is (= :string (dtype/get-datatype (ds/column ds "char"))))
(is (= ["t", "f", "y", "n", "T", "F", "Y", "N", "A", "z"]
(ds/column ds "char")))
(is (= :string (dtype/get-datatype (ds/column ds "word"))))
(is (= ["true", "False", "YES", "NO", "positive", "negative", "yep", "not", "pos", "neg"]
(ds/column ds "word")))
(is (= :boolean (dtype/get-datatype (ds/column ds "bool"))))
(is (= [true, true, false, false, true, false, true, false, false, false]
(ds/column ds "boolean")))
(is (= :boolean (dtype/get-datatype (ds/column ds "boolstr"))))
(is (= [true, true, false, false, true, false, true, false, false, false]
(ds/column ds "boolstr")))
(is (= :boolean (dtype/get-datatype (ds/column ds "boolean"))))
(is (= [true, true, false, false, true, false, true, false, false, false]
(ds/column ds "boolean")))))

(deftest iterable
(let [ds (ds/->dataset (test-utils/mapseq-fruit-dataset))]
Expand Down

0 comments on commit d3a9dc8

Please sign in to comment.