parse boolean string default to string, and update ->dataset docs (#167)

* parse boolean string default to string, and update ->dataset docs * add test data * :parser-fn default to :bool, only "true" and "false" is parsed
techascent · Nov 13, 2020 · d3a9dc8 · d3a9dc8
1 parent 19100d7
commit d3a9dc8
Show file tree

Hide file tree

Showing 7 changed files with 105 additions and 33 deletions.
diff --git a/docs/tech.v3.dataset.html b/docs/tech.v3.dataset.html
@@ -25,14 +25,22 @@
   <li><code>:max-chars-per-column</code> - Defaults to 4096. Columns with more characters that this  will result in an exception.</li>
   <li><code>:max-num-columns</code> - Defaults to 8192. CSV,TSV files with more columns than this  will fail to parse. For more information on this option, please visit:  <a href="https://github.com/uniVocity/univocity-parsers/issues/301">https://github.com/uniVocity/univocity-parsers/issues/301</a></li>
   <li><code>:n-initial-skip-rows</code> - Skip N rows initially. This currently may include the header  row. Works across both csv and spreadsheet datasets.</li>
-  <li><code>:parser-fn</code> -</li>
-  <li><code>keyword?</code> - all columns parsed to this datatype</li>
-  <li>tuple - pair of [datatype <code>parse-data</code>] in which case container of type [datatype] will be created. <code>parse-data</code> can be one of:
+  <li><code>:parser-fn</code> -
     <ul>
-      <li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard  parse functions do not stop the parsing process. :unparsed-values and  :unparsed-indexes are available in the metadata of the column that tell  you the values that failed to parse and their respective indexes.</li>
-      <li><code>fn?</code> - function from str-&gt; one of <code>:tech.ml.dataset.parser/missing</code>,  <code>:tech.ml.dataset.parser/parse-failure</code>, or the parsed value.  Exceptions here always kill the parse process. :missing will get marked  in the missing indexes, and :parse-failure will result in the index being  added to missing, the unparsed the column’s :unparsed-values and  :unparsed-indexes will be updated.</li>
-      <li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via  DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid  argument to Charset/forName.</li>
-      <li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function  to parse the value.</li>
+      <li><code>keyword?</code> - all columns parsed to this datatype. For example: <code>{:parser-fn :string}</code></li>
+      <li><code>map?</code> - <code>{column-name parse-method}</code> parse each column with specified <code>parse-method</code>.  The <code>parse-method</code> can be:
+        <ul>
+          <li><code>keyword?</code> - parse the specified column to this datatype. For example:  <code>{:parser-fn {:answer :boolean :id :int32}}</code></li>
+          <li>tuple - pair of <code>[datatype parse-data]</code> in which case container of type  <code>[datatype]</code> will be created. <code>parse-data</code> can be one of:
+            <ul>
+              <li><code>:relaxed?</code> - data will be parsed such that parse failures of the standard  parse functions do not stop the parsing process. :unparsed-values and  :unparsed-indexes are available in the metadata of the column that tell  you the values that failed to parse and their respective indexes.</li>
+              <li><code>fn?</code> - function from str-&gt; one of <code>:tech.ml.dataset.parser/missing</code>,  <code>:tech.ml.dataset.parser/parse-failure</code>, or the parsed value.  Exceptions here always kill the parse process. :missing will get marked  in the missing indexes, and :parse-failure will result in the index being  added to missing, the unparsed the column’s :unparsed-values and  :unparsed-indexes will be updated.</li>
+              <li><code>string?</code> - for datetime types, this will turned into a DateTimeFormatter via  DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid  argument to Charset/forName.</li>
+              <li><code>DateTimeFormatter</code> - use with the appropriate temporal parse static function  to parse the value.</li>
+            </ul>
+          </li>
+        </ul>
+      </li>
     </ul>
   </li>
   <li><code>map?</code> - the header-name-or-idx is used to lookup value. If not nil, then  value can be any of the above options. Else the default column parser  is used.</li>
@@ -139,4 +147,4 @@
   <li><code>:max-chars-per-column</code> - csv,tsv specific, defaults to 65536 - values longer than this will  cause an exception during serialization.</li>
   <li><code>:max-num-columns</code> - csv,tsv specific, defaults to 8192 - If the dataset has more than this number of  columns an exception will be thrown during serialization.</li>
   <li><code>:quoted-columns</code> - csv specific - sequence of columns names that you would like to always have quoted.</li>
-</ul></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/io.clj#L205">view source</a></div></div></div></body></html>
+</ul></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/io.clj#L209">view source</a></div></div></div></body></html>
diff --git a/src/tech/v3/dataset/io.clj b/src/tech/v3/dataset/io.clj
@@ -129,24 +129,28 @@
   - `:n-initial-skip-rows` - Skip N rows initially.  This currently may include the header
      row.  Works across both csv and spreadsheet datasets.
   - `:parser-fn` -
-    - `keyword?` - all columns parsed to this datatype
-    - tuple - pair of [datatype `parse-data`] in which case container of type
-      [datatype] will be created. `parse-data` can be one of:
-        - `:relaxed?` - data will be parsed such that parse failures of the standard
-           parse functions do not stop the parsing process.  :unparsed-values and
-           :unparsed-indexes are available in the metadata of the column that tell
-           you the values that failed to parse and their respective indexes.
-        - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
-           `:tech.ml.dataset.parser/parse-failure`, or the parsed value.
-           Exceptions here always kill the parse process.  :missing will get marked
-           in the missing indexes, and :parse-failure will result in the index being
-           added to missing, the unparsed the column's :unparsed-values and
-           :unparsed-indexes will be updated.
-        - `string?` - for datetime types, this will turned into a DateTimeFormatter via
-           DateTimeFormatter/ofPattern.  For encoded-text, this has to be a valid
-           argument to Charset/forName.
-        - `DateTimeFormatter` - use with the appropriate temporal parse static function
-           to parse the value.
+      - `keyword?` - all columns parsed to this datatype. For example: `{:parser-fn :string}`
+      - `map?` - `{column-name parse-method}` parse each column with specified `parse-method`.
+        The `parse-method` can be:
+          - `keyword?` - parse the specified column to this datatype. For example:
+            `{:parser-fn {:answer :boolean :id :int32}}`
+          - tuple - pair of `[datatype parse-data]` in which case container of type
+            `[datatype]` will be created. `parse-data` can be one of:
+              - `:relaxed?` - data will be parsed such that parse failures of the standard
+                 parse functions do not stop the parsing process.  :unparsed-values and
+                 :unparsed-indexes are available in the metadata of the column that tell
+                 you the values that failed to parse and their respective indexes.
+              - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
+                 `:tech.ml.dataset.parser/parse-failure`, or the parsed value.
+                 Exceptions here always kill the parse process.  :missing will get marked
+                 in the missing indexes, and :parse-failure will result in the index being
+                 added to missing, the unparsed the column's :unparsed-values and
+                 :unparsed-indexes will be updated.
+              - `string?` - for datetime types, this will turned into a DateTimeFormatter via
+                 DateTimeFormatter/ofPattern.  For encoded-text, this has to be a valid
+                 argument to Charset/forName.
+              - `DateTimeFormatter` - use with the appropriate temporal parse static function
+                 to parse the value.
    - `map?` - the header-name-or-idx is used to lookup value.  If not nil, then
            value can be any of the above options.  Else the default column parser
            is used.

diff --git a/src/tech/v3/dataset/io/column_parsers.clj b/src/tech/v3/dataset/io/column_parsers.clj
@@ -56,7 +56,14 @@
 
 (def default-coercers
   (merge
-   {:boolean #(if (string? %)
+   {:bool #(if (string? %)
+             (let [^String data %]
+               (cond
+                 (.equals "true" data) true
+                 (.equals "false" data) false
+                 :else parse-failure))
+             (boolean %))
+    :boolean #(if (string? %)
                 (let [^String data %]
                   (cond
                     (or (.equalsIgnoreCase "t" data)
@@ -141,6 +148,8 @@
     (loop [n-elems n-elems]
       (when (< n-elems idx)
         (case simple-dtype
+          :bool
+          (.addBoolean container false)
           :boolean
           (.addBoolean container false)
           :int64
@@ -282,9 +291,9 @@
 
 
 (def default-parser-datatype-sequence
-  [:boolean :int16 :int32 :int64 :float64 :uuid
+  [:bool :int16 :int32 :int64 :float64 :uuid
    :packed-duration :packed-local-date
-   :zoned-date-time :string :text])
+   :zoned-date-time :string :text :boolean])
 
 
 (defn- promote-container
@@ -363,7 +372,8 @@
                           (cond
                             (== 0 n-valid)
                             parser-data
-                            (and (or (= :boolean container-dtype)
+                            (and (or (= :bool container-dtype)
+                                     (= :boolean container-dtype)
                                      (casting/numeric-type? container-dtype))
                                  (casting/numeric-type?
                                   (packing/unpack-datatype parser-datatype)))
@@ -399,7 +409,7 @@
 (defn promotional-string-parser
   ([column-name parser-datatype-sequence]
    (let [first-dtype (first parser-datatype-sequence)]
-     (PromotionalStringParser. (column-base/make-container first-dtype)
+     (PromotionalStringParser. (column-base/make-container (if (= :bool first-dtype) :boolean first-dtype))
                                first-dtype
                                false
                                (default-coercers first-dtype)

diff --git a/test/data/datatype_parser.csv b/test/data/datatype_parser.csv
@@ -0,0 +1,11 @@
+id, char,   word,     bool,   boolstr, boolean
+1,  t,      true,     true,   true,    t
+2,  f,      False,    true,   true,    y
+3,  y,      YES,      false,  false,   n
+4,  n,      NO,       false,  false,   f
+5,  T,      positive, true,   true,    true
+6,  F,      negative, false,  false,   false
+7,  Y,      yep,      true,   true,    positive
+8,  N,      not,      false,  false,   negative
+9,  A,      pos,      false,  False,   negative
+10, z,      neg,      false,  false,   negative
diff --git a/test/tech/v3/dataset/ames_test.clj b/test/tech/v3/dataset/ames_test.clj
@@ -21,7 +21,8 @@
            (dtype/->vector new-col)))))
 
 
-(def src-ds (ds/->dataset "test/data/ames-house-prices/train.csv"))
+(def src-ds (ds/->dataset "test/data/ames-house-prices/train.csv"
+                          {:parser-fn {"CentralAir" :boolean}}))
 
 
 (defn missing-pipeline

diff --git a/test/tech/v3/dataset/parse_test.clj b/test/tech/v3/dataset/parse_test.clj
@@ -54,7 +54,7 @@
    ["BsmtHalfBath" :int16]
    ["BsmtQual" :string]
    ["BsmtUnfSF" :int16]
-   ["CentralAir" :boolean]
+   ["CentralAir" :string]
    ["Condition1" :string]
    ["Condition2" :string]
    ["Electrical" :string]

diff --git a/test/tech/v3/dataset_test.clj b/test/tech/v3/dataset_test.clj
@@ -17,6 +17,44 @@
   (:import [java.util List HashSet UUID]
            [java.io File]))
 
+(deftest datatype-parser
+  (let [ds (ds/->dataset "test/data/datatype_parser.csv")]
+    (is (= :int16 (dtype/get-datatype (ds/column ds "id"))))
+    (is (= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (ds/column ds "id")))
+    (is (= :string (dtype/get-datatype (ds/column ds "char"))))
+    (is (= ["t", "f", "y", "n", "T", "F", "Y", "N", "A", "z"]
+           (ds/column ds "char")))
+    (is (= :string (dtype/get-datatype (ds/column ds "word"))))
+    (is (= ["true", "False", "YES", "NO", "positive", "negative", "yep", "not", "pos", "neg"]
+           (ds/column ds "word")))
+    (is (= :boolean (dtype/get-datatype (ds/column ds "bool"))))
+    (is (= [true, true, false, false, true, false, true, false, false, false]
+           (ds/column ds "bool")))
+    (is (= :string (dtype/get-datatype (ds/column ds "boolstr"))))
+    (is (= ["true", "true", "false", "false", "true", "false", "true", "false", "False", "false"]
+           (ds/column ds "boolstr")))
+    (is (= :string (dtype/get-datatype (ds/column ds "boolean"))))
+    (is (= ["t", "y", "n", "f", "true", "false", "positive", "negative", "negative", "negative"]
+           (ds/column ds "boolean"))))
+  (let [ds (ds/->dataset "test/data/datatype_parser.csv" {:parser-fn {"boolean" :boolean
+                                                                      "boolstr" :boolean}})]
+    (is (= :int16 (dtype/get-datatype (ds/column ds "id"))))
+    (is (= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (ds/column ds "id")))
+    (is (= :string (dtype/get-datatype (ds/column ds "char"))))
+    (is (= ["t", "f", "y", "n", "T", "F", "Y", "N", "A", "z"]
+           (ds/column ds "char")))
+    (is (= :string (dtype/get-datatype (ds/column ds "word"))))
+    (is (= ["true", "False", "YES", "NO", "positive", "negative", "yep", "not", "pos", "neg"]
+           (ds/column ds "word")))
+    (is (= :boolean (dtype/get-datatype (ds/column ds "bool"))))
+    (is (= [true, true, false, false, true, false, true, false, false, false]
+           (ds/column ds "boolean")))
+    (is (= :boolean (dtype/get-datatype (ds/column ds "boolstr"))))
+    (is (= [true, true, false, false, true, false, true, false, false, false]
+           (ds/column ds "boolstr")))
+    (is (= :boolean (dtype/get-datatype (ds/column ds "boolean"))))
+    (is (= [true, true, false, false, true, false, true, false, false, false]
+           (ds/column ds "boolean")))))
 
 (deftest iterable
   (let [ds (ds/->dataset (test-utils/mapseq-fruit-dataset))]