Skip to content

Commit

Permalink
Documentation pass. Fixes #243
Browse files Browse the repository at this point in the history
  • Loading branch information
cnuernber committed May 13, 2021
1 parent af25092 commit 2f68e5a
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 18 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ test/data/stocks.csv [5 3]:
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |

;; tech.v3.libs.poi registers xls, tech.v3.libs.fastexcel registers xlsx. If you want
;; to use poi for everything use workbook->datasets in the tech.v3.libs.poi namespace.
user> (require '[tech.v3.libs.poi])
nil
user> (def xls-data (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/file_example_XLS_1000.xls"))
Expand Down
4 changes: 2 additions & 2 deletions src/tech/v3/dataset/modelling.clj
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@
Options:
* `:randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be
* `:randomize-dataset?` - When true, shuffle the dataset. In that case 'seed' may be
provided. Defaults to true.
* `:seed` - when `:randomize-dataset?` is true then this can either be an
implementation of java.util.Random or an integer seed which will be used to
Expand All @@ -167,7 +167,7 @@
Options:
* `:randomize-dataset? - When true, shuffle the dataset. In that case 'seed' may be
* `:randomize-dataset?` - When true, shuffle the dataset. In that case 'seed' may be
provided. Defaults to true.
* `:seed` - when `:randomize-dataset?` is true then this can either be an
implementation of java.util.Random or an integer seed which will be used to
Expand Down
25 changes: 18 additions & 7 deletions src/tech/v3/libs/fastexcel.clj
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
(ns tech.v3.libs.fastexcel
"Fast xlsx parsing."
"Parse a dataset in xlsx format. This namespace auto-registers a handler for
the 'xlsx' file type so that when using ->dataset, `xlsx` will automatically map to
`(first (workbook->datasets))`.
Note that this namespace does **not** auto-register a handler for the `xls` file type.
`xls` is handled by the poi namespace.
If you have an xlsx file that contains multiple sheets and you want a dataset
out of each sheet you have to use `workbook->datasets` as opposed to the higher level
`->dataset` operator."
(:require [tech.v3.io :as io]
[tech.v3.datatype.protocols :as dtype-proto]
[tech.v3.datatype :as dtype]
Expand Down Expand Up @@ -120,14 +130,15 @@


(defn workbook->datasets
"Returns a sequence of dataset named after the sheets. This supports a subset of
the arguments for tech.v3.dataset/->dataset. Specifically:
"Given a workbook, an string filename or an input stream return a sequence of
dataset named after the sheets.
* `:header-row?`
* `:parser-fn`
* `:parser-scan-len`
Options are a subset of the arguments for tech.v3.dataset/->dataset:
Returns a non-lazy sequence of datasets."
* `:header-row?`
* `:num-rows`
* `:n-initial-skip-rows`
* `:parser-fn`"
([input options]
(let [workbook (input->workbook input options)]
(try
Expand Down
30 changes: 21 additions & 9 deletions src/tech/v3/libs/poi.clj
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
(ns tech.v3.libs.poi
"xls, xlsx formats."
"Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for
the `xls` file type so that when using ->dataset, `xls` will automatically map to
`(first (workbook->datasets))`.
Note that this namespace does **not** auto-register a handler for the `xlsx` file
type. `xlsx` is handled by the fastexcel namespace.
If you have an `xlsx` or `xls` file that contains multiple sheets and you want a
dataset out of each sheet you have to use `workbook->datasets` as opposed to the
higher level `->dataset` operator."
(:require [tech.v3.io :as io]
[tech.v3.datatype :as dtype]
[tech.v3.datatype.protocols :as dtype-proto]
Expand Down Expand Up @@ -135,23 +144,26 @@


(defn workbook->datasets
"Returns a sequence of dataset named after the sheets. This supports a subset of
the arguments for tech.v3.dataset/->dataset. Specifically:
"Given a workbook, an string filename or an input stream return a sequence of
dataset named after the sheets.
* `:header-row?`
* `:parser-fn`
* `:parser-scan-len`
Options are a subset of the arguments for tech.v3.dataset/->dataset:
Returns a non-lazy sequence of datasets."
* `:file-type` - either `:xls` or `:xlsx` - inferred from the filename when input is
a string. If input is a stream defaults to `:xlsx`.
* `:header-row?`
* `:num-rows`
* `:n-initial-skip-rows`
* `:parser-fn`"
([input options]
(let [workbook (input->workbook input options)]
(try
(mapv #(parse-spreadsheet/sheet->dataset % options) workbook)
(finally
(when-not (identical? input workbook)
(.close workbook))))))
([workbook]
(workbook->datasets workbook {})))
([input]
(workbook->datasets input {})))


(defmethod ds-io/data->dataset :xls
Expand Down

0 comments on commit 2f68e5a

Please sign in to comment.