Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xlsx support #26

Closed
wants to merge 4 commits into from
Closed

add xlsx support #26

wants to merge 4 commits into from

Conversation

gtsalles
Copy link

@gtsalles gtsalles commented Jan 6, 2017

I've added xlsx support based on this lib: https://github.com/tealeg/xlsx

Copy link
Member

@mish15 mish15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems goroutines are unnecessary in the ConvertXLSX func. The dependency also has requirements under BSD, @dhowden thoughts?

Copy link
Member

@dhowden dhowden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request!

I've made a few comments. Feel free to ask if you are unsure about anything.

Will comment directly on the pull request with regards to the library license.


// Meta data
mc := make(chan map[string]string, 1)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use a go routine here.


// Document body
bc := make(chan string, 1)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need a go routine here either.

// Document body
bc := make(chan string, 1)
go func() {
var body string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially a lot of string concat operations here, you should use a bytes.Buffer instead.

"io"
)

// Convert MS Excel Spreadsheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exported functions the comment should begin with the name of the function:

// ConvertXLSX ...

This looks like it converts XLSX to something similar to CSV, but isn't (i.e. there doesn't seem to be any escaping): if a cell includes , then the output will include an extra column. Definitely clarify this in the comment.

You probably want use https://golang.org/pkg/encoding/csv/#Writer from the standard library if you need CSV.

for _, sheet := range xlsFile.Sheets {
for rowIndex, row := range sheet.Rows {
for cellIndex, cell := range row.Cells {
text, _ := cell.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ignore the error here. From the docs for the XLSX lib you're using, it will return the raw value of the cell, and hence incorrectly formatted. I would assume that the formatting string in the spreadsheet is correct, and the XLSX library is failing to parse it and so the output won't properly represent the value as shown in the spreadsheet: you don't want to fail silently.


import (
"fmt"
"github.com/tealeg/xlsx"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this import underneath the others (it's non-std lib).

go func() {
var body string
for _, sheet := range xlsFile.Sheets {
for rowIndex, row := range sheet.Rows {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a convention to use smaller names in Go, so rowIndex -> i here instead?

var body string
for _, sheet := range xlsFile.Sheets {
for rowIndex, row := range sheet.Rows {
for cellIndex, cell := range row.Cells {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cellIndex -> j

@dhowden
Copy link
Member

dhowden commented Jan 8, 2017

Further to what @mish15 said:

We use this library internally to convert documents into (unstructured) plain text (i.e. paragraphs of text).

Looking at XLSX -> CSV is very different: there is definitely structure, and it's being mostly preserved in the conversion. I can't see any situation where we would want to read CSV and assume that it is equivalent to a paragraph of text - at least not without more pre-processing first.

The license terms on the library are a real problem. Including copyright information in everything that uses docconv is a hassle and not something that we want to start doing unless it's really essential to the library.

@mish15 mish15 closed this Nov 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants