Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xlsx support #26

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 5 additions & 0 deletions docconv.go
Expand Up @@ -25,6 +25,8 @@ func MimeTypeByExtension(filename string) string {
return "application/msword"
case ".docx":
return "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
case ".xlsx":
return "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
case ".odt":
return "application/vnd.oasis.opendocument.text"
case ".pages":
Expand Down Expand Up @@ -69,6 +71,9 @@ func Convert(r io.Reader, mimeType string, readability bool) (*Response, error)
case "application/vnd.oasis.opendocument.text":
body, meta, err = ConvertODT(r)

case "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet":
body, meta, err = ConvertXLSX(r)

case "application/vnd.apple.pages", "application/x-iwork-pages-sffpages":
body, meta, err = ConvertPages(r)

Expand Down
59 changes: 59 additions & 0 deletions xlsx.go
@@ -0,0 +1,59 @@
package docconv

import (
"fmt"
"github.com/tealeg/xlsx"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this import underneath the others (it's non-std lib).

"io"
)

// Convert MS Excel Spreadsheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exported functions the comment should begin with the name of the function:

// ConvertXLSX ...

This looks like it converts XLSX to something similar to CSV, but isn't (i.e. there doesn't seem to be any escaping): if a cell includes , then the output will include an extra column. Definitely clarify this in the comment.

You probably want use https://golang.org/pkg/encoding/csv/#Writer from the standard library if you need CSV.

func ConvertXLSX(r io.Reader) (string, map[string]string, error) {
f, err := NewLocalFile(r, "/tmp", "sajari-convert-")
if err != nil {
return "", nil, fmt.Errorf("error creating local file: %v", err)
}
defer f.Done()

fileStat, err := f.Stat()
if err != nil {
return "", nil, fmt.Errorf("error on getting file stats: %v", err)
}
xlsFile, err := xlsx.OpenReaderAt(f, fileStat.Size())
if err != nil {
return "", nil, fmt.Errorf("error on xlsx parsing: %v", err)
}

// Meta data
mc := make(chan map[string]string, 1)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use a go routine here.

meta := make(map[string]string)
meta["ModifiedDate"] = fmt.Sprintf("%d", fileStat.ModTime().Unix())
mc <- meta
}()

// Document body
bc := make(chan string, 1)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need a go routine here either.

var body string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially a lot of string concat operations here, you should use a bytes.Buffer instead.

for _, sheet := range xlsFile.Sheets {
for rowIndex, row := range sheet.Rows {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a convention to use smaller names in Go, so rowIndex -> i here instead?

for cellIndex, cell := range row.Cells {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cellIndex -> j

text, _ := cell.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't ignore the error here. From the docs for the XLSX lib you're using, it will return the raw value of the cell, and hence incorrectly formatted. I would assume that the formatting string in the spreadsheet is correct, and the XLSX library is failing to parse it and so the output won't properly represent the value as shown in the spreadsheet: you don't want to fail silently.

body += text
if cellIndex < len(row.Cells)-1 {
body += ","
}
}
if rowIndex < len(sheet.Rows)-1 {
body += "\n"
}
}
}
bc <- body
}()

body := <-bc
meta := <-mc

return body, meta, nil
}