Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How To Ouptput doc worddocuemt sector? #10

Closed
xlcoder opened this issue Dec 18, 2019 · 3 comments
Closed

How To Ouptput doc worddocuemt sector? #10

xlcoder opened this issue Dec 18, 2019 · 3 comments

Comments

@xlcoder
Copy link

xlcoder commented Dec 18, 2019

as you code

if entry.Name == "WordDocument" {
	fmt.Println(buf[:i])
        fmt.Println(string(buf[:i]))
}

will output unknow code ,

I expect output person read string or text

@richardlehane
Copy link
Owner

answered via email

@infodusha
Copy link

Ok, same issue now..

@richardlehane
Copy link
Owner

To get the bytes out of that stream you could do something like this:

package main

import (
"io"
"io/ioutil"
"log"
"os"

"github.com/richardlehane/mscfb"

)

func main() {
file, err := os.Open("test/test.doc")
defer file.Close()
if err != nil {
log.Fatal(err)
}
doc, err := mscfb.New(file)
if err != nil {
log.Fatal(err)
}
for entry, err := doc.Next(); err == nil; entry, err = doc.Next() {
if entry.Name == "WordDocument" {
buf, err := ioutil.ReadAll(entry)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(buf))
}
}
}

BUT

"... [this] package only implements the MS-CFB spec (https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/53989ce4-7b05-4f8d-829b-d08d6148375b) which is a common container format used by a lot of different Windows software. It doesn't implement the MS Word spec (MS-DOC) - so can't help you identify byte ranges of the runs of text in a word doc. To do something like that, you'd need to look at the MS-DOC spec (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/d7fae142-670d-4cd5-869a-708366984a71) - you'd probably need to work out how to interpret the File Information Block structure (FIB) at the start of the WordDocument stream to get offsets for where the text entries are in the stream. That's probably quite a bit of work. The other option might be just to iterate over the byte slice and delete any bytes not in the ASCII range (this won't work if the doc stream has UTF16 or some other encoding)? e.g.

buf2 := make([]byte, 0, len(buf))
for _, c := range buf {
if c > 6 && c < 128 {
buf2 = append(buf2, c)
}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants