warc provides primitives for reading and writing WARC files in Go. This version is based on edsu's warc library, but many changes were made:
This package works with WARC files in plain text, GZip compression and BZip2 compression out of the box.
The record content is exposed via io.Reader
interfaces. Types and functions were renamed
to follow Go's naming conventions.
All external dependencies were removed. A Writer was added.
The following example reads a WARC file from stdin
and prints
the header values of each record to stdout
.
reader, err := warc.NewReader(os.Stdin)
if err != nil {
panic(err)
}
defer reader.Close()
for {
record, err := reader.ReadRecord()
if err != nil {
break
}
fmt.Println("Record:")
for key, value := range record.Header {
fmt.Printf("%v = %v\n", key, value)
}
}
The next example writes a WARC record to stdout
.
writer := warc.NewWriter(os.Stdout)
record := warc.NewRecord()
record.Header.Set("warc-type", "resource")
record.Header.Set("content-type", "plain/text")
record.Content = strings.NewReader("Hello, World!")
if _, err := writer.WriteRecord(record); err != nil {
panic(err)
}
Parsing WARC files is as fast as it can get. The real overhead stems from the underlying compression algorithms. So if you are about to parse the same file several times for whatever reason, consider decompressing it first.
warc is released under CC0 license. You can find a copy of the CC0 License in the LICENSE file.