forked from dataseries/DataSeries
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
108 lines (87 loc) · 5.22 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#
# (c) Copyright 2005, Hewlett-Packard Development Company, LP
#
# See the file named COPYING for license details
#
sanitize regression data
- Figure out how to document the minor improvements that happen to some of the
converters; not important enough to go into NEWS, but important for people
that might use them.
- Rebuild the complex.ds-bigend file without lzo compression to enable checks
with just gz, lzf, bz2 support. Might want to move to just lzf support.
- Replace the code in GeneralField.C that deals with print_format, print_offset,
print_multiplier, print_divisor, print_style to be a single concept
print_transforms="t1,t2,t3,t4,t5" where a general value goes in the left,
runs through the transforms and then gets printed out. This would allow for
natural extentions to transforms that convert Clock::TFrac into
textual times in (s,ms,us,ns), int32 conversions into dotted quad IP
addresses, int64 to mac address, etc. The more general version of this is full
expression parsing for the transforms, which is the more SQL way to
do all of this.
- Think about what should happen if you try to access a nullable column with
a field that doesn't support nulls. Right now it seems to accept this, which
is not clearly the right thing to be doing.
- consider making all the Extent * pointers boost::shared_ptr<>; this would
let us safely share them. Would need a way to mark an extent as read only
for this to be safe.
- think about whether we should change to preadExtent taking an
ExtentType::int64 type argument than an off64_t. The values come out of
DS as int64, but get used as off64_t; while these are in theory the same,
x86_64 seems to generate warnings/errors on this.
- decide whether the implementation of preadExtent which auto-updates the
offset is good; seeming to have to copy the offset too many times in
use. Perhaps an alternate implementation where &offset is passed,
and in that case the value is updated.
- implement ExtentSeries::typeFieldMatch; this is defined as the field
having the xml definition, but ignoring the pack_* attributes, any
{note,comment} attributes, and possibly defining a prefix nt_*
(non-type) that is always ignored
- implement a GroupBy module that takes a list of fields for use as the
key, and a factory class that can define new modules which can handle
each of the individual groups.
- implement DSv2 file format -- switch from using an adler32 digest to an
SHA-1 digest on the compressed data, switch to having the partially
unpacked bjhash include the hashing of things which are reversably packed,
e.g. bool, char, int{32,64}, variable32 to make sure that the unpack
worked correctly.
- allow ignoring of either of the hash checks as an option during reading
- implement an option for checking the uncompression time during compressing
and selecting the compression algorithm that gives the lowest
decompression-time * compressed_size, or perhaps (F1 * decompression-time) +
(F2 * compressed-size)
- think about how to add in a recursive structured variable type,
e.g. a keyed union in the way they are done in pascal. This would
be useful for providing an alternate way for handling network traces
which tend to have a recursive structure of (type-key, value-options).
The other way to implement this is the multiple table approach used by
the nfs analysis -- it is unclear which of these is the "best" way to
do this.
- consider adding in the unsigned types, and some sort of
(size,alignment,byteswap-rules) type
- extend the generic indexing to have more modes, for example a multi-range
min-max index to handle the case where there are multiple dense ranges of
some values, for example a key that could be derived from either a
host-id+process-id, a global counter, or a time value. Another example of
a useful index is a unique value index, useful if there are a small set of
expected values such as user-ids or user-names.
- extend dsselect so that it can support computations in the selection
criteria, the "as" construct to rename columns and extent-types, and
support for where clauses
- implement dscat that can both concatenate multiple dataseries files, and
can re-compress the files. The best implementation would operate directly
on the dataseries file avoiding most of the unpack effort by just
decompressing and recompressing the raw extents and updating the various
pointers.
- implement support for altname in the type specification to allow for gradual
renaming of columns
- work out a way to avoid reading all of the extent types across all of the
files before we get started with doing a read of a collection of files.
I believe the original reason that we read all the types was consistency
checking across the types and to simplify the logic. I think we can work
around this now since we shouldn't require that all the extent type
definitions be identical, just that we are using the correct one for
each file. Not sure about that claim though.
- think about whether you should be forced to set the field value for
fields which are not nullable. Right now (via testing with
textindex) it seems that the code will just set the value to the
default. Not clear this is what we would want.