forked from Unidata/netcdf-c
/
internals.html
1477 lines (1409 loc) · 55.6 KB
/
internals.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<body>
<center>
<pre>
/* Copyright 2009, UCAR/Unidata and OPeNDAP, Inc.
See the COPYRIGHT file for more information. */
</pre>
<h1>NCGEN Internals Documentation</h1>
<h3>Draft: 03/07/2009<br>
Last Revised: 07/15/2009</h3>
</center>
<h1><u>Introduction</u></h1>
This document is an ongoing effort to
describe the internal operation of the ncgen
cdl compiler; ncgen is a part of the netcdf
system.
<p>
The document has two primary parts.
<ol>
<li><a href="#LANG">Language Support</a>
-- describes how to add a new output language to ncgen.
<p>
<li><a href="#GIT">General Internals Information</a>
-- describes additional information about the internals;
parsing, for example.
</ol>
<h1></u><a name="LANG">Modifying NCGEN to Output a New Language</a></u></h1>
This document outlines the general method for adding
a new language output to ncgen. Currently, it supports
binary, C, and (experimentally) NcML and Java.
Before reading this document, the reader should also
review the internals.html document.
<p>
Also, the reader should note that code is a bit crufty
and needs refactoring. This is primarily because
it was originally defined to support only C and
each new language stresses the code.
<p>
In order to get ncgen to generate output for a new
language, the following steps are required.
<ol>
<li> <a href="#Misc">Modify various files to invoke the new language output.</a>
<li> <a href="#Create">Create a new set of generate functions.</a>
</ol>
<h2><a name="Misc">Modify various files to invoke the new language output.</a></h2>
The following steps are required to provide the necessary code
to invoke a new language output.
For the purposes of this discussion, let us call the language Java.
<h4>ncgen.h</h4>
<ol>
<li> Locate the code enabler #defines
(e.g. <code>#define ENABLE_C</code>)
and insert a new one of the form
<pre>
#define ENABLE_JAVA
</pre>
</ol>
<h4>main.c</h4>
<ol>
<li> Locate the global declaration (<code>int fortran_flag;</code>)
and insert a new declaration.
<pre>int java_flag;</pre>
<li> Locate the initialization (<code>fortran_flag = 0;</code>)
in the body of the main() procedure and add a new initialization.
<pre>java_flag = 0;</pre>.
<li>Locate the options processing switch case for -l (<code>case 'l':</code>).
Duplicate one of the instances there and add to the conditionals.
It should look like this.
<pre>
} else if(strcmp(lang_name, "java") == 0
|| strcmp(lang_name, "Java") == 0) {java_flag = 1;}
</pre>
<li> Just after the options processing switch code,
there are a number of #ifndef conditionals
(e.g. <code>#ifndef ENABLE_C</code>).
Add a new one for Java.
It should look like this.
<pre>
#ifndef ENABLE_JAVA
if(java_flag) {
fprintf(stderr,"Java not currently supported\n");
exit(1);
}
#endif
</pre>
</ol>
<h2><a name="Create">Create a new set of generate functions.</a></h2>
The hard part is creating the actual code generation files.
To do this, it is easiest to take one of the existing
generators and modify it, viz:
<ul>
<li> copy genc.c genj.c
<li> copy cdata.c jdata.c
</ul>
The genj.c file will do most of the code generation. The jdata.c file
will generate lists of data constants that come from the CDL data: section.
There is nothing magical about using two files: they can be refactored
as desired.
<p>
In order to facilitate code generation, it is useful to look
at the translations produced by other languages.
The idea is to take these translations and decide what the
corresponding Java (for example) code would look like.
Then the idea is to modify the genc code (in genj.c)
to reflect that translation.
<p>
In most of the rest of this discussion, the genc.c and cdata.c
code will be used to explain the operation.
Appropriate procedure renaming should be done for new languages
(e.g, for Java, <i>genc_XXX</i> is changed to <i>genj_XXX</i>
consistently).
<h3>Useful Output Procedures</h3>
The following output procedures are defined in genc.c to create C output.
The idea is that output is accumulated in a <a href="#Bytebuffer">Bytebuffer</a>
called ccode. Periodically, ccode
contents are flushed to stdout.
The relevant procedures from the C code are as follows.
<ol>
<li> <code>void cprint(Bytebuffer* buf)</code>
-- dump the contents of buf to output (ccode actually).
<li> <code>void cpartial(char* line)</code>
-- dump the specified string to output.
<li> <code>void cline(char* line)</code>
-- dump the specified string to output and add a newline.
<li> <code>void clined(int n, char* line)</code>
-- dump the specified string to output preceded by
<i>n</i> instances of indentation.
<li> <code>void cflush(void)</code>
-- dump the contents of ccode to standard output
and reset the ccode buffer.
</ol>
There is, of course, nothing sacred about these procedures:
feel free to modify as needed. In fact, there are two
important reasons to modify the code.
First, the indentation rules may differ from language to language
(FORTRAN 77 for example). Second, the rules for folding lines
that are too long differ across languages.
It is usually easiest to handle both of these issues
in the output procedures.
<p>
The <a href="#Bytebuffer">Bytebuffer</a> type is an important data structure.
It allows for dynamically creating strings of characters
(actually arbitrary 8 bit values).
Most of the operations should be obvious: examine bytebuffer.h.
It is used widely in this code especially to capture sub-pieces
of the generated code that must be saved for out-of-order output.
<h3>Code Generation</h3>
The code generation method used for C is a pretty good
general paradigm, so this discussion will use it as a model.
The gen_ncc procedure is responsible for
creating and dumping the generated C code.
<p>
It has at its disposal several global lists of Symbols.
Note that the lists cross all groups.
<ul>
<li>dimdefs - the set of symbols defining dimensions.
<li>vardefs - the set of symbols defining variables.
<li>attdefs - the set of symbols defining non-global attributes.
<li>gattdefs - the set of symbols defining global attributes.
<li>grpdefs - the set of symbols defining groups.
<li>typdefs - the set of symbols defining types; note that this list
has been topologically sorted so that a given type depends only
on types with lower indices in the list.
</ul>
<p>
The superficial operation of gen_ncc is as follows; the details
are provided later where the operation is complex.
<ol>
<li>Generate header code (e.g. #include <stdio.h>").
<li>Generate C type definitions corresponding to the
CDL types.
<li>Generate VLEN constants.
<li>Generate chunking constants.
<li>Generate initial part of the main() procedure.
<li>Generate C variable definitions to hold the ncids
for all created groups.
<li>Generate C variable definitions to hold the typeids
of all created types.
<li>Generate C variables and constants that correspond to
to the CDL dimensions.
<li>Generate C variable definitions to hold the dimids
of all created dimensions.
<li>Generate C variable definitions to hold the varids
of all created variables.
<li>Generate C code to create the netCDF binary file.
<li>Generate C code to create the all groups in the proper
hierarchy.
<li>Generate C code to create the type definitions.
<li>Generate C code to create the dimension definitions.
<li>Generate C code to create the variable definitions.
<li>Generate C code to create the global attributes.
<li>Generate C code to create the non-global attributes.
<li>Generate C code to leave define mode.
<li>Generate C code to assign variable datalists.
</ol>
<p>
The following code generates C code for defining the groups.
It is fairly canonical and can be seen repeated in variant form
when defining dimensions, types, variables, and attributes.
<p>
This code is redundant but for consistency, the root group
ncid is stored like all other group ncids.
Note that nprintf is a macro wrapper around snprint.
<pre>
nprintf(stmt,sizeof(stmt),"%s%s = ncid;",indented(1),groupncid(rootgroup));
cline(stmt);
</pre>
<p>
The loop walks all group symbols in preorder form
and generates C code call to nc_def_grp
using parameters taken from the group Symbol instance (gsym).
The call to nc_def_grp is succeeded by a call to the
check_err procedure to verify the operation's result code.
<pre>
for(igrp=0;igrp<listlength(grpdefs);igrp++) {
Symbol* gsym = (Symbol*)listget(grpdefs,igrp);
if(gsym == rootgroup) continue; // ignore root
if(gsym->container == NULL) PANIC("null container");
nprintf(stmt,sizeof(stmt),
"%sstat = nc_def_grp(%s, \"%s\", &%s);",
indented(1),
groupncid(gsym->container),
gsym->name, groupncid(gsym));
cline(stmt); // print the def_grp call
clined(1,"check_err(stat,__LINE__,__FILE__);");
}
flushcode();
</pre>
Note the call to indented(). It generates a blank string corresponding
to indentation to a level of its argument N; level n might result in
more or less than N blank characters.
<p>
Note also that one must be careful when dumping names
(e.g. gsym->name above) if the name is expected to contain
utf8 characters. For C, utf8 works fine in strings, but with
a language like Java, which takes utf-16 characters,
some special encoding is required to convert the non-ascii
characters to use the \uxxxx form.
<p>
The code to generate dimensions, types, attributes, variables
is similar, although often more complex.
<p>
The code to generate C equivalents of CDL types is
in the procedure definectype().
Note that this code is not the code that invokes e.g. nc_def_vlen.
The generated C types are used when generating datalists
so that the standard C constant assignment mechanism will produce
the correct memory values.
<p>
For non-C languages, the interaction between this code and the
nc_def_TYPE code may be rather more complex than with C.
<p>
The genc_deftype procedure is the one that actually
generates C code to define the netcdf types.
The generated C code is designed to store the resulting
typeid into the C variable defined earlier
for holding that typeid.
<p>
Note that for compound types, the NC_COMPOUND_OFFSET
macro is normally used to match netcdf offsets to
the corresponding struct type generated in definectype.
However, there is a flag, TESTALIGNMENT,
that can be set to use a computed value for the offset.
And for non-C languages, handling offsets is tricky and is
addressed in more detail below.
<h3>Data Generation Methods</h3>
There are basically three known approaches for generating
the data constants that are passed to, for example, <i>nc_put_vara</i>.
<ol>
<li> For C (and C++) it is possible to generate C language constants
directly into the code using the C initializer syntax.
This is because CDL was originally defined with C in mind.
This method can also be used for FORTRAN when doing classic model only.
<p>
<li> Generate the binary data
and convert it to a large single string constant using
appropriate escaping mechanisms; this was done in the original
ncgen.
This method has the advantage that it can be used for most
languages, but it has (at least) two disadvantages:
(1) it is not generally portable because the machine architecture
influences the memory encoding; (2) it loses all information
about the structure of the memory and hence makes more debugging
difficult.
<p>
<li>Extend the netCDF interface with a set
of operations to build up the memory structure piece by piece.
This is the approach taken in the Java generation code.
<p>
The idea is that one has a set of procedures in C with a simple
interface that can be invoked by the output language.
These procedures do the following.
<ol>
<li>Create a dynamically extendible memory buffer (much like Bytebuffer).
<li>Append an array of instances
of some primitive type to a specified buffer.
<li>Invoke nc_put_vara with a specified buffer.
<li>Reclaim a buffer
</ol>
Appropriate calls to these procedures can construct any required memory
in a portable fashion.
<p>
This method is appropriate to use with most non-C languages, with interpretive
languages (e.g., Ruby and Perl), and even is probably the best way to
get FORTRAN to handle the full netcdf-4 data model.
</ol>
<h3>Data Generation: Overview</h3>
The way to think about data generation is to consider
the following tree.
<ul>
<li>The root is a convenience and represents the whole
set of variables specified in the CDL "data:" section.
<li>The nodes in the tree just below the root represent
the set of variables to which values are assigned in the
data section.
<li>The subtrees below each variable are the basetypes
of each variable. Thus if a variables x has a basetype
that is a compound type, then the node below x will
represent the whole compound type and the nodes below
that compound type node will be the fields of the compound
type, and so on.
<li>The leaves of this tree are all of primitive type
(e.g. NC_CHAR, NC_INT, NC_STRING).
</ul>
<p>
The data generation code is divided into two
primary groups. One group handles all non-primitive variables
and types. The other group handles all primitive variables
and types (especially fields). The reason for this is that
almost all languages can handle simple lists of primitive values.
However, for non-primitive types, one of the methods from the previous
section needs to be used.
<p>
Secondarily, the primitive handling code is divided into
two groups. One group handles the character type
and the other group handles all other primitive types.
The code for the first group is in chardata.c and is generally
usable across all languages.
<p>
The reason for this split is for historical reasons.
It turns out that it is tricky to properly handle variables
(or Compound type fields) of type NC_CHAR.
Here the term "proper" means to mimic the output of
the original ncgen program. To this end, a set of generically useful routines
are define in the chardata.c file. These routines take a datasource
and walk it to build a single string of characters, with appropriate fill,
to correspond to a NC_CHAR typed variable or field.
Unless your language has special
requirements, it is probably best to always use these routines to process
datalists for variables of type NC_CHAR.
<h3>Data Generation: Part I</h3>
Data generation occurs in several places, but is roughly
divided into two parts. First, the genc.c code will set up
appropriate declarations to hold the data. Second, the code
in cdata.c will generate the actual memory contents that must be
passed to nc_put_vara.
<p>
As a rule, the genc.c code calls a limited set of
entry points into cdata.c. Again as a rule,
cdata.c does not call genc.c code except for the closure
mechanism described below.
<p>
The critical pieces of code for part I are the procedures
genc_defineattr() and genc_definevardata() in genc.c.
<h4>genc_definevardata</h4>
This procedure is responsible for generating C constants corresponding
to the data to be assigned to a variable as defined in the "data:" section
of a CDL file. It is also responsible for
generating the appropriate nc_put_vara_XXX code to actually assign
the data to the variable.
<h4>genc_defineattr</h4>
This procedure is responsible for generating C constants corresponding
to the data to be assigned to an attribute.
from a CDL file. It is also responsible for
generating the appropriate nc_put_att_XXX code to actually define
the attribute.
<p>
As with variables, defining attributes of type NC_CHAR requires use
of the gen_charXXX procedures.
<h3>Data Generation: Part II</h3>
The procedures in cdata.c walk a datalist
and generate a sequence of space separated constants
and possibly with nested paired braces ("{...}") as needed.
The result is placed into a specified Bytebuffer.
<p>
As an aside, commas are added when needed to the list of constants
using the <i>commify</i> procedure.
<p>
Their are three primary procedures that are called from
the genj.c code.
<ul>
<li>genc_attrdata --
store (in its Bytebuffer argument) the sequence of constants
corresponding to a given attribute datalist.
<li>genc_scalardata --
store the single constant (which may be of a user-defined type)
corresponding to its variable's datalist.
<li>and genc_arraydata.
store the vector of constants corresponding to its variable's datalist.
This is by far the most complicated of the three procedures.
</ul>
<p>
Internally, each of these three procedures invokes
the <i>genc_data</i> procedure to process part of a datalist.
<h3>Closures and VLEN</h4>
Closures and VLEN handling are two rather specialized mechanisms.
<h4>Closures</h4>
The data generation code uses a concept of closure or callback
to allow the datalist processing to periodically
call external code to do the actual C code generation.
The reason for this is that it significantly improves
performance if the generated datalist is periodically
dumped to the netcdf .nc file using <i>nc_put_vara</i>.
Note that the closure mechanism is only used for generating
variable data; attributes cannot use this mechanism
since they are defined all at once.
<p>
Basically, each call to the callback will generate
C code for some C constants and calls to nc_put_vara().
The closure data structure (struct Putvar) is defined as follows.
<pre>
typedef struct Putvar {
int (*putvar)(struct Putvar*, Odometer*, Bytebuffer*);
int rank;
Bytebuffer* code;
size_t startset[NC_MAX_VAR_DIMS];
struct CDF {
int grpid;
int varid;
} cdf;
struct C {
Symbol* var;
} c;
} Putvar;
</pre>
An instance of the closure is created for
each variable that is the target of nc_put_vara().
It is initialized with the variable's symbol, rank, group id and variable
id. It is also provided with a Bytebuffer into which it is supposed
to store the generated C code.
The startset is the cached previous set of dimension indices used
for generating the nc_put_vara (see below).
<p>
The callback procedure (field "putvar")
for generating C code putvar is assigned to the procedure called cputvara()
(defined in genc.c).
This procedure takes as arguments the closure object,
an <a href="#odometer">odometer</a> describing the current set of dimension indices,
and a Bytebuffer containing the generated C constants
to be assigned to this slice of the variable.
<p>
Every time the closure procedure is called, it generates a C variable
to hold the generated C constant. It also generated
C constants to hold the start and count vectors required
by <i>nc_put_vara</i>. It then generates an <i>nc_put_vara()</i> call.
The start vector argument for the nc_put_vara is defined by the startset
field of the closure. The count vector argument to nc_put_vara
is computed from the current cached
start vector and from the indices in the odometer.
After the nc_put_vara() is generated, the odometer vector
is assigned to the startset field in the closure for use on the next call.
<p>
There are some important assumptions about the state of the odometer
when it is called.
<ol>
<li>The zeroth index controls the count set.
<li>All other indices are assumed to be at their max values.
</ol>
<p>
In particular, this means that the start vector is zero
for all positions except position zero. The count vector
is positions, except zero is the index in the odometer,
which is assumed to be the max.
<p>
For start position zero, the position is taken from the last
saved startset. The count position zero is the difference between
that last start position and the current odometer zeroth index.
<h4>VLEN Constants</h4>
VLEN constants need to be constructed
as separate C data constants because
the C compiler will never convert nested
groups ({...}) to separate memory chunks.
Thus, ncgen must in several places
generate the VLEN constants as separate variables
and then insert pointers to them in the appropriate
places in the later datalist C constants.
Note that this process can be very tricky
for non-C language (see genj.c and jdata.c for one approach).
<p>
As an optimization, ncgen tracks which datatypes
will require use of vlen constants.
This is any type whose definition is a vlen or whose
basetype contains a vlen type.
<p>
The vlen generation process is two-fold.
First, in the procedure processdatalist1() in semantics.c,
the location of the struct Datalist objects
that correspond to vlen constants is stored in a list called vlenconstants.
When detected, each such Datalist object is tagged with
a unique identifier and the vlen length (count).
These will be used later to generate references to the vlen constant.
These counts are only accurate for non-char typed variables;
Special handling is in place to handle character vlen constants.
<p>
The second vlen constant processing action is in the
procedure genc_vlenconstant() in cdata.c First, it walks the
vlenconstants list and generates C code for C variables to
define the vlen constant and C code to assign the vlen
constant's data to that C variable.
<p>
When, later, the genc_datalist procedure encounters
a Datalist tagged as representing a data list, it can generate
a nc_vlen_t constant as {<count>,<vlenconstantname>}
and use it directly in the generated C datalist constant.
<h2>Utility Data Structures</h2>
<h3>Pool Memory Allocation</h3>
As an approximation to garbage collection,
this code uses a pool allocation mechanism.
The goal is to allow dynamic construction of strings
that have very short life-times; typically they are used
to construct strings to send to the output file.
<p>
The pool mechanism wraps malloc and records the malloc'd
memory in a circular buffer. When the buffer reaches its maximum
size, previously allocated pool buffers are free'd.
This is good in that the user does not have to litter
code with free() statements. It is bad in that the pool
allocated memory can be free'd too early if the memory
does not have a short enough life.
If you suspect the latter, then bump the size of the circular buffer
and see if the problem goes away. If so, then your code
is probably holding on to a pool buffer too long and should use
regular malloc/free.
<p>
In the end, I am not sure if this is a good idea, but
if does make the code simpler.
<h3><a name="List">List<a> and <a name="Bytebuffer">Bytebuffer</a></h3>
The two datatypes List and Bytebuffer are used through out the
code. They correspond closely in semantics to the Java Arraylist
and Stringbuffer types, respectively. They are used to help
encapsulate dynamically growing lists of objects or bytes
to reduce certain kinds of memory allocation errors.
<p>
The canonical code for non-destructive walking of a List<T>
is as follows.
<pre>
for(i=0;i<listlength(list);i++) {
T* element = (T*)listget(list,i);
...
}
</pre>
<p>
Bytebuffer provides two ways to access its internal buffer of characters.
One is "bbContents()", which returns a direct pointer to the buffer,
and the other is "bbDup()", which returns a malloc'd string containing
the contents and is guaranteed to be null terminated.
<h3><a name="odometer">Odometer: Multi-Dimensional Array Handling</a></h3>
The odometer data type is used to convert
multiple dimensions into a single integer.
The rule for converting a multi-dimensional
array to a single dimensions is as follows.
<p>
Suppose we have the declaration <code>int F[2][5][3];</code>.
There are obviously a total of 2 X 5 X 3 = 30 integers in F.
Thus, these three dimensions will be reduced to a single dimension of size 30.
<p>
A particular point in the three dimensions, say [x][y][z], is reduced to
a number in the range 0..29 by computing <code>((x*5)+y)*3+z</code>.
The corresponding general C code is as follows.
<pre>
size_t
dimmap(int rank, size_t* indices, size_t* sizes)
{
int i;
size_t count = 0;
for(i=0;i<rank;i++) {
if(i > 0) count *= sizes[i];
count += indices[i];
}
return count;
}
</pre>
In this code, the indices variable corresponds to the x,y, and z.
The sizes variable corresponds to the 2,5, and 3.
<p>
The Odometer type stores a set of dimensions
and supports operations to iterate over all possible
dimension combinations.
The definition of Odometer is defined by the types Odometer and Dimdata.
<pre>
typedef struct Dimdata {
unsigned long datasize; // actual size of the datalist item
unsigned long index; // 0 <= index < datasize
unsigned long declsize;
} Dimdata;
typedef struct Odometer {
int rank;
Dimdata dims[NC_MAX_VAR_DIMS];
} Odometer;
</pre>
The following primary operations are defined.
<ul>
<li>Odometer* newodometer(Dimset*) - create an odometer from a set of Dimsets.
<li>void freeodometer(Odometer*) - release the memory of an odometer.
<li>int odometermore(Odometer* odom) - return 1 if there are more combinations
of dimension values.
<li>int odometerincr(Odometer* odo,int) - move to the next combination
of dimension values.
<li>unsigned long odometercount(Odometer* odo) -
apply the above algorithm to convert the current odometer combination
into a single integer.
</ul>
<h2>Misc. Notes</h2>
<ul>
<li> The flag "usingclassic" should be consulted when appropriate to determine
is this CDL file should be treated as using only the netCDF classic model.
</ul>
<h2><u>Change Log</u></h2>
<ul>
<li>07/04/2009 - First draft.
</ul>
</body>
</html>
<p>
<i>genc_scalardata</i> or <i>genc_arraydata</i>.
It stores in its Bytebuffer argument the sequence of constants
corresponding to a given datalist. Handling commas is a tricky issue
so you will that many of the non-top-level routines in cdata.c
take a pointer to a global state element, commap, that determines the
current state of adding commas. The idea is that at the beginning of
any (sub-) Datalist, we want to turn off the comma in front of the
first generated constant and then add commas until be reach the end
of that (sub-)Datalist.
<h1></u><a name="GIT">General Internals Information</a></u></h1>
<h2><u>Primary NCGEN Data Structures</u></h2>
There are two primary structures used in ncgen:
<a href="#Symbol">struct Symbol</a>) and
<a href="#Datalist">struct Datalist</a>).
<h3><a name="Symbol">struct Symbol</a></h3>
Symbol objects are linked into hierarchical structures
to represent netcdf dimensions, types, groups, and variables.
The struct has the following fields.
<table>
<tr><th colspan=3>struct Symbol Fields
<tr valign=top><td>struct Symbol* next<td>-<td>
The Symbol objects are all kept on a single linked list.
No symbol is ever deleted until the end of the program.
<tr valign=top><td>nc_class objectclass<td>-<td>
This defines the general class of symbol, one of: NC_GRP, NC_DIM, NC_VAR, NC_ATT, or NC_TYPE.
<tr valign=top><td>nc_classsubclass<td>-<td>
This defines the sub class of symbol, one of:
NC_PRIM, NC_OPAQUE, NC_ENUM,
NC_FIELD, NC_VLEN, NC_COMPOUND,
NC_ECONST, NC_ARRAY, or NC_FILLVALUE.
<tr valign=top><td>char*name<td>-<td>
The symbol's name.
<tr valign=top><td>struct Symbol* container<td>-<td>
The symbol that is the container for this symbol.
Typically, this the group symbol that contains
this symbol.
<tr valign=top><td>struct Symbol location<td>-<td>
The current group that was open when this symbol was created.
<tr valign=top><td>List* subnodes<td>-<td>
The list of child symbols of this symbol.
For example, a group symbol will have its dimensions,
types, vars, and subgroups will be in this list.
<tr valign=top><td>int is_prefixed<td>-<td>
True if the name of this symbol contains a complete
prefix path (e.g. /x/y/z).
<tr valign=top><td>List* prefix<td>-<td>
A list of the prefix names for this node.
Note that if is_prefixed is false, then this
list was constructed from the set of enclosing groups.
<tr valign=top><td>struct Datalist* data<td>-<td>
Stores the constants from attribute or datalist
constructs.
<tr valign=top><td>Typeinfo typ<td>-<td>
Type information about this symbol
as defined by the Typeinfo structure.
<tr valign=top><td>Varinfo var<td>-<td>
Variable information about a variable symbol
as defined by the Varinfo structure.
<tr valign=top><td>Attrinfo att<td>-<td>
Attribute information about an attribute symbol
as defined by the Attrinfo structure.
<tr valign=top><td>Diminfo dim<td>-<td>
Dimension information about a dimension symbol
as defined by the Diminfo structure.
<tr valign=top><td>Groupinfo grp<td>-<td>
Group information about a group symbol
as defined by the Groupinfo structure.
<tr valign=top><td>int lineno<td>-<td>
The source line in which this symbol was created.
<tr valign=top><td>int touched<td>-<td>
Used in transitive closure operations
to prevent revisiting symbols.
<tr valign=top><td>char* lname<td>-<td>
Cached C or FORTRAN name (not used?).
<tr valign=top><td>int ncid<td>-<td>
The ncid/varid/dimid, etc when
defining netcdf objects.
</table>
<h4>struct Groupinfo</h4>
Group symbols primarily keep the group
containment structure in the subnodes field of the Symbol.
<p>
<table>
<tr><th colspan=3>struct Groupinfo Fields
<tr valign=top><td>int is_root<td>-<td>
Is this the root group?
</table>
<h4>struct Diminfo</h4>
The only important information about a dimension,
aside from name, is the dimension size.
Additionally, type definitions may have anonymous
(unnamed) dimensions.
<p>
<table>
<tr><th colspan=3>struct Diminfo Fields
<tr valign=top><td>int isconstant<td>-<td>
Is this an anonymous dimension?
<tr valign=top><td>unsigned int size<td>-<td>
The size of the dimension.
</table>
<h4>struct Varinfo</h4>
Variables require two primary pieces of information:
the set of attributes (including special attributes)
and dimension information. The dimension information
is kept in the Typeinfo structure because things
other than variables have dimensions (e.g. user defined types).
<p>
<table>
<tr><th colspan=3>struct Varinfo Fields
<tr valign=top><td>int nattributes<td>-<td>
The number of attributes; this is redundant but useful.
<tr valign=top><td>List* attributes<td>-<td>
The list of all attribute symbols associated with this
variable.
<tr valign=top><td>Specialdata special<td>-<td>
Special attribute values.
</table>
<h4>struct Typeinfo</h4>
The type information is probably the second most
used structure in all of the code (second to Symbol itself).
<p>
<table>
<tr><th colspan=3>struct Typeinfo Fields
<tr valign=top><td>struct Symbol* basetype<td>-<td>
Provide a reference to the base type of this symbol.
This applies to other types, variables, and attributes.
<tr valign=top><td>int hasvlen<td>-<td>
Does the type have a vlen definition anywhere within it.
This is used as an optimization to avoid searching datalists
for vlen constants.
<tr valign=top><td>nc_type typecode<td>-<td>
The typecode of the basetype. This is most useful
when the basetype is a primitive type.
<tr valign=top><td>unsigned long size<td>-<td>
The size of this object.
<tr valign=top><td>unsigned long offset<td>-<td>
The field offset for fields in compound types.
<tr valign=top><td>unsigned long alignment<td>-<td>
The memory alignment (i.e. 1,2,4,or 8).
<tr valign=top><td>Constant econst<td>-<td>
For enumeration constants, the actual value of the constant.
<tr valign=top><td>Dimset dimset<td>-<td>
The dimension information for the type or variable.
The dimset stores the number of dimensions and a list
of pointers to the corresponding dimension symbols.
</table>
<h4>struct Attrinfo</h4>
Note that the actual attribute data is stored
in the data field of the containing Symbol.
<p>
<table>
<tr><th colspan=3>struct Attrinfo Fields
<tr valign=top><td>struct Symbol* var<td>-<td>
The variable with which this attribute is associated;
it is NULL for global attributes.
<tr valign=top><td>unsigned long count<td>-<td>
The number of instances associated with the attribute value.
</table>
<h3><a name="Datalist">Datalists and Datasrcs</a></h3>
Whenever a datalist is encountered during parsing, it is converted
to an instance of struct Datalist.
Each datalist instance contains a vector of instances of
struct Constant that contains the actual data.
<p>
Each datalist instance contains the following information.
<table>
<tr><th colspan=3>struct Datalist Fields
<tr valign=top><td>struct Datalist* next<td>-<td>
All datalists are chained for reclamation.
<tr valign=top><td>int readonly<td>-<td>
Can this datalist be modified?
<tr valign=top><td>unsigned int length<td>-<td>
The number of Constant instances in the data field.
<tr valign=top><td>unsigned int alloc<td>-<td>
The memory space allocated to the data field.
<tr valign=top><td>Constant* data<td>-<td>
The vector in sequential memory of the constants comprising this datalist.
<tr valign=top><td>struct Symbol* schema<td>-<td>
The symbol (type, variable, or attribute) defining the structure of this datalist,
if known.
<tr valign=top><td>struct Vlen {<td>-<td>
Information about the vlen instances contained in this datalist.
<tr><td>unsigned int count;
<tr><td>unsigned int uid;
<tr><td>} vlen
<tr valign=top><td>Odometer* dimdata<td>-<td>
A tracker to count through dimensions associated with this datalist via the schema.
</table>
<p>
In turn, a Constant instance is defined as follows.
<pre>
typedef struct Constant {
nc_type nctype;
int lineno;
Constvalue value;
} Constant;
</pre>
It indicates the type of the value and the source line number (if known)
in which this constant was created.
<p>
The ConstValue type is a union
of all possible values that can occur
in a datalist.
<pre>
typedef union Constvalue {
struct Datalist* compoundv; // NC_COMPOUND
char charv; // NC_CHAR
signed char int8v; // NC_BYTE
unsigned char uint8v; // NC_UBYTE
short int16v; // NC_SHORT
unsigned short uint16v; // NC_USHORT
int int32v; // NC_INT
unsigned int uint32v; // NC_UINT
long long int64v; // NC_INT64
unsigned long long uint64v; // NC_UINT64
float floatv; // NC_FLOAT
double doublev; // NC_DOUBLE
struct Stringv { // NC_STRING
int len;
char* stringv;
} stringv;
struct Opaquev { // NC_OPAQUE
int len; // length as originally written (rounded to even number)
char* stringv; //as constant was written
// (padded to even # chars >= 16)
// without leading 0x
} opaquev;
struct Symbol* enumv; // NC_ECONST
} Constvalue;
</pre>
<p>
Several fields are of particular interest:
<table>
<tr><th colspan=3>Selected Constvalue Fields
<tr valign=top><td>struct Datalist* compoundv<td>-<td>
This stores nested datalists - typically
of the form "{...{...}...}".
<tr valign=top><td>struct Stringv {int len; char* stringv;} stringv<td>-<td>
Store string constants.
<tr valign=top><td>struct Opaquev {int len; char* stringv;} opaquev<td>-<td>
Store opaque constants as written (i.e. abc...),
without the leading 0x, and
padded to an even number of characters to be
at least 16 characters long.
<tr valign=top><td>struct Symbol* enumv<td>-<td>
Pointer to an enumeration constant definition.
</table>
<h4>struct Datasrc</h3>
When it comes time to generate datalists for output,
it is necessary to "walk" the datalist (including nested
datalist). The Datasrc structure is used to do this.
Its definition is as follows.
<pre>
typedef struct Datasrc {
unsigned int index; // 0..length-1
unsigned int length;
int autopop; // pop when at end
Constant* data; // duplicate pointer; so do not free.
struct Datasrc* stack;
} Datasrc;
</pre>
The Datasrc tracks the "current" location in the sequence
of Constants (taken from a Datalist). The index field indicates
the current location.
In effect, Datasrc is the lexer and the code
that is walking it is in effect parsing the data sequence.
The following operations are supported (see data.[ch]).
<ul>
<li>datalist2src - takes a Datalist and constructs a Datasrc.
<li>srcpush - assumes the current constant is a nested Datalist
and pushes into that Datalist.
<li>srcpushlist - pushes into the passed Datalist argument.
<li>srcpop - pops the current list and resumes the next list in the
stack.
<li>srcnext - return the value at the index
and then advance the Datasrc index.
If at the end of the current datalist, then return NULL;
srcincr is an alias for srcnext.
<li>srcmore - return 1 is not at the end of the current Datasrc.
Pushed datalists are not considered.
<li>srcline - return a usable line number associated with the current
position of the Datasrc (that is why Constant instances have a line
number).
<li>srcpeek - return the value at the index but do not advance.
If at the end of the current datalist, then return NULL; srcget is an alias
for srcpeek.
</ul>
<h2><u>The CDL Parser</u></h2>
The CDL parser and associated lexer
(primarily files "ncgen.y" and "ncgen.l")
parse CDL files into various data structures
for use by the remaining ncgen code.
The data structures described above,
(<a href="#Symbol">Symbol</a>, and
<a href="#Datalist">Datalist</a>)
are primarily generated by the parser.
<h3>Parse Cliches</h3>
<h4>Node Stacking</h4>
One of the issues that must be addressed by any bottom-up
parser is handling the accumulation of sets of items (nodes,
etc.). The YACC/Bison parse stack cannot be used
because the set of accumulated nodes is unbounded
and the YACC stack mechanism is bounded (i.e. each rule
has a bounded right hand side length).
<p>
The node stacking set of cliches is ubiquitous in the
parser, so they must be understood to understand how the
parser works. The cliche here is shown in the handling of,
for example, the varlist rule, which is defined as follows.
<pre>
varlist: varspec
{$$=listlength(stack); listpush(stack,(elem_t)$1);}
| varlist ',' varspec
{$$=$1; listpush(stack,(elem_t)$3);}
;
</pre>
The varlist rule collects variable name declarations (via the varspec rule).
The idea is to use a separate stack named "stack", and tracking
the index into the stack of the start of collection of objects.
The varlist value (in the YACC sense) is defined as an integer
representing the size of the stack at the start of a list of variables.
That is what this code does: <code>$$=listlength(stack)</code>.
<p>
At the point where the set of varspecs should processed, the following code cliche
is used.
<pre>