/
R-ints.texi
3299 lines (2774 loc) · 140 KB
/
R-ints.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\input texinfo
@c %**start of header
@setfilename R-ints.info
@settitle R Internals
@setchapternewpage on
@c %**end of header
@c @documentencoding ISO-8859-1
@syncodeindex fn vr
@dircategory Programming
@direntry
* R Internals: (R-ints). R Internals.
@end direntry
@finalout
@include R-defs.texi
@include version.texi
@ifinfo
This is a guide to R's internal structures.
@Rcopyright{1999}
@ignore
Permission is granted to process this file through TeX and print the
results, provided the printed document carries a copying permission
notice identical to this one except for the removal of this paragraph
(this paragraph not being relevant to the printed manual).
@end ignore
@permission{}
@c ---------- ^- read that
@end ifinfo
@titlepage
@title R Internals
@subtitle Version @value{VERSION}
@author R Development Core Team
@page
@vskip 0pt plus 1filll
@permission{}
@Rcopyright{1999}
@value{ISBN-ints}
@end titlepage
@c @ifnothtml
@contents
@c @end ifnothtml
@ifnottex
@node Top, R Internal Structures, (dir), (dir)
@top R Internals
This is a guide to the internal structures of @R{} and coding standards for
the core team working on @R{} itself.
The current version of this document is @value{VERSION}.
@value{ISBN-ints}
@end ifnottex
@menu
* R Internal Structures::
* .Internal vs .Primitive::
* Internationalization in the R sources::
* Graphics Devices::
* Tools::
* R coding standards::
* Testing R code::
* Function and variable index::
* Concept index::
@end menu
@node R Internal Structures, .Internal vs .Primitive, Top, Top
@chapter R Internal Structures
This chapter is the beginnings of documentation about @R{} internal
structures. It is written for the R core team and others studying the
code in the @file{src/main} directory.
It is a work-in-progress, first begun for @R{} 2.4.0, and should be
checked against the current version of the source code.
@menu
* SEXPs::
* Environments and variable lookup::
* Attributes::
* Contexts::
* Argument evaluation::
* Autoprinting::
* The write barrier::
* Serialization Formats::
* Encodings for CHARSXPs::
* The CHARSXP cache::
* Warnings and errors::
* S4 objects::
* Memory allocators::
* Internal use of global and base environments::
* Modules::
@end menu
@node SEXPs, Environments and variable lookup, R Internal Structures, R Internal Structures
@section SEXPs
@cindex SEXP
@cindex SEXPRREC
What @R{} users think of as @emph{variables} or @emph{objects} are
symbols which are bound to a value. The value can be thought of as
either a @code{SEXP} (a pointer), or the structure it points to, a
@code{SEXPREC} (and there are alternative forms used for vectors, namely
@code{VECSXP} pointing to @code{VECTOR_SEXPREC} structures).
So the basic building blocks of @R{} objects are often called
@emph{nodes}, meaning @code{SEXPREC}s or @code{VECTOR_SEXPREC}s.
Note that the internal structure of the @code{SEXPREC} is not made
available to R Extensions: rather @code{SEXP} is an opaque pointer, and
the internals can only be accessed by the functions provided.
@cindex node
Both types of node structure have as their first three fields a 32-bit
@code{sxpinfo} header and then three pointers (to the attributes and the
previous and next node in a doubly-linked list), and then some further
fields. On a 32-bit platform a node@footnote{strictly, a @code{SEXPREC}
node; @code{VECTOR_SEXPREC} nodes are slightly smaller but followed by
data in the node.} occupies 28 bytes: on a 64-bit platform typically 56
bytes (depending on alignment constraints).
The first five bits of the @code{sxpinfo} header specify one of up to 32
@code{SEXPTYPE}s.
@menu
* SEXPTYPEs::
* Rest of header::
* The 'data'::
* Allocation classes::
@end menu
@node SEXPTYPEs, Rest of header, SEXPs, SEXPs
@subsection SEXPTYPEs
@cindex SEXPTYPE
Currently @code{SEXPTYPE}s 0:10 and 13:25 are in use. Values 11 and 12 were
used for internal factors and ordered factors and have since been
withdrawn. Note that the @code{SEXPTYPE}s are stored in @code{save}d
objects and that the ordering of the types is used, so the gap cannot
easily be reused.
@cindex SEXPTYPE table
@quotation
@multitable {no} {SPECIALSXPXXX} {S4 classes not of simple type}
@headitem no @tab SEXPTYPE@tab Description
@item @code{0} @tab @code{NILSXP} @tab @code{NULL}
@item @code{1} @tab @code{SYMSXP} @tab symbols
@item @code{2} @tab @code{LISTSXP} @tab pairlists
@item @code{3} @tab @code{CLOSXP} @tab closures
@item @code{4} @tab @code{ENVSXP} @tab environments
@item @code{5} @tab @code{PROMSXP} @tab promises
@item @code{6} @tab @code{LANGSXP} @tab language objects
@item @code{7} @tab @code{SPECIALSXP} @tab special functions
@item @code{8} @tab @code{BUILTINSXP} @tab builtin functions
@item @code{9} @tab @code{CHARSXP} @tab internal character strings
@item @code{10} @tab @code{LGLSXP} @tab logical vectors
@item @code{13} @tab @code{INTSXP} @tab integer vectors
@item @code{14} @tab @code{REALSXP} @tab numeric vectors
@item @code{15} @tab @code{CPLXSXP} @tab complex vectors
@item @code{16} @tab @code{STRSXP} @tab character vectors
@item @code{17} @tab @code{DOTSXP} @tab dot-dot-dot object
@item @code{18} @tab @code{ANYSXP} @tab make ``any'' args work
@item @code{19} @tab @code{VECSXP} @tab list (generic vector)
@item @code{20} @tab @code{EXPRSXP} @tab expression vector
@item @code{21} @tab @code{BCODESXP} @tab byte code
@item @code{22} @tab @code{EXTPTRSXP} @tab external pointer
@item @code{23} @tab @code{WEAKREFSXP} @tab weak reference
@item @code{24} @tab @code{RAWSXP} @tab raw vector
@item @code{25} @tab @code{S4SXP} @tab S4 classes not of simple type
@end multitable
@end quotation
@cindex atomic vector type
Many of these will be familiar from @R{} level: the atomic vector types
are @code{LGLSXP}, @code{INTSXP}, @code{REALSXP}, @code{CPLXSP},
@code{STRSXP} and @code{RAWSXP}. Lists are @code{VECSXP} and names
(also known as symbols) are @code{SYMSXP}. Pairlists (@code{LISTSXP},
the name going back to the origins of @R{} as a Scheme-like language)
are rarely seen at @R{} level, but are for example used for argument
lists. Character vectors are effectively lists all of whose elements
are @code{CHARSXP}, a type that is rarely visible at @R{} level.
@cindex language object
@cindex argument list
Language objects (@code{LANGSXP}) are calls (including formulae and so
on). Internally they are pairlists with first element a
reference@footnote{a pointer to a function or a symbol to look up the
function by name, or a language object to be evaluated to give a
function.} to the function to be called with remaining elements the
actual arguments for the call (and with the tags if present giving the
specified argument names). Although this is not enforced, many places
in the code assume that the pairlist is of length one or more, often
without checking.
@cindex expression
Expressions are of type @code{EXPRSXP}: they are a vector of (usually
language) objects most often seen as the result of @code{parse()}.
@cindex function
The functions are of types @code{CLOSXP}, @code{SPECIALSXP} and
@code{BUILTINSXP}: where @code{SEXPTYPE}s are stored in an integer
these are sometimes lumped into a pseudo-type @code{FUNSXP} with code
99. Functions defined via @code{function} are of type @code{CLOSXP} and
have formals, body and environment.
@cindex S4 type
The @code{SEXPTYPE} @code{S4SXP} was introduced in @R{} 2.4.0 for S4
classes which were previously represented as empty lists, that is
objects which do not consist solely of a simple type such as an atomic
vector or function.
@node Rest of header, The 'data', SEXPTYPEs, SEXPs
@subsection Rest of header
The @code{sxpinfo} header is defined as a 32-bit C structure by
@example
struct sxpinfo_struct @{
SEXPTYPE type : 5; /* @r{discussed above} */
unsigned int obj : 1; /* @r{is this an object with a class attribute?} */
unsigned int named : 2; /* @r{used to control copying} */
unsigned int gp : 16; /* @r{general purpose, see below} */
unsigned int mark : 1; /* @r{mark object as `in use' in GC} */
unsigned int debug : 1;
unsigned int trace : 1;
unsigned int spare : 1; /* @r{unused} */
unsigned int gcgen : 1; /* @r{generation for GC} */
unsigned int gccls : 3; /* @r{class of node for GC} */
@}; /* Tot: 32 */
@end example
@findex debug bit
The @code{debug} bit is used for closures and environments. For
closures it is set by @code{debug()} and unset by @code{undebug()}, and
indicates that evaluations of the function should be run under the
browser. For environments it indicates whether the browsing is in
single-step mode.
@findex trace bit
The @code{trace} bit is used for functions for @code{trace()} and for
other objects when tracing duplications (see @code{tracemem}).
@findex named bit
@findex NAMED
@findex SET_NAMED
@cindex copying semantics
The @code{named} field is set and accessed by the @code{SET_NAMED} and
@code{NAMED} macros, and take values @code{0}, @code{1} and @code{2}.
@R{} has a `call by value' illusion, so an assignment like
@example
b <- a
@end example
@noindent
appears to make a copy of @code{a} and refer to it as @code{b}.
However, if neither @code{a} nor @code{b} are subsequently altered there
is no need to copy. What really happens is that a new symbol @code{b}
is bound to the same value as @code{a} and the @code{named} field on the
value object is set (in this case to @code{2}). When an object is about
to be altered, the @code{named} field is consulted. A value of @code{2}
means that the object must be duplicated before being changed. (Note
that this does not say that it is necessary to duplicate, only that it
should be duplicated whether necessary or not.) A value of @code{0}
means that it is known that no other @code{SEXP} shares data with this
object, and so it may safely be altered. A value of @code{1} is used
for situations like
@example
dim(a) <- c(7, 2)
@end example
@noindent
where in principle two copies of @code{a} exist for the duration of the
computation as (in principle)
@example
a <- `dim<-`(a, c(7, 2))
@end example
@noindent
but for no longer, and so some primitive functions can be optimized to
avoid a copy in this case.
The @code{gp} bits are by definition `general purpose'. We label these
from 0 to 15. As of version 2.4.0 of R, bit 4 is turned on to mark S4
objects. Bits 0-3 and bits 14-15 have been used previously as described
below (from detective work on the sources).
@findex gp bits
@findex LEVELS
@findex SETLEVELS
The bits can be accessed and set by the @code{LEVELS} and
@code{SETLEVELS} macros, which names appear to date back to the internal
factor and ordered types and are now used in only a few places in the
code. The @code{gp} field is serialized/unserialized for the
@code{SEXPTYPE}s other than @code{NILSXP}, @code{SYMSXP} and
@code{ENVSXP}.
Bits 14 and 15 of @code{gp} are used for `fancy bindings'. Bit 14 is
used to lock a binding or an environment, and bit 15 is used to indicate
an active binding. (For the definition of an `active binding' see the
header comments in file @file{src/main/envir.c}.) Bit 15 is used for an
environment to indicate if it participates in the global cache.
Almost all other uses seem to be only of bits 0 and 1, although one
reserves the first four bits.
@findex ARGSUSED
@findex SET_ARGUSED
The macros @code{ARGUSED} and @code{SET_ARGUSED} are used when matching
actual and formal function arguments, and take the values 0, 1 and 2.
@findex MISSING
@findex SET_MISSING
The macros @code{MISSING} and @code{SET_MISSING} are used for pairlists
of arguments. Four bits are reserved, but only two are used (and
exactly what for is not explained). It seems that bit 0 is used by
@code{matchArgs} to mark missingness on the returned argument list, and
bit 1 is used to mark the use of a default value for an argument copied
to the evaluation frame of a closure.
@findex DDVAL
@findex SET_DDVAL
@cindex ... argument
Bit 0 is used by macros @code{DDVAL} and @code{SET_DDVAL}. This
indicates that a @code{SYMSXP} is one of the symbols @code{..n} which
are implicitly created when @code{...} is processed, and so indicates
that it may need to be looked up in a @code{DOTSXP}.
@findex PRSEEN
@cindex promise
Bit 0 is used for @code{PRSEEN}, a flag to indicate if a promise has
already been seen during the evaluation of the promise (and so to avoid
recursive loops).
Bit 0 is used for @code{HASHASH}, on the @code{PRINTNAME} of the
@code{TAG} of the frame of an environment.
Bits 0 and 1 are used for weak references (to indicate 'ready to
finalize', 'finalize on exit').
Bit 0 is used by the condition handling system (on a @code{VECSXP}) to
indicate a calling handler.
As from @R{} 2.5.0, bits 2 and 3 for a @code{CHARSXP} are used to note
that it is known to be in Latin-1 and UTF-8 respectively. (These are not
usually set if it is also known to be in ASCII, since code does not need
to know the charset to handle ASCII strings. From @R{} 2.8.0
it is guaranteed that they will not be set for CHARSXPs created by @R{}
itself.) As from @R{} 2.8.0 bit 5 is used to indicate that a CHARSXP
is hashed by its address, that is NA_STRING or in the CHARSXP cache.
@c Finally, @code{SETLEVELS} and @code{LEVELS} are used by that name for
@c the internal code for @code{terms.formula} to compute the @code{order}
@c attribute of the result. This is computed on an internal pairlist, and
@c marks the order of the interaction. This is in principle unlimited
@c (although no test is done) and could in principle exceed 15. (This
@c usage could easily be replaced by one not making use of @code{gp}.)
@node The 'data', Allocation classes, Rest of header, SEXPs
@subsection The `data'
A @code{SEXPREC} is a C structure containing the 32-bit header as
described above, three pointers (to the attributes, previous and next
node) and the node data, a union
@example
union @{
struct primsxp_struct primsxp;
struct symsxp_struct symsxp;
struct listsxp_struct listsxp;
struct envsxp_struct envsxp;
struct closxp_struct closxp;
struct promsxp_struct promsxp;
@} u;
@end example
@noindent
All of these alternatives apart from the first (an @code{int}) are three
pointers, so the union occupies three words.
@cindex vector type
The vector types are @code{RAWSXP}, @code{CHARSXP}, @code{LGLSXP},
@code{INTSXP}, @code{REALSXP}, @code{CPLXSXP}, @code{STRSXP},
@code{VECSXP}, @code{EXPRSXP} and @code{WEAKREFSXP}. Remember that such
types are a @code{VECTOR_SEXPREC}, which again consists of the header
and the same three pointers, but followed by two integers giving the
length and `true length'@footnote{This is almost unused. The only
current use is for hash tables of environments (@code{VECSXP}s), where
@code{length} is the size of the table and @code{truelength} is the
number of primary slots in use, and for the reference hash tables in
serialization (@code{VECSXP}s), where @code{truelength} is the number of
slots in use.} of the vector, and then followed by the data (aligned as
required: on most 32-bit systems with a 24-byte @code{VECTOR_SEXPREC}
node the data can follow immediately after the node). The data are a
block of memory of the appropriate length to store `true length'
elements (rounded up to a multiple of 8 bytes, with the 8-byte blocks
being the `Vcells' referred in the documentation for @code{gc()}).
The `data' for the various types are given in the table below. A lot of
this is interpretation, i.e. the types are not checked.
@table @code
@item NILSXP
There is only one object of type @code{NILSXP}, @code{R_NilValue}, with
no data.
@item SYMSXP
Pointers to three nodes, the name, value and internal, accessed by
@code{PRINTNAME} (a @code{CHARSXP}), @code{SYMVALUE} and
@code{INTERNAL}. (If the symbol's value is a @code{.Internal} function,
the last is a pointer to the appropriate @code{SEXPREC}.) Many symbols
have @code{SYMVALUE} @code{R_UnboundValue}.
@item LISTSXP
Pointers to the CAR, CDR (usually a @code{LISTSXP} or @code{NULL}) and TAG
(usually a @code{SYMSXP}).
@item CLOSXP
Pointers to the formals (a pairlist), the body and the environment.
@item ENVSXP
Pointers to the frame, enclosing environment and hash table (@code{NULL} or a
@code{VECSXP}). A frame is a tagged pairlist with tag the symbol and
CAR the bound value.
@item PROMSXP
Pointers to the value, expression and environment (in which to evaluate
the expression). Once an promise has been evaluated, the environment is
set to @code{NULL}.
@item LANGSXP
A special type of @code{LISTSXP} used for function calls. (The CAR
references the function (perhaps via a symbol or language object), and
the CDR the argument list with tags for named arguments.) @R{}-level
documentation references to `expressions' / `language objects' are
mainly @code{LANGSXP}s, but can be symbols (@code{SYMSXP}s) or
expression vectors (@code{EXPRSXP}s).
@item SPECIALSXP
@itemx BUILTINSXP
An integer giving the offset into the table of
primitives/@code{.Internal}s.
@item CHARSXP
@code{length}, @code{truelength} followed by a block of bytes (allowing
for the @code{nul} terminator).
@item LGLSXP
@itemx INTSXP
@code{length}, @code{truelength} followed by a block of C @code{int}s
(which are 32 bits on all @R{} platforms).
@item REALSXP
@code{length}, @code{truelength} followed by a block of C @code{double}s
@item CPLXSXP
@code{length}, @code{truelength} followed by a block of C99
@code{double complex}s, or equivalent structures.
@item STRSXP
@code{length}, @code{truelength} followed by a block of pointers
(@code{SEXP}s pointing to @code{CHARSXP}s).
@item DOTSXP
A special type of @code{LISTSXP} for the value bound to a @code{...}
symbol: a pairlist of promises.
@item ANYSXP
This is used as a place holder for any type: there are no actual objects
of this type.
@item VECSXP
@itemx EXPRSXP
@code{length}, @code{truelength} followed by a block of pointers. These
are internally identical (and identical to @code{STRSXP}) but differ in
the interpretations placed on the elements.
@item BCODESXP
For the future byte-code compiler.
@item EXTPTRSXP
Has three pointers, to the pointer, the protection value (an @R{} object
which if alive protects this object) and a tag (a @code{SYMSXP}?).
@item WEAKREFSXP
A @code{WEAKREFSXP} is a special @code{VECSXP} of length 4, with
elements @samp{key}, @samp{value}, @samp{finalizer} and @samp{next}.
The @samp{key} is @code{NULL}, an environment or an external pointer,
and the @samp{finalizer} is a function or @code{NULL}.
@item RAWSXP
@code{length}, @code{truelength} followed by a block of bytes.
@item S4SXP
two unused pointers and a tag.
@end table
@node Allocation classes, , The 'data', SEXPs
@subsection Allocation classes
@cindex allocation classes
As we have seen, the field @code{gccls} in the header is three bits to
label up to 8 classes of nodes. Non-vector nodes are of class 0, and
`small' vector nodes are of classes 1 to 6, with `large' vector nodes
being of class 7. The `small' vector nodes are able to store vector
data of up to 8, 16, 32, 48, 64 and 128 bytes: larger vectors are
@code{malloc}-ed individually whereas the `small' nodes are allocated
from pages of about 2000 bytes.
@node Environments and variable lookup, Attributes, SEXPs, R Internal Structures
@section Environments and variable lookup
@cindex environment
@cindex variable lookup
What users think of as `variables' are symbols which are bound to
objects in `environments'. The word `environment' is used ambiguously
in @R{} to mean @emph{either} the frame of an @code{ENVSXP} (a pairlist
of symbol-value pairs) @emph{or} an @code{ENVSXP}, a frame plus an
enclosure.
@cindex user databases
There are additional places that `variables' can be looked up, called
`user databases' in comments in the code. These seem undocumented in
the @R{} sources, but apparently refer to the @pkg{RObjectTable} package
at @uref{http://www.omegahat.org/RObjectTables/}.
@cindex base environment
@cindex environment, base
The base environment is special. There is an @code{ENVSXP} environment
with enclosure the empty environment @code{R_EmptyEnv}, but the frame of
that environment is not used. Rather its bindings are part of the
global symbol table, being those symbols in the global symbol table
whose values are not @code{R_UnboundValue}. When @R{} is started the
internal functions are installed (by C code) in the symbol table, with
primitive functions having values and @code{.Internal} functions having
what would be their values in the field accessed by the @code{INTERNAL}
macro. Then @code{.Platform} and @code{.Machine} are computed and the
base package is loaded into the base environment followed by the system
profile.
The frames of environments (and the symbol table) are normally hashed
for faster access (including insertion and deletion).
By default @R{} maintains a (hashed) global cache of `variables' (that
is symbols and their bindings) which have been found, and this refers
only to environments which have been marked to participate, which
consists of the global environment (aka the user workspace), the base
environment plus environments@footnote{Remember that attaching a list or
a saved image actually creates and populates an environment and attaches
that.} which have been @code{attach}ed. When an environment is either
@code{attach}ed or @code{detach}ed, the names of its symbols are flushed
from the cache. The cache is used whenever searching for variables from
the global environment (possibly as part of a recursive search).
@menu
* Search paths::
* Name spaces::
@end menu
@node Search paths, Name spaces, Environments and variable lookup, Environments and variable lookup
@subsection Search paths
@cindex search path
@Sl{} has the notion of a `search path': the lookup for a `variable'
leads (possibly through a series of frames) to the `session frame' the
`working directory' and then along the search path. The search path is
a series of databases (as returned by @code{search()}) which contain the
system functions (but not necessarily at the end of the path, as by
default the equivalent of packages are added at the end).
@R{} has a variant on the @Sl{} model. There is a search path (also
returned by @code{search()}) which consists of the global environment
(aka user workspace) followed by environments which have been attached
and finally the base environment. Note that unlike @Sl{} it is not
possible to attach environments before the workspace nor after the base
environment.
However, the notion of variable lookup is more general in @R{}, hence
the plural in the title of this subsection. Since environments have
enclosures, from any environment there is a search path found by looking
in the frame, then the frame of its enclosure and so on. Since loops
are not allowed, this process will eventually terminate: until @R{}
2.2.0 it always terminated at the base environment, but nowadays it can
terminate at either the base environment or the empty environment. (It
can be conceptually simpler to think of the search always terminating at
the empty environment, but with an optimization to stop at the base
environment.) So the `search path' describes the chain of environments
which is taken once the search reaches the global environment.
@node Name spaces, , Search paths, Environments and variable lookup
@subsection Name spaces
@cindex name space
Name spaces are environments associated with packages (and once again
the base package is special and will be considered separately). A
package @code{@var{pkg}} with a name space defines two environments
@code{namespace:@var{pkg}} and @code{package:@var{pkg}}: it is
@code{package:@var{pkg}} that can be @code{attach}ed and form part of
the search path.
The objects defined by the @R{} code in the package are symbols with
bindings in the @code{namespace:@var{pkg}} environment. The
@code{package:@var{pkg}} environment is populated by selected symbols
from the @code{namespace:@var{pkg}} environment (the exports). The
enclosure of this environment is an environment populated with the
explicit imports from other name spaces, and the enclosure of
@emph{that} environment is the base name space. (So the illusion of the
imports being in the name space environment is created via the
environment tree.) The enclosure of the base name space is the global
environment, so the search from a package name space goes via the
(explicit and implicit) imports to the standard `search path'.
@cindex base name space
@cindex name space, base
@findex R_BaseNamespace
The base name space environment @code{R_BaseNamespace} is another
@code{ENVSXP} that is special-cased. It is effectively the same thing
as the base environment @code{R_BaseEnv} @emph{except} that its
enclosure is the global environment rather than the empty environment:
the internal code diverts lookups in its frame to the global symbol
table.
@node Attributes, Contexts, Environments and variable lookup, R Internal Structures
@section Attributes
@cindex attributes
@findex ATTRIB
@findex SET_ATTRIB
@findex DUPLICATE_ATTRIB
As we have seen, every @code{SEXPREC} has a pointer to the attributes of
the node (default @code{R_NilValue}). The attributes can be accessed/set
by the macros/functions @code{ATTRIB} and @code{SET_ATTRIB}, but such
direct access is normally@footnote{An exception is the internal code for
@code{terms.formula} which directly manipulates the attributes.} only
used to check if the attributes are @code{NULL} or to reset them.
Otherwise access goes through the functions @code{getAttrib} and
@code{setAttrib} which impose restrictions on the attributes. One thing
to watch is that if you copy attributes from one object to another you
may (un)set the @code{"class"} attribute and so need to copy the object
and S4 bits as well. There is a macro/function
@code{DUPLICATE_ATTRIB} to automate this.
The code assumes that the attributes of a node are either
@code{R_NilValue} or a pairlist of non-zero length (and this is checked
by @code{SET_ATTRIB}). The attributes are named (via tags on the
pairlist). The replacement function @code{attributes<-} ensures that
@code{"dim"} precedes @code{"dimnames"} in the pairlist. Attribute
@code{"dim"} is one of several that is treated specially: the values are
checked, and any @code{"names"} and @code{"dimnames"} attributes are
removed. Similarly, you cannot set @code{"dimnames"} without having set
@code{"dim"}, and the value assigned must be a list of the correct
length and with elements of the correct lengths (and all zero-length
elements are replaced by @code{NULL}).
The other attributes which are given special treatment are
@code{"names"}, @code{"class"}, @code{"tsp"}, @code{"comment"} and
@code{"row.names"}. For pairlist-like objects the names are not stored
as an attribute but (as symbols) as the tags: however the @R{} interface
makes them look like conventional attributes, and for one-dimensional
arrays they are stored as the first element of the @code{"dimnames"}
attribute. The C code ensures that the @code{"tsp"} attribute is an
@code{REALSXP}, the frequency is positive and the implied length agrees
with the number of rows of the object being assigned to. Classes and
comments are restricted to character vectors, and assigning a
zero-length comment or class removes the attribute. Setting or removing
a @code{"class"} attribute sets the object bit appropriately. Integer
row names are converted to and from the internal compact representation.
@cindex copying semantics
Care needs to be taken when adding attributes to objects of the types
with non-standard copying semantics. There is only one object of type
@code{NILSXP}, @code{R_NilValue}, and that should never have attributes
(and this is enforced in @code{installAttrib}). For environments,
external pointers and weak references, the attributes should be relevant
to all uses of the object: it is for example reasonable to have a name
for an environment, and also a @code{"path"} attribute for those
environments populated from @R{} code in a package.
@cindex attributes, preserving
@cindex preserving attributes
When should attributes be preserved under operations on an object?
Becker, Chambers & Wilks (1988, pp. 144--6) give some guidance. Scalar
functions (those which operate element-by-element on a vector and whose
output is similar to the input) should preserve attributes (except
perhaps class, and if they do preserve class they need to preserve the
@code{OBJECT} and S4 bits). Binary operations normally call
@findex copyMostAttributes
@code{copyMostAttributes} to copy most attributes from the longer
argument (and if they are of the same length from both, preferring the
values on the first). Here `most' means all except the @code{names},
@code{dim} and @code{dimnames} which are set appropriately by the code
for the operator.
Subsetting (other than by an empty index) generally drops all attributes
except @code{names}, @code{dim} and @code{dimnames} which are reset as
appropriate. On the other hand, subassignment generally preserves such
attributes even if the length is changed. Coercion drops all
attributes. For example:
@example
> x <- structure(1:8, names=letters[1:8], comm="a comment")
> x[]
a b c d e f g h
1 2 3 4 5 6 7 8
attr(,"comm")
[1] "a comment"
> x[1:3]
a b c
1 2 3
> x[3] <- 3
> x
a b c d e f g h
1 2 3 4 5 6 7 8
attr(,"comm")
[1] "a comment"
> x[9] <- 9
> x
a b c d e f g h
1 2 3 4 5 6 7 8 9
attr(,"comm")
[1] "a comment"
@end example
@node Contexts, Argument evaluation, Attributes, R Internal Structures
@section Contexts
@cindex context
@emph{Contexts} are the internal mechanism used to keep track of where a
computation has got to (and from where), so that control-flow constructs
can work and reasonable information can be produced on error conditions,
(such as @emph{via} traceback) and otherwise (the @code{sys.@var{xxx}}
functions).
Execution contexts are a stack of C @code{structs}:
@example
typedef struct RCNTXT @{
struct RCNTXT *nextcontext; /* @r{The next context up the chain} */
int callflag; /* @r{The context `type'} */
JMP_BUF cjmpbuf; /* @r{C stack and register information} */
int cstacktop; /* @r{Top of the pointer protection stack} */
int evaldepth; /* @r{Evaluation depth at inception} */
SEXP promargs; /* @r{Promises supplied to closure} */
SEXP callfun; /* @r{The closure called} */
SEXP sysparent; /* @r{Environment the closure was called from} */
SEXP call; /* @r{The call that effected this context} */
SEXP cloenv; /* @r{The environment} */
SEXP conexit; /* @r{Interpreted @code{on.exit} code} */
void (*cend)(void *); /* @r{C @code{on.exit} thunk} */
void *cenddata; /* @r{Data for C @code{on.exit} thunk} */
char *vmax; /* @r{Top of the @code{R_alloc} stack} */
int intsusp; /* @r{Interrupts are suspended} */
SEXP handlerstack; /* @r{Condition handler stack} */
SEXP restartstack; /* @r{Stack of available restarts} */
struct RPRSTACK *prstack; /* @r{Stack of pending promises} */
@} RCNTXT, *context;
@end example
@noindent
plus additional fields for the future byte-code compiler. The `types'
are from
@example
enum @{
CTXT_TOPLEVEL = 0, /* @r{toplevel context} */
CTXT_NEXT = 1, /* @r{target for @code{next}} */
CTXT_BREAK = 2, /* @r{target for @code{break}} */
CTXT_LOOP = 3, /* @r{@code{break} or @code{next} target} */
CTXT_FUNCTION = 4, /* @r{function closure} */
CTXT_CCODE = 8, /* @r{other functions that need error cleanup} */
CTXT_RETURN = 12, /* @r{@code{return()} from a closure} */
CTXT_BROWSER = 16, /* @r{return target on exit from browser} */
CTXT_GENERIC = 20, /* @r{rather, running an S3 method} */
CTXT_RESTART = 32, /* @r{a call to @code{restart} was made from a closure} */
CTXT_BUILTIN = 64 /* @r{builtin internal function} */
@};
@end example
@noindent
where the @code{CTXT_FUNCTION} bit is on wherever function closures are
involved.
Contexts are created by a call to @code{begincontext} and ended by a
call to @code{endcontext}: code can search up the stack for a
particular type of context via @code{findcontext} (and jump there) or
jump to a specific context via @code{R_JumpToContext}.
@code{R_ToplevelContext} is the `idle' state (normally the command
prompt), and @code{R_GlobalContext} is the top of the stack.
Note that whilst all calls to closures set a context, those to special
internal functions never do, and those to builtin internal functions
have done so only recently (and prior to that only when profiling).
@findex UseMethod
@cindex method dispatch
Dispatching from a S3 generic (via @code{UseMethod} or its internal
equivalent) or calling @code{NextMethod} sets the context type to
@code{CTXT_GENERIC}. This is used to set the @code{sysparent} of the
method call to that of the @code{generic}, so the method appears to have
been called in place of the generic rather than from the generic.
The @R{} @code{sys.frame} and @code{sys.call} work by counting calls to
closures (type @code{CTXT_FUNCTION}) from either end of the context
stack.
Note that the @code{sysparent} element of the structure is not the same
thing as @code{sys.parent()}. Element @code{sysparent} is primarily
used in managing changes of the function being evaluated, i.e. by
@code{Recall} and method dispatch.
@code{CTXT_CCODE} contexts are currently used in @code{cat()},
@code{load()}, @code{scan()} and @code{write.table()} (to close the
connection on error), by @code{PROTECT}, serialization (to recover from
errors, e.g.@: free buffers) and within the error handling code (to
raise the C stack limit and reset some variables).
@node Argument evaluation, Autoprinting, Contexts, R Internal Structures
@section Argument evaluation
@cindex argument evaluation
As we have seen, functions in @R{} come in three types, closures
(@code{SEXPTYPE} @code{CLOSXP}), specials (@code{SPECIALSXP}) and
builtins (@code{BUILTINSXP}). In this section we consider when (and if)
the actual arguments of function calls are evaluated. The rules are
different for the internal (special/builtin) and R-level functions
(closures).
For a call to a closure, the actual and formal arguments are matched and
a matched call (another @code{LANGSXP}) is constructed. This process
first replaces the actual argument list by a list of promises to the
values supplied. It then constructs a new environment which contains
the names of the formal parameters matched to actual or default values:
all the matched values are promises, the defaults as promises to be
evaluated in the environment just created. That environment is then
used for the evaluation of the body of the function, and promises will
be forced (and hence actual or default arguments evaluated) when they
are encountered.
@findex NAMED
(Evaluating a promise sets @code{NAMED = 2} on its value, so if the
argument was a symbol its binding is regarded as having multiple
references during the evaluation of the closure call.)
If the closure is an S3 generic (that is, contains a call to
@code{UseMethod}) the evaluation process is the same until the
@code{UseMethod} call is encountered. At that point the argument on
which to do dispatch (normally the first) will be evaluated if it has
not been already. If a method has been found which is a closure, a new
evaluation environment is created for it containing the matched
arguments of the method plus any new variables defined so far during the
evaluation of the body of the generic. (Note that this means changes to
the values of the formal arguments in the body of the generic are
discarded when calling the method, but @emph{actual} argument promises
which have been forced retain the values found when they were forced.
On the other hand, missing arguments have values which are promises to
use the default supplied by the method and not the generic.) If the
method found is a special or builtin it is called with the matched
argument list of promises (possibly already forced) used for the generic.
@cindex builtin function
@cindex special function
@cindex primitive function
@cindex .Internal function
The essential difference@footnote{There is currently one other
difference: when profiling builtin functions are counted as function
calls but specials are not.} between special and builtin functions is
that the arguments of specials are not evaluated before the C code is
called, and those of builtins are. In each case positional matching of
arguments is used. Note that being a special/builtin is separate from
being primitive or @code{.Internal}: @code{function} is a special
primitive, @code{+} is a builtin primitive, @code{switch} is a special
@code{.Internal} and @code{grep} is a builtin @code{.Internal}.
@cindex generic, internal
@findex DispatchOrEval
Many of the internal functions are internal generics, which for specials
means that they do not evaluate their arguments on call, but the C code
starts with a call to @code{DispatchOrEval}. The latter evaluates the
first argument, and looks for a method based on its class. (If S4
dispatch is on, S4 methods are looked for first, even for S3 classes.)
If it finds a method, it dispatches to that method with a call based on
promises to evaluate the remaining arguments. If no method is found,
the remaining arguments are evaluated before return to the internal
generic.
@cindex generic, generic
@findex DispatchGeneric
The other way that internal functions can be generic is to be group
generic. All such functions are builtins (so immediately evaluate all
their arguments), and contain a call to the C function
@code{DispatchGeneric}. There are some peculiarities over the number of
arguments for the @code{"Math"} group generic, with some members
allowing only one argument, some having two (with a default for the
second) and @code{trunc} allows one or more but the default only
accepts one.
@menu
* Missingness::
* Dot-dot-dot arguments::
@end menu
@node Missingness, Dot-dot-dot arguments, Argument evaluation, Argument evaluation
@subsection Missingness
@cindex missingness
Actual arguments to (non-internal) @R{} functions can be fewer than are
required to match the formal arguments of the function. Having
unmatched formal arguments will not matter if the argument is never used
(by lazy evaluation), but when the argument is evaluated, either its
default value is evaluated (within the evaluation environment of the
function) or an error is thrown with a message along the lines of
@example
argument "foobar" is missing, with no default
@end example
@findex MISSING
@findex R_MissingArg
Internally missingness is handled by two mechanisms. The object
@code{R_MissingArg} is used to indicate that a formal argument has no
(default) value. When matching the actual arguments to the formal
arguments, a new argument list is constructed from the formals all of
whose values are @code{R_MissingArg} with the first @code{MISSING} bit
set. Then whenever a formal argument is matched to an actual argument,
the corresponding member of the new argument list has its value set to
that of the matched actual argument, and if that is not
@code{R_MissingArg} the missing bit is unset.
This new argument list is used to form the evaluation frame for the
function, and if named arguments are subsequently given a new value
(before they are evaluated) the missing bit is cleared.
Missingness of arguments can be interrogated via the @code{missing()}
function. An argument is clearly missing if its missing bit is set or
if the value is @code{R_MissingArg}. However, missingness can be passed
on from function to function, for using a formal argument as an actual
argument in a function call does not count as evaluation. So
@code{missing()} has to examine the value (a promise) of a
non-yet-evaluated formal argument to see if it might be missing, which
might involve investigating a promise and so on @dots{}.
@node Dot-dot-dot arguments, , Missingness, Argument evaluation
@subsection Dot-dot-dot arguments
@cindex ... argument
Dot-dot-dot arguments are convenient when writing functions, but
complicate the internal code for argument evaluation.
The formals of a function with a @code{...} argument represent that as a
single argument like any other argument, with tag the symbol
@code{R_DotsSymbol}. When the actual arguments are matched to the
formals, the value of the @code{...} argument is of @code{SEXPTYPE}
@code{DOTSXP}, a pairlist of promises (as used for matched arguments)
but distinguished by the @code{SEXPTYPE}.
Recall that the evaluation frame for a function initially contains the
@code{@var{name}=@var{value}} pairs from the matched call, and hence
this will be true for @code{...} as well. The value of @code{...} is a
(special) pairlist whose elements are referred to by the special symbols
@code{..1}, @code{..2}, @dots{} which have the @code{DDVAL} bit set:
when one of these is encountered it is looked up (via @code{ddfndVar})
in the value of the @code{...} symbol in the evaluation frame.
Values of arguments matched to a @code{...} argument can be missing.
@node Autoprinting, The write barrier, Argument evaluation, R Internal Structures
@section Autoprinting
@cindex autoprinting
@findex R_Visible
Whether the returned value of a top-level @R{} expression is printed is
controlled by the global boolean variable @code{R_Visible}. This is set
(to true or false) on entry to all primitive and internal functions
based on the @code{eval} column of the table in @file{names.c}: the
appropriate setting can be extracted by the macro @code{PRIMPRINT}.
@findex PRIMPRINT
@findex invisible
The @R{} primitive function @code{invisible} makes use of this
mechanism: it just sets @code{R_Visible = FALSE} before entry and