-
Notifications
You must be signed in to change notification settings - Fork 0
/
exp_meth.tex
executable file
·399 lines (327 loc) · 15.1 KB
/
exp_meth.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
\chapter{Simulation Methodology}\label{app:sim_meth}
We now describe our process for conducting experiments, expanding the
outlines given in Chapters~\ref{chap:numdep},~\ref{chap:consistency},
and~\ref{chap:tempresult} demarcated into sections on evolving
relations, the consistency problem, and simulations on our temporal
logic, respectively.
\medskip
We precede this with a brief discussion of implementation issues.
\section{Notes on Implementation}\label{sec:imp_notes}
\subsection{Random Number Generation}
A number of algorithms in this thesis use randomised techniques. To
circumvent any potential problems with non-random behaviour we used
a linear congruent procedure taken
from the algorithm provided by \cite{pm88} which
avoids cycles by incorporating multiplier and modulus
having 534 million full period generators.
\smallskip
The code was implemented in GNU C++ version 2.7.2 running
on Sun Solaris 2.5.1. C++ with
the embedded CORAL deductive
database interface \cite{rss92} was also used for evolving relations
in Chapter~\ref{chap:numdep}. For efficiency reasons the code for procedures
described in Chapters~\ref{chap:consistency} and~\ref{chap:tempresult}
was implemented in C++ alone.
\subsection{Table Indexing}
\subsection{Code Design}
\section{Documentation of Simulations: Evolving
Relations}\label{sec:sim_er}
\subsection{
\section{Documentation of Simulations: The Consistency
Problem}\label{sec:sim_conprob}
\section{Documentation of Simulations: Numerical Dependency Temporal
Logic}\label{sec:sim_ndltl}
We now provide additional results for the consistency problem. In
Table~\ref{tbl:fd_set_used} we detail the FD sets referred to in the
following figures. All of the mean values referred to for the average
number of worlds required were obtained within batchs, each of 500
runs. The results in this appendix reinforce those presented in
Chapter~\ref{chap:consistency}. All relations used, with respect to the
FD sets in Table~\ref{tbl:fd_set_used}, contain exactly those
attributes within the respective FD set as the schema and no more.
{
\line
\begin{table}[ht]
\footnotesize{
\begin{center}
\begin{tabular}{|c||c||c||c||c||c||c||c|}
\hline
{\bf Set 1} & {\bf Set 2} & {\bf Set 4} & {\bf Set 6} & {\bf Set 7} & { \bf Set 11} & {\bf Set 15} & {\bf Set 17} \\ \hline \hline
$A \to B$ & $A \to BC$ & $C \to AB$ & $D \to ABC$ & $AB \to D$ & $A \to B$ &$A \to BCD$ & $A \to B$ \\
$D \to C$ & $D \to C$ & $B \to AC$ & $AB \to D$ & $D\to ABC$ & $D \to C$ & & $B \to C$ \\
& & & $A \to B$ & & $BC \to A$ & & $C \to D$ \\
& & & $B \to A$ & & & & \\\hline
\end{tabular}
\end{center}
}
\caption{\label{tbl:fd_set_used} FD sets used in
Figures~\ref{graph:4.3w} to~\ref{graph:histo2}}
\end{table}
}
For an FD set X $\to$ Y, where $\mid$ Y $\mid > 1$, we split this into
a set of FDs such that for all A $\in$ Y we have X $\to$ A for
expression as NDs in simulations. This is justified given that from X
$\to^k Y$ we can infer, for all A $\in Y$, X $\to^k$ A.
\section{Average Number of Worlds Required}
We present examples showing the average number of worlds
required in batches by our chase and hill-climbing algorithm in
figures~\ref{graph:4.3w} to~\ref{graph:4.16w}. We can see immediately
that the average number of worlds required is very small. We
hypothesise that within our random relations it is relatively easy to
generate a definite world using algorithms~\ref{alg:gen}
and~\ref{alg:chase-gen}. The average number of worlds is reduced for
larger relations with respect to a fixed domain size. Investigation
has shown this to be due to ND sets being satisfied closer to the
domain size, whilst the presence of additional tuples, with respect to
a fixed domain size, increases the number of redundant values within
indefinite cells which the chase procedure can remove. This is
particularly true of relations with larger arity indefinite cells, see
figures~\ref{graph:4.2w}, and~\ref{graph:4.13w}. Figures~\ref
{graph:4.4w} and~\ref{graph:4.12w} show this is less likely when
relations have smaller arity indefinite cells.
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w15_d3_a2_c.eps}}}
\caption{\label{graph:4.3w} {Average Number of Worlds Required by
the chase and hill-climbing approach for FD set 15, domain sizes 3 -
9, maximum indefinite cell arity 2 }}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w15_d3_a4_c.eps}}}
\caption{\label{graph:4.2w} {Average Number of Worlds
Required for FD set 15, domain sizes 5 - 9, max indefinite arity 4 - 6}}
\end{minipage}
\end{figure}
We also note that the small number of average worlds is in sharp
contrast to the number of worlds required by the naive {\em generate and
test} algorithms to obtain similar results. We are vague as to an
exact relationship due to the varying nature of both the indefinite
relations and FD sets.
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w17_d3_a2_c.eps}}}
\caption{\label{graph:4.4w} {Average Number of Worlds
Required for FD set 17, domain sizes 3 - 9, max indefinite arity 2}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w17_d3_a4_c.eps}}}
\caption{\label{graph:4.5w} {Average Number of Worlds Required for FD
set 17, domain sizes 5 - 9, indefinite cell arity 4 - 6}}
\end{minipage}
\end{figure}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w5_d3_a4_c.eps}}}
\caption{\label{graph:4.10w} {Average Number of Worlds
Required for FD set 5, domain size 5 - 9, indefinite cell arity 4 - 6}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w7_d3_a4_c.eps}}}
\caption{\label{graph:4.16w} {Average Number of Worlds
Required for FD set 7, domain size 5 - 9, indefinite cell arity 4 - 6}}
\end{minipage}
\end{figure}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w6_d3_a2_c.eps}}}
\caption{\label{graph:4.12w} {Average Number of Worlds
Required for FD set 6, domain size 3 - 9, indefinite arity 2}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/w6_d3_a4_c.eps}}}
\caption{\label{graph:4.13w} {Average Number of Worlds required
for FD set 6, domain size 5 - 9, indefinite arity 4 - 6}}
\end{minipage}
\end{figure}
\section{Average Proximity to FD sets}
We now discuss the proximity of our results to that of an FD set. The
increasing proximity to an FD set as relation size increases is due to
the domain size remaining fixed. Normalisation of our measure for ND
sets, a prospect for future work, would remedy this.
\medskip
We draw the following conclusions concerning proximity:
\begin{itemize}
\item On average the naive and chase procedures produce very similar
results. We emphasise that our relations were created in a uniformly
random manner. As such in the real world it may be highly likely that
a relation with indefinite information may perform better with respect
to utilising the chase and hill-climbing approach. This speculation is
enforced by the encouraging discovery that the chase procedure produce
slightly better results when a relation is sparse in indefinite cells
in arity in either attributes on the left or right hand side of
dependencies, as detailed in figures~\ref{graph:5.6}
to~\ref{graph:5.5}.
\item Proximity to functional satisfaction increases, on average, with
an increase in the number of attributes being determined. We can see
these differences in figures~\ref{graph:6.1} and~\ref{graph:5.15}. The
difference in this case is slight but this was enforced by all
results. This is due to additional attributes being examined within
the hill-climbing process. The random nature of the relations
generated did not provide us with any pathological data
which might contradict this.
\item Whether the relation is in BCNF or non-BCNF did not, in the
data assessed, affect the results
\end{itemize}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/f15_average.eps}}}
\caption{\label{graph:5.6} {Average Proximity to FD set 15, standard
and reduced
right hand side indefinite cell weighting}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/f15_average2.eps}}}
\caption{\label{graph:5.7} {Average Proximity to FD set 15, standard
and reduced
left hand side indefinite cell weighting}}
\end{minipage}
\end{figure}
\section{Closest Proximity to FD sets}
In contrast with the average results we find that, generally, the
best result within a batch is obtained by our chase and hill-climbing
procedure. Figures~\ref{graph:5.4} and~\ref{graph:5.5}, as well
as~\ref{graph:6.1} and~\ref{graph:5.15}, serve to emphasise that
occasionally the naive procedure, at a cost of efficiency, can
outperform the chase and hill-climbing procedure.
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/closest.eps}}}
\caption{\label{graph:5.4} {Closest Proximity to FD set 15 for
standard and reduced right hand side weighting of indefinite cells}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/closest2.eps}}}
\caption{\label{graph:5.5} {Closest Proximity to FD set 15 for
standard and reduced left hand side weighting of indefinite cells }}
\end{minipage}
\end{figure}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/max2.eps}}}
\caption{\label{graph:5.13} {Average Proximity to FD set 7, domain
size 7, max indefinite cell arity 6}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/av6_d5_a2.eps}}}
\caption{\label{graph:5.14} {Average Proximity to FD set 6, domain
5,7, indefinite arity 2}}
\end{minipage}
\end{figure}
\newpage
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/m15_d5_a4.eps}}}
\caption{\label{graph:6.1} {Closest Proximity to FD set 15, varying
domain sizes 5 - 9, chase and naive approaches, indefinite arity 4}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/dot_S_8_12/max6_d5_a2.eps}}}
\caption{\label{graph:5.15} {Closest Proximity to FD set 6, domain
size 5 - 9, indefinite arity 4}}
\end{minipage}
\end{figure}
\section{Jacknife and Bootstrap Comparisons}
Figures~\ref{graph:7.1}, ~\ref{graph:7.2} and~\ref{graph:7.3} present
examples of jackknife and bootstrap resampling used within our dynamic
algorithm~\ref{alg:blimit}. To ensure a fair comparison we conducted
these tests so that at each iteration each sample of possible worlds,
and therefore each sample of ND sets satisfied, was equivalent before
either bootstrap or jackknife resampling was performed. This accounts
for much of the similarity in each figure.
We draw the following conclusions:
\begin{itemize}
\item Jackknife and Bootstrap resampling will reach approximate
fixpoints, on average, at a similar number of possible
worlds. Figures~\ref{graph:7.1}, ~\ref{graph:7.2} and~\ref{graph:7.3}
are indicative of this.
\item The jackknife resampling technique is more computationally
intensive in such a dynamic setting. The size of the bootstrap
replication is fixed, say at 50 or 100, found to be useful in this
context. \cite{et86,et93} note that sizes about 200 produce no
additional information, in general. However, the jackknife procedure
creates $n$ replicates, each of size $n-1$, when the sample size is
$n$. Therefore we have to examine 299 jackknife resamples at sample
size 300 as opposed to 100 for the bootstrap. The results show this to
be sufficient to infer a suitable standard deviation, variance, or
mean.
\item Jackknife resamples are slightly smoother due to the fact that
we are merely omitting one point in each resample. Though this may
imply it is likely to reach a fixpoint at an earlier stage, results do
not suggest this, validating, in some sense, our approach.
\end{itemize}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/jack_fd11_10_50_3.eps}}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/boot_fd11_10_50_3.eps}}}
\end{minipage}
\caption{\label{graph:7.1} {A comparison of Jackknife and
Bootstrap mean ND set values iterated to an approximate fixpoint of
the mean using equivalent samples, for FD set 11, with a domain of 10,
50 tuples and a maximum indefinite cell arity of 3}}
\end{figure}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/jack_fd11_10_50_5.eps}}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/boot_fd11_10_50_5.eps}}}
\end{minipage}
\caption{\label{graph:7.2} {A comparison of Jackknife and
Bootstrap mean ND set values iterated to an approximate fixpoint of
the mean using equivalent samples, for FD set 11, with a domain of 10,
50 tuples and a maximum indefinite cell arity of 5}}
\end{figure}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/jack_fd11_10_25_3.eps}}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/boot_fd11_10_25_3.eps}}}
\end{minipage}
\caption{\label{graph:7.3} {A comparison of Jackknife and
Bootstrap mean ND set values iterated to an approximate fixpoint of
the mean using equivalent samples, for FD set 11, with a domain of 10,
25 tuples and a maximum indefinite cell arity of 3}}
\end{figure}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/boot_var_fd11_10_50_5.eps}}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/graphs_18_2_98/boot_sd_fd11_10_50_5.eps}}}
\end{minipage}
\caption{\label{graph:7.4} {Bootstrap variance and standard deviation
convergence, for FD set 11, with a domain of 10, 25 tuples and a maximum indefinite cell arity of 3}}
\end{figure}
Figure~\ref{graph:7.4} highlights convergence of the standard
deviation and variance as the sample size is increased. We are dealing
with approximate fixpoint and this implies equality within confidence
limits shown in Figure~\ref{graph:conlim}.
\subsection{Bootstrap Variance Results}
\begin{figure}
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/var2500.eps}}}
\end{minipage}
\hfill
\begin{minipage}{7cm}
\centerline{\scalebox{0.5}{\includegraphics{../../CPP/Chase/all_results/var210000.eps}}}
\end{minipage}
\caption{\label{graph:histo2} {Histograms displaying variance of 500 and 10000 bootstrap replications}}
\end{figure}
Figure~\ref{graph:histo2} display the overall similarity in variances
achieved for 500 and 10000 BRS, respectively, complementing
Figure~\ref{graph:cp_hist1}.