home ::
syllabus ::
groups ::
© 2024, Tim Menzies
Stats on the results.
Run experiments on your SMO tool. Vary sample sizes, and comparing to random.
Report baselines and ceiling.
- Baseline = value without any treatments (the centroid of the original data set)
- Ceiling = If we abandoned all the principles of this subject and evaluated everything, how good can we get?
Report current date, random number seed, and some summary stats on the input data.
For each treatments, repeat the run 20 times. During the run, print some progress statement so observers know you have not crashed.
Use stats to group the results into statistically distinguishable groups.
Reproduce the following output.
Before summarizing the results of many runs, first show details (very useful for debugging).
In the following we show baseline centroids (mid) and variability around that centroid (div). We then run SMO 20 times (smo9)with a budget of 9 (peek at 4 to find initial best and rest, then look at five more).
Then we compare to "just grab any 50 at random" (any50).
Finally, we abandoned all the principles of this subject and evaluated everything (100%)..
date : 08/02/2024 07:42:53
file : ../data/auto93.csv
repeats : 20
seed : 31210
rows : 398
cols : 8
names ['Clndrs' 'Volume' 'HpX' 'Model' 'origin' 'Lbs-' 'Acc+' 'Mpg+'] D2h-
mid [5.45 193.43 104.47 76.01 1 2970.42 15.57 23.84] 0.54
div [1.7 104.27 38.49 3.7 1.33 846.84 2.76 8.34] 0.16
#
smo9 [4 90 48 78 2 1985 21.5 40] 0.19
smo9 [4 90 48 78 2 1985 21.5 40] 0.19
smo9 [4 90 48 78 2 1985 21.5 40] 0.19
smo9 [4 90 48 80 2 2085 21.7 40] 0.2
smo9 [4 85 65 81 3 1975 19.4 40] 0.24
smo9 [4 85 65 81 3 1975 19.4 40] 0.24
smo9 [4 85 65 81 3 1975 19.4 40] 0.24
smo9 [4 85 65 80 3 2110 19.2 40] 0.25
smo9 [4 79 58 77 2 1825 18.6 40] 0.26
smo9 [4 98 70 82 1 2125 17.3 40] 0.31
smo9 [4 85 52 76 1 2035 22.2 30] 0.31
smo9 [4 98 65 81 1 2380 20.7 30] 0.34
smo9 [4 98 '?' 71 1 2046 19 30] 0.36
smo9 [4 98 68 77 3 2045 18.5 30] 0.37
smo9 [4 112 88 82 1 2605 19.6 30] 0.38
smo9 [4 98 80 79 1 1915 14.4 40] 0.39
smo9 [4 97 92 72 3 2288 17 30] 0.41
smo9 [4 97 92 72 3 2288 17 30] 0.41
smo9 [4 135 84 82 1 2525 16 30] 0.44
smo9 [4 97 88 73 3 2279 19 20] 0.49
#
any50 [4 97 52 82 2 2130 24.6 40] 0.17
any50 [4 90 48 80 2 2335 23.7 40] 0.19
any50 [4 90 48 80 2 2335 23.7 40] 0.19
any50 [4 90 48 78 2 1985 21.5 40] 0.19
any50 [4 90 48 78 2 1985 21.5 40] 0.19
any50 [4 90 48 78 2 1985 21.5 40] 0.19
any50 [4 90 48 80 2 2085 21.7 40] 0.2
any50 [4 90 48 80 2 2085 21.7 40] 0.2
any50 [4 90 48 80 2 2085 21.7 40] 0.2
any50 [4 86 65 80 3 2110 17.9 50] 0.25
any50 [4 89 60 80 3 1968 18.8 40] 0.26
any50 [4 85 70 78 3 2070 18.6 40] 0.27
any50 [4 85 70 78 3 2070 18.6 40] 0.27
any50 [4 72 69 71 3 1613 18 40] 0.27
any50 [4 91 68 82 3 2025 18.2 40] 0.28
any50 [4 98 70 82 1 2125 17.3 40] 0.31
any50 [4 85 52 76 1 2035 22.2 30] 0.31
any50 [5 121 67 80 2 2950 19.9 40] 0.31
any50 [4 97 46 73 2 1950 21 30] 0.32
any50 [4 68 49 73 2 1867 19.5 30] 0.34
#
100% [4 97 52 82 2 2130 24.6 40] 0.17
Right a more succinct report that summarizes 20 runs on anything that uses stochastic choice.
Here, we only report distance to ehaven (d2h).
The following report can get slow for larger data sets so the line starting with #base prints each word as we loop through that part.
In all the following, we make 4 initial guesses to initialize best:rest, then we run on for some BUDGET=4 repeats.
- e.g. bonr20 means 4 initial guesses then 16 subsequent ones.
bonr means using the acquire function you've been using all along ((b+r)/(b-r))
RandN means, 20 ties, pull 90% of the data, sort by d2h, then report the top one.
base shows the d2h distribution within the untreated data set.
best reports the best d2h in the data (this is our ceiling)
tiny shows .35*standard deviation. Any difference less than this is getting a little pedanctic.
date : February/02/2024 08:19:54,
file : ../data/auto93.csv,
repeats : 20,
seed : 31210,
rows : 398,
cols : 8,
best : 0.17,
tiny : 0.06
#base #bonr9 #rand9 #bonr15 #rand15 #bonr20 #rand20 #rand358
#report8
#
0, #rand358, 0.17, 0.00, * | , 0.17, 0.93
#
1, #rand20, 0.26, 0.07, *--- | , 0.17, 0.93
1, #bonr20, 0.30, 0.14, -----*- | , 0.17, 0.93
#
2, #bonr15, 0.32, 0.16, -------* | , 0.17, 0.93
2, #bonr9, 0.34, 0.11, --*-- | , 0.17, 0.93
2, #rand15, 0.34, 0.06, ---* | , 0.17, 0.93
2, #rand9, 0.36, 0.12, ----*-- | , 0.17, 0.93
#
3, base, 0.55, 0.20, -------*--- , 0.17, 0.93
In the above, the left-hand-side number shows the statistical ranking. All statistically similar treatments have the same rank.
In the above we note that:
- All treatments significantly improve on the base line
- Hooray
- Evaluating a lot of things (rand358) does better than evaluating just a few things (e.g. bonr9).
- No surprises there.
- Evaluating just a few things is surprisingly effective (e.g. the bonri9 results are very similar to bonr20)
- 15 no better than 9 not recommended
- 20 recommended.
- Random as good as smo
- For X in (bonr,rand)
- and N in (9,15,20):
- #rand(N) == #bonr(N)
- and N in (9,15,20):
- Is this result repeatable in many data sets (your class project)?
- For X in (bonr,rand)
How do we sore these things?
- by predictive prowess?
- by predictive certainty (the variance)
- by simplicity of explanation (not down above):
- what are the least number of attribute value settings...
- ... that most influence the outcomes?
- welcome to explanation (homeworks 7,8,9)