-
Notifications
You must be signed in to change notification settings - Fork 1
/
technical.txt
856 lines (676 loc) · 37.4 KB
/
technical.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
____ _ _ ____ _
| __ ) _ __ ___ __ _| | _(_)_ __ __ _ | __ ) __ _ _ _ __| |
| _ \| '__/ _ \/ _` | |/ / | '_ \ / _` | | _ \ / _` | | | |/ _` |
| |_) | | | __/ (_| | <| | | | | (_| | | |_) | (_| | |_| | (_| |
|____/|_| \___|\__,_|_|\_\_|_| |_|\__, | |____/ \__,_|\__,_|\__,_|
|___/
Intro
-----
Breaking Baud has been about 3 years in the making. I have been interested in
tape encoding systems for a long time - when I was a kid, after disassembling
a Spectrum loader and getting confused, and then finding the Spectrum ROM
disassembly book in the library, I then ported the Spectrum saving code to the
CPC. Sadly, I've long since lost this code, but the interest in tapes remained!
Background - understanding a normal loader
------------------------------------------
If you're new to tape loading, I'd strongly recommend studying the Spectrum
ROM loading code. It's not the same technique as I use, but it's well worth
understanding. Take a look here:
http://www.wearmouth.demon.co.uk/zx82.htm#L0556
Essentially, I'll briefly describe it now, which might help understand the
disassembly.
LD-BREAK (056B)
LD-EDGE-1 is called, this times out when B reaches #FF, but that is
unimportant for now. If it timed out, it keeps looping unless space is
pressed.
LD-WAIT (0574)
If a pulse was detected, it pauses for a second and checks again.
LD-LEADER (0580)
LD-EDGE-2 is called to detect a pair of pulses, starting with a timing
constant of #9C. This function increases B each time round the loop and
when this increases past #FF to #00, it times out. If the time after
two pulses is between #C6 and #FF, it is considered to be "long enough"
for a pilot tone, below #C6 it is too short, and above #FF as we saw
above, it will time out.
These pairs of pulses are counted with H, this starts at 0 from after
the LD-WAIT loop, and when H wraps back round to 0, we have therefore
seen 512 long pulses, so we consider the pilot tone to be valid.
LD-SYNC (058F)
LD-EDGE-1 is called with a timing constant about half that of 2 pulses,
and this is explained as we're just looking at single pulses this time.
Between #D4 and #FF, it is still long enough for a pilot pulse, so we
keep looping. However, we're really looking for a short pulse with a
time of between #C9 and #D4 and if we find this, we call LD-EDGE-1
again to check for another short pulse within the remaining time.
LD-MARKER (05C8)
L is initialised to #01, which is used to shift in 8 bits of data.
Every time a new bit is shifted into L, a 0 falls out until the 8th
bit when a 1 will fall out.
LD-8-BITS (05CA)
LD-EDGE-2 is called (the timing constant is either set at 05A5 for the
initial byte or later down in this block). We compare against the half
pulse length, which sets carry to 0 for a short pulse and 1 for a long
pulse. This is then shifted into L and we loop until all 8 bits are
read.
A very simple XOR parity is calculated, and if DE==0, we check to see
if the XOR parity is 0 and then exit the function. If DE<>0 we jump to
LD-LOOP (05A9) which checks to see if this is the first byte, and if
so compares it against the "sync byte", otherwise it stores the byte
and continues.
We can therefore describe the tape format as:
* A pilot tone of at least 1 second PLUS 512 long cycles, then 2 short cycles
* A single byte of sync, consists of 8 bits of either 2 long or 2 short cycles
* DE bytes of data, each consist of 8 bits of either 2 long or 2 short cycles
* A single byte of CRC, again 8 bits of either 2 long or 2 short cycles
So, we can see it's a pretty simple format. The regular Amstrad format is a
bit more complicated, so I won't describe it in detail here, but you can take
a look here:
http://www.cpcwiki.eu/forum/programming/callbca1-firmware-compatible-tape-loader/
The main features of the Amstrad loader are that the timing constants are all
derived from the pilot tones so that any speed (within reason) can be used
which improves performance with stretched tapes, the CRC is more complicated
and calculated on every block of 256 bytes so an error is detected earlier,
but essentially it's a very similar endoing system.
Background - other tape formats
-------------------------------
This system of pairs of pulses seems wasteful on the face of it - after all,
why use 2 pulses instead of 1? The reason is for tolerance to volume levels.
The tape input on most systems uses an op-amp with feedback to provide
hysteresis - essentially, the voltage on the input pin during a pulse looks
like this on the left rather than what you might expect on the right:
_______ _________
/ \ | |
/ \ | |
____/ \_____ _________| |________
The key point is that the transition isn't instant, and if the pin was
directly connected to a digital circuit, during the transition it would
read either 0 or 1 randomly. The opamp forces the signal to be more like
the signal on the right.
If the voltage is "just right", then a continually pulsing signal will have
lows and highs of the same length, but if the volume is too loud, the high
will be longer than the low and if the volume is too quiet, the low will be
longer than the high. By counting 2 pulses, we can ignore the effects of the
volume on the pulse lengths, and instead count the length of the full cycle
(2 pulses).
One thing you will have noticed from the above, is that a 0 bit will transfer
in half the time of a 1 bit. Another older format (FM) would have 2 long pulses
or 4 short pulses, so that each bit takes a constant time, but on fast systems
like the CPC and Spectrum, we're mostly waiting for something to happen, so
there is no point in doing this.
There are other possibilities too, for example the ZX80 used a system whereby
a 0 bit was represented by a short then a long pulse and a 1 bit was
represented by a long and a short pulse. Again, a feature of this system is
that each bit is a constant length.
Background - some terminology
-----------------------------
There are a variety of terms, that are unfortunately often interchangeable
despite having quite different meanings.
bps & baud
These seem simple and the same. However, they're subtly different. bps is bits
per second, baud is actually technically symbols per second. In the case above
a symbol is a pair of pulses and so the numbers are the same, however this
is not always true.
symbol & character
So, armed with the above information, you might think that a symbol represents
a single transition. And that's often true, and in the analogue world a symbol
represents a distinct state, e.g. 64QAM has 64 possible states, and so each
symbol transfers 6 bits of data (so here 1 baud = 6 bps). However, there are
a variety of other encoding systems where multiple symbols are used together
to represent a set of values, and these values are sometimes also called
symbols, other times characters and other times words. Unfortunately, in my
code I tend to call both transitions and groups of transitions as symbols, but
it's easier to consider the term character.
line coding
This is a catch all phrase that describes the mappings from groups of symbols
to characters and many such systems exist.
Background - disk encoding
--------------------------
There has been much more progress made in recent years on disk encoding, and
this is the primary reason why we can get so much more data on modern hard
disks than ever before (and also, why they're more susceptible to corruption).
The other thing that should be highlighted early when discussing disk formats
is that all encoding systems are designed around transitions. We were already
thinking like this for tapes, but when we consider a tape signal it's natural
to think of the stream of data coming from the tape as the bits read from the
tape port, but on a disk the transitions are the important thing, so e.g.
tape 00001111001100001111
disk 10001000101010001000
The original disk formats used FM (frequency modulation), which is almost
exactly what was described above in the tape formats section. For disk drives,
the constant bit rate is important, so FM uses 2 time periods per bit and a 0
bit is represented by 1 transition in that timeslice, a 1 bit by 2 transitions.
As you might remember from above, a transition between states is usually not
a clean transition but rather the signal is slow to change state. Consequently,
there's a finite limit on the number of transitions we can make in a given
time, which in turn limits the maximum transfer rate.
I haven't explicitly stated it, but one of the drivers of the systems above is
so that the clock signal is embedded into the data so that the reader can
unambiguously detect a transition. However, for FM encoding that means that
a clock bit has to be transmitted for each bit, so on average there are 1.5
transitions per bit (possible combinations are 10, 11).
With MFM encoding (modified frequency modulation), the clock bits are dependant
on the data bits, if either of the bits surrounding the clock bit are 1, the
clock bit is 0, if both data bits are 0, then the clock bit is 1. Consequently,
a 0 is encoded as 10 or 00 depending on the previous bit and a 1 is always
encoded as a 01. There are 0.75 transitions per bit on average. This actually
means that for the same quality media, we can double the transfer rate and
maintain the same number of transitions on the media.
This type of encoding is the simplest form of RLL (run length limited), which
is a formal way of specifying how frequent the transitions need to be.
FM is known as (0,1) RLL, that is there are 0-1 bits gap between transitions
MFM is known as (1,3) RLL, that is there are 1-3 bits gap between transitions
It should be noted that it's possible to get bitstreams that meet the criteria
for the RLL specification, but that cannot be produced by normal encoding. One
example for MFM is 100010010o01001 - the bit marked "o" should be a 1 by the
normal encoding rules, so this kind of system is often used to synchronise
at the start of a block (sector).
GCR encoding is used on older Apple disk systems and achieves a better
storage capacity than MFM, although slower. With GCR, 4 bits are encoded to 5,
a so called 4b5b system, such that no more than two consecutive bits in a code
are zero, but also no code starts or ends with two zero bits, so any two codes
together also meets this rule. This is therefore another (0,1) RLL, except
it's also not possible to have a code with more than eight consecutive 1s.
Variations of these encodings are also used for other purposes. One common
one is 8b10b which is used for encoding HDMI data. In this, 8 bits of data
is encoded as 10 bits such that there can't be runs of more than six 0s or six
1s, and that transitions are minimised (there are 2 possible encodings for
each byte, chosen based on the previous bit transmitted).
Background - tape encoding continued
------------------------------------
So, back to tape... I'd been considering how to achieve something like GCR
encoding on a tape. Essentially, we're limited to a relatively slow transition
rate, so the obvious best fix is to try to achieve better results than 1 bit
per 1.5 transitions on average.
With tape, however, there are more things to be concerned about than just
transitions. The low and high signals decribed originally aren't really zero
and non-zero voltages, they're actually positive and negative voltages that
directly reflect the magnetic charge on the tape. Over time, the average charge
needs to be zero. If this rule isn't adhered to, the head will become charged
and so the head will be less responsive one way and more the other. So, after
a long run of +ve charge, the head will be slightly +vely charged and so a
full +ve charge will register as a smaller +ve voltage and a full -ve charge
will register as a larger -ve voltage. This phenomena is known as the DC bias,
but fortunately is easy to deal with - we just ensure that for a given encoding
the length of time the charge is +ve matches the time the charge is -ve.
My initial investigations focused on just this rule, however, the volume level
problem I described above also means that it's simpler just to stick to pairs
of pulses which necessarily means that the DC bias is zero.
My approach
-----------
My next investigations revolved around looking at RLL rules. Rather than
just having long and short pulses, I experimented with short, medium and long
pulses. With just long and short, on average you get 1 bit of data for every
1.5 time periods, with short, medium and long, you get 1 trit of data for every
2 time periods. After an average of 6 time periods, you get 3*3*3 (27) possible
combinations compared to 2*2*2*2 (16) possible combinations with the normal
encoding. Initial tests with a MP3-tape adapter proved very positive, however,
when I tried recording onto a real tape, I discovered that the time for the
pulse edges made this too unreliable - occasionally medium length pulses would
be detected as long pulses or short pulses.
This inspired me to my next idea - why not INCREASE the difference between a
short and long pulse? By increasing the ratio of the timings, we can actually
speed up the signal, but keep the same average transition rate. However, a 0
bit is now shorter but can still be reliably detected. Obviously, a 1 bit is
now longer, so on average we haven't gained anything...
...at least at first glance. :)
But think back to the RLL rules we were talking about before. They're all about
encoding the data to avoid long runs of certain bits (in this case 1s) and to
prefer others (in this case 0s).
By considering short pulses and long pulses that are 3 times longer, we can
think about encoding into a fixed time period, e.g. 8 time periods could encode
9 possible combinations (3 ones are represented by 1..):
0 0000000
1 00001..
2 0001..0
3 001..00
4 01..000
5 01..1..
6 1..0000
7 1..01..
8 1..1..0
If you look closely, you'll see that we had flexibility on placing the first
1.. sequence, but after the first has been placed, it makes it harder to place
others. If you add an extra time period, the range grows more:
0 00000000
1 000001..
2 00001..0
3 0001..00
4 001..000
5 001..1..
6 01..0000
7 01..01..
8 01..1..0
9 1..00000
10 1..001..
11 1..01..0
12 1..1..00
It turns out that in 16 time periods there are 277 possible codes that can
be encoded. After experimenting, I found this system was very reliable and
so it became the basis for the new loader. In actual practice, the division
between short and long pulses can still be midway between them, however unlike
a standard loader, the data on tape has the pulse length ratio 1:3 instead of
1:2.
Decoding the data
-----------------
Once we know the time between a pair of pulses, we can obtain a 0 or 1 bit.
With the standard loader, this is simply shifted into the byte until we have
8 bits.
With my encoding system, ironically the constant time therefore decodes to a
variable decoded bit length. Fortunately, with a 0 or 1 bit produced after
each pulse, we can easily use a binary tree to decode the data. Also, it's
very easy to create this binary tree at runtime, which in fact we do.
If you also look at the data, in some cases (e.g. 2,3,8,11,12 above) we have
one or two trailing 0 bits that can't possibly be any other value. The binary
tree can stop early such that these bits aren't emitted, so actually the code
lengths are 14-16 time periods, for an average of 15, a saving of 6%.
Of the output codes, 0-255 are used as literal bytes, 256-276 are "spare", so
these are actually used to represent control codes that introduce compressed
sequences.
Compressed sequences
--------------------
I haven't been too original on this front, simply because some excellent work
was done in the 70s on compression, and this still forms the basis of most
compression systems still in use today. The best known of these is LZW in GIF
files and a variant of LZ77 called DEFLATE which forms the basis of gzip,
however all of these are ultimately built upon a simpler variant called LZ77.
The basis of this encoding system is essentially a choice between a literal
byte or an instruction to copy some preceeding data (D,L) where D is the
distance from where the data can be found from the current position and L is
the length of the data to copy.
Now, it's quite wasteful to just encode D and L as 16-bit values, but equally
if D was limited to 8-bits we would severely limit the efficiency of
compression as there'd be less data to consider for copying. Many systems
will use variable length encoding, e.g. 9 bits starting with 0 could be a
literal, a 10 prefix might indicate a 3 bit length and 8 bit offset, and 1100,
1101, 1110 and 1111 prefixes for less likely combinations. All of these though
require shifting bits through a register to parse these sequences.
Remember back to the previous section, where output codes 256-276 are spare.
These are used as simple prefixes, so for instance a 5 byte copy can be
encoded as a single symbol that represents "5 bytes and 8-bit offset" and
another symbol for the offset.
The list of these special symbols are:
ofs8 ofs16 note
256 257 implied repeat of 3
258 259 implied repeat of 4
260 261 implied repeat of 5
262 263 implied repeat of 6
264 265 implied repeat of 7
266 267 implied repeat of 8
268 269 implied repeat of 9
270 271 implied repeat of 10
272 274 specified 8-bit repeat count
273 implied offset of -1, specified 8-bit repeat count
275 276 skip bytes
So, for instance the symbols: 'A',273,2,'B',258,-4 produces "AAABAAAB"
This indicates a special trick - if the offset is nearer than the repeat count,
then part of the copy is repeated itself.
These special codes are actually not encoded with the values 256-276, rather
they are rearranged so that shifting bits from the right determine the type.
Bit 0 = 0 indicates a replied repeat count, in this case bit 1 determines
whether the offset is 8-bit or 16-bit and the other bits when shifted down
hold the literal repeat count. So, e.g. code 268 (implied repeat of 9, 8-bit
offset) is encoded at 256+(9<<2)+(0<<1)+(0<<0) = #124
Bit 0 = 1 indicates one of the last 5 codes,
272 -> #101 (rpt8,ofs8)
274 -> #105 (rpt8,ofs16)
273 -> #103 (rpt8) RLE implied offset of -1
275 -> #107 (skip8) SKIP skip8 bytes
276 -> #109 (skip16) SKIP skip16 bytes
These symbols are translated into these codes when building the tree, so the
these codes rather than the logical symbol number is used when calculating the
CRC. They are therefore also translated in cdt.py when compressing data.
Implementation
--------------
From the start, this tape loader was designed to be integrated with other code
so that interesting things can happen during loading. By necessity, the tape
sampling needs to happen very frequently, so if other things are being done as
well the code needs to be very efficient.
The most important thing in this loader is the main loop. At the minimum, this
code needs to read the tape, check if a pulse has occurred, count the time
since the last pulse and process the pulse when detected. After a lot of
experimentation, I got it down to this:
mainloop: ; L0+ 0
ex af,af' ;1 L0+ 1
mainloop_1:
exx ;1 L0+ 2
in a,(c) ;4 L0+ 6
xor c ;1 L0+ 7 ; check for edge
call m,new_found_edge ;3/5 L0+10
ld l,(hl) ;2 L0+12 ; update counter
exx ;1 L0+13
mainloop_13:
ex af,af' ;1 L0+14
defs line-(16+3)
jp mainloop ;3 L0+17
This code does a lot, so it's worth explaining in detail what's going on.
The EXX and EX AF,AF' instructions switch to the alternate register set, so
that the current registers are preserved. Realistically, you will always want
the EXX, but most times it doesn't matter if AF gets corrupted.
The IN / XOR / CALL combination reads the current tape bit into bit 7, checks
it against the current bit, and if it is changed (bit 7 is now 1, so the sign
is negative) we call into new_found_edge to process the pulse.
The LD is a quick way to increment a count and also to handle timeouts etc.
(HL) points to the next number in the sequence, and the same value if we've
reached a time-out.
In general then, we want to regularly execute the code between mainloop_1 and
mainloop_13 and each time we enter new_found_edge, we have a pulse and L
contains a count since the last pulse. We will examine this code in a moment.
It's also apparent that the alternate set registers have special usage:
B=#F5 input port for tape
C high bit is last tape bit read
HL lookup table
We will now consider the pulse function, which is entered at L0+12 as the
call c takes 5 cycles when taken, rather than the 3 when not taken.
new_found_edge:
; L0+12
ld a,c ;1 L0+13
xor #87 ;2 L0+15 swap between #44/#4a (#22,#a5)
ld c,a ;1 L0+16
add a,a ;1 L0+17 ; colours: pre-pilot 2a/ad, pilot 2a/ac, data 22/a5
ld b,#7f ;2 L0+19
out (c),a ;4 L0+23 ; change border colour (~L0+19)
ld b,#f5 ;2 L0+25
inc l ;1 L0+26 ; move to transition table
ld a,(hl) ;2 L0+28 ; peek at next value
sub b ;1 L0+29 ; symbols start at F5 :)
jr c,not_a_symbol ;2/3 L0+31
defs line+0-31 ; L1+ 0
ld pc,iy ; jp (iy) ;2 L1+ 2 ; jump to symbol handler
not_a_symbol:
; L0+32
defs line+1-32 ; L1+ 1
in a,(c) ;4 L1+ 5
xor c ;1 L1+ 6 ; check for edge
ret p ;2/4 L1+ 8 ; back at caller at 10
nop ;1 L1+ 9
jr new_found_edge ;3 L1+12
The first part with the XOR inverts the top bit of C (used to hold the last
bit) and also some lower bits that are used for the border colour. The next
block updates the border colour based on A (rotated so the high bit becomes
the low bit). For normal data, this results in oscillating between values #44
and #4a which represent the familiar blue/yellow of spectrum loaders.
The next block again is some trickery relating to the lookup table. Whilst I
said before that L was the counter, this isn't quite true. In reality, most
entries hold L+3, so that (L%3) is the pulse count and (L/3) is the pulse
length. The INC increases the pulse count but leaves the length alone.
The LD / CP fetches the current value and checks if it's above #F5 then it's
considered to be a symbol, otherwise we jump to not_a_symbol, where we read
the next tape bit and return if it's unchanged and repeat if we've found
another edge. This code looks a bit weird because it's trying to execute the
same operations at the same cycle as before, and also so that RET P will
return on the same cycle as the mainloop_10.
If the byte in (HL) is above #F5, we process the symbol by simply jumping to
the symbol handler in IY, and A then contains a symbol from #00 to #0A. This
range of symbols was a hangover from when I had variable length symbols but
it's still used to handle the timeout case (symbol #F5) against the short and
long cases (#F6 and #F7).
The final tricky part of this code is the "defs line+0-31" and "defs line+1-32"
which is used to pad all these routines to take "line" cycles in total. This
snippet of code above dictates the shortest loop cycle, when line=31 and this
is used for the demo. However, during testing I used line=32 (which can easily
meet the timing requirements to read line=31 data) and line=64 in order to spot
places where I'd miscalculated cycle counts. By using 32 or 64, the border
colour transition would happen at a fixed point on the video line, resulting
in a visible vertical line that would shift if the counts were incorrectly
calculated.
The register usage is now:
B=#F5 input port for tape
C colour and high bit of last tape bit read
HL lookup table
IY symbol handler
I won't go into detail on the next couple of symbol handlers, but essentially
the first one, find_sync_handler, counts long sync pulses in DE and when a
short pulse is detected after a certain number of long pulses, it changes the
handler to head_byte_handler. From then on, DE is used to hold the current
pointer to the symbol decode binary tree (for some reason called new_table)
and the jump at head_finished is patched to check_sync_byte1 so that code is
executed on the first decoded symbol. This code stores the byte and uses the
function continue_header_18 to repatch head_finished to process more header
bytes. The last header byte is processed in check_crc_byte2 which verifies the
header and sets IY to the main symbol handler data_byte_handler.
data_byte_handler is fairly straightforward - for a literal byte, it is stored
in (IX) and otherwise the decompression instruction is decoded. Apart from
skip (which I'll explain later), these compression codes are also stored at
(IX+0) to (IX+2) - (IX+0) is the copy count, (IX+1) and (IX+2) is the offset
of the source data. The current IX is stored into a list for the decompressor
to process and IX is the increased by the length so that we can continue to
read literals in-place.
The skip codes were originally intended only to skip the 48 bytes at the end
of each screen pixel line, but they turned out to be useful for compressing
deltas too as they could be used without destroying the first 3 bytes.
The register usage is now:
B=#F5 input port for tape
C colour and high bit of last tape bit read
DE binary tree table (or pilot tone count)
HL lookup table
IX next byte to transfer (like the ZX code)
IY symbol handler
Timing
------
Having settled on a line of 31 cycles, we now need to consider the target
baud rate.
As the tape is only sampled every 31 cycles, there is the possibility for the
tape to be inaccurate by ±1 samples. Consider these edge cases with a
difference in detected lengths of 2 lines:
| | | | | | | |
| x| | | |x | | |
| |x | | x| | | |
Therefore, a normal time period, i.e. the length of a short pulse, is defined
to be nominally 3 "lines", or 3*31 cycles=93 cycles. A pair of pulses is
therefore 186 cycles.
In reality, though, we need to consider that pulses of length 1-5 "lines" as
short and, 7-11 "lines" as long pulses when we consider this possible variances
of 2 "lines" between each sampled pulse.
A symbol is 14-16 time periods long, so 2604, 2790 or 2976 cycles long, on
average we can expect 2790 cycles.
The average character rate is therefore 1MHz/2790 = 358 cps, 336 cps minimum.
Multiplying by 8 bits per byte gives 2867 bps avg, 2688 bps minmum.
Ovbiously, compression will increase this rate further, but even for totally
uncompressed data, this significantly outperforms most turbo loaders.
Binary tree
-----------
The binary tree is stored as a series of 3 byte entries. The first byte
consists of 2 high bits, the next 2 bytes are 2 low bytes. Put together, they
form a 12-bit value for each of the left and right nodes.
A value of 0..255 represents a literal byte as a terminal node of the tree
A value of 256..277 represents a decompression code as a terminal node
A value of 512..4095 represents an address of the next entry (another branch)
The addresses are padded such that the 3 byte record never crosses a 256-byte
page boundary, so the next byte can always be obtained by INC E.
An obvious consequence is that the binary tree table must be between #0200 and
#0fff. For a 16 time period table, this occupies #0200 to #0541.
CRC
---
The data is also verified with a CRC, rather than using a simple XOR like the
ZX ROM or a complicated LFSR like the Amstrad ROM, I instead use Fletcher's 16
checksum as used in ZIP files and described here:
http://en.wikipedia.org/wiki/Fletcher%27s_checksum
This mod-255 variant is very easy to calculate on the CPC (assuming D=1):
current_crc equ $+1
ld hl,#ffff ;3 L1+28
add a,l ;1 L2+29
adc a,d ;1 L2+30
defs line-1-30 ; L2- 1
ld l,a ;1 L2+ 0
add a,h ;1 L2+ 1
adc a,d ;1 L2+ 2
ld h,a ;1 L2+ 3 ; HL = updated CRC
ld (current_crc),hl ;5 L2+ 8
Additionally, a valid CRC results in HL=#FFFF, so it's easy to check for:
ld a,l ;1 L2+ 9
and h ;1 L2+10 ; A=0xff if correct sync
inc a ;1 L2+11 ; A=0 if correct sync
Decompression and mainloop
--------------------------
As described above, we maintain a list of blocks to decompress, which are
always just straight copies. By doing the literals in-place and processing
copies in order, we never need to worry about the data being available to copy.
The mainloop alluded to earlier looks likeL
mainloop_1:
exx ;1 L0+ 2
in a,(c) ;4 L0+ 6
xor c ;1 L0+ 7 ; check for edge
call m,new_found_edge ;3/5 L0+10 ; note DE preserved as can't transition here
ld l,(hl) ;2 L0+12 ; update counter
exx ;1 L0+13
current_read_ptr equ $+1
ld de, free_buffer ;3 L0+16
ld hl, (patchup_write_ptr) ;5 L0+21
sbc hl,de ;4 L0+25 ; carry was clear
jp z, copy_finished_28 ;3 L0+28
ex de,hl ;1 L0+29
ld e,(hl) ;2 L0+31
inc l ;1 L0+32
defs line+1-32 ; L1+ 1
exx ;1 L0+ 2
in a,(c) ;4 L0+ 6
xor c ;1 L0+ 7 ; check for edge
call m,new_found_edge ;3/5 L0+10 ; note DE preserved as can't transition here
ld l,(hl) ;2 L0+12 ; update counter
exx ;1 L0+13
ld d,(hl) ;2 L0+15
inc l ;1 L0+16 ; DE=address of data to copy
ld (current_read_ptr),hl ;5 L0+21 ; update the read pointer
...
mainloop_patch equ $+1
jp mainloop ; L1+ 0
So this mainloop executes whenever the tape code is idle.
The text drawing code is later patched into to this mainloop_patch so that it
too is executed whenever the system is idle. In actual practice, there is a
lot of free time. I always planned to have more demo effects here, but I never
got chance. This is the place to do it!
Interrupts
----------
Interrupts are basically a pain with this system as they would disturb the
tight timing. However, they're also useful to sychronise with the video
frame so that we can change the palette or play audio once per frame.
We rely on the fact that once the interrupt is raised, even if interrupts are
disabled at the time, as long as we handle the interrupt within 32 video lines
everything will behave normally. An interrupt can therefore be processed like
this:
exx ;1 L0+ 2
in a,(c) ;4 L0+ 6
xor c ;1 L0+ 7 ; check for edge
call m,new_found_edge ;3/5 L0+10 ; note DE preserved as can't transition here
nop ;1 L1+11
nop ;1 L1+12
ei ;1 L1+13
ld l,(hl) ;2 L1+15 ; update counter
di ;1 L1+16
exx ;1 L1+17
If an interrupt has been raised, it will execute after the LD L,(HL):
intvec_palette:
;10 L0+23 ; interrupt takes 10us
Returning back looks like this:
in a,(c) ;4 L0+ 6
xor c ;1 L0+ 7 ; check for edge
call m,new_found_edge ;3/5 L0+10
ld l,(hl) ;2 L1+12 ; update counter
ret ;3 L1+15 ; back at caller at 15
The LD here represents advancing the next line's count, and the RET will return
at L1+15, hence the 2 NOPs inserted before the EI so ensure the timing is
correct.
The default routine, intvec_palette, updates the video mode and palette once
per frame, but it's also patched to jump to the audio player once that has
been loaded.
Blocks and error detection
--------------------------
As described above, the CRC is quite robust for this loader so errors can be
easily detected.
Each block has a small header immediately after the pilot tones. The block
contains:
2 bytes for sync (aka block ID)
2 bytes for address
2 bytes for length
2 bytes of header CRC
length bytes of data
2 bytes of data CRC
The system initially looks for block ID #01.00.
If a block is found with a block ID greater than the block we're looking for,
then we assume an error has occurred, so we change the border to red and
ignore the block.
If a block is found with a block ID lower than the block we're looking for,
then we assume the user rewound the tape, we change the border to green and
ignore the block.
If the block is found with the correct block ID, we read the block. If the
length is non-zero, we process it as data and increment the block sub ID.
If the length is zero, we treat this as the end of the block. The block sub ID
is reset to 0 and the block main ID is increased. Additionally, if the address
is non-zero, we jump to that address, allowing arbitrary code execution.
CDT generation
--------------
The whole demo is data-driven, based on the tape data starting from block
#01.00 and proceeding from there. The main executable can therefore be used
for any loading system that's desired.
The CDT file is generated by code in cdt.py, using the class mainfile. The
most useful functions are:
loader() which adds an AMSDOS file
load_data() to generate all the blocks to load data to a specific address
exec_code() to load a file to a specific address and execute it
gap() to create a delay (and a long pilot tone for the next block)
end_multi_block() to end a main block and generate a pause
datablock() to load specific data to an address
palette() to configure the mode and palette
blocks() outputs a sequence of blocks
The compression code also lives in cdt.py. It's not very optimal in cases where
compression is very high - there are many possible compression options to be
evaluated and so it's quite slow. This would be better rewritten! Essentially,
this takes a block of data, an address, and optionally a block of data that's
already in memory, and generates a compressed version of the data. This is
actually a generator that returns (start,end,symbols) tuples that can be passed
to cdt.blocks().
screen.py contains the code that converts a python image that has been loaded
from a PNG or GIF into a mode 0 or mode 1 image (depending on number of colours
in the palette). The screen_writer class is repsonsible for managing the
pickle cache to hold comrpession results so that the slow compression process
only takes place when the images have been changed.
demo.py creates the demo CDT and uses intro.py for the intro and outro screens.
Audio playback
--------------
Surprisingly, the music playback was one of the hardest part of this demo.
If you look through the source, you'll notice evidence that there used to be
an ay_tones buffer - a 6KB circular buffer that held about 5 seconds of audio
data, and originally the plan was to stream audio with the same compression
as the picture data. It was quite impressive to hear audio starting almost as
soon as the pilot tone was over.
However, it was soon apparent that the compression was nowhere near as good
as the binary version of the arkos file. Additionally, synchronising the reads
into the buffer and playback was complicated due the the potential for read
errors to mess up the timing.
The source in arkos_player.asm is very closely based on the original Arkos
Tracker source, except obviously the player needs to be interleaved with the
tape playback code. This also means that, similar to the fixed cycle count
version of the arkos player code, the cycle accuracy is very important and
so frequently blocks are rearranged and delays inserted so that each block
is a multiple of 31 cycles including the 12 cycles of tape code.
However, the most significant change stems from the fact that the tape code
needs the alternate registers and both IX and IY to be intact. As the tape
code is called very frequently, it is obviously impractical to save and restore
these registers on entry to the tape code, so instead the playback code has
been rewritten to only use AF,BC,DE,HL. For the pattern playback, this was
fairly simple, however the instrument playback code made heavy use of all
registers.
Instead, the playback code has been largely rearchitected. HL points to a
control block that describes the channel, including current pointers to
track, instrument data and state. Finally, by including the addresses of the
output registers in this table, we can do away with the register reordering
that is present in the original.
Another benefit of this version of the code is that a single call handles
all the pattern maintenance for that channel, and a single call for all the
instrumnet maintenance. This is in stark contrast to the large block of copy
pasted code from the original player, but ultimately reduces the size of the
code.
One difference to be aware of if you plan on re-using this code is that the
channels are processed in the order 1,2,3 but 3,2,1 in the original. This
means that when multiple channels use hardware envelopes the highest number
takes precedence, in the original the lowest number takes precedence. The
reason for this is simply that one of the tracks in the demo has the melody
in channel 3 but long running bass in channel 2.
Contact
-------
If you have any questions about this, feel free to email doz:
baud@ranulf.net
Sources available at:
https://github.com/ralferoo/breaking-baud
More CRTC releases at:
http://crtc.tv/