# publicsteveWang/Notes

### Subversion checkout URL

You can clone with HTTPS or Subversion.

Fetching contributors…

Cannot retrieve contributors at this time

file 925 lines (920 sloc) 57.906 kb
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 

CS 150 - Digital Design

August 23, 2012

30 lab stations. Initially, individual, but later will partner up. Canadmit up to 60 people -- limit. Waitlist + enrolled is a little overthat. Lab lecture that many people miss: Friday 2-3. Specific lab sections(that you're in). Go to the assigned section at least for the first severalweeks.

This is a lab-intensive course. You will get to know 125 Cory verywell. Food and drink on round tables only. Very reasonable policy.

As mentioned: first few are individual labs, after that you'll pair up forthe projects. The right strategy is to work really hard on the first fewlabs so you look good and get a good partner.

The book is Harris & Harris, Computer Design & Architecture.

Reading: (skim chapter 1, read section 5.6.2 on FPGAs) -- H&H, start tolook at Ch. 5 of the Virtex User's Guide.

H&H Ch. 2 is on combinational circuits. Assuming you took 61C, not doingproofs of equivalence, etc.

Ch. 3 is sequential logic. Combinational is history-agnostic; sequentialallows us to store state (dynamical time-variant system).

With memory and a NAND gate, you can make everything.

Chapter 4 is HDLs. Probably good to flip through for now. We're going touse Verilog this semester. Book gives comparisons between Verilog and VHDL.

First lab next week, you will be writing simple Verilog code to implementsimple boolean functions. 5 is building blocks like ALUs. 6 isarchitecture. 7, microarchitecture: why does it work, and how do you makepipelined processors. May find there's actually code useful to finalproject. Chapter 8 is on memory.

Would suggest that you read the book sooner rather than later. Can sit downin first couple of weeks and read entire thing through.

Lecture notes: will be using the whiteboard. If you want lecture notes, goto the web. Tons of resources out there. If there's something particularabout the thing Kris says, use Piazza. Probably used several times bynow, so not an issue.

Cheating vs. collaboration: link on website that points to Kris's versionof a cheating policy.

Grading! There will be homeworks, and there will probably be homeworkquizzes (a handful, so probably 10 + 5%). There will be a midterm at least(possibly two), so that's like 15%. Labs and project are like 10 and 30%,and the final is 30%.

Couple things to note: lab is very important, because this is a labcourse. If you take the final and the midterm and the quizzes into account,that's 50% of your grade.

Lab lecture in this room (306 Soda, F 2-3p). Will probably have five weeksof lab lecture. Section's 3-4. Starting tomorrow.

Office hours to be posted -- as soon as website is up. Hopefully bytomorrow morning.

King Silicon

FINFETs (from Berkeley, in use by Intel). Waht can you do with 22nm tech?Logic? You get something more than $10^6$ digital gates per $mm^2$. SRAM,you get something like $10 Mb/mm^2$, Flash and DRAM, you get something like$10 MB/mm^2$. You want to put your MIPS processor on there, or a 32-bit ARMCortex? A small but efficient machine? On the order of $10^5$ gates, soabout $0.1 mm^2$. Don't need a whole lot of RAM and flash for yourprogram. Maybe a megabit of RAM, a megabyte of flash, and that adds up to$0.3 mm^2$. Even taking into account cost of packaging and testing, you'remaking a chip for a few pennies.

Think of the cell phone: processor surrounded by a whole bunch ofperipherals. I/O devices (speakers, screens, LEDs, buzzer; microphone,keypad, buttons, N-megapixel camera, 3-axis accelerometer, 3-axisgyroscope, 3-axis magnetometer, touchscreen, etc), networking devices(cell, Wifi, bluetooth, etc), Cool thing here is that it means that you canget all of these sensors in a little chip package. Something:microprocessor in general will not want direct interface. A whole cloud of"glue logic" that frees the processor from having to deal withidiosyncrasies of these things. Lots of different interfaces that you haveto deal with. Another way of looking at this: microprocessor at the core,glue logic around the outside of that, that is talking to the analogcircuitry, which talks to the actual transducers (something has to doconversions between different energy domains).

another way of looking at this is that you have this narrow waist ofmicroprocessors, which connects all of this stuff. the real reason we dothis is to get up to software. one goal of this class is to make youunderstand the tradeoffs between hw and sw. hw is faster, lower power,cheaper, more accurate (several axes: timing, sw is more flexible. if weknew everything people wanted to do, we'd put it in hw. everything isbetter in hw, except when you put it in hw, it's fixed. in general, you'vegot a bunch more people working in sw than in hw. this class is nice inthat it connects these two worlds.

if you can cross that bridge and understand how to understand a softwareproblem in hardware design stages, or solve a hw bug through software, youcan be the magician.

what we're going to do this semester in the project is similar to previousprojects in that we'll have a mips processor. looks like a 3-stage pipelinedesign, and we'll have you do a bunch of hardware interfaces (going acrossthe hw/sw boundary, not into analog obviously). we have, for example,video: we might do audio out. we'll do a keyboard for sure and atimer. there's some things in here; all of this will end up beingmemory-mapped io so that software on the processor can get to it, and it'llalso have interrupts (exceptions). not that many people who understandinterrupts and can design that interface well so that sw people are happywith it.

You will not be wiring up chips on breadboards (or protoboards), the waywe used to in this class. You'll be writing in Verilog. You'll basically beusing a text editor to write Verilog, which is a HDL. There's a couple offorms: one is a structural one, where you actually specify the nodes. Firstlab you'll do that, but afterwards, you'll be working at a behaviorallevel. You'll let the synthesis engine figure out how to take thathigh-level description and turn it into the right stuff in the targettechnology. Has to do a mapping function, and eventually a place & route.

Lot of logic. Whole bunch of underlying techs you might map to, and in theend, you might go to an IC where there's some cell library that theparticular foundry gives you. Very different than if you're mapping to aFPGA, which is what you'll be doing this semester. Job of the synthesistool to turn text into right set of circuits, and that goes into simulationengine(s), and it lets you go around one of these loops in minutes forsmall designs, hours for larger designs, and iterate on your design andmake sure it's right.

Some of you have used LTSpice and are used to using drawing to get aschematic, and that's another way to get into this kind of system as well,but that's structural. A big part of this course, unfortunately, islearning and understanding how these tools work: how I go through thesimulation and synthesis process.

The better you get at navigating the software, the better a digitaldesigner you'll be. Painful truth of it all. Reality is that this isexactly the way it works in industry. Nature of the IC CAD world. Somethinglike a $10B/yr industry. Whole lot better than plugging into a board. FPGA board: fast and cheap upfront, but expensive per part. Other part ofspectrum is to go with an IC or application-specific integrated circuit(ASIC), which is slow and costly upfront, but cheap per unit. Something inbetween: use an FPGA + commercial off-the-shelf chips, and a customPCB. Still expensive per part (less so), but it's pretty fast. FPGA? Field-programmable gate array. The core of an FPGA is theconfigurable logic block. The whole idea behind a CLB is that you have thedataplane, where you have a bunch of digital inputs to the box, and somenumber of outputs in the data plane, and there's a separate control plane(configuration) that's often loaded from a ROM or flash chip externallywhen the chip boots up. Depending on what you put in, it can lookdifferent. Fast, since it's implemented at the HW level. If you take a bunch of CLBs and put them in an array, and you put a bunchof wiring through that array, that is configurable wiring. If we made this chip and went through the process, when we turned it into asingle chip, we'd take everything we put into this, it'd be less than asquare millimeter of 22nm silicon. Talk about how FPGAs make it easy to connect external devices to amicroprocessor. Course material: we can start from systems perspective: systems arecomposed of datapath and control (FSM), or we can start from the verybottom: transistors compose gates, which get turned into registers andcombinational logic, which gets turned into state and next state logic,which make up the control. Also, storage and math/ALU (from registers andcombinational logic) makes up the datapath. CS 150 Lab Lecture 0 August 24, 2012 Note: please make sure to finish labs 2 and 4, since those will be goinginto your final project. Labs will run first 6 weeks, after which we willbe starting the final project. Large design checkpoint, group sizes < 3. stuff. CS 150: Digital Design & Computer Architecture August 28, 2012 Admin Lab: cardkey, working on that. Not a problem yet. Labs: T 5:30-8:30, W 5-8,θ 5-8. Discussion section: looking for a room. Office hours online. θ11-12, F 10:30-11:30. In 512 Cory. Reading: Ch. 4 (HDL). This week: through 4.3 (section that talks aboutstructural Verilog). For next week: the rest. It is Verilog only; we're notgoing to need VHDL. If you get an interview question in VHDL, answer it inVerilog. Taxonomy you've got HDLs, and there are really two flavors: VHDL andVerilog. Inside of Verilog you've got structural (from gates) andbehavioral (say output is equal to a & b). Abstraction Real signals are continuous in time and magnitude. Before you do anythingwith signals, these days, you'll discretize in time and magnitude --generally close enough to continuous for practical purposes. If it's aserial line, there's some dividing line in the middle, and the HW has tomake a decision. Regular time interval called CLK; two values ofmagnitude is called binary. Hierarchy Compose bigger blocks from smaller blocks. Principle of reuse -- modularitybased on abstraction is how we get things done (Liskov). Reuse testedmodules. Very important design habit to get into. Both partners work on anddefine interface specification. Layering. Expose inputs and outputs andbehavior. Define spec, then divide labor. One partner implements module, one partner implements test harness. Regularity: because transistors are cheap, and design time is expensive,sometimes you build smaller (simpler) blocks out of tested biggerblocks. Key pieces of what we want to do with our digital abstraction. Abstraction is not reality. Simulation: Intel FDIV bug in the originalPentium. Voltage sag because of relatively high wire resistance. Lab 0: our abstraction is structural Verilog. There are tons of onlinetutorials on Verilog; Ch 4.3 in H\&H is a good reference on that; your TAsare a good reference. Pister's not a good reference on the syntax. You'reallowed to drop a small number of different components on your circuitboard and wire them up. If you want to make some circuit, you can. Powers of two! FDIV bug: from EE dept at UCLA: outraged that they had not done exhaustivetesting. note:$1 yr \approx \pi \cdot 10^7 s$. With this approximation,$\pi$and$2$are about the same. Combinatorial problem. Combinational logic vs. sequential logic. Combinational logic: outputs area function of current inputs, potentially after some delay (memoryless),versus sequential, where output can be a function of previous inputs. Combinational circuits have no loops (no feedback), whereas circuits withmemory have feedback. Classic: SR latch (2 xor gates hooked to each other). So let's look at the high-level top-down big picture that we drew before:system design comes from a combination of datapath and control (FSM). Onthe midterm (or every midterm Pister's given for this course), there'sgoing to be a problem about SRAM, and you're going to have to design asimple system with that SRAM. e.g. Given 64k x 16 SRAM design, design a HW solution to find the min andmax two's complement numbers in that SRAM. Things you need to know about transistors for this class: you already knowthem. Wired OR (could be a wired AND, depending on how you look at it). Opendrain or open collector: this sort of thing. Zero static power: CMOS inverter. Not longer true; power going down, butnumber goes up. Leakage current ends up being on the order of an amp. Also,increasingly, gates leak. switching current: charging and discharging capacitors.$\alpha C V%2 f$crowbar current:$I_{CC}$, While voltage is swinging from min to max orvice versa, this current exists. All of these things come together to limitperformance of microprocessor. a minterm is a product containing every input variable or its complement,and a maxterm is a sum containing every input variable or its complement. CS 150: Digital Design & Computer Architecture August 30, 2012 Introduction Finite state machines, namely in Verilog. If we have time, canonical formsand more going from transistor to inverter to flip-flop. So. The idea with lab1 is that you're going to be making a digitallock. The real idea is that you're going to be learning behavioral Verilog. Finite State Machines Finite state machines are sequential circuits (as opposed to combinationalcircuits), so they may depend on previous inputs. What we're interested inare synchronous (clocked) sequential circuits. In a synchronous circuit,the additional restriction is that you only care about in/out values on the(almost always) positive-going edge of the clock. Drawing with a caret on it refers to a circuit sensitive on a positiveclock edge. A bubble corresponds to the negative edge. If we have a clock, some input D, and output Q, we have our standardpositive edge-triggered D flip-flop. The way we draw an unknown value, wedraw both values. A register is one or more D flip-flops with a shared clock. Blocking vs. unblocking assignments. So. We have three parts to a Moore machine. State, Output logic, and nextstate. Mealy machine is not very different. Canonical forms Minterms and maxterms. Truth table is the most obvious way of writing downa canonical form. And then there's minterm expansion and maxtermexpansion. Both are popular and useful for different reasons. A minterm isa product term containing every input variable, while a maxterm is a sumterm containing every input variable. Consider min term as a way ofspecifying the ones in the truth table. Construction looks like disjunctivenormal form. Maxterms are just the opposite: you're trying to knock out rows of thetruth table. If you've got some function that's mostly ones, you have towrite a bunch of minterms to get it, as opposed to a handful ofmaxterms. Construction looks like conjunctive normal form. Both maxterm and minterm are unique. de Morgan's law: "bubble-pushing". CS 150: Digital Design & Computer Architecture September 4, 2012 FSM design: Problem statement, block diagram (inputs and outputs), statetransition diagram (bubbles and arcs), state assignment, state transitionfunction, output function. Classic example: string recognizer. Output a 1 every time you see a onefollowed by two zeroes on the input. When talking about systems, there's typically datapath and the FSMcontroller, and you've got stuff going between the two (and outside worldinteracts with the control). Just go through the steps. Low-level stuff Transistor turns into inverter, which turns into inverter with enable,which turns into D flip-flop. Last time: standard CMOS inverter. If you want to put an enable on it,several ways to do that: stick it into an NMOS transistor, e.g. When enableis low, output is Z (high impedance) -- it's not trying to drag outputanywhere. It turns out (beyond scope of things for this class) that NMOS are good atpulling things down, but not so much at pulling things up. Turns out youreally want to add a PMOS transistor to pull up. We want this transistor tobe on when enable is 1, but it turns on when the gate is low. So we stickan inverter on enable. Common; called a pass-gate (butterfly gate). Passgates are useful, but they're not actually driving anything. They justallow current to flow through. If you put too many in series, though,things slow down. Pass-gates as controlled inverters; can be used to create a mux. SR (set/reset) latch. Requires a NOR gate. Useful thing about NOR and NANDis that with the right constant, they can make inverters. That is why theyare useful in making latches (if we cross-couple two of them). If S = R = 0, then NOR gates turn into inverters, and this thingeffectively turns into a bistable storage element. If I feed in a 1, it'llforce the output to be 0, which forces the original gate's input to be a 1. Clock systems. Suppose we take our SR latch and put an AND gate in front ofS and R with an enable line on it, we can now turn off this functionality,and when enable is low, S and R can do whatever they want; they're notgoing to affect the outputs of this thing. You can design synchronousdigital systems using simple level sensitive latches. Contrast with ring oscillator (3-stage; simplest). That is unstable -- if Iput an odd number of inverters in series, there is no stableconfiguration. Very useful for generating a clock. Standard crystaloscillator: Pierce configuration. Odd number of stages unstable, two stages stable, more stages you have towrry about other things. Can be clocked, but you have to be careful. For example: if I wanted todesign a 1-bit counter, with a clocked system, we can consider alevel-sensitive D latch. This is what happens when you get a latch inVerilog. Otherwise, the synthesis tool well have it keep its previousvalue. If you do that, it turns out that probably gives you enough delaythat when the clock is high, the output is 1; it'll probably oscillate. Sothat's bad; maybe we'll make the enable line (the clock line) reallynarrow. And not surprisingly, that's called narrow clocking. For simplesystems, you can get away with that. Make delay similar to single gate's ora few gates' delay. However, ugly, and don't do that. Back in the day,people did this, and they were simple enough that they could get away withit. What I really want is my state register and its output going through somecombinational logic block with some input to set my next state, and onlyonce per clock period does this thing happen. The problem here is in asingle clock period, I get a couple of iterations through the loop. So howdo I take my level-sensitive latch (I've turned it into a D latch withenable, and that's my clock), and when clock is low, there's no problem. Idon't worry that my input's going to cruise through this thing; and whenit's high, I want my input (the D input) to remain constant. As long as clock is high, I don't care; it'll maintain its state, since I'mnot looking at those inputs. There are a whole bunch of ways you can do it(all of which get used somewhere), but the safest (and probably mostcommon) is to stick another latch (another clocked level-sensitive latch)in front of it with an inverter. That's now my input. So when the clock is low, the first one is enabled, and it's transparent(it's a buffer). This is called an edge-triggered master/slave D flip-flop. The modern way of implementing the basic D latch is by using feedback forthe storage element, and an input (both the feedback and the input aredriven by out-of-phase enable). My front end (the master) is driving thesignal line when the clock is low, and, conversely, when the clock is high,the feedback inverter will be driving the line. Bistable storage elementmaintaining its state, and input is disconnected. Now, with the slave, same picture. Sensitive when clock is high, as opposedto master, which is sensitive when clock is low. The idea is that the slaveprevents anything from getting into the storage element until itstabilizes. At the end of the day, the rising edge of the clock latched D toQ. Variation that happened after, doesn't propagate to master; variationthat happened before, slave wasn't listening. So now we have flip-flops andcan make FSMs. CS 150: Digital Design & Computer Architecture September 6, 2012 Verilog synthesis vs. testbenches (H&H 4.8) There's the subset of the language that's actually synthesizable, and thenthere's more stuff just for the purpose of simulation and testing. Wayeasier to debug via simulation. Constructs that don't synthesize: #t: used for adding a simulation delay. ===, !==: 4 state comparisons (0, 1, X, Z).* System tasks (e.g. \$display, prints to console in a C printf-style  format that's pretty easy to figure out; \$monitor, prints to console whenever its arguments change) In industry, it's not at all uncommon to write the spec, write thetestbench, then implement the module. Once the testbench is written, itbecomes the real spec. "You no longer have bugs in your code; you only havebugs in your specification." How do we build a clock in Verilog? parameter halfT = 5;reg CLK;initial CLK = 0;always begin #(halfT) CLK = ~CLK;end H&H example 4.39 shows you how to read from a file. silly_function(.a(a), .b(b), .c(c), .s(s));reg [3:0] testvect[10000:0];$readmemb("test.tv", testvect); // done when input is 4 bits of X (don't care)always @(posedge CLK)    #1 assign {a, b, c, out_exp} = testvect[num]always @(negedge CLK) begin    if (s !== out_exp) $display("error ... "); num <= num + 1;end How big can you make shift registers? At some point, IBM decreed that everyregister on every IBM chip would be part of one gigantic shift register. Soyou've got your register file feeding your ALU; it's a 32 x 32register. There's a test signal; when it's high, the entire thing becomesone shift register. Why? Testing. This became the basis of JTAG. Anotherthing: dynamic fault imaging. Take a chip and run it inside a scanningelectron microscope. Detects backscatter from electrons. Turns out that ametal absorbs depending on what voltage they're at, and oxides absorbdepending on the voltage of the metal beneath them. So you get a differentintensity depending on the voltage. We can also take these passgates and make variable interconnects. So ifI've got two wires that don't touch, I can put a passgate on there and callthat the connect input. Last time we talked about MUXes. I can make a configurable MUX -- the MUX,we did a two-to-one mux, and if I've got some input over here, I selectaccording to what I have as my select input. Next time: more MIPS, memory. CS 150: Digital Design & Computer Architecture September 11, 2012 Changed office hours this week. CLBs, SAR, K-maps. Last time: we went from transistors to inverters with enable to D-flipflopsto a shift register with some inputs and outputs, and, from there to theidea that once you have that shift register, then you can hook that up withan n-input mux and make an arbitrary function of n variables. This gives me configurable logic and configurable interconnects, andnaturally I take the shift out of one and into another, and I've got anFPGA. The LUT is the basic building block: I get four of those per slice, andsome other stuff: includes fast ripple carry logic for adders, the abilityto take two 6s and put them together to form a 7-LUT. So: pretty flexible,fair amount of logic, and that's a slice. One CLB is equal to two slices. And I've got, what is that, 8640 CLBs per chip. Also: DSP48 blocks. 64 ofthese, and each one is a little 48-bit configurable ALU. So that gives yousomething like 70000 (6-LUTs + D-ff). So that's what you've got, what wework with. Now, let's talk about a successive approximation register analog to digitalconverter. A very popular device. Link to a chip that implemented thedigital piece of this thirty years ago. Why are we looking at this? It'snice; it's an example of a "mixed-signal" (i.e. analog mixed with digitalin the same block, where you have a good amount of both) system. It turns out that analog designers need to be good digital designers thesedays. I was doing some consulting for a company recently. They hadbrilliant analog designers, but they had to do some digital blocks on theirchips. "Real world" interfaces. Has some number of output bits that go into theDAC; the DAC's output is simply "linear". You trust the analog designer togive you this piece, and the digital comparator sample and hold circuit,with the sample input, and here's your analog input voltage. So real quickwe'll look at what's in those blocks (even though this isn't 150material). S/H: simplest example is just a transistor. Maybe it's abutterfly gate; typically, there's some storage capacitor on the outside sothat if you've got your input voltage; when it goes low, that is held onthere. Maybe there's some buffer amplifier (little CMOS opamp so it can drive niceloads); capacitive input, so signal will stay there for a long time. Not150 material. The DAC, a simple way of making this is to generate a reference voltage(diode-connected PMOS with voltage division, say). which you mirror, tiedtogether with a switch, and all of these share the same gate. Comparator'sa little more subtle. Maybe when we talk about SRAMs and DRAMs. Anyway. So. Now we have the ability to generate an analog voltage underdigital control. We sample that input and are going to use thatsignal. This tells us whether the output of the DAC is too big. Thattogether is called a SAR. So what does that thing do? There's a very simple (dumb) SAR: acounter. From reset in time, its digital output increases linearly; at somepoint it crosses the analog$V_{in}$, and at that point, you stop. But:that's not such a great thing to do: between 1 and 1024 cycles to get theresult. The better way is to do a binary search. Fun to do withdictionaries and kids. Also works here. FSM: go bit-by-bit, starting withmost significant bit. Better solution (instead of using oversized tree -- better in the sense ofless logic required): use a shift register (and compute all bitssequentially). Or counter going into a decoder; sixteen outputs of which Ionly need 10. Next piece: another common challenge and where a lot of mistakes get made:analog stuff does not simulate as well. While you're developing anddebugging, you have to come up with some way of simulating. Good news: you can often go in and fix things like that. Sort of an aside(although it sometimes shows up in my exams), once you put thesetransistors down, and then you've got all these layers of metal ontop. Turns out that you can actually put this thing in a scanning electronmicroscope and use undedicated logic and go in with a FIB (focused ionbeam) and fix problems. "Metal spin". Back to chapter two: basic gates again. de Morgan's law:$\bar{AB} =\bar{A} + \bar{B}$:$\bar{\prod{A_i}} = \sum \bar{A_i}$. Similarly,$\bar{\sum{A_i}} = \prod \bar{A_i}$. Suppose you have a two-level NAND/NANDgate: that becomes a sum of products (SoP). Similarly, NOR/NOR isequivalent to a product of sums (PoS). Now, if I do NOR/NOR/INV, this is a sum of products, but the inputs areinverted. This is an important one. This particular one is useful becauseof the way you can design logic. The way we used to design logic a fewdecades ago (and the way we might go back in the future) was with big longstrings of NOR gates. So if I go back to our picture of a common source amplifier (erm,inverter), and we stick a bunch of other transistors in parallel, then wehave a NOR gate. Remember: MOS devices have parasitic capacitance. Consider another configuration. Suppose we invert our initial input andconnect to both of these a circled wire, which can be any of the following:fuse / anti-fuse, mask-programmable (when you make the mask, decision toadd a contact), transistor with flipflop (part of shift register, e.g.), anextra gate (double-gate transistors). So now if I chain a bunch more of these together (all NOR'd together, thenI can program many functions. In particular, it could just be an inverter. I can put a bunch of these together, and I can combine the function outputswith another set of NORs, invert it all at the end, and I end up withNOR/NOR/INV. These guys are called PLAs (programmable logic arrays), and you can stillbuy them, and they're still useful. Not uncommon to have a couple offlipflops on them. Will have a homework assignment where you use a 30 centPLA and design something. Quick and dirty way of getting something forcheap. Not done anymore because slow (huge capacitances), but may come backbecause of carbon nanotubes. Javey managed to make a nanotube transistorwith a source, drain, gate, and he got transport, highest current densityper square micron cross-section ever, and showed physics all worked, andthis thing is 1nm around. What Prof. Ali Javey's doing now is working withnanowires and showing that you can grow these things on a roller and rollthem onto a surface (like a plastic surface), and putting down layers ofnanowires in alternating directions. You can imagine (we're a ways away from this) where you get a transistor ateach of these locations, and you've got some CMOS on this side generatingsignals, and CMOS on the output taking the output (made with big fatgigantic 14nm transistors), and you can put$10^5$transistors per squaremicron (not pushing it, since density can get up to$10^6$). End of roadfor CMOS doesn't mean you ditch CMOS. Imagine making this into a junglegym; then you're talking about$10^8$carbon nanotubes per cubic micron,etc. The fact that we can make these long thin transistors on their lengthsmeans that this might come back into fashion. CS 150: Digital Design & Computer Architecture September 13, 2012 Questions of the form 16x16 SRAM, design a circuit that will find thesmallest positive integer or biggest even number, or count number of times17 appears in memory, etc. Kris loves these questions where you figure outdesign (remember: separate datapath and control, come up with it on yourown) -- will probably show up on both midterm and final. Office hours moved. So... last time, we were talking about PLAs (prog logic array) and stuff(NOR/NOR equivalent to AND/OR). You'll hear people talking about AND planeand OR plane, even though they're both NORs. If you look at Fig 2.2.3,they'll show the same regular and inverted signals, and they just draw thisas a line with an AND gate at the end. Pretty common way to draw this;lines with OR gates. Variant of PLA called a PAL -- subsets of "product" terms going to "OR"gates. Beginning of complex programmable logic devices (CPLDs, FPGAS). You canstill buy these registered PALs. Why would you use this over a microprocessor? Faster. Niche. The "oh crap"moment when you finish your board and you find that you left somethingout. I want to say a little about memory, because you'll be using block ram inyour lab next week. There's a ton of different variations of memory, butthey all have a couple of things in common: a decoder (address decoder)where you take$n$input bits and turn them into$2^n$word lines in amemory that has$2^n$words. Also have cell array. Going through cell arrayyou have some number of bit lines, we'll call this either$k$or$2k$,depending on the memory. That goes into some amps / drivers, and then outthe other side you have$k$inputs and/or outputs. Sometimes shared(depends whether or not there's output-enable). Write-enable,output-enable, sometimes clock, sometimes d-in as well as d-out, sometimesmultiple d-outs (multiple address data pairs); whole bunch of variation inhow this happens. Conceptually, though all comes down to something thatlooks like this. So what's that decoder look like? Decoders are very popular circuits: theygenerate all minterms of their input (gigantic products). Note that if youinvert all of the outputs, we get the maxterms (sums). That was DRAM. Now, SRAM: Still have word line going across; now I have a bit line and negated bitline. Inside, I have two cross-coupled inverters (bistable storageelement). Four transistors in there: already down (vs. 1), and I still haveto access. Access transistor going to each side, hooked up to the wordline. When I read this thing, I put in an n-bit address, and thetransistors pull the bit lines. We want these as small as possible for thebit density. 6T, sense amp needed. You can imagine that what you usually dois pre-charge$BL$,$\bar{BL}$. As soon as you raise the word line for thisparticular row, what you find is that one of them starts discharging, andthe other is constant. Analog sensing present so you can make a decisionmuch much faster. That's how reads work; writes are interesting. Suppose I have some$D_{in}$, what do I do? I could put an output-enable so that when writing,they don't send anything to the output, but that would increase sizesignificantly. So what do I do? I just make big burly inverters and drivethe lines. Big transistors down there overcome small transistors up there;and they flip the bit. PMOS is also generally weaker than NMOS, etc. Justoverpower it. One of rare times that you have PMOS pulling up and NMOSpulling down. (notion of "bigger":$W/L$). Transistors leak. They can leak a substantial amount. By lowering voltage,I reduce power. It turns out there's a nonlinear relationship here, and sothe transistors leak a lot less. So that's SRAM. The other question? What about a register? What's thedifference between this and a register file? Comes back to what's in thecell array. We talked that a register is a bunch of flipflops with a sharedclock and maybe a shared enable. Think of a register as having the commonword line, and you've got a D flipflop in there. There's some clock sharedacross the entire array, and there's an enable on it and possibly anoutput, depending on what kind of system you've got set up. We've got D-in,D-out, and if I'm selecting this thing, presumably I want output-enable; ifI'm writing, I need to enable write-enable. So. You clearly have the ability to make registers on chips, so you canclearly do this on the FPGA. Turns out there's some SRAMs on there,too. There's an external SRAM that we may end up using for the classproject, and there's a whole bunch of DDR DRAM on there as well. Canonical forms Truth tables, minterm / maxterm expansions. These we've seen. If you have a function equal to the sum of minterms 1,3,5,6,7, we couldimplement this with fewer gates by using the maxterm expansion. "Minimum sum of products", "minimum product of sums". Karnaugh Maps Easy way to reduce to minimum sum of products or minimum product ofsums. (Section 2.7). Based on the combining theorem, which says that$XA +X\bar{A} = X$. Ideally: every row should just have a single valuechanging. So, I use Gray codes. (e.g. 00, 01, 11, 10). Graphicalrepresentation! CS 150: Digital Design & Computer Architecture September 18, 2012 Lab this week you are learning about chipscope. Chipscope is kinda likewhat it sounds: allows you to monitor things happening in the FPGA. One ofthe interesting things about Chipscope is that it's a FSM monitoring stuffin your FPGA, it also gets compiled down, and it changes the location ofeverything that goes into your chip. It can actually make your bug go away(e.g. timing bugs). So. Counters. How do counters work? If I've got a 4-bit counter and I'mcounting from 0, what's going on here? D-ff with an inverter and enable line? This is a T-ff (toggleflipflop). That'll get me my first bit, but my second bit is slower.$Q_1$wants to toggle only when$Q_0$is 1. With subsequent bits, they want totoggle when all lower bits are 1. Counter with en: enable is tied to the toggle of the first bit. Counterwith ld: four input bits, four output bits. Clock. Load. Then we're goingto want to do a counter with ld, en, rst. Put in logic, etc. Quite common: ripple carry out (RCO), where we AND$Q[3:0]$and feed thisinto the enable of$T_4$. Ring counter (shift register with one hot out), If reset is low I justshift this thing around and make a circular shift register. If high, I clearthe out bit. Mobius counter: just a ring counter with a feedback inverter in it. Justgoing to take whatever state in there, and after n clock ticks, it invertsitself. So you have$n$flipflops, and you get$2n$states. And then you've got LFSRs (linear feedback shift registers). Given Nflipflops, we know that a straight up or down counter will give us$2^N$states. Turns out that an LFSR give syou almost that (not 0). So why dothat instead of an up-counter? This can give you a PRNG. Fun times withGalois fields. Various uses, seeds, high enough periods (Mersenne twisters are higher). RAM Remember, decoder, cell array,$2^n$rows,$2^n$word lines, some number ofbit lines coming out of that cell array for I/O with output-enable andwrite-enable. When output-enable is low, D goes to high-Z. At some point, some externaldevice starts driving some Din (not from memory). Then I can apply a writepulse (write strobe), which causes our data to be written into the memoryat this address location. Whatever was driving it releases, so it goes backto high-impedance, and if we turn output-enable again, we'll see "Din" fromthe cell array. During the write pulse, we need Din stable and address stable. We have apulse because we don't want to break things. Bad things happen. Notice: no clock anywhere. Your FPGA (in particular, the block ram on theML505) is a little different in that it has registered input (addr &data). First off, very configurable. All sorts of ways you can set this up,etc. Addr in particular goes into a register and comes out of there, andthen goes into a decoder before it goes into the cell array, and what comesout of that cell array is a little bit different also in that there's adata-in line that goes into a register and some data-out as well that'sseparate and can be configured in a whole bunch of different ways so thatyou can do a bunch of different things. The important thing is that you can apply your address to those inputs, andit doesn't show up until the rising edge of the clock. There's the optionof having either registered or non-registered output (non-registered forthis lab). So now we've got an ALU and RAM. And so we can build some simpledatapaths. For sure you're going to see on the final (and most likely themidterm) problems like "given a 16-bit ALU and a 1024x16 sync SRAM, designa system to find the largest unsigned int in the SRAM." Demonstration of clock cycles, etc. So what's our FSM look like? EitherLOAD or HOLD. On homework, did not say sync SRAM. Will probably change. CS 150: Digital Design & Computer Architecture September 20, 2012 Non-overlapping clocks. n-phase means that you've got n different outputs,and at most one high at any time. Guaranteed dead time between when onegoes low and next goes high. K-maps Finding minimal sum-of-products and product-of-sums expressions forfunctions. On-set: all the ones of a function; implicant: one ormore circled ones in the onset; a minterm is the smallest implicant youcan have, and they go up by powers of two in the number of things you canhave; a prime implicant can't be combined with another (by circling);an essential prime implicant is a prime implicant that contains atleast one one not in any other prime implicant. A cover is anycollection of implicants that contains all of the ones in the on-set, and aminimal cover is one made up of essential prime implicants and theminimum number of implicants. Hazards vs. glitches. Glitches are when timing issues result in dips (orspikes) in the output; hazards are if they might happen. Completelyirrelevant in synchronous logic. Project 3-stage pipeline MIPS150 processor. Serial port, graphics accelerator. Ifwe look at the datapath elements, the storage elements, you've got yourprogram counter, your instruction memory, register file, and datamemory. Figure 7.1 from the book. If you mix that in with figure 8.28,which talks about MMIO, that data memory, there's an address and data busthat this is hooked up to, and if you want to talk to a serial port on aMIPS processor (or an ARM processor, or something like that), you don'taddress a particular port (not like x86). Most ports arememory-mapped. Actually got a MMIO module that is also hooked up to theaddress and data bus. For some range of addresses, it's the one thathandles reads and writes. You've got a handful of different modules down here such as a UART receivemodule and a UART transmit module. In your project, you'll have yourpersonal computer that has a serial port on it, and that will be hooked upto your project, which contains the MIPS150 processor. Somehow, you've gotto be able to handle characters transmitted in each direction. UART Common ground, TX on one side connected to RX port on other side, and viceversa. Whole bunch more in different connectors. Basic protocol is calledRS232, common (people often refer to it by connector name: DB9 (rarelyDB25); fortunately, we've moved away from this world and use USB. We'lltalk about these other protocols later, some sync, some async. Workhorsefor long time, still all over the place. You're going to build the UART receiver/transmitter and MMIO module thatinterfaces them. See when something's coming in from software /hardware. Going to start out with polling; we will implement interruptslater on in the project (for timing and serial IO on the MIPSprocessor). That's really the hardcore place where software and hardwaremeet. People who understand how each interface works and how to use thoseoptimally together are valuable and rare people. What you're doing in Lab 4, there's really two concepts of (1) how doesserial / UART work and (2) ready / valid handshake. On the MIPS side, you've got some addresses. Anything that starts with FFFFis part of the memory-mapped region. In particular, the first four aremapped to the UART: they are RX control, RX data, TX control, and TX data. When you want to send something out the UART, you write the byte -- there'sjust one bit for the control and one byte for data. Data goes into some FSM system, and you've got an RX shift register and aTX shift register. There's one other piece of this, which is that inside of here, the thinginterfacing to this IO-mapped module uses this ready bit. If you have twomodules: a source and a sink (diagram from the document), the source hassome data that is sending out, tells the sink when the data is valid, andthe sink tells the source when it is ready. And there's a shared "clock"(baud rate), and this is a synchronous interface. • source presents data • source raises valid • when ready & valid on posedge clock, both sides know the transaction was successful. Whatever order this happens in, source is responsible for making sure datais valid. HDLC? Takes bytes and puts into packets, ACKs, etc. Talk about quartz crystals, resonators.$\pi \cdot 10^7$. So: before I let you go, parallel load, n bits in, serial out, etc. UART, MIPS and Timing September 25, 2012 Timing: motivation for next lecture (pipelining). Lot of online resources(resources, period) on MIPS. Should have lived + breathed this thing during61C. For sure, you've got your 61C lecture notes and CS150 lecture notes(both from last semester). Also the green card (reference) and there'sobviously the book. Should have tons of material on the MIPS processor outthere. So, from last time: we talked about a universal asynchronous receivertransmitter. On your homework, I want you to draw a couple of boxes(control and datapath; they exchange signals). Datapath is mostly shiftregisters. May be transmitting and receiving at same time; one may be idle;any mix. Some serial IO lines going to some other system not synchronizedwith you. Talked about clock and how much clock accuracy you need. Foreight-bit, you need a couple percent matching parity. In years past, we'veused N64 game controllers as input for the project. All they had was an RCrelaxation oscillator. Had same format: start bit, two data bits, and stopbit. Data was sent Manchester-coded (0 -> 01; 1: 10). In principle, I canhave a 33% error, which is something I can do with an RC oscillator. Also part of the datapath, 8-bit data going in and out. Whatever, going tobe MIPS interface. Set of memory-mapped addresses on the MIPS, so you canread/write on the serial port. Also some ready/valid stuff uphere. Parallel data to/from MIPS datapath. MIPS: invented by our own Dave Patterson and John Henessey fromStanford. Started company, Kris saw business plan. Was confidential, nowprobably safe to talk about. Started off and said they're going to end upgetting venture capital, and VCs going to take equity, which is going todilute their equity. Simple solution, don't take venture money. These guyshave seen enough of this. By the time they're all done, it would be awesomeif they each had 4% of the company. They set things up so that they startedat 4%. Were going to allocate 20% for all of the employees, series A goingto take half, series B, they'll give up a third, and C, 15%. Interestingbit about MIPS that you didn't learn in 61C. One of the resources, the green sheet, once you've got this thing, you knowa whole bunch about the processor. You know you've got a program counterover here, and you've got a register file in here, and how big itis. Obviously you've got an ALU and some data memory over here, and youknow the instruction format. You don't explicitly know that you've got aseparate instruction memory (that's a choice you get to make as animplementor); you don't know how many cycles it'll be (or pipelined,etc). People tend to have separate data and instruction memory for embeddedsystems, and locally, it looks like separate memories (even on morepowerful systems). We haven't talked yet about what a register file looks like inside. Notabsolute requirement about register file, but it would be nice if yourregister file had two read and one write address. We go from a D-ff, and we know that sticking an enable line on there letsus turn this into a D-ff with enable. Then if I string 32 of these inparallel, I now have a register (clocked), with a write-enable on it. Not going to talk about ALU today: probably after midterm. So now, I've got a set of 32 registers. Considerations of cost. Costs onthe order of a hundredth of a cent. Now I've made my register file. How big is that logic? NAND gates toimplement a 5->32 bit decoder. Asynchronous reads. At the rising edge of the clock, synchronous write. So, now we get back to MIPS review. The MIPS instrctions, you've gotR/I/J-type instructions. All start with opcode (same length: 6 bits). Tinyfraction of all 32-bit instructions. More constraints as we get more stuff. If we then want to constrain thatthis is a single-cycle processor, then you end up with a pretty clearpicture of what you want. PC doesn't need 32 bits (two LSBs are always 0);can implement PC with a counter. PC goes into instruction memory, and out comes my instruction. If, forexample, we want to execute LW$s0 12(%s3), then we look at the greencard, and it tells us the RTL.

Pipelining

September 27, 2012

Last time, I just mentioned in passing that we will always be reading32-bit instruction words in this class, but ARM has both 32- and 16-bitinstruction sets. MicroMIPS does the same thing.

Optimized for size rather than speed; will run at 100 MHz (not very goodcompared to desktop microprocessors made in the same process, which run inthe gigahertz range), but it burns 3 mW. $0.06 \text{mm}^2$. Questionsabout power monitor -- you've got a chip that's somehow hanging off of thepower plug and manages one way or the other to get a voltage and currentsignal. You know the voltage is going to look like 155 amplitude.

Serial! Your serial line, the thing I want you to play around with is thereceiver. We give this to you in the lab, but the thing is I want you todesign the basic architecture.

Start, stop, some bits between. You've got a counter on here that's runningat 1024 ticks per bit of input. Eye diagrams.

Notion of factoring state machines. Or you can draw 10000 states if youwant.

Something about Kris + scanners, it always ends badly. Will be puttinglectures on the course website (and announce on Piazza). High-level, lookat pipelines.

MIPS pipeline

For sure, you should be reading 7.5, if you haven't already. H&H do a greatjob. Slightly different way of looking at pipelines, which is probablyinferior, but it's different.

First off, suppose I've got something like my Golden Bear power monitor,and $f = (A+B)C + D$. It's going to give me this ALU that does addition, ALUthat does multiplication, and then an ALU that does addition again, andthat will end up in my output register.

There is a critical path (how fast can I clock this thing?). For now,assume "perfect" fast registers. This, however, is a bad assumption.

So let's talk about propagation delay in registers.

Timing & Delay (H&H 3.5; Fig 3.35,36)

Suppose I have a simple edge-triggered D flipflop, and these things comewith some specs on the input and output, and in particular, there is asetup time ($t_{\mathrm{setup}}$) and a hold time ($t_{\mathrm{hold}}$).

On the FPGA, these are each like 0.4 ns, whereas in 22nm, these are morelike 10 ps.

And then the output is not going to change immediately (going to remainconstant for some period of time before it changes), $t_{ccq}$ is theminimum time for clock to contamination (change) in Q. And then there's amaximum called $t_{pcq}$, the maximum (worst-case) for clock to stableQ. Just parameters that you can't control (aside from choosing a differentflipflop).

So what do we want to do? We want to combine these flipflops through somecombinational logic with some propagation delay ($t_{pd}$) and see what ourconstraints are going to be on the timing.

Once the output is stable ($t_{pcq}$), it has to go through mycombinational logic ($t_{pd}$), and then counting backwards, I've got$t_{setup}$, and that overall has to be less than my cycle. Tells you howcomplex logic can be, and how many stages of pipelines you need. Part ofthe story of selling microprocessors was clock speed. Some of the peoplewho got bachelors in EE cared, but people only really bought the higherclock speeds. So there'd be like 4 NAND gate delays, and that was it. Oneof the reasons why Intel machines have such incredibly deep pipelines:everything was cut into pieces so they could have these clock speeds.

So. $t_{pd}$ on your Xilinx FPGA for block RAM, which you care about, issomething like 2 ns from clock to data. 32-bit adders are also on the orderof 2 ns. What you're likely to end up with is a 50 MHz part. I also have toworry about fast combinational logic -- what happens as the rising edgegoes high, my new input contaminates, and it messes up this register beforethe setup time? Therefore $t_{ccq} + t_{pd} > t_{hold}$, necessarily, so weneed $t_{ccq} > t_{hold}$ for a good flipflop (consider shift registers,where we have basically no propagation delay).

Therefore $t_{pcq} + t_{setup} + t_{pd} < t_{cycle}$.

What does this have to do with the flipflop we know about? If we look atthe flipflop that we've done in the past (with inverters, controlledbuffers, etc), what is $t_{setup}$? We have several delays; $t_{setup}$should ideally have D propagate to X and Y. How long is the holdafterwards? You'd like $D$ to be constant for an inverter delay (so that itcan stop having an effect). That's pretty stable. $t_{hold}$ is somethinglike the delay of an inverter (if you want to be really safe, you'd saytwice that number). $t_{pcq}$, assuming we have valid setup, the D valuewill be sitting on Y, and we've got two inverter delays, and $t_{ccq}$ isalso 2 inverter delays.

Good midterm-like question for you: if I have a flipflop with somecharacteristic setup and hold time, and I put a delay of 1 ps on the input,and I called this a new flipflop, how does that change any of these things?Can make $t_{hold}$ negative. How do I add more delay? Just add moreinverters in the front. Hold time can in fact go negative. Lot of 141-stylestuff in here that you can play with.

Given that, you have to deal with the fact that you've got this propagationtime and the setup time. Cost of pipelined registers.

Critical path time, various calculations.

Something went wrong with that request. Please try again.