steveWang/Notes

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
1997 lines (1992 sloc) 124 KB


CS 150 - Digital Design

August 23, 2012

30 lab stations. Initially, individual, but later will partner up. Can admit up to 60 people -- limit. Waitlist + enrolled is a little over that. Lab lecture that many people miss: Friday 2-3. Specific lab sections (that you're in). Go to the assigned section at least for the first several weeks.

This is a lab-intensive course. You will get to know 125 Cory very well. Food and drink on round tables only. Very reasonable policy.

As mentioned: first few are individual labs, after that you'll pair up for the projects. The right strategy is to work really hard on the first few labs so you look good and get a good partner.

The book is Harris & Harris, Computer Design & Architecture.

Reading: (skim chapter 1, read section 5.6.2 on FPGAs) -- H&H, start to look at Ch. 5 of the Virtex User's Guide.

H&H Ch. 2 is on combinational circuits. Assuming you took 61C, not doing proofs of equivalence, etc.

Ch. 3 is sequential logic. Combinational is history-agnostic; sequential allows us to store state (dynamical time-variant system).

With memory and a NAND gate, you can make everything.

Chapter 4 is HDLs. Probably good to flip through for now. We're going to use Verilog this semester. Book gives comparisons between Verilog and VHDL.

First lab next week, you will be writing simple Verilog code to implement simple boolean functions. 5 is building blocks like ALUs. 6 is architecture. 7, microarchitecture: why does it work, and how do you make pipelined processors. May find there's actually code useful to final project. Chapter 8 is on memory.

Would suggest that you read the book sooner rather than later. Can sit down in first couple of weeks and read entire thing through.

Lecture notes: will be using the whiteboard. If you want lecture notes, go to the web. Tons of resources out there. If there's something particular about the thing Kris says, use Piazza. Probably used several times by now, so not an issue.

Cheating vs. collaboration: link on website that points to Kris's version of a cheating policy.

Grading! There will be homeworks, and there will probably be homework quizzes (a handful, so probably 10 + 5%). There will be a midterm at least (possibly two), so that's like 15%. Labs and project are like 10 and 30%, and the final is 30%.

Couple things to note: lab is very important, because this is a lab course. If you take the final and the midterm and the quizzes into account, that's 50% of your grade.

Lab lecture in this room (306 Soda, F 2-3p). Will probably have five weeks of lab lecture. Section's 3-4. Starting tomorrow.

Office hours to be posted -- as soon as website is up. Hopefully by tomorrow morning.

King Silicon

FINFETs (from Berkeley, in use by Intel). Waht can you do with 22nm tech? Logic? You get something more than $10^6$ digital gates per $mm^2$. SRAM, you get something like $10 Mb/mm^2$, Flash and DRAM, you get something like $10 MB/mm^2$. You want to put your MIPS processor on there, or a 32-bit ARM Cortex? A small but efficient machine? On the order of $10^5$ gates, so about $0.1 mm^2$. Don't need a whole lot of RAM and flash for your program. Maybe a megabit of RAM, a megabyte of flash, and that adds up to $0.3 mm^2$. Even taking into account cost of packaging and testing, you're making a chip for a few pennies.

Think of the cell phone: processor surrounded by a whole bunch of peripherals. I/O devices (speakers, screens, LEDs, buzzer; microphone, keypad, buttons, N-megapixel camera, 3-axis accelerometer, 3-axis gyroscope, 3-axis magnetometer, touchscreen, etc), networking devices (cell, Wifi, bluetooth, etc), Cool thing here is that it means that you can get all of these sensors in a little chip package. Something: microprocessor in general will not want direct interface. A whole cloud of "glue logic" that frees the processor from having to deal with idiosyncrasies of these things. Lots of different interfaces that you have to deal with. Another way of looking at this: microprocessor at the core, glue logic around the outside of that, that is talking to the analog circuitry, which talks to the actual transducers (something has to do conversions between different energy domains).

another way of looking at this is that you have this narrow waist of microprocessors, which connects all of this stuff. the real reason we do this is to get up to software. one goal of this class is to make you understand the tradeoffs between hw and sw. hw is faster, lower power, cheaper, more accurate (several axes: timing, sw is more flexible. if we knew everything people wanted to do, we'd put it in hw. everything is better in hw, except when you put it in hw, it's fixed. in general, you've got a bunch more people working in sw than in hw. this class is nice in that it connects these two worlds.

if you can cross that bridge and understand how to understand a software problem in hardware design stages, or solve a hw bug through software, you can be the magician.

what we're going to do this semester in the project is similar to previous projects in that we'll have a mips processor. looks like a 3-stage pipeline design, and we'll have you do a bunch of hardware interfaces (going across the hw/sw boundary, not into analog obviously). we have, for example, video: we might do audio out. we'll do a keyboard for sure and a timer. there's some things in here; all of this will end up being memory-mapped io so that software on the processor can get to it, and it'll also have interrupts (exceptions). not that many people who understand interrupts and can design that interface well so that sw people are happy with it.

You will not be wiring up chips on breadboards (or protoboards), the way we used to in this class. You'll be writing in Verilog. You'll basically be using a text editor to write Verilog, which is a HDL. There's a couple of forms: one is a structural one, where you actually specify the nodes. First lab you'll do that, but afterwards, you'll be working at a behavioral level. You'll let the synthesis engine figure out how to take that high-level description and turn it into the right stuff in the target technology. Has to do a mapping function, and eventually a place & route.

Lot of logic. Whole bunch of underlying techs you might map to, and in the end, you might go to an IC where there's some cell library that the particular foundry gives you. Very different than if you're mapping to a FPGA, which is what you'll be doing this semester. Job of the synthesis tool to turn text into right set of circuits, and that goes into simulation engine(s), and it lets you go around one of these loops in minutes for small designs, hours for larger designs, and iterate on your design and make sure it's right.

Some of you have used LTSpice and are used to using drawing to get a schematic, and that's another way to get into this kind of system as well, but that's structural. A big part of this course, unfortunately, is learning and understanding how these tools work: how I go through the simulation and synthesis process.

The better you get at navigating the software, the better a digital designer you'll be. Painful truth of it all. Reality is that this is exactly the way it works in industry. Nature of the IC CAD world. Something like a $10B/yr industry. Whole lot better than plugging into a board. FPGA board: fast and cheap upfront, but expensive per part. Other part of spectrum is to go with an IC or application-specific integrated circuit (ASIC), which is slow and costly upfront, but cheap per unit. Something in between: use an FPGA + commercial off-the-shelf chips, and a custom PCB. Still expensive per part (less so), but it's pretty fast. FPGA? Field-programmable gate array. The core of an FPGA is the configurable logic block. The whole idea behind a CLB is that you have the dataplane, where you have a bunch of digital inputs to the box, and some number of outputs in the data plane, and there's a separate control plane (configuration) that's often loaded from a ROM or flash chip externally when the chip boots up. Depending on what you put in, it can look different. Fast, since it's implemented at the HW level. If you take a bunch of CLBs and put them in an array, and you put a bunch of wiring through that array, that is configurable wiring. If we made this chip and went through the process, when we turned it into a single chip, we'd take everything we put into this, it'd be less than a square millimeter of 22nm silicon. Talk about how FPGAs make it easy to connect external devices to a microprocessor. Course material: we can start from systems perspective: systems are composed of datapath and control (FSM), or we can start from the very bottom: transistors compose gates, which get turned into registers and combinational logic, which gets turned into state and next state logic, which make up the control. Also, storage and math/ALU (from registers and combinational logic) makes up the datapath. CS 150 Lab Lecture 0 August 24, 2012 Note: please make sure to finish labs 2 and 4, since those will be going into your final project. Labs will run first 6 weeks, after which we will be starting the final project. Large design checkpoint, group sizes < 3. stuff. CS 150: Digital Design & Computer Architecture August 28, 2012 Admin Lab: cardkey, working on that. Not a problem yet. Labs: T 5:30-8:30, W 5-8, θ 5-8. Discussion section: looking for a room. Office hours online. θ 11-12, F 10:30-11:30. In 512 Cory. Reading: Ch. 4 (HDL). This week: through 4.3 (section that talks about structural Verilog). For next week: the rest. It is Verilog only; we're not going to need VHDL. If you get an interview question in VHDL, answer it in Verilog. Taxonomy you've got HDLs, and there are really two flavors: VHDL and Verilog. Inside of Verilog you've got structural (from gates) and behavioral (say output is equal to a & b). Abstraction Real signals are continuous in time and magnitude. Before you do anything with signals, these days, you'll discretize in time and magnitude -- generally close enough to continuous for practical purposes. If it's a serial line, there's some dividing line in the middle, and the HW has to make a decision. Regular time interval called CLK; two values of magnitude is called binary. Hierarchy Compose bigger blocks from smaller blocks. Principle of reuse -- modularity based on abstraction is how we get things done (Liskov). Reuse tested modules. Very important design habit to get into. Both partners work on and define interface specification. Layering. Expose inputs and outputs and behavior. Define spec, then divide labor. One partner implements module, one partner implements test harness. Regularity: because transistors are cheap, and design time is expensive, sometimes you build smaller (simpler) blocks out of tested bigger blocks. Key pieces of what we want to do with our digital abstraction. Abstraction is not reality. Simulation: Intel FDIV bug in the original Pentium. Voltage sag because of relatively high wire resistance. Lab 0: our abstraction is structural Verilog. There are tons of online tutorials on Verilog; Ch 4.3 in H\&H is a good reference on that; your TAs are a good reference. Pister's not a good reference on the syntax. You're allowed to drop a small number of different components on your circuit board and wire them up. If you want to make some circuit, you can. Powers of two! FDIV bug: from EE dept at UCLA: outraged that they had not done exhaustive testing. note:$1 yr \approx \pi \cdot 10^7 s$. With this approximation,$\pi$and$2$are about the same. Combinatorial problem. Combinational logic vs. sequential logic. Combinational logic: outputs are a function of current inputs, potentially after some delay (memoryless), versus sequential, where output can be a function of previous inputs. Combinational circuits have no loops (no feedback), whereas circuits with memory have feedback. Classic: SR latch (2 xor gates hooked to each other). So let's look at the high-level top-down big picture that we drew before: system design comes from a combination of datapath and control (FSM). On the midterm (or every midterm Pister's given for this course), there's going to be a problem about SRAM, and you're going to have to design a simple system with that SRAM. e.g. Given 64k x 16 SRAM design, design a HW solution to find the min and max two's complement numbers in that SRAM. Things you need to know about transistors for this class: you already know them. Wired OR (could be a wired AND, depending on how you look at it). Open drain or open collector: this sort of thing. Zero static power: CMOS inverter. Not longer true; power going down, but number goes up. Leakage current ends up being on the order of an amp. Also, increasingly, gates leak. switching current: charging and discharging capacitors.$\alpha C V^2 f$crowbar current:$I_{CC}$, While voltage is swinging from min to max or vice versa, this current exists. All of these things come together to limit performance of microprocessor. a minterm is a product containing every input variable or its complement, and a maxterm is a sum containing every input variable or its complement. CS 150: Digital Design & Computer Architecture August 30, 2012 Introduction Finite state machines, namely in Verilog. If we have time, canonical forms and more going from transistor to inverter to flip-flop. So. The idea with lab1 is that you're going to be making a digital lock. The real idea is that you're going to be learning behavioral Verilog. Finite State Machines Finite state machines are sequential circuits (as opposed to combinational circuits), so they may depend on previous inputs. What we're interested in are synchronous (clocked) sequential circuits. In a synchronous circuit, the additional restriction is that you only care about in/out values on the (almost always) positive-going edge of the clock. Drawing with a caret on it refers to a circuit sensitive on a positive clock edge. A bubble corresponds to the negative edge. If we have a clock, some input D, and output Q, we have our standard positive edge-triggered D flip-flop. The way we draw an unknown value, we draw both values. A register is one or more D flip-flops with a shared clock. Blocking vs. unblocking assignments. So. We have three parts to a Moore machine. State, Output logic, and next state. Mealy machine is not very different. Canonical forms Minterms and maxterms. Truth table is the most obvious way of writing down a canonical form. And then there's minterm expansion and maxterm expansion. Both are popular and useful for different reasons. A minterm is a product term containing every input variable, while a maxterm is a sum term containing every input variable. Consider min term as a way of specifying the ones in the truth table. Construction looks like disjunctive normal form. Maxterms are just the opposite: you're trying to knock out rows of the truth table. If you've got some function that's mostly ones, you have to write a bunch of minterms to get it, as opposed to a handful of maxterms. Construction looks like conjunctive normal form. Both maxterm and minterm are unique. de Morgan's law: "bubble-pushing". CS 150: Digital Design & Computer Architecture September 4, 2012 FSM design: Problem statement, block diagram (inputs and outputs), state transition diagram (bubbles and arcs), state assignment, state transition function, output function. Classic example: string recognizer. Output a 1 every time you see a one followed by two zeroes on the input. When talking about systems, there's typically datapath and the FSM controller, and you've got stuff going between the two (and outside world interacts with the control). Just go through the steps. Low-level stuff Transistor turns into inverter, which turns into inverter with enable, which turns into D flip-flop. Last time: standard CMOS inverter. If you want to put an enable on it, several ways to do that: stick it into an NMOS transistor, e.g. When enable is low, output is Z (high impedance) -- it's not trying to drag output anywhere. It turns out (beyond scope of things for this class) that NMOS are good at pulling things down, but not so much at pulling things up. Turns out you really want to add a PMOS transistor to pull up. We want this transistor to be on when enable is 1, but it turns on when the gate is low. So we stick an inverter on enable. Common; called a pass-gate (butterfly gate). Pass gates are useful, but they're not actually driving anything. They just allow current to flow through. If you put too many in series, though, things slow down. Pass-gates as controlled inverters; can be used to create a mux. SR (set/reset) latch. Requires a NOR gate. Useful thing about NOR and NAND is that with the right constant, they can make inverters. That is why they are useful in making latches (if we cross-couple two of them). If S = R = 0, then NOR gates turn into inverters, and this thing effectively turns into a bistable storage element. If I feed in a 1, it'll force the output to be 0, which forces the original gate's input to be a 1. Clock systems. Suppose we take our SR latch and put an AND gate in front of S and R with an enable line on it, we can now turn off this functionality, and when enable is low, S and R can do whatever they want; they're not going to affect the outputs of this thing. You can design synchronous digital systems using simple level sensitive latches. Contrast with ring oscillator (3-stage; simplest). That is unstable -- if I put an odd number of inverters in series, there is no stable configuration. Very useful for generating a clock. Standard crystal oscillator: Pierce configuration. Odd number of stages unstable, two stages stable, more stages you have to wrry about other things. Can be clocked, but you have to be careful. For example: if I wanted to design a 1-bit counter, with a clocked system, we can consider a level-sensitive D latch. This is what happens when you get a latch in Verilog. Otherwise, the synthesis tool well have it keep its previous value. If you do that, it turns out that probably gives you enough delay that when the clock is high, the output is 1; it'll probably oscillate. So that's bad; maybe we'll make the enable line (the clock line) really narrow. And not surprisingly, that's called narrow clocking. For simple systems, you can get away with that. Make delay similar to single gate's or a few gates' delay. However, ugly, and don't do that. Back in the day, people did this, and they were simple enough that they could get away with it. What I really want is my state register and its output going through some combinational logic block with some input to set my next state, and only once per clock period does this thing happen. The problem here is in a single clock period, I get a couple of iterations through the loop. So how do I take my level-sensitive latch (I've turned it into a D latch with enable, and that's my clock), and when clock is low, there's no problem. I don't worry that my input's going to cruise through this thing; and when it's high, I want my input (the D input) to remain constant. As long as clock is high, I don't care; it'll maintain its state, since I'm not looking at those inputs. There are a whole bunch of ways you can do it (all of which get used somewhere), but the safest (and probably most common) is to stick another latch (another clocked level-sensitive latch) in front of it with an inverter. That's now my input. So when the clock is low, the first one is enabled, and it's transparent (it's a buffer). This is called an edge-triggered master/slave D flip-flop. The modern way of implementing the basic D latch is by using feedback for the storage element, and an input (both the feedback and the input are driven by out-of-phase enable). My front end (the master) is driving the signal line when the clock is low, and, conversely, when the clock is high, the feedback inverter will be driving the line. Bistable storage element maintaining its state, and input is disconnected. Now, with the slave, same picture. Sensitive when clock is high, as opposed to master, which is sensitive when clock is low. The idea is that the slave prevents anything from getting into the storage element until it stabilizes. At the end of the day, the rising edge of the clock latched D to Q. Variation that happened after, doesn't propagate to master; variation that happened before, slave wasn't listening. So now we have flip-flops and can make FSMs. CS 150: Digital Design & Computer Architecture September 6, 2012 Verilog synthesis vs. testbenches (H&H 4.8) There's the subset of the language that's actually synthesizable, and then there's more stuff just for the purpose of simulation and testing. Way easier to debug via simulation. Constructs that don't synthesize: #t: used for adding a simulation delay. ===, !==: 4 state comparisons (0, 1, X, Z). * System tasks (e.g. \$display, prints to console in a C printf-style format that's pretty easy to figure out; \$monitor, prints to console whenever its arguments change) In industry, it's not at all uncommon to write the spec, write the testbench, then implement the module. Once the testbench is written, it becomes the real spec. "You no longer have bugs in your code; you only have bugs in your specification." How do we build a clock in Verilog? parameter halfT = 5; reg CLK; initial CLK = 0; always begin #(halfT) CLK = ~CLK; end H&H example 4.39 shows you how to read from a file. silly_function(.a(a), .b(b), .c(c), .s(s)); reg [3:0] testvect[10000:0];$readmemb("test.tv", testvect); // done when input is 4 bits of X (don't care)

always @(posedge CLK)
#1 assign {a, b, c, out_exp} = testvect[num]

always @(negedge CLK) begin
if (s !== out_exp) $display("error ... "); num <= num + 1; end How big can you make shift registers? At some point, IBM decreed that every register on every IBM chip would be part of one gigantic shift register. So you've got your register file feeding your ALU; it's a 32 x 32 register. There's a test signal; when it's high, the entire thing becomes one shift register. Why? Testing. This became the basis of JTAG. Another thing: dynamic fault imaging. Take a chip and run it inside a scanning electron microscope. Detects backscatter from electrons. Turns out that a metal absorbs depending on what voltage they're at, and oxides absorb depending on the voltage of the metal beneath them. So you get a different intensity depending on the voltage. We can also take these passgates and make variable interconnects. So if I've got two wires that don't touch, I can put a passgate on there and call that the connect input. Last time we talked about MUXes. I can make a configurable MUX -- the MUX, we did a two-to-one mux, and if I've got some input over here, I select according to what I have as my select input. Next time: more MIPS, memory. CS 150: Digital Design & Computer Architecture September 11, 2012 Changed office hours this week. CLBs, SAR, K-maps. Last time: we went from transistors to inverters with enable to D-flipflops to a shift register with some inputs and outputs, and, from there to the idea that once you have that shift register, then you can hook that up with an n-input mux and make an arbitrary function of n variables. This gives me configurable logic and configurable interconnects, and naturally I take the shift out of one and into another, and I've got an FPGA. The LUT is the basic building block: I get four of those per slice, and some other stuff: includes fast ripple carry logic for adders, the ability to take two 6s and put them together to form a 7-LUT. So: pretty flexible, fair amount of logic, and that's a slice. One CLB is equal to two slices. And I've got, what is that, 8640 CLBs per chip. Also: DSP48 blocks. 64 of these, and each one is a little 48-bit configurable ALU. So that gives you something like 70000 (6-LUTs + D-ff). So that's what you've got, what we work with. Now, let's talk about a successive approximation register analog to digital converter. A very popular device. Link to a chip that implemented the digital piece of this thirty years ago. Why are we looking at this? It's nice; it's an example of a "mixed-signal" (i.e. analog mixed with digital in the same block, where you have a good amount of both) system. It turns out that analog designers need to be good digital designers these days. I was doing some consulting for a company recently. They had brilliant analog designers, but they had to do some digital blocks on their chips. "Real world" interfaces. Has some number of output bits that go into the DAC; the DAC's output is simply "linear". You trust the analog designer to give you this piece, and the digital comparator sample and hold circuit, with the sample input, and here's your analog input voltage. So real quick we'll look at what's in those blocks (even though this isn't 150 material). S/H: simplest example is just a transistor. Maybe it's a butterfly gate; typically, there's some storage capacitor on the outside so that if you've got your input voltage; when it goes low, that is held on there. Maybe there's some buffer amplifier (little CMOS opamp so it can drive nice loads); capacitive input, so signal will stay there for a long time. Not 150 material. The DAC, a simple way of making this is to generate a reference voltage (diode-connected PMOS with voltage division, say). which you mirror, tied together with a switch, and all of these share the same gate. Comparator's a little more subtle. Maybe when we talk about SRAMs and DRAMs. Anyway. So. Now we have the ability to generate an analog voltage under digital control. We sample that input and are going to use that signal. This tells us whether the output of the DAC is too big. That together is called a SAR. So what does that thing do? There's a very simple (dumb) SAR: a counter. From reset in time, its digital output increases linearly; at some point it crosses the analog$V_{in}$, and at that point, you stop. But: that's not such a great thing to do: between 1 and 1024 cycles to get the result. The better way is to do a binary search. Fun to do with dictionaries and kids. Also works here. FSM: go bit-by-bit, starting with most significant bit. Better solution (instead of using oversized tree -- better in the sense of less logic required): use a shift register (and compute all bits sequentially). Or counter going into a decoder; sixteen outputs of which I only need 10. Next piece: another common challenge and where a lot of mistakes get made: analog stuff does not simulate as well. While you're developing and debugging, you have to come up with some way of simulating. Good news: you can often go in and fix things like that. Sort of an aside (although it sometimes shows up in my exams), once you put these transistors down, and then you've got all these layers of metal on top. Turns out that you can actually put this thing in a scanning electron microscope and use undedicated logic and go in with a FIB (focused ion beam) and fix problems. "Metal spin". Back to chapter two: basic gates again. de Morgan's law:$\bar{AB} = \bar{A} + \bar{B}$:$\bar{\Pi A_i} = \sum \bar{A_i}$. Similarly,$\bar{\Sigma A_i} = \prod \bar{A_i}$. Suppose you have a two-level NAND/NAND gate: that becomes a sum of products (SoP). Similarly, NOR/NOR is equivalent to a product of sums (PoS). Now, if I do NOR/NOR/INV, this is a sum of products, but the inputs are inverted. This is an important one. This particular one is useful because of the way you can design logic. The way we used to design logic a few decades ago (and the way we might go back in the future) was with big long strings of NOR gates. So if I go back to our picture of a common source amplifier (erm, inverter), and we stick a bunch of other transistors in parallel, then we have a NOR gate. Remember: MOS devices have parasitic capacitance. Consider another configuration. Suppose we invert our initial input and connect to both of these a circled wire, which can be any of the following: fuse / anti-fuse, mask-programmable (when you make the mask, decision to add a contact), transistor with flipflop (part of shift register, e.g.), an extra gate (double-gate transistors). So now if I chain a bunch more of these together (all NOR'd together, then I can program many functions. In particular, it could just be an inverter. I can put a bunch of these together, and I can combine the function outputs with another set of NORs, invert it all at the end, and I end up with NOR/NOR/INV. These guys are called PLAs (programmable logic arrays), and you can still buy them, and they're still useful. Not uncommon to have a couple of flipflops on them. Will have a homework assignment where you use a 30 cent PLA and design something. Quick and dirty way of getting something for cheap. Not done anymore because slow (huge capacitances), but may come back because of carbon nanotubes. Javey managed to make a nanotube transistor with a source, drain, gate, and he got transport, highest current density per square micron cross-section ever, and showed physics all worked, and this thing is 1nm around. What Prof. Ali Javey's doing now is working with nanowires and showing that you can grow these things on a roller and roll them onto a surface (like a plastic surface), and putting down layers of nanowires in alternating directions. You can imagine (we're a ways away from this) where you get a transistor at each of these locations, and you've got some CMOS on this side generating signals, and CMOS on the output taking the output (made with big fat gigantic 14nm transistors), and you can put$10^5$transistors per square micron (not pushing it, since density can get up to$10^6$). End of road for CMOS doesn't mean you ditch CMOS. Imagine making this into a jungle gym; then you're talking about$10^8$carbon nanotubes per cubic micron, etc. The fact that we can make these long thin transistors on their lengths means that this might come back into fashion. CS 150: Digital Design & Computer Architecture September 13, 2012 Questions of the form 16x16 SRAM, design a circuit that will find the smallest positive integer or biggest even number, or count number of times 17 appears in memory, etc. Kris loves these questions where you figure out design (remember: separate datapath and control, come up with it on your own) -- will probably show up on both midterm and final. Office hours moved. So... last time, we were talking about PLAs (prog logic array) and stuff (NOR/NOR equivalent to AND/OR). You'll hear people talking about AND plane and OR plane, even though they're both NORs. If you look at Fig 2.2.3, they'll show the same regular and inverted signals, and they just draw this as a line with an AND gate at the end. Pretty common way to draw this; lines with OR gates. Variant of PLA called a PAL -- subsets of "product" terms going to "OR" gates. Beginning of complex programmable logic devices (CPLDs, FPGAS). You can still buy these registered PALs. Why would you use this over a microprocessor? Faster. Niche. The "oh crap" moment when you finish your board and you find that you left something out. I want to say a little about memory, because you'll be using block ram in your lab next week. There's a ton of different variations of memory, but they all have a couple of things in common: a decoder (address decoder) where you take$n$input bits and turn them into$2^n$word lines in a memory that has$2^n$words. Also have cell array. Going through cell array you have some number of bit lines, we'll call this either$k$or$2k$, depending on the memory. That goes into some amps / drivers, and then out the other side you have$k$inputs and/or outputs. Sometimes shared (depends whether or not there's output-enable). Write-enable, output-enable, sometimes clock, sometimes d-in as well as d-out, sometimes multiple d-outs (multiple address data pairs); whole bunch of variation in how this happens. Conceptually, though all comes down to something that looks like this. So what's that decoder look like? Decoders are very popular circuits: they generate all minterms of their input (gigantic products). Note that if you invert all of the outputs, we get the maxterms (sums). That was DRAM. Now, SRAM: Still have word line going across; now I have a bit line and negated bit line. Inside, I have two cross-coupled inverters (bistable storage element). Four transistors in there: already down (vs. 1), and I still have to access. Access transistor going to each side, hooked up to the word line. When I read this thing, I put in an n-bit address, and the transistors pull the bit lines. We want these as small as possible for the bit density. 6T, sense amp needed. You can imagine that what you usually do is pre-charge$BL$,$\bar{BL}$. As soon as you raise the word line for this particular row, what you find is that one of them starts discharging, and the other is constant. Analog sensing present so you can make a decision much much faster. That's how reads work; writes are interesting. Suppose I have some$D_{in}$, what do I do? I could put an output-enable so that when writing, they don't send anything to the output, but that would increase size significantly. So what do I do? I just make big burly inverters and drive the lines. Big transistors down there overcome small transistors up there; and they flip the bit. PMOS is also generally weaker than NMOS, etc. Just overpower it. One of rare times that you have PMOS pulling up and NMOS pulling down. (notion of "bigger":$W/L$). Transistors leak. They can leak a substantial amount. By lowering voltage, I reduce power. It turns out there's a nonlinear relationship here, and so the transistors leak a lot less. So that's SRAM. The other question? What about a register? What's the difference between this and a register file? Comes back to what's in the cell array. We talked that a register is a bunch of flipflops with a shared clock and maybe a shared enable. Think of a register as having the common word line, and you've got a D flipflop in there. There's some clock shared across the entire array, and there's an enable on it and possibly an output, depending on what kind of system you've got set up. We've got D-in, D-out, and if I'm selecting this thing, presumably I want output-enable; if I'm writing, I need to enable write-enable. So. You clearly have the ability to make registers on chips, so you can clearly do this on the FPGA. Turns out there's some SRAMs on there, too. There's an external SRAM that we may end up using for the class project, and there's a whole bunch of DDR DRAM on there as well. Canonical forms Truth tables, minterm / maxterm expansions. These we've seen. If you have a function equal to the sum of minterms 1,3,5,6,7, we could implement this with fewer gates by using the maxterm expansion. "Minimum sum of products", "minimum product of sums". Karnaugh Maps Easy way to reduce to minimum sum of products or minimum product of sums. (Section 2.7). Based on the combining theorem, which says that$XA + X\bar{A} = X$. Ideally: every row should just have a single value changing. So, I use Gray codes. (e.g. 00, 01, 11, 10). Graphical representation! CS 150: Digital Design & Computer Architecture September 18, 2012 Lab this week you are learning about chipscope. Chipscope is kinda like what it sounds: allows you to monitor things happening in the FPGA. One of the interesting things about Chipscope is that it's a FSM monitoring stuff in your FPGA, it also gets compiled down, and it changes the location of everything that goes into your chip. It can actually make your bug go away (e.g. timing bugs). So. Counters. How do counters work? If I've got a 4-bit counter and I'm counting from 0, what's going on here? D-ff with an inverter and enable line? This is a T-ff (toggle flipflop). That'll get me my first bit, but my second bit is slower.$Q_1$wants to toggle only when$Q_0$is 1. With subsequent bits, they want to toggle when all lower bits are 1. Counter with en: enable is tied to the toggle of the first bit. Counter with ld: four input bits, four output bits. Clock. Load. Then we're going to want to do a counter with ld, en, rst. Put in logic, etc. Quite common: ripple carry out (RCO), where we AND$Q[3:0]$and feed this into the enable of$T_4$. Ring counter (shift register with one hot out), If reset is low I just shift this thing around and make a circular shift register. If high, I clear the out bit. Mobius counter: just a ring counter with a feedback inverter in it. Just going to take whatever state in there, and after n clock ticks, it inverts itself. So you have$n$flipflops, and you get$2n$states. And then you've got LFSRs (linear feedback shift registers). Given N flipflops, we know that a straight up or down counter will give us$2^N$states. Turns out that an LFSR give syou almost that (not 0). So why do that instead of an up-counter? This can give you a PRNG. Fun times with Galois fields. Various uses, seeds, high enough periods (Mersenne twisters are higher). RAM Remember, decoder, cell array,$2^n$rows,$2^n$word lines, some number of bit lines coming out of that cell array for I/O with output-enable and write-enable. When output-enable is low, D goes to high-Z. At some point, some external device starts driving some Din (not from memory). Then I can apply a write pulse (write strobe), which causes our data to be written into the memory at this address location. Whatever was driving it releases, so it goes back to high-impedance, and if we turn output-enable again, we'll see "Din" from the cell array. During the write pulse, we need Din stable and address stable. We have a pulse because we don't want to break things. Bad things happen. Notice: no clock anywhere. Your FPGA (in particular, the block ram on the ML505) is a little different in that it has registered input (addr & data). First off, very configurable. All sorts of ways you can set this up, etc. Addr in particular goes into a register and comes out of there, and then goes into a decoder before it goes into the cell array, and what comes out of that cell array is a little bit different also in that there's a data-in line that goes into a register and some data-out as well that's separate and can be configured in a whole bunch of different ways so that you can do a bunch of different things. The important thing is that you can apply your address to those inputs, and it doesn't show up until the rising edge of the clock. There's the option of having either registered or non-registered output (non-registered for this lab). So now we've got an ALU and RAM. And so we can build some simple datapaths. For sure you're going to see on the final (and most likely the midterm) problems like "given a 16-bit ALU and a 1024x16 sync SRAM, design a system to find the largest unsigned int in the SRAM." Demonstration of clock cycles, etc. So what's our FSM look like? Either LOAD or HOLD. On homework, did not say sync SRAM. Will probably change. CS 150: Digital Design & Computer Architecture September 20, 2012 Non-overlapping clocks. n-phase means that you've got n different outputs, and at most one high at any time. Guaranteed dead time between when one goes low and next goes high. K-maps Finding minimal sum-of-products and product-of-sums expressions for functions. On-set: all the ones of a function; implicant: one or more circled ones in the onset; a minterm is the smallest implicant you can have, and they go up by powers of two in the number of things you can have; a prime implicant can't be combined with another (by circling); an essential prime implicant is a prime implicant that contains at least one one not in any other prime implicant. A cover is any collection of implicants that contains all of the ones in the on-set, and a minimal cover is one made up of essential prime implicants and the minimum number of implicants. Hazards vs. glitches. Glitches are when timing issues result in dips (or spikes) in the output; hazards are if they might happen. Completely irrelevant in synchronous logic. Project 3-stage pipeline MIPS150 processor. Serial port, graphics accelerator. If we look at the datapath elements, the storage elements, you've got your program counter, your instruction memory, register file, and data memory. Figure 7.1 from the book. If you mix that in with figure 8.28, which talks about MMIO, that data memory, there's an address and data bus that this is hooked up to, and if you want to talk to a serial port on a MIPS processor (or an ARM processor, or something like that), you don't address a particular port (not like x86). Most ports are memory-mapped. Actually got a MMIO module that is also hooked up to the address and data bus. For some range of addresses, it's the one that handles reads and writes. You've got a handful of different modules down here such as a UART receive module and a UART transmit module. In your project, you'll have your personal computer that has a serial port on it, and that will be hooked up to your project, which contains the MIPS150 processor. Somehow, you've got to be able to handle characters transmitted in each direction. UART Common ground, TX on one side connected to RX port on other side, and vice versa. Whole bunch more in different connectors. Basic protocol is called RS232, common (people often refer to it by connector name: DB9 (rarely DB25); fortunately, we've moved away from this world and use USB. We'll talk about these other protocols later, some sync, some async. Workhorse for long time, still all over the place. You're going to build the UART receiver/transmitter and MMIO module that interfaces them. See when something's coming in from software / hardware. Going to start out with polling; we will implement interrupts later on in the project (for timing and serial IO on the MIPS processor). That's really the hardcore place where software and hardware meet. People who understand how each interface works and how to use those optimally together are valuable and rare people. What you're doing in Lab 4, there's really two concepts of (1) how does serial / UART work and (2) ready / valid handshake. On the MIPS side, you've got some addresses. Anything that starts with FFFF is part of the memory-mapped region. In particular, the first four are mapped to the UART: they are RX control, RX data, TX control, and TX data. When you want to send something out the UART, you write the byte -- there's just one bit for the control and one byte for data. Data goes into some FSM system, and you've got an RX shift register and a TX shift register. There's one other piece of this, which is that inside of here, the thing interfacing to this IO-mapped module uses this ready bit. If you have two modules: a source and a sink (diagram from the document), the source has some data that is sending out, tells the sink when the data is valid, and the sink tells the source when it is ready. And there's a shared "clock" (baud rate), and this is a synchronous interface. • source presents data • source raises valid • when ready & valid on posedge clock, both sides know the transaction was successful. Whatever order this happens in, source is responsible for making sure data is valid. HDLC? Takes bytes and puts into packets, ACKs, etc. Talk about quartz crystals, resonators.$\pi \cdot 10^7$. So: before I let you go, parallel load, n bits in, serial out, etc. UART, MIPS and Timing September 25, 2012 Timing: motivation for next lecture (pipelining). Lot of online resources (resources, period) on MIPS. Should have lived + breathed this thing during 61C. For sure, you've got your 61C lecture notes and CS150 lecture notes (both from last semester). Also the green card (reference) and there's obviously the book. Should have tons of material on the MIPS processor out there. So, from last time: we talked about a universal asynchronous receiver transmitter. On your homework, I want you to draw a couple of boxes (control and datapath; they exchange signals). Datapath is mostly shift registers. May be transmitting and receiving at same time; one may be idle; any mix. Some serial IO lines going to some other system not synchronized with you. Talked about clock and how much clock accuracy you need. For eight-bit, you need a couple percent matching parity. In years past, we've used N64 game controllers as input for the project. All they had was an RC relaxation oscillator. Had same format: start bit, two data bits, and stop bit. Data was sent Manchester-coded (0 -> 01; 1: 10). In principle, I can have a 33% error, which is something I can do with an RC oscillator. Also part of the datapath, 8-bit data going in and out. Whatever, going to be MIPS interface. Set of memory-mapped addresses on the MIPS, so you can read/write on the serial port. Also some ready/valid stuff up here. Parallel data to/from MIPS datapath. MIPS: invented by our own Dave Patterson and John Henessey from Stanford. Started company, Kris saw business plan. Was confidential, now probably safe to talk about. Started off and said they're going to end up getting venture capital, and VCs going to take equity, which is going to dilute their equity. Simple solution, don't take venture money. These guys have seen enough of this. By the time they're all done, it would be awesome if they each had 4% of the company. They set things up so that they started at 4%. Were going to allocate 20% for all of the employees, series A going to take half, series B, they'll give up a third, and C, 15%. Interesting bit about MIPS that you didn't learn in 61C. One of the resources, the green sheet, once you've got this thing, you know a whole bunch about the processor. You know you've got a program counter over here, and you've got a register file in here, and how big it is. Obviously you've got an ALU and some data memory over here, and you know the instruction format. You don't explicitly know that you've got a separate instruction memory (that's a choice you get to make as an implementor); you don't know how many cycles it'll be (or pipelined, etc). People tend to have separate data and instruction memory for embedded systems, and locally, it looks like separate memories (even on more powerful systems). We haven't talked yet about what a register file looks like inside. Not absolute requirement about register file, but it would be nice if your register file had two read and one write address. We go from a D-ff, and we know that sticking an enable line on there lets us turn this into a D-ff with enable. Then if I string 32 of these in parallel, I now have a register (clocked), with a write-enable on it. Not going to talk about ALU today: probably after midterm. So now, I've got a set of 32 registers. Considerations of cost. Costs on the order of a hundredth of a cent. Now I've made my register file. How big is that logic? NAND gates to implement a 5->32 bit decoder. Asynchronous reads. At the rising edge of the clock, synchronous write. So, now we get back to MIPS review. The MIPS instrctions, you've got R/I/J-type instructions. All start with opcode (same length: 6 bits). Tiny fraction of all 32-bit instructions. More constraints as we get more stuff. If we then want to constrain that this is a single-cycle processor, then you end up with a pretty clear picture of what you want. PC doesn't need 32 bits (two LSBs are always 0); can implement PC with a counter. PC goes into instruction memory, and out comes my instruction. If, for example, we want to execute LW$s0 12(%s3), then we look at the green card, and it tells us the RTL.

Pipelining

September 27, 2012

Last time, I just mentioned in passing that we will always be reading 32-bit instruction words in this class, but ARM has both 32- and 16-bit instruction sets. MicroMIPS does the same thing.

Optimized for size rather than speed; will run at 100 MHz (not very good compared to desktop microprocessors made in the same process, which run in the gigahertz range), but it burns 3 mW. $0.06 \text{mm}^2$. Questions about power monitor -- you've got a chip that's somehow hanging off of the power plug and manages one way or the other to get a voltage and current signal. You know the voltage is going to look like 155 amplitude.

Serial! Your serial line, the thing I want you to play around with is the receiver. We give this to you in the lab, but the thing is I want you to design the basic architecture.

Start, stop, some bits between. You've got a counter on here that's running at 1024 ticks per bit of input. Eye diagrams.

Notion of factoring state machines. Or you can draw 10000 states if you want.

Something about Kris + scanners, it always ends badly. Will be putting lectures on the course website (and announce on Piazza). High-level, look at pipelines.

MIPS pipeline

For sure, you should be reading 7.5, if you haven't already. H&H do a great job. Slightly different way of looking at pipelines, which is probably inferior, but it's different.

First off, suppose I've got something like my Golden Bear power monitor, and $f = (A+B)C + D$. It's going to give me this ALU that does addition, ALU that does multiplication, and then an ALU that does addition again, and that will end up in my output register.

There is a critical path (how fast can I clock this thing?). For now, assume "perfect" fast registers. This, however, is a bad assumption.

So let's talk about propagation delay in registers.

Timing & Delay (H&H 3.5; Fig 3.35,36)

Suppose I have a simple edge-triggered D flipflop, and these things come with some specs on the input and output, and in particular, there is a setup time ($t_{\mathrm{setup}}$) and a hold time ($t_{\mathrm{hold}}$).

On the FPGA, these are each like 0.4 ns, whereas in 22nm, these are more like 10 ps.

And then the output is not going to change immediately (going to remain constant for some period of time before it changes), $t_{ccq}$ is the minimum time for clock to contamination (change) in Q. And then there's a maximum called $t_{pcq}$, the maximum (worst-case) for clock to stable Q. Just parameters that you can't control (aside from choosing a different flipflop).

So what do we want to do? We want to combine these flipflops through some combinational logic with some propagation delay ($t_{pd}$) and see what our constraints are going to be on the timing.

Once the output is stable ($t_{pcq}$), it has to go through my combinational logic ($t_{pd}$), and then counting backwards, I've got $t_{setup}$, and that overall has to be less than my cycle. Tells you how complex logic can be, and how many stages of pipelines you need. Part of the story of selling microprocessors was clock speed. Some of the people who got bachelors in EE cared, but people only really bought the higher clock speeds. So there'd be like 4 NAND gate delays, and that was it. One of the reasons why Intel machines have such incredibly deep pipelines: everything was cut into pieces so they could have these clock speeds.

So. $t_{pd}$ on your Xilinx FPGA for block RAM, which you care about, is something like 2 ns from clock to data. 32-bit adders are also on the order of 2 ns. What you're likely to end up with is a 50 MHz part. I also have to worry about fast combinational logic -- what happens as the rising edge goes high, my new input contaminates, and it messes up this register before the setup time? Therefore $t_{ccq} + t_{pd} > t_{hold}$, necessarily, so we need $t_{ccq} > t_{hold}$ for a good flipflop (consider shift registers, where we have basically no propagation delay).

Therefore $t_{pcq} + t_{setup} + t_{pd} < t_{cycle}$.

What does this have to do with the flipflop we know about? If we look at the flipflop that we've done in the past (with inverters, controlled buffers, etc), what is $t_{setup}$? We have several delays; $t_{setup}$ should ideally have D propagate to X and Y. How long is the hold afterwards? You'd like $D$ to be constant for an inverter delay (so that it can stop having an effect). That's pretty stable. $t_{hold}$ is something like the delay of an inverter (if you want to be really safe, you'd say twice that number). $t_{pcq}$, assuming we have valid setup, the D value will be sitting on Y, and we've got two inverter delays, and $t_{ccq}$ is also 2 inverter delays.

Good midterm-like question for you: if I have a flipflop with some characteristic setup and hold time, and I put a delay of 1 ps on the input, and I called this a new flipflop, how does that change any of these things? Can make $t_{hold}$ negative. How do I add more delay? Just add more inverters in the front. Hold time can in fact go negative. Lot of 141-style stuff in here that you can play with.

Given that, you have to deal with the fact that you've got this propagation time and the setup time. Cost of pipelined registers.

Critical path time, various calculations.

Hazards, Stalls, Delay slots, Three-stage pipeline

October 2, 2012

:)

Let's look at some hazards on the five-stage and then talk about what they would look like in the three-stage. In the book, 7.51, this is where they go through and look at what happens with the load word.

Must stall or use delay slot.

MMIO

October 4, 2012

Section 8.5. Not exactly perfect; we'll talk a bit about that this lecture, but it gives you a good idea how that works. Talk a little bit about 3-stage pipeline and look at what happens if you put the regfile next to the ALU instead of next to the instruction memory.

Last time, we had IMEM, Regfile, ALU all by itself, and then data memory all over there. Let's now see what happens if we stick the regfile and ALU together. Not at all clear to me on the FPGA you've got whether there's going to be a substantial benefit to one or the other. Don't think it's going to affect speed or complexity tremendously.

What you should be doing for your project is draw the basic single-cycle MIPS (figure 7.11?), then add pipeline registers, label every single wire in there; everything should be lined up; make sure that you put ALUOut (there's going to be more than one of those: it's going to cross from the execution phase to the memory phase, and so you're going to have at least two of these.

Good midterm question: you decide you're going to forward. If we choose to do that, what is $T_{c,min}$ in this case?

Note: memory map in book is not the same as what we're using in the project, but the concepts are all the same. So we've got a text segment where the program actually goes; your global variables have some place in here, and you initialize $gp to that, and the top chunk is the I/O (called reserved in their diagram. You've got the heap that grows up, and the stack that grows down. It's this region where you've got your MMIO, and I want to make this clear what's going on. From the book, figure 8.28, you've got your ALU, regfile, muxes, and there's two things that come out of here that are important when you're doing a read or a write. You've got the address, and you've got a DataIn if you're doing a write, and there's your memory. There's also a DataOut from memory (32 bits). So far, we've been saying that this is just Dmem. In reality, Dmem is just one of the things that lives in here. We've got a block that we've been calling Dmem. There are other things: in particular, there's your UART controller, and your UART controller has a bunch of lines that go to and from the actual UART that you build, which has a single SIn and SOut. This is what you connect your terminal to on this side, and there's a bunch of things that go across this interface. Two sets of three lines to represent the ready/valid interface. Control line that tells the memory when to write; address and data going into Dmem (it's only 12 bits that go in; you have to figure that out). This guy up here, this is your decoder. You also have to have your instruction memory live inside of here, and it for sure needs to get that Din and the address input as well. It also only presumably has 12 bits of input for your project, and it also has a write on it. And this controller needs to be able to see some bits of address. Stack, Procedure Calls, Exceptions October 9, 2012 The homework this week is pretty much just things you're working on: how did you implement j and jal? We are going to talk about the stack and procedure calls, and also exceptions (interrupts!) -- 6.7.2, 7.7. Also, look at your green sheet. From the book, like we drew last time, we've got our memory allocation system that starts at 0, and somewhere way at the top, we've got FFFFFFFC, which gets chopped up into pieces. In particular, in the normal memory map, all of this is reserved, which ends up being memory-mapped IO devices, and you start off with your stack pointer pointing right here at 7FFFFFFC, and then you've got another reserved section on the bottom for text, static, and room for your stack to grow down and your heap to grow up, and your stack pointer and global pointer. Your program counter ends up being initialized at the bottom of the text section. Some differences. How do we do procedure calls? In our book, that's section 6.4.6. We'll look at it from the simplest no args, no ret val, then we'll see how to do args and return values, then local and global variables. Your code is main, just calls a function simple, and simple, which is a void, just returns. 61C material. jr$ra. Turns out that doesn't work for a MIPS architecture. What is the address that goes into the return address when this is called? 0x8: book does not have delay slots. In particular, you're going to have to put NOPs to get this to work. Actually, at memory location 1004, you just happened to have an instruction that was all 0, except it had a 1000 at the very end. That's jr, so we end up with nastiness: infinite looping, potentially.

So suppose I have args and stuff. We end up with \$s0 = y. Suppose we're at 1000, how does this thing get called? \$a registers, \$v registers. Utilization of delay slots. What needs to be saved on the stack, and when? Arrays on the stack, etc. When you complete a procedure, the stack pointer should be back where it was before the call. So, you put it all together, and you get the stack frame shown on your green card. You may have any args above 4, you do your jal, a0-3 may contain any args, ra has your return address we talked about. The standard order is a0-3, ra, s0-7, and local variables and sp, during procedure. Then you're going to do a jr ra at the end, and v0-1 contains stuff. void input() { char s[20]; gets(s); } So what's the stack going to look like? Need to save ra, among other things. The input that some friendly person puts in, some string of characters here, 7fff0028, and then some string of 20 more characters. Suppose that's my input. What's going to end up happening? I'm going to say jump and link to gets, and I'm all done, so I'm going to do a jr to my return address, and I'm going to have to load it back into ra. I'll fix up my stack sp, and then I can finally return. Buffer overflows (7fff0028)! Notion of stack traces. Finally, you've got global or static or extern variables, which go into that part of memory called static. Pretty straightforward. Something you may not have run into is volatile. When might I declare something to be volatile? Modification by interrupt, UART control registers, etc. Tells compiler to check the memory-mapped value always, not just check the register. Lot of compilers which do the wrong things with volatile. Exceptions, Interrupts, Traps Why do we need exceptions? You've got IO devices and other peripherals which do things asynchronously with the software. When they're doing this: keyboard, mouse, touch screen, buttons, radio(s), ethernet, SPI/I2C, ALU can give you an overflow condition or divide-by-zero; the controller can say bogus instruction. A lot of things that don't fit into the natural flow of software. Popular one: overflow. What do you do if the ALU tells you that something is bigger than you can store? What we'll do next is learn how to write interrupt handlers. Correction, Interrupts October 11, 2012 Midterm Thursday. In-class, open book, open notes, no silicon (exception: watches) (OB/ON/NS). The content of the midterm: all the homeworks (up to and including the one due this Friday), all the labs, and all the labs (up to and including the one due Friday/Monday). No homework next week, and next checkpoint is in two weeks. last time: we had a jal to a function, followed by a nop, and then we moved the result of this function into s0. The machine is designed to execute the branch delay instruction immediately after making the jump. Exceptions Problem: inputs arriving synchronously with clock edges but asynchronous with software execution. The keypress is the classic example of that. Options: you can do blocking, polling, or interrupts. Blocking is fine for question-answer games (like you learn when you're first doing programming); polling is maybe useful if you're already in a tight loop. Either way, it's extremely difficult to enforce that somebody goes back and checks quickly enough before the data gets overwritten. Issues: hardware calls procedure. Can happen anywhere or anytime. No args, no return values. How to figure out what happened? How to call and return from procedure? When you get an exception or interrupt: • finish current instruction. • save a copy of next instruction's address • load address of interrupt handler into PC. • set some bits somewhere to indicate to tell that procedure what happened. • When done with handler, restore saved PC and return execution. Be careful of forwarding. There are a whole bunch of different ways that processors deal with exception handling. Pretty much every machine has a different way of dealing with it. A common thing to do is push the program counter, some count of PSW (processor status word), and some set of registers onto the stack, and to use the type of exception to determine which handler gets called (i.e. address to load in the PC) (vectored interrupts). Also common to allow higher priority interrupts to interrupt an interrupt handler. PSWs vary by processor, but they may say which interrupts are enabled and what just happened. Might have a section of bits inside it that tells you what caused the interrupt. Lots of things that can be inside of that thing. It turns out that MIPS does none of this. MIPS does arguably a simpler thing, so it's easier to implement. MIPS has coprocessors, which are basically completely separate register sets that you can reference. There's one for floating point, and there's one for processor control (CoP0). In principle, there are 32 registers for each coprocessor. For instance, CoP0 r12 is the status register; CoP0 r13 is the cause register, CoP0 r14 is the EPC (exception program counter). There's only one exception handler address that ever gets invoked. For the nominal machine, it's way up in restricted memory at 0x80000180. CoP0, in r12 (status), bit 0 is IE (interrupt enable). If that is 0, your processor will not be enabled. In fact, that is cleared on interrupt by the hardware and set by the interrupt handler on exit. In general, in MIPS, you have one interrupt handler, it runs when you get the interrupt, and it does not let any more interrupts happen. This is important because we have no interrupt stack. In general, you look very carefully at the code that goes into interrupts, because it's really bad if that stuff breaks. It's at the core of everything that happens in the machine. Bits 10-15 of CoP0 r12 correspond to IM (interrupt mask). When we look at r13, that corresponds to the IP (interrupt pending) area. There's also a code in here that we won't use: if it is equal to 0, then that means you have an interrupt. In particular, for us, cause <15> is going to be a timer interrupt, cause <14> is going to be a RTC (real-time clock) interrupt, and cause <10> is going to be a UART interrupt (set by hardware when TXready goes high or RXvalid goes high). What about timer and RTC? Turns out you have a couple other registers in here. r9 ("count") is cycle count, and r11 is the compare register. Between them, they let you measure the execution time of something in cycles and set a timer. The timer bit is set when r11 == r9. RTC will be high every time cycle count overflows. Fairly low-priority interrupt. Allowed to execute roughly 4 billion instructions between ticks. Finally, r14 is the exception PC. These are the registers you'll have to add to your architecture. If you choose to make an exact copy of your 32-register file, that'll be inefficient. Or you could just have 5 registers and know what their addresses are so you can mux them into the datapath. You'll want to generate an interrupt when IE & IM & IP is true. The very first thing that happens when you start servicing an interrupt, you'll disable the interrupt enable bit. If another interrupt happens, then when the handler re-enables the enable bit, it'll go back and process it again. So we end up with two or three new instructions. If I remember correctly, the assembler doesn't recognize one of them. For sure the ones it does recognize are move to and from coprocessor 0, i.e. mfc0 and mtc0. The opcode is 16 for both of these, but the rs value is 0 for mfc0 and 4 for mtc0. rt is always the MIPS register, and rd is always the CoP0 register. For instance: mfc0 \$k0, Cause, mtc0 \$k1, Status. When you're moving to CoP0, you're going to have your two outputs. Need to get wd for RegFile to also consider coming from CoP0. On the MIPS, what do we do? We finish the current instruction (up to you to determine what that means); the EPC register gets the next PC (whatever that means to you). The PC gets 0x80000180 (on a real MIPS; we'll probably have a different address for you), and status<0> gets 0. 0x80000180 is the interrupt vector; code at the interrupt handler: • figure out what happened • deal with it • reset the appropriate cause<15:10> bit(s) • set status<0> back to 1 • PC gets EPC (in some ISAs, that becomes eret (exception return). Example interrupt handler code: 0x80000180 mfc0$k0, Cause
mfc0 $k1, Status and$k0, $k0,$k1    # masked, pending interrupts
andi $k1,$k0, 0x4000 # 1 << 15
bne  $k1,$0, Timer_ISR
andi $k1,$k0, 0x2000 # 1 << 14
bne  $k1,$0, RTC_ISR

Note: all registers other than k0,k1 must be saved on the stack (and restored before the interrupt handler returns) if used, since an interrupt may happen at any point / time. Further: you may use the stack, provided that you restore the stack pointer when you exit from the interrupt handler.

Secondary note: always reserve stack space before using it, especially if interrupts are in play.

At this point, I can do several clever things: I could use IP to determine where to hop to, or I could go through bit by bit.

At the end of ISR, it's either eret, or mfc0 \$k1, Status; ori \$k1, \$k1, 1; mtc0 \$k1, Status; mfc0 \$k1, EPC; jr \$k1;.

Midterm Review, Cache

October 16, 2012

Midterm Thursday, OBONNS. Review session tonight. Project checkpoint 1 is behind us (or most of us). Checkpoint 2 and 3, we've rearranged things a little bit to allow TAs to get the interrupt one done in time. Last time I said exceptions and interrupts would be next, but that'll be CP3; the next will be integrating cache.

MT review, Cache. Material will be based on homework and labs/project. Not going to throw anything at you that you haven't seen before, no need to invent anything new, not going to make it tricky. Basically, set of things I will choose from is K-maps, MIPS datapath/pipeline/assembly, ready/valid interface, UART, Verilog, RAM + ALU datapath, FF and CLB internals.

All this is stuff that's either been on a homework assignment or is part of your lab. So not on exam: exceptions, cache.

K-maps, broadly: what I want you to do is given a description either as minterms or maxterms or f(A,B,C,D) or truth table, or maybe even gates, I want you to be able to go to drawing the K-map (four variables max), find the minimal sum of products or product of sums, and then implement with NOR/NAND/whatever. Nowhere up there does it say accurate definitions of terms. It's open-book and open-note.

MIPS, for sure single-cycle datapath: add some new piece of hardware or add an instruction; how would you change it if you had to add something to it? (add a new register, floating point unit). Pipeline is obviously what we've covered most; sequencing of instructions (how they work their way through the pipeline as other instructions go through, and that's predominantly for 3-stage pipeline (don't think I'll give you a five-stage pipeline, but I might, since that's been on your homework); hazards and forwarding. Calculating minimum clock period given some particular arrangement of pipeline registers and propagation delay of various elements; ff stuff, logic stuff, topology of the implementation, etc. What else can I ask about pipelines? Probably it.

Ready/valid: draw traces (suppose I give you a clock and what the ready line does; what is the valid line and what shows up on the datapath?). For you, that's related to the UART and certainly understanding the UART interface there. Revamp to lab next time around; still not ideal. At this point, now that you've done your checkpoint 1, you ought to know what's happening cold. What happens at the bit-level? Bits going over wire; how does it work?

Verilog for simple FSM. "Design a FSM that outputs a 1 when XYZ". Your job is to take that and turn it into state encoding, state register; next-state logic; output logic. Do it all cleanly and clearly so we can say that you know how Verilog works.

RAM + ALU + registers question: "find the min/max/XYZ of values in SRAM". Draw datapath, draw state transition diagram. A big part is figuring out these are separate and make sure you aren't putting control into your datapath and being clever by throwing gates over the control.

FF internals: down to FETs. You should know how to if I tell you to give me a D-flipflop with synchronous load and asynchronous reset, you ought to be able to draw the gates that do that, and how the gates are implemented with FETs. Add some functionality to the flipflop (as on the homework). Trace output.

Similarly, CLB internals: probably not going to ask you to take that down to transistors. Will be provided with any datasheet info necessary. Related to this, I may have something on the registered PAL or something like that.

Caches

Why is it we have registers in the MIPS CPU? Immediate access. Faster, lower energy per operation. Multiport (we can have several read inputs at the same time as a write input) -- that's a bit win. Shorter addresses, so we can actually encode these with single instructions.

First two: sort of a first version of cache. So if we look at the hierarchy from registers to cache (which is SRAM) to what I'll call main memory (DRAM), and then I guess all the way down here, we've got disk (and disk cache), it's all about binding. At the end of the day, an add refers to variables, which had to be bound to something. Whether that's a memory address or a register address, there's some kind of binding.

The binding coordination here to registers happens either by the compiler (or whoever's writing the assembly coder, in general). The cache is managed by the cache controller. Main memory is handled (mostly) by the operating system.

In terms of size of transfers, registers are 1B to 8B, depending on the machine you have (for our MIPS, that transfer is 4B). Cache is on the order of 8B to 128B (we'll be doing 16B chunks), and then transfers between main memory and disk are block sizes from 512B to 4KB (and we don't got any).

And speed: registers are on the order of 0.1 - 1ns today (you can get regfiles that operate faster than 10GHz, but the problem is they dissipate too much power per unit area, so we've stopped turning up the clock and figured out how to take advantage of the scaling). Cache is somewhat slower; on the order of 1ns, and main memory is around 10-100ns, depending on what's going on.

There are lots of analyses you can find and lots of figures that relate the cache size to the cache miss rate, and let me draw your project architecture before we talk more about that: you've done your MIPS core at this point, which has an instruction and data memory interface. The next step is we're going to put in our DDR2 SDRAM (256MB), and then a memory controller called a memory arbiter. There's actually another block in here called the Xilinx memory interface generator (XMIG) that lives between the arbiter and the SDRAM. At the end of the day, that interface just has one address going in and data going back and forth. And each one of these is going to be talking to that memory arbiter, and our block RAMs become the instruction cache and the data cache.

When all things go well, this looks no different. But if the address you're looking for isn't in your instruction cache (both of which are going to be 8KB). Those are cache misses. The miss rate on a log scale (for 1 kB to 1 MB) looks different between the cache type, the code being executed, the archiecture of your processor, and just about everything else you can think of. Generally fairly linear on a log-log plot for some point, and then it drops drastically. Register access on the order of a hundred times faster than a memory access.

Big caches: take more chip area, and in an era where you can put literally billions of transistors on a chip, that's not the problem: big caches are slower and more energy-costly.

One of the things you'll be doing is talk to a video chip, which will generate the video output that drives your monitor, and it'll use memory-mapped IO, so it'll read out a megapixel every thirtieth of a second, which you'll have to write out to the arbiter. Haven't yet talked about why it's slower and costs more power; we'll get to that at the end.

If you look at the Intel i7 architecture, it's 1-8 cores on a chip, and each core has an instruction cache and data cache (L1); those are 32KB each. We then have an L2 (unified instruction and data) cache, and then finally an L3 cache that's shared by all processors (6MB), which then goes off to DRAM.

L1 cache is about 0.5 ns, L2 is about 2.5 ns, and L3 is 10 ns. Off-chip is on the order of 40-100 ns, depending on what you put down there.

Three layers of cache is not the end of the story. It turns out that in fact, we can keep playing this game that very often, your PC goes into your L1 instruction cache, and you get an instruction word out of this thing that might be 128 bits wide (in some sense, L0), and then you've got 2 bits of control that decides which word you're going to grab, which is also controlled by the PC, broadly.

On the other end, you've got your register file and your ALU, and this might be like ours, on the MIPS, 32 by 32; Bill Dally wrote a paper that showed that in 28nm CMOS, the energy cost of doing the computation for a fused multiply accumulate operation (where you read out three 32-bit operands and write back AB + C), that the register reads/write is roughly two times the energy of the FMAC (floating point multiply accumulate). So he put in 4 32-bit registers inside the ALU so that he didn't have to do that, and that cut the overall power consumption by a factor of two (basically, he added a register cache, analogous to L-1).

So how about von Neumann (Princeton) and Harvard?

Harvard Architecture

You have your program counter address going to some instruction memory, and you get some instruction out, and you have some kind of registers and ALU giving you your address and exchanging data with some data memory.

Princeton Architecture

You have your PC, RF, ALU, and peripherals, all of whom talk to an address bus and a data bus. Shared memory, any of them can use it.

Interrupt Service Routines

October 23, 2012

What you choose to put into HW versus ISRs is a super-important part of the design of digital systems.

What's going on? What generates an interrupt, which is a normally-occurring procedure call queued by some hardware change? The ones we use are overflow of COUNT, COUNT == COMPARE (timer), and UART RXdataValid, TXReady.

Those three boxes generate interrupt requests (these are all IRQs coming out of particular pieces of hardware on the chip), and those interrupt requests go into the Cause register, and in particular, the timer firing goes into bit 15, and the overflow goes into bit 14, and down here at bit 10 is where the UART lives. That whole range is the standard range for hardware interrupts on a MIPS machine.

So that might not do anything. On the next rising edge after some event happens, the corresponding bit should be set. Does not necessarily generate an interrupt (for one, need to check InterruptEnable).

Note: these are simply boolean flags, so the ISR has time limits regarding how long it has to execute.

Interrupt mask must be high for the interrupt that is pending. Bunch of and gates lined up down here that take those bits, and if any one of those is true, then we have a pending interrupt that is not masked.

Make sure software enables interrupts. What do you have to do? mfc0 \$t1, status; ori \$t1, 1;

I want you to contiguously count to 100M in various ways and print number of cycles: reg, variable, addi; reg, function call; DRAM variable, addi; DRAM variable, function, call. Finally, I want you to be able to disable and enable the timer ISR.

Compilers don't do a very good job with function calls and volatile.

Caution is indicated for wraparound of FIFOs. Circular buffers. Fill FIFO, then prompt UART.

Interrupts (CP3, OS), ALU

October 25, 2012

Threads, tasks, processes, etc. Multi-tasking on a single core. Used to be that computers only came with a single core. UNIX time-sharing system. Talk about scheduling, renicing of processes. IO access? Via OS, generally. You say you want to use some IO device, and you actually have to make a syscall (a trap). Does the same thing.

Simplest version of a 32-bit adder is just a 32-bit ripple-carry adder (with a bunch of 1-bit full adders stacked on top of each other). Whatever the time is from carry-in to carry-out on a single stage will be multiplied by 32.

Motivation for speeding up adders. Slow; subtraction is just 2s complement addition.

Pister's favorite: instead of making a 32-bit adder, make a 16-bit adder with carry-in and carry-out. While this thing is calculating its sum, calculate the sum twice: once with a carry-in of 1, once with a carry-in of 0. Then just use the carry-out from the first one and use that to select which one we're working with. This actually takes just over half (need to account for mux, which is fast). However: 1.5 times area + power, and 2x input capacitance (worst-case, on the top 16 bits).

You can do this again for the 16-bit adders, and so we can get a propagation delay of $8t_{ci/co} + 2t_{mux}$. But now I've got 2.25 times area and power, and 4 times input capacitance.

That's carry-select. Nice and simple. There's an even simpler one that's a fun combination called carry-bypass. The propagate signal is $a_0 \oplus b_0$. Each one is generating its propagate signal to see if exactly one of them is equal to 1.

All adders are used, depending on what you're optimizing for.

October 30, 2012

So back to carry-lookahead adders: propagate is XOR, generate is AND, and kill is NOR.

Kogge-Stone: most common adder when optimizing for speed. False paths. Humans still beat synthesis tools by thinking about how things work.

Next couple of minutes: what other outputs do we want out of our adder? Your adder doesn't have as many as most: usually a set of four: NCVZ -- Negative, Carry-out, Overflow, and Zero. We're using two's complement, so negative is just the top bit of the sum, carry is the top carry bit, overflow is a bit trickier -- XOR top two carry bits. Zero is trivial.

Used to give exams with different number wheel (bias), but eh.

How to make a shifter? One-bit shifter: either 0 or 1. Use a bunch of 2-1 muxes, take input from either the same or one over (sort of like shift registers). Need to be a bit careful at the top, depending on whether it's arithmetic or not.

If it's a 0/2 shift, we can play the same game except take things from two over. This, coupled with the 0/1 shift, will give us a 0-3 shift.

If ROT, muxes at the top have to be able to grab things from end of array.

Finally, what if I've got left-shifts? Left-shifts are just rotates where I've decided to shift in some 0s.

Graphics acceleration and fun stuff

November 6, 2012

Project, Video, DRAM. Homework assignment not due; don't guarantee I'll do that for the next few weeks. Like to only test students on problems that look like homework problems they've been given. Sense of questions on final exam.

Project: got MIPS core processor with I-cache and D-cache, and you're going to do a graphics processor that contains (among other things) a fill block and a line block and some other block(s). The MIPS is going to talk to that with a simple MMIO; all of these things are going to talk to the DRAM-request-controller/arbiter; that talks to the Xilinx memory interface block, and that talks to your 256MB of DRAM.

The last piece is there's a pixel feeder down here that talks to the Chrontel DVI driver, which talks ultimately to your screen and draws pretty things on the screen. The parts you have to do primarily so far is the graphics processor. We'll give you the DVI driver, and we've already done most of the rest.

Inside this DRAM, obviously you've got your instruction and data memory; you'll have two frame buffers. This pixel feeder is going to grab frame 0 or 1 (depending on which frame it is) and put it on the screen. The interface you'll use is an 800x600 screen at 75Hz. Graphics card gets DMA; it fetches instructions and writes data to memory.

We've got 800x600 @ 75Hz. For graphics, this is (0,0) to (799,599). We'll have 24-bit color and store (for convenience's sake) in DRAM with X,Y having 10-bit addresses, 1 pixel per 32-bit word.

We'll be using 8MB total for our two frames. Wasteful, but we have plenty of RAM.

Up here, we've got 800x600 (480k pixels, even though we're allocating space for a megapixel). If we've got 480k pixels per frame, and running at 75 Hz, we get 36MPixel/s (144 MB/s)that has to go over the interface right here. However: the pixel clock on the DVI driver for 800x600 @ 75Hz is 49.5 MHz. Why? Philo T. Farnsworth. Turns out that just like Farnsworth had back in the thirties, each line has a sync pulse, front/back porch. You've got sync, back porch video, signal (visible region), front porch, sync, and so on. Sync is 80 pixels, front porch is 160 pixels, and back porch is 16 pixels. When you add it all up, you end up with $1056 t_{pixel}$ for the time for a given line. And then the number of lines per frame is 600 visible, but you play the same game with lines as pixels on the line, (1 fp, 21 bp, 3 sync) so 625 lines per frame, of which only 600 are visible.

So $1056 \cdot 625 \cdot 75$ yields 49.5 megapixels per second (so $t_{pixel}$ is about 20.2 us). The visible part of the line is only 16.6 us.

If you have to draw every frame from scratch, that means you need to erase, then redraw.

Hardware acceleration for graphics. This is the fill engine. Line engine, you give it a color and two endpoints. A whole bunch of simple ways to turn endpoints into pixels. One elegant one by Bresenham, who came up with this in the very early days of computer science: his algorithm takes (x0,y0), (x1,y1), and if you assume it's a shallow downward slope between 0 and 45, x0 < x1, y0 < y1, then you just calculate:

int dx = x1-x0,
dy = y1-y0,
y = y0,
error = dx/2;
for (x = x0; x <= x1; x++) {
plot(x,y);
error = error - dy;
if (error < 0) {
y++;
error += dx;
}
}

If you do this right, you end up with a single cycle per x pixel.

Graphics processor is going to have a GP code register and GP frame register. These two are going to live in the MMIO space of the MIPS. That's how the MIPS talks to the processor. Software will first write graphics instructions to DRAM, then set GP frame to point to the appropriate one of these, and finally write GP code register with the address of the graphics instructions. Writing that register is the trigger that'll tell your graphics processor that it's time to execute these. Execute until it hits done, then stop.

GP instructions are going to be 32 bits each; the first byte is going to be type of instruction, and then there will be 0 or more arguments and 0 or more additional words. In particular, if type is 0, this is stop; if type is 1, this is fill (with R/G/B); if 2, this is line with R/G/B, next two words in memory are (y0,x0) and (y1,x1) in 10-bit values. Those are the three you need to implement, and you've got to come up with one more instruction to do something interesting.

Generalized memcpy; you've got an array of bitmaps, copy a character. If you really want to be fancy, the shaded triangle.

Move on to talk about what's going on inside this DRAM box.

DRAM

What was the first company to sell MOS DRAMs? Intel, 1970, i1101 and very soon thereafter the i1103. 1Kb DRAM, 500 ns cycle time (asynchronous), sold for 10 dollars, which was huge: less than 1 cent per bit for the first time in the industry.

Architecture on this thing was a capacitor for storage, a word line going across, and a bit line going down (for writes). Funky voltages, by the way (asymmetric). Did not have good enough electronics to read out, so they actually did a 3T memory on that and put the access transistor with a separate bit line for reads. That's where we started, and pretty quickly, that got onto Moore's law curves. There was a time when this was the single most popular chip for a couple of years, and they started making these microprocessors.

So we have these asynchronous circuits -- no clock on them -- they got big pretty quickly. They didn't want all the pins going in. Parallel 2-bit, enough room in the 20-pin package. They hit a megabit on chip pretty quickly, and they didn't want that many address lines, so they had shared address lines between row and column, which meant that they needed latch signals to lock those into the row and column, so there was the row address strobe (they called them strobes because when it flashes, that's when you grab the value) and the column address strobe.

This is basically an edge-triggered signal, and to this day people use that word. This is RAS and CAS. So you've got a big cell array which has a bunch of word lines coming out of it, so we've got a row register with a clock input on it, inverted to RAS, and a column register inverted going to CAS. The inputs are n bits of address, D and Q, and that goes into a big DMUX, where you've got $2^n$ word lines, which goes into your cell array, and coming out of your cell array you have $2^m$ bit lines, where $m \le n$.

The first thing this needs to hit are the sense amps and column drivers. What's inside today is just a single transistor and capacitor; we've got a word line and a bit line; transistor is just a tiny capacitor with maybe only a thousand electrons on it, compared to capacitor (hence the sense amps); this then goes into the row register, which has a bunch of control lines on it and a clock. If we're reading, at least conceptually, we've got a big DMUX here, and we're taking that signal and taking those $2^n$ bits and turning it into a single output bit from the array.

And the logic on that (see handout) involves RAS, CAS, WE, and OE. This destroys contents of the row that you're reading.

DRAM, Asynchronous; DRAM, Synchronous

November 8, 2012

Chipscope feature: "trigger on data". Super-valuable; if you do it this way, you don't spend that much time on synthesis.

$t_{rcd}$: row to column delay. Used to be RAS to CAS, but that's the same thing.

$t_{CL}$: column / CAS latency.

$t_{rp}$: row precharge.

Total delay from rising edge to $t_{rc}$, which is either the row cycle or the (random) read cycle (depending on who you talk to). Row cycle is a better term.

How about writing? If you think of your row register and sense amps up here, what does a slice through this thing look like? I've got my bit line coming down, and it goes into a sense amp, which then goes into a MUX -- I want to get this into my register -- but I also have an input line coming up that I'd like to be able to write to; I need some strobe on the register, and I need to be able to write back into the array, and I need to be able to write out to the outputs.

In these days, these lines get called DQ lines for obvious reasons.

The standard refresh interval is 64 ms; for automotive temperature grade (higher), there's more leakage, so this has to be lower (typically 32 ms).

In the good old days, that used to be outside the chip; now that's built into the chip.

So I've got this row register, now, and it seems like a waste to only have one output from this thing. It turns out that $t_{CL}$ is often comparable to $t_{rcd}$. The time to decode (select) one of these lines is pretty large. The time it took a signal to travel has to do with how long your lines are and how much capacitance it has. In order to minimize that, you've got the whole row there; why not clump it in larger groups? I'll have a group of $k$ bits that come down to a MUX, and I'll get $k$ bits out of it, and pipeline that process.

My column address therefore gets broken up into two pieces.

If $t_{cl}, t_{rcd}, t_{rp}, t_{rc} \gg T_{clk}$, share pins among 2,4,8 identical banks. All of these have the same address lines, and they have separate RAS and CAS lines. Become more efficient with pins, but that makes things messier.

Getting messy, so throw more logic at the chip.

Synchronous DRAM (SDRAM) is pretty much what everyone uses these days. Yes, you're just adding a clock, but you're changing the entire interface. You still have CS, WE, RAS, CAS, but now these plus some bank address BA0/1 all become control inputs to a FSM. There's a (mostly) separate FSM for each chip.

As an example, RAS=1,CAS=0,WE=0 yields "activate", 010 is "read", 011 is "write", 101 is "precharge", which also writes back. Tried best to make it look like old days.

So. The state machine, you start in IDLE; obviously there's some bootup sequence. You go to the active state when you're given that active command input, and that takes you $t_{rcd}$, which for our chip is 15ns. From there you go to a write state, a read state, a write and precharge state, read and precharge state, or precharge with writeback state.

200MHz bus clock (5ns), most things take 3 clock cycles. The minimum time if you want to go around the loop and back to idle again, is 11. 256 MB, which is 32M $\times$ 64b = 32M $\times$ 8B. 32M $\times$ 16b per chip, 8M $\times$ 16b per bank.

How does a bank work? We have 64b-wide columns. That comes down into our row register, and then coming off of that, we come down to another register that goes into a multiplexer to give us our 16-bit output. The column address (which is sitting in the column address register) gets split up.

Double-data rate: pump on rising as well as falling edge. Every 2.5ns (with 200MHz clock), you get an edge.

Reads can happen in parallel (since we have 4 banks). DDR2 does burst reads. You're going to read 16 bits out of a given chip on every edge for four edges (and there are four chips). So 256 bits per burst read / write.

They all showed up in the space of two clock cycles. 32B in 10ns is 3.2 GB/s. Latency is $6 T_{clk} = 30$ ns; at the end of this, I get a precharge command, when I can finally given another activate command and a new row command. So if I'm just doing truly random transfers (writes then reads), I get one transfer per 55 ns, which is something like 18 Mtransfers/sec for completely random access.

talk about graphics processor, should be able to do ff and pf much faster than software because we can read/write more than 32 bits at a given time.

vertical blanking / sync; horizontal blanking / sync. If your pixel feeder gives you a DONE interrupt, the ISR should change the pixel feeder source address should change to reflect the current buffer. Various other things should change here.

Project - cost model

November 13, 2012

Technology node, defined in terms of gate length of a MOS transistor; ancient technology that's been around for a long long time is 0.18 micron. The mask cost for 0.18, let's call it 200 grand (depends on which foundry you go to). Then there's how much SRAM you get in bits per square millimeter, how much DRAM or flash or ROM you have.

Call it 100 kb/mm$^2$ of SRAM and 1Mb/mm$^2$ for DRAM, and 50k useful logic gates per square millimeter.

Technologies scale roughly with the square root of 2 with every generation. We could replicate this table, and a reasonable approximation is that all of these things scale with the technology node. Let's just approximate as linear. In particular, the mask cost is going up at least linearly. It'll probably be fine. The theoretical numbers get better as the square of the scaling (quoted), but the practical numbers rarely do.

The kirf of a diamond saw, somewhere around 25 micron. But if I blow up that die, I've got a kirf on each side, I've also got pads on all sides in a pad ring. This is where you make a wire bond -- you put this chip down on some sort of package with a lead frame with some metal that comes out and wraps around, and you glue it down. You've got to have a machine that comes down and makes a wire bond from the top to the chip, etc. So you've got to have these pads to get your signal in and out; the pad spacing is on the order of 100 micron, but the area you have to devote is more like 200 microns. How would this chip get made? What are the inputs?

Don't need to talk about floor planning, but that is something that would be in a real discussion.

HBM: Human body model: 10k resistor, however many pF capacitor, and some voltage across the capacitor that's voltage of the human. ESD -- electrostatic discharge.

Probe card -- forest of tiny pins that drop down onto the surface of the chip; controlled, so all of these guys can be grounded when they touch down; driven by appropriate voltage. You can sprinkle test pads all over inside.

Another thing: it's very common that this die has very different technology.

Less interested that you use the right scaling so much as you knowing that there is scaling.

So let's talk about SDRAM again.

You've got a DMA controller that you're interfacing with that interfaces with your SDRAM chip. Interfaces with your devices and a bunch of FIFOs.

Note regarding masks: ones correspond to what you don't write.

For writes, this is a ready-valid interface on both of these. The ready is the nor of the two fifo-fulls (af and wdf), and valid is just the af_wr_en signal, and you have to double-pump the data. Give same address twice (not sure, but would not mess with this). Hold these values for two cycles when ready and valid.

When you write a single pixel to memory, it's going to go right through and find the right address and put the data in there. There will only be 4 zeros in the two sets of 16. For reads, same thing -- your ready is ~af_fifo_full; I don't think that the read fifo needs to tell you that its queue is full. There's something on the other side that inverts these things when talking to the SDRAM, but you don't have to worry about these things.

secondary notes: we put x after y because we draw row by row. 31 bits that look something like {5'b0, 6'bfb, 10'by, 7'bx, 3'b0}.

On your chips, each bank is 16kb wide.

Row-column delay: time between when I present the address, it goes through, and gets latched at the row register. At that time, I can give a column address, take those things from the row, and stick that into what we're calling a burst register.

Come down now on your chips to four columns of 16 bits each.

Either send the data out, or I've got it coming back in. Those are wired up to the D-Q pins (either inputs or outputs; point is that data can flow in either direction depending on whether I'm doing a read or a write out of this thing). $t_{CAS}$ or $CL$; $t_{CAS}$ is actually $CL * t_{clk}$.

Delays: $CL*t_{clk} - t_{RCD} - t_{RD} - t_{RAS}$.

If I'm in active state in the right row, your SDRAM is 3-3-3-8, but your verilog implementation is 5-3-3-8. For whatever reason, what's actually implemented is a CAS latency of 5.

Time to go from idle to read to idle.

Overview of the rest of the semester

November 15, 2012

Graphics Processors

November 20, 2012

Guest lecture by John Lazzaro.

What we're going to be talking about here is how the GPUs in your computer and iPad work. Giving you the 80-minute version. Help you figure out whether you want to go and work for NVidia.

Most of this block diagram lives on one chip. On this one integrated die, for the desktop, we have 4 cores, a GPU that's fast enough to handle your day-to-day use cases of simple games. What used to be on a separate chip, the north bridge, is up here (what talks to DRAM, PCI). In addition to the GPU, it has a lot of logic to drive the DVI port (aside from the high voltage parts).

The GPU that comes with the chip doesn't have its own dedicated graphics memory; it shares memory with the CPU.

There are people for whom this won't be enough; that's what the PCIe bus is for. Discrete GPU. They put the very best GPU you could make as opposed to this simpler one.

Why do we make this specialized hardware at all? You can answer this for yourself on the back of the envelope. Double buffering to prevent artifacts. Simplest thing you could build. How much work can you do on each pixel? On the order of tens of instructions.

What are we accelerating? Now, 3-D games. Back in the day, 2-D acceleration -- fast windowing systems, games like pacman.

Why should we use a special processor for graphics? Programmers generally use a certain style.

Triangle: simplest closed shape that may be defined by straight edges. With enough triangles, you can make anything.

3-D model: includes faces we can't see in the current view. Arbitrary smoothing based on granularity of triangles. Canonical example: teapot; wireframe outlines the triangles.

That's idea #1: whatever our processor is going to do, it's going to do on triangles.

Affine transformations to scale, rotate, translate, etc. Can apply these to the ensemble of triangles to apply the transformation to the (aggregate) object as a whole.

Wireframes are boring; vertex shading -- smooth gradient, we want to use interpolation here. Real graphics actually takes this a step further and considers light sources; what you specify on the edges is how these things absorb light as functions of angle of incidence, intensity, etc; how the light hits the vertices.

We see a 2-D window into the 3-D world (orthogonal projection; 4-D?). This story will only really be true for the next decade or so. You have 3-D TVs that don't need the glasses. These days, they can do something like 50-60 views. Once that happens, we won't be seeing a 2-D window so much as an actual 3-D world. That isn't here yet; this talk is about the hardware we need today.

To go from this three-dimensional world to this two-dimensional space, we must project each triangle that might face the eye onto the image plane. Then, create fragments on the boundary of the image plane triangle (rasterization).

Went from three-dimensional world to two-dimensional world on the screen. Called pixel fragments because a screen pixel color might depend on many triangles (e.g. a glass teapot). Need a system general enough to process many triangles.

Need to figure out color -- shading. There's thousands of SIGGRAPH papers saying how to do this, but there are two basic approaches: (a) physics of what copper looks like when you shine light on it (purely simulated physics -- you know the material and its properties, lots of math, but you get nice pictures when it all works -- only recently has this become feasible), or (b) you hire an artist to paint a surface of the teapot; we map this texture onto each fragment during shading.

The final step is reassembly / output merge.

Lugo Jr: first Pixar movie.

Graphics acceleration (back to hardware for the rest of the talk -- couldn't talk about hardware until we understood the motivation).

Algorithms were generally hardwired, programmable CPUs for vertex and pixel shaders. Specialized programming languages for specialized CPUs.

First difference is a set of input registers (read-only). Outside world puts this in. Only one vertex at a time is placed in; you do that; the program that gets put in is run once. Short code (e.g. 128 instructions), the same code runs on every vertex. Then the result gets put in the output registers, which are write-only. This goes out to the rasterization hardware. And there's some constant registers the general-purpose CPU could change; also read-only for the shader CPU. Very specialized thing; optimized in many ways. Working on 4 floating point numbers.

What really makes this whole thing work is that it's trivial to parallelize: processing of each vertex is independent. The only thing you have to worry about is ordering: the order of incoming vertices should be the same as outgoing vertices.

Pixel shader works similarly: takes one fragment at a time; outputs one fragment at a time. Memory system that handles mathematics; built into memory system.

Basic idea: replace specialized logic with many copies of one unified CPU design. Consequence: you no longer see the graphics pipeline when you look at the architecture block diagram. Only way to make graphics work well with DirectX 10.

New pipeline features: geometry shader lets a shader program create new triangles: one vertex in, many vertices out. Things you no longer need the CPU to do.

The other thing you could do was use recurring algorithms that lived entirely on the chip. Most of the specialized instructions have more or less gone. Shader CPUs are more like RISC machines.

More complicated things possible without CPU.

Delay and Power

November 27, 2012

SDRAM recap. Gave you a handout about a week ago. In our small-outline dual-inline memory module, we've got four chips. Each chip is 512 Mb; overall, the whole things has in I/O D-Q width of 64; a bunch of control signals. Address, bank address, foolishness with RAS, CAS, WE, ..., with names that are tied back to when they did exist way back when.

All of that stuff, address pins, bank address, and everything else, cut across those four chips: they're shared. Also write mask that goes on here as well.

So what happens inside this thing? Inside each chip, we know that there are address input, bank input, command input, 16-bit D-Q, and those are all shared by four separate banks. Essentially that's all parallel, and the only place these things come together is in these D-Q I/Os.

Separate row register, burst register, all that stuff drawn on that sheet is replicated four times. You can give independent commands; only difference is that you can't do it simultaneously; must be done sequentially. Also, 200 MHz clock; all of these things are sequential, including the clock.

So that's the picture; on the figure I drew what's inside one bank of this thing. You've got a row address register; your address line comes in and 13 bits go into the row address register, which goes into a decoder, which spits out 8k word lines that go across the array; inside of here the word line activates a row of transistors that look at a storage cell, and there's a bit line that goes up, and sitting on top of that there's a PMOS transistor for precharge. The core of this thing is very much like the old async DRAM chips, but now there's a bunch of digital stuff wrapped around it.

The storage array can hold its value for some time, but you have to refresh it every now and then (charge leakage).

That comes down to a bunch of sense amplifiers and column drivers. Amplify that signal to get something digital, which you can latch into the row register, or you can feed it back in. We talk about precharging on the SDRAM, but it's really taken care of by a state machine on-chip. A lot of terminology is now misleading.

So what's going on in here? This row register now has 256 copies of 64 bits each on one chip, which also means in one bank; on the whole SO-DIMM, we've got 256 by 256 bits. So you can think of this whole thing as having a bunch of 64-bit outputs coming out of it, and that goes through a MUX which selects those to go out a burst register.

If I've got 256 bits and I'm going to select one of them, we need eight select bits.

You know that the chip has only 16-bit input and output; this guy has 64 kb in his row register; you're trying to select down. A four-to-one MUX is a lot faster than an eight-to-one MUX. This is the one that runs hyper-fast: you get an output every rising edge of the clock.

When we put that together, the burst register is 4x16, or I can think of this as 4x64 when I put these all together.

I latch in a row address and wait; that is propagation. etc. So again. This is one bank inside of there, and in principle, you could have all four of them operating in phases. Not quite as broad of an interface as you'd like, because these pins are shared.

So if I draw my 200MHz clock as a bunch of rising edges, and I've got my command and address inputs and my D/Q I/Os, the command interface is synchronous with the clock (not DDR), 5 ns, if I give the command "activate" (in the old days, this would have been the row-address strobe), this would start the process. The decoder takes a while. Need to make sure the row is stable.

The logic doesn't let you do that.

So I apply a row address; there's a state machine on here. There are certain big states like IDLE, and in between them, there are several smaller states.

In both cases, it takes some number of ticks for the state machine to do the thing it's supposed to do.

notes: row-column delay is time between when row and column are specified; CAS latency is time between when column is specified and when data shows up.

Some number of states before you make it back to IDLE. Once you're back in IDLE, you can give another activate command. That delay there is the row precharge time (3 cycles). It turns out there's a minimum time from one activate to a precharge command; that is the row address strobe delay (8 cycles). Finally, the row cycle latency is just the sum of the row precharge and the row address strobe delays (11 cycles). So you see the chips given in CAS latency, row-column delay, row precharge, and RAS (3-3-3-8 possible; 5-3-3-8 implemented on chip).

Maybe that helps explain why things are set up the way they are. If you're interfacing with this thing, and you want to do a read on this from DDR, one way to do that is to have registers for the four data values, which are going to come out every 2.5 ns and have all of them share those 64-bit lines but have their clocks be controlled by some separate thing. I can then put those values together and stick them into a MUX that now goes into a read-data FIFO.

If on the exam I give you specs for some real chip, I'll expect you to know what's faster.

Last concept(s): Delay and Power. Going to go all the way back to lecture 1, when we were talking about CMOS inverters.

Time between input goes high and output goes low is $t_{pHL}$ (pull from high to low), and the time between input goes low and output goes high is $t_{pLH}$ (pull from low to high).

Capacitance actually now dominated not by transistors but by that of the wire.

Truth is that transistors don't look like resistors nowadays; since we're at velocity saturation, they look more like current sources. Reality is that it's not actually either; it's somewhere in between. With the definitions we've got, our times are $\tau \ln 2$; we're not going to worry about that. So what is $RC$? $R$ can vary by orders of magnitude. But typically they're designed so that very roughly, you get something like 1 k$\Omega$ (on the order of this; maybe tens, but not much more). $C$ is on the order of $1 \text{ fF}/\micro \mathrm{m} (W_n + W_p)$.

So if I've got a 22nm process, and the n-channel device is ten times that wide, and p-channel device is five times, this number might be something like a micron, not much more; so the ballpark is that this is roughly a femtofarad. RC therefore is about a picosecond. Light travels 300 microns in this time. So what if my inverter is 100 micron away? Another very rough number is about 200 fF per millimeter, which is about .2 fF per micron.

So that covers timing. How about power and energy?

We've got two components: resistors (dissipate, no storage) and capacitors (storage, no dissipation). $\Delta E = QV = CV^2$. Power stored by the capacitor is half of this, though. Dissipated in the form of heat by a resistor. If you make the resistance small, this looks more like an inductor, and we get ringing.

What frequency do we care about? Time interval between zero and one, which is generally much less than the clock frequency. That ratio, frequency that you go from 0 to 1 divided by the clock, is often called $\alpha$, the activity factor. Power therefore is $\alpha CV^2 f_{clk}$.

Power is then $NCV^2 f_{clk}$ (with $N$ being the number of gates). Total capacitance on the chip comes out to $1 \mu \mathrm{F}$, which only switches 1V, switches 1 $\mu \mathrm{J}$, which is about a kW; depending on the size of your chip, your heat flux may approach the surface of the sun. $\alpha$ should be much less than 1, therefore.

Delay and Power

November 29, 2012

Office hours next week: TuTh 10-11, F 3-4; Finals week not 100% sure, but I think it'll be TuTh afternoon. Won't be around on Friday until the exam; will post on Piazza what they end up being.

Projects still due tomorrow; 3 PM: Verilog. Hopefully should not be much software to write at this point.

Final is still Friday of finals week. Will be comprehensive. Will post a couple of examples of questions on SDRAM and maybe delay + power. Really good things to ask the TAs about; maybe will come up with solutions as well.

Delay and Power

Last time: the two key equations were $\tau = RC \approx$ delay per stage or gate. Lots of ways to look at this, but gets you into the right ballpark. Nitty gritty details: 141. $C = C_{wire} + C_{gate}$. .2fF/um, 1fF/um*W. If building dense stuff, gate dominates, but anything over "wide area" is dominated by wire capacitance.

And two, the power that gets dissipated is $P = NCV^2 f_{0 \to 1}$. If all transistors switch every nanosecond, you dissipate easily a kW of power, and the chip melts. Fortunately, most gates aren't switching.

If we draw our FF, we have a storage cell with an enable line on it being fed by something with an enable line on it. If you recall, we've got CLK and !CLK; you've got your input D on there. As soon as CLK goes high, you start feeding that value back to itself, and you turn off the input.

Register file: 32x32 in your regfile; you do have a handful of address lines going into this thing, but at most a tenth are going to be accessed. One to be written, two to be read out.

Could figure out power consumption from this formula. What's going on inside the array? On average sixteen bits out of the 1024 in this array; that's not where the power is. But you've got the clock going between four different gates, and if you go back to your notes, that's two PMOSes and two NMOSes; clock to clock-bar, and there's your A and A-bar.

This clock line coming in has to drive a bunch of inverter-equivalent gates; it's the clock tree that burns the power more so than the reading and writing of these feedback networks.

SRAM cell is also cross-coupled inverters, and some access lines -- you've got a word line running across the top, and vertically you've got bit-line and bit-line-bar coming down through the array. These lines in general, one line not going to be active, but you'll have to switch the capacitance of the bit-lines. Doesn't matter what the bit is; you'll be charging one or the other of the two capacitances. Turns out that's not generally the problem with the SRAM; the biggest problem is leakage. Will talk about that in a minute.

So that's where our power is coming from. Where our delay is coming from -- a common thing people will do to test process is put a ring oscillator test structure: you put some odd prime (need prime to avoid harmonics) number of inverters; they'll make that structure of ring oscillators, put it in a box with a power supply, measure current going through, and put a counter on the outside; and that might be for example a 12-bit counter, and they have a ripple-carry-out coming out of the test structure; they then see how fast this ring oscillator will go.

If you go on the web to MOSIS, you can take your design from CS150 and send it off to them, and for \$2000, they'll send you back some chips. In principle, they'll do what they did on your FPGA. Collect designs from lots of groups; share costs of mask process. Tons of process information. One example: 32nm CMOS, 0.9V, 11-stage ring oscillator,$f_{osc} = 9.3 GHz$,$P = 110 uW$. What is$\tau, C, R$? Every gate switches at$f_{osc}$. If we think about the delay for a gate in an$N$-stage ring oscillator, how many gate delays fit into the time constant?$T_{osc} = 2Nt_d; t_d \approx \tau = t_{osc}/2N$. MOSIS might do fanout of 4; that'll be something you hear very often. You might see something like a ring oscillator where each inverter drives 3 dummy inverters as well as the one actually in use.$110uW = 11 C_{gate} * (0.9V)^2 * f_{osc}$. Note: usually$f_{0\to 1} \ll f_{clk}$;$C_{gate} \approx 10 uW / (10 V^2 GHz) = 1fF$;$R = \tau / C = 5 k\Omega$. Just from looking at the power consumption and oscillation frequency, you can figure out$\tau, R, C$. Dynamic power:$CV^2 f$. Leakage power is the problem, often. Gate oxide leakage, like to thin oxide layer, but thinner oxide increases likeliness of tunneling. Probability cloud. EOT; equivalent oxide thickness. Higher k-value; looks like silicon dioxide with less thickness (to get the electric field needed), but you don't get as much tunneling. Leakage path through this thing. Still current that flows through. Plot drain current vs gate-source voltage on a log-log plot, etc. You want to maximize$I_{on} / I_{off}$(roughly switching speed vs leakage). Turns out there are fundamental limits on this slope here. You can't get that to be less than 60 mV at room temperature. And you need a lot of orders of magnitude. You really like this to be a big number; one way to do that is to just use a bigger supply voltage: increase$V_{DD}$. But power goes as$V_{DD}^2\$. Further, materials tend to break down at high voltages (per distance).

MEMS transistors. You can make a relay which in cross-section has a subsection which is insulated and an electrode; a polysilicon beam that goes over this thing. By putting a voltage, you use electrostatic attraction and make / break connections. Now you do have something that looks like a switch; you don't have any leakage that way. You still have surface leakage, though.

Return to lecture 1, where we talked about what digital system design is these days. At the core is microprocessor and what runs on that; then you've got inputs. Outputs like audio, video, LEDs, communication and networking (wifi, USB, ethernet, cellular bands, ...), memory and other storage (flash / ROM, SRAM, DRAM); a bunch of accelerators (graphics, crypto, floating point and math, DSP broadly).

What you have done: doing your Verilog and synthing that to an FPGA is most certainly what you do in industry. Finally, you tape out. Make masks, which cost an enormous amount of money, you make wafers and chips and boards, which go through testing. 2 to 24 months. 6-8 might be more typical. Many millions of dollars today.