# Spring 2019 | CS 6400

Author: Travis Jefferies
Last Updated: 04252019

## Efficiency

Your database could be the best designed thing in the world, but if it isn't fast enough, nobody will use it! We must first begin with some computer hardware 101.

### Computer Architecture 101

![](p1.svg)

Main memory - RAM - volatile, fast, expensive \$\$\$<br>
Secondary Memory - DISK - permanent, slow, big and cheap $<br>
* Applications run by the CPU can only query and update data in main memory
* Data must be written back to secondary memory after being updated
* Only tiny fraction of a real database fits into main memory

### Who Cares?

We should care about this stuff when considering the difference in access time between main memory and disk:
* Main Memory
    * 30ns or $3\times10^{-7}$ sec
* Disk
    * 10ms or $1\times10^{-2}$ sec

Assuming we have 60 seconds to access things, its $\frac{\frac{60}{3\times10^{-7}}}{\frac{60}{0.01}}$ or 33,333.33$\times$ faster to use main memory.
* In fact, RAM access time is completely ignored when computing computational cost
    * Only disk access time is even considered!
    * CPU cost is ignored too in these calculations
    
### Disk

See picture below for an in-depth overview of the disk:

![](p2.png)

A disk has a number of plates or platters.<br>
For each one of the platters, there is a read-write head that accesses the top of the plate, and one that accesses the bottom of the plate.<br>
All of the read-write heads are connected together and are operated by an actuator.<br>
On each surface, what passes under the read-write in that position is called a track.<br>
A collection of tracks is called a cylinder.<br>
Each surface is split up into a number of sectors.<br>
A sector is the smallest physical unit that could be transported from disk to main memory.
* Usually consists  of 512 bytes

Blocks have contributions from several sectors.<br>
* Typically blocks are 4k bytes or 8 sectors
    * Could be 8k or 16k depending on the data we are storing
    
### Records, Blocks, Files

The memory access required for database applications can mainly be thought about as records, blocks, and files.

#### Records

Records are stored on a block on a disk.<br>

![](p3.png)

RegularUser(
    Email varchar(50),
    Sex char(1),
    BirthDate datetime,
    CurrentCity varchar(50),
    Hometown varchar(50)
)

datetime = 8 bytes<br>
record size = 50 + 1 + 8 + 50 + 50 = [1, 160] = 159 bytes<br>

#### Blocks

Now if we assume the following specs:

block size: 4 kb (+metadata)<br>
filled: ~80%<br>

We can expect the following memory footprint:

4,000 $\times$ 0.8 = 3,200<br>
3,200 / 159 is = 20.126 records/block<br>

So what do we do with the remaining 0.874 record?

![](p4.png)

If we decide to go with the option on the left - where 0.126 of one record starts in one block and the other 0.874 record is represented in another block, we have what is called a *spanned representation*.<br>

The default action in most database systems is to run with unspanned representations, simply to avoid the processing that's needed to break off records.
* Obviously if you have record sizes that are larger than block sizes than you don't have a chice but to run with the spanned representation.

Now that we have the concept of Blocks, we can create Files.

#### Files

Files are Blocks linked together by pointers.<br>

![](p5.png)

Assuming the following specs:

block pointer size: 4 bytes (true for 32 bit architecture)<br>
\# Records: 4,000,000<br>
block size: 4 kb

Assuming we can fit ~20 records/block, 4,000,000 records will require $\frac{4,000,000 \text{records}}{20 \frac{\text{records}}{block}} \approx 200,000$ Blocks

So we can expect ~$200,000 \times 4\text{kb} \approx 800\text{MB}$ size file on disk.

### Compute Transport Time from Disk to Main Memory

Assumptions:

* Seek time: 3-8 ms
* Rotational Delay: 2-3 ms
* Transfer time: 0.5-1.0 ms
* Total: 5-12 ms

0.01 sec or 10 ms per page fault.<br>

What if instead of picking up 1 block at a time, we chose to pick up 250 blocks per seek-rotation?<br>
These are called ***extent transfers***.

Normally such an operation would cost $250 \text{ blocks} \times 10\frac{\text{ ms}}{\text{block}} \approx 2.5 \text{ secs}$.<br>

Using *extent transfers* we can do the same operation for $\approx 0.25 \text{ secs}$.
* Extent transfers only incur seek time and rotational delay on finding the first block
    * From that point on it is straight transfer time, seek-rotation delay is elminated
    
The downside to using *extent transfers* is we will probably need more buffer space.<br>
Whenever we move data from the disk into main memory or from main memory back to disk, we need good buffer management strategies.
* One of the most commonly used strategies is Least Recently Used (LRU)
    * If we run out of buffer space and need to free up space in main memory for the data being transfered from disk, we find what is least recently used and overwrite that.
    * Philosophy being *We haven't used it in awhile, were probably not gonna use it next."
    
LRU is ideal for merge joins.

* LRU really struggles with nested loop joins
    * MRU or Most Recently Used is ideal for nested loop joins

In [6]:
2.5/0.25

10.0