# Files

Files are an abstraction that’s maintained by the operating system (OS). It presents a façade of names and hierarchy above a nest of raw bytes. Files also provide a layer of security. These have attached permissions that the OS enforces.

## File Formats

Storage media like hard disk drives work faster when reading or writing large blocks of data in serial. File formats are standards for working with data as an single, ordered sequence of bytes.

File formats manage trade-offs between performance, human readability, and portability. 

## Accessing Files

`std::fs::File` is the primary type for interacting with the filesystem. There are two methods available for creating a file: `open()` and `create()`. When you require more control, `std::fs::OpenOptions` is available.

## Interacting With The Filesystem In A Type-safe Manner

Rust provides type-safe variants of `str` and `String` in its standard library: `std::path::Path` and `std::path::PathBuf`. You can use these variants to unambiguously work with path separators in a cross-platform way. Path can address files, directories, and related abstractions, such as symbolic links.

In [10]:
use std::path::PathBuf;

let hello = PathBuf::from("/tmp/hello.txt");
println!("{:?}", hello.extension());

Some("txt")


## What Is EOF?
The end of file (EOF) is a convention that operating systems provide to applications. There is no special marker or delimiter at the end of a file within the file itself. EOF is a zero byte (0u8).

When reading from a file, the OS tells the application how many bytes were successfully read from storage. If no bytes were successfully read from disk, yet no error condition was detected, then the OS and, therefore, the application assume that EOF has been reached.

This works because the OS has the responsibility for interacting with physical devices. When a file is read by an application, the application notifies the OS that it would like to access the disk.

## Bitcask Storage Format

Bitcask storage backend that was developed for the original implementation of the Riak database, a NoSQL database. Bitcask lays every record in a prescribed manner. It's comprised of three parts:

1. Fixed width header:
    - checksum: `u32`
    - key_len: `u32`
    - value_len: `u32`
2. Variable width key: `[u8; <key_len>]`    
3. Variable width value: `[u8; <value_len>]`


To parse a record, the header information is read, then that information is used to read the body. Lastly, the body's contents are verified with the checksum in the header.

## Writing Multi-Byte Binary Data To Disk

Computing platforms differ as to how numbers are read. Some read the 4 bytes of an i32 from left to right; others read from right to left. That could potentially be a problem if the program is designed to be written by one computer and loaded by another.

Rust ecosystem provides some options. The `byteorder` crate can extend types that implement the standard library’s `std::io::Read` and `std::io::Write traits`. The extensions can guarantee how multi-byte sequences are interpreted, either as little endian or big endian.

### Byteorder Crate

`byteorder::LittleEndian` and its peers `BigEndian` and `NativeEndian` are types that declare how multi-byte data is written to and read from disk. Checkout demo project `byte_order_explorer` to see how to use `byteorder` crate. 

## Validating I/O Errors With Checksums

Checksum can come in handy to validate what the program has read from the disk and what it has written to the disk.

1. Saving to disk: Before data is written to disk, a checking function is applied to those bytes. The result of the checking function (the checksum) is written alongside the original data. No checksum is calculated for the bytes of the checksum. If something breaks while writing the checksum’s own bytes to disk, this will be noticed later as an error.
2. Reading from disk: Read the data and the saved checksum, applying the checking function to the data. Then compare the results of the two checking functions. If the two results do not match, an error has occurred, and the data should be considered corrupted.

### Checksum Function

An ideal checksum function would:

-  Return the same result for the same input
- Always return a different result for different inputs 
- Be fast
- Be easy to implement

We'll use parity bit checking in `ActionKV`.

### Parity Bit Checking

Parity checks count the number of 1s within a bitstream. These store a bit that indicates whether the count was even or odd.

Parity bits are traditionally used for error detection within noisy communication systems, such as transmitting data over analog systems such as radio waves. 

Below is an implementation of the `parity_bit()` function that takes an arbitrary stream of bytes and returns a `u8`, indicating whether the count of the input’s bits was even or odd. 

In [2]:
fn parity_bit(bytes: &[u8]) -> u8 {
    let mut n_ones: u32 = 0;

    for byte in bytes {
        let ones  = byte.count_ones();
        n_ones += ones;
        println!("{} (0b{:08b}) has {} one bits", byte, byte, ones);
    }

    (n_ones % 2 == 0) as u8
}

In [4]:
let abc = b"abc";
println!("input: {:?}", abc);
println!("output: {:08x}", parity_bit(abc));

input: [97, 98, 99]


97 (0b01100001) has 3 one bits
98 (0b01100010) has 3 one bits
99 (0b01100011) has 4 one bits
output: 00000001


In [5]:
let abcd = b"abcd";
println!("input: {:?}", abcd);
println!("result: {:08x}", parity_bit(abcd))

input: [97, 98, 99, 100]
97 (0b01100001) has 3 one bits
99 (0b01100011) has 4 one bits
98 (0b01100010) has 3 one bits
100 (0b01100100) has 3 one bits
result: 00000000


()