Skip to content
John Cook edited this page Jan 1, 2021 · 6 revisions

My data is currently in a state of disarray because of multiple laptops dying and needing the data recovered, using a different computer until fixed/replaced, and then not merging everything when switching (back) to the repaired/replacement device.

When my most recent laptop motherboard failure occurred and I had to search through 88 terabytes of data looking for the 1 megabyte file containing the decryption key for the drives in the dead laptop, I knew something had to be done.

There are three things I ultimately need:

  1. A single place some data should be and a way to check for corruption.
  2. A second copy of that data (a backup) that is there if I need it.
  3. A clearly defined way of archiving data that I'm no longer using to both offload it from main storage and know where it is when I want it again.

Data Categorisation

In order to determine the best way to store, backup, and archive data, the data needs to be categorised.

To start with, I am merging all of my Documents git repositories into a single repository. This is a slow process, but once it is done I am going to digitise all of my remaining paperwork and commit it to the repository, freeing up desk space so that I can start diagnosing the fault on my laptop.

With my documents in a single version controlled repository I can then start deciding how best to reorganise the data (such as by splitting pictures out into a git submodule), how best to backup the data, and how best to archive the data.

Some data can already be categorised: multimedia.

Multimedia Categorisation

My multimedia files are in various formats, but in general they are in one of the following categories:

  1. A large individual file.
  2. A folder containing several related files.
  3. A folder containing many related files.

Let's use regular media types as an example for each type:

  1. A Windows 10 DVD ISO, /mnt/Media1/ISOs/Windows 10 English x64/Win10_1703_EnglishInternational_x64.iso.
  2. An unzipped downloaded music album, /mnt/Media1/Audio-Downloaded-FLAC-HD/Boston Pops/A Boston Pops Christmas - Live from Symphony Hall.
  3. A folder containing some live streams, /mnt/g/1b - Recorded Streams/Subnautica 2020.

At present, the first type would have both a .md5 file with it for integrity checking and a .torrent file so that, were a "piece" of the file corrupt it would in theory be possible to grab that piece from the second copy.

The only thing the second type have are internal checksums (if any). The FLAC file format, for example, includes such checksums and .flac files can be integrity tested (e.g. with flac -t).

The third type would also have a .torrent file created for the directory so that the individual files can be integrity checked and if need be a good copy of a file can be copy/pasted over a bad one.

Multimedia Backup and Archiving

Here's the thing: multimedia files invariably need format-shifting (or transcoding) so when it comes to backups and archives I need to adjust my terminology.

An archive copy should be the copy. It is the source file, it is the video before it is uploaded to YouTube, it is the video before it is converted to DVD-Video.

A backup copy should be a second copy of something that is not archived. In data lingo, a backup copy is a warm copy of something that is hot, whereas an archive copy is cold. In general, the warm copy should be in the same format as the hot copy.

At the moment I don't do data archiving. For multimedia files, I have a hot copy (e.g. on Media1) and a warm copy (e.g. on Media1External). I have bought an LTO-6 tape drive, a SAS HBA, and some media, however, as I do plan on starting to archive things.