# File Formats

Often when we work with larger programs we want to store data rather than have it live in memory and be lost when the program terminates. In order to do this, we have to define a file format.

These can be in bytes or text, but we'll just work with text — that is, unlike the previous "File Format" assignment, the entire thing is technically just ASCII, and it's our Python program that interprets the significance.

There are various ways to read files, but we'll look at two: recursive structures and state machines.

Both types can be supported by regex parsing for individual lines.

## Recursive structures

Some file formats, such as XML and HTML, use a recursive structure in which tags enclose smaller tags. For example:

```
<html>
  <head>
    <title>My webpage</title>
  </head>
  <body>
    <h1>Heading</h1>
    <p>Some body text</p>
  </body>
</html>
```

You can use a recursive parser on these. The result might be a dict in which keys are the name of the tag type and the value is the content, which could either be text or a nested dict of the same kind.

```
parsed = {
  'html': {
    'head': {
      'title': 'My webpage'
    },
    'body': {
      'h1': 'Heading',
      'p': 'Some body text'
    }
  }
}
```

Note that since Python dictionaries are unordered, the actual implementation of such a structure could use an `OrderedDict` or just simulate a dictionary by having a list where the first item is the tag type.

## State machines

A "state machine" is a way of modelling behaviour that switches between states, where each state has an expected input, with behaviour defined for only certain inputs, plus a means of entering and exiting the flow. In other words, when parsing a file, you expect data to occur in a certain order, linearly from top to bottom (though inner structures can repeat).

For example, here's a simple format for a hockey game.

```
LEAFS VS. CANADIENS
2021-05-25

LEAFS Tavares
CANADIENS Caufield
LEAFS Nylander
LEAFS Simmons
CANADIENS Price
LEAFS Matthews
```

A parser could have five states:
0. Enter/Looking for team name line
1. Looking for date line
2. Looking for blank line
3. Looking for goal line
4. Exit

Here's an annotated version of the format:

```
LEAFS VS. CANADIENS        state 0 - parse & store team names - set to state 1
2021-05-25                 state 1 - parse & store date - set to state 2
                           state 2 - blank - set to state 3
LEAFS Tavares              state 3 - non-blank - parse & store team & scorer
CANADIENS Caufield         state 3 - non-blank - parse & store team & scorer
LEAFS Nylander             state 3 - non-blank - parse & store team & scorer
LEAFS Simmons              state 3 - non-blank - parse & store team & scorer
CANADIENS Price            state 3 - non-blank - parse & store team & scorer
LEAFS Matthews             state 3 - non-blank - parse & store team & scorer
                           state 3 - blank* - set to state 4

* Assumes newline char at the end of Matthews, even if the next line has no content
```

State machines tend to be laid out in `switch/case` structures or equivalent. Each case, based on the current state, reads a line (or a character, or whatever the unit is), reacts, and updates the state.

## Serialization

Serialization refers to a process whereby an object is directly stored. The difference between this and simply saving data is that plain data must be interpreted by logic to create the program state, whereas serialization stores the actual state of the object.

Imagine a pseudo-RNG class. Perhaps it's written to take a seed int, from which it can generate a repeating sequence of 65,536 random numbers. It keeps an `i` value to track where it is in that sequence. To define a particular sequence, you could save the seed; but if you serialize the object, the stored data would contain both the seed and the `i` (as well as any other properties it might contain).

This has the advantage of not needing to think about the format and having a built-in encoder/decoder. That's a big advantage, only partly negated by three disadvantages: you might store things you don't need or want to store (e.g. time-based attributes might be out of date); you expose the structure of your constructed object; and it implicitly assumes an object-oriented programming design if you want to elegantly capture a program.

The standard Python module for serialization is `pickle`. It's easy to use but is not easy to manually read or edit to verify data.

An alternative is JSON, which is a human-readable format and extremely common for sharing data between programs and even languages. Especially on the web, but also in other applications, JSON is an industry standard for serializing an object.

## Tasks

1. Write a simple HTML parser based on the above example. The tags do not need any attributes (such as `href`), just the tag name. You should be able to parse `html`, `head`, `body`, `title`, `h1`, `p`, and `img` (the `img` tag will have no content or closing tag; it is NOT an enclosing tag and will appear as `<img />`). Return a dict.

2. Write a state machine that can handle Assignment files from the File Format assignment. Assume that one file can contain any number of Assignments, separated by one or more blank lines. The assignment description may also be on multiple lines (without semantic significance). Also, the lines are not labelled. For example:

```
9
Civics
2021-05-27
Write to your MP.
Make sure to format it as a proper letter with salutations.

10
English
2020-09-10
Make a media poster.

...
```

Note that we will just use plaintext, not the assignment's binary encoding. Your parser should create a list of dicts that have four keys.

3. Serialize.
 1. Add to either #1 or #2 a few lines to `pickle` the parse result and be able to `unpickle` it from a supplied filename, returning the original object.
 2. Do the same with JSON.
