# Data Structures on the system

The system has a number of data structures that it uses.  The two that we will be exploring are these:
- the file system
- a file

### The file system

The file system is where all data is stored and organized.  The structure is that of a hierarchy or tree with branches eminating from a "root" and from "nodes" ( aka branch points ) and terminating in "leaves".  


### A file

A file is just a series of bytes of a finite length.  

## The file system

One command that we can use to display the file system hierarch is `tree`.

For example:

In [10]:
tree -F /home/

/home/
└── jovyan/
    ├── apt.txt
    ├── environment.yml
    ├── hello.jl
    ├── hello.R
    ├── Initial.Data.Analysis/
    │   ├── data-structures.files.folders.ipynb
    │   ├── env.rc
    │   └── README.md
    ├── LICENSE
    ├── LinuxTerminal/
    │   └── Linux.Terminal.Setup.ipynb
    ├── Manifest.toml
    ├── postBuild*
    ├── Project.toml
    ├── README.md
    ├── runtime.txt
    └── start*

3 directories, 15 files


( We will explain the details of using `tree` and other commands later.  For now, you can just run the commands to see what ouput they generate. )

Here we see a portion of the file system hierarchy.  The tree is made up of only two elements: directories ( aka folders ) and files.  Directories can contain other directories ( aka subdirectories ) and files.  In this example, directories are denoted by ending with a '/' symbol.  Here we have four directories: home, jovyan, Initial.Data.Analysis, and LinuxTerminal.  The home folder has a single child, jovyan, which itself has a number of children, including two subdirectories, Initial.Data.Analysis and LinuxTerminal.  The LinuxTerminal directory contains no subdirectories and has only a single file, Linux.Terminal.Setup.ipynb.

Notice that the home directory is prefixed with a slash '/'.  This is the symbol for the root of the tree or the "root" directory. The slash is also the symbol used to delineate directories from subdirectories.  For example, we can list the files and folders in a specific directory by using the `ls` command:

In [7]:
ls -F /home/jovyan/

apt.txt
environment.yml
hello.jl
hello.R
Initial.Data.Analysis/
LICENSE
LinuxTerminal/
Manifest.toml
postBuild*
Project.toml
README.md
runtime.txt
start*


A few items to note:
1. all subdirectories end with a slash '/'. 
1. the contents of those subfolders are not listed
1. files are listed in alphabetical order
1. file names can have a mix of upper and lower case characters

Chaining together directories and subdirectories is called creating a path.  In the previous example '/home/jovyan/' is the path. Since LinuxTerminal is a subdirectory, we can append it to the path.  For example:

In [9]:
ls -F /home/jovyan/LinuxTerminal/

Linux.Terminal.Setup.ipynb


In this case, there is only one file in the LinuxTerminal subdirectory and there are no further subdirectories.  We have reached the end of the path.

## Files

### File contents

Just like `ls` displays the contents of a subdirectory, `cat` displays the contents of a file ... with some interpretation.  For example:

In [24]:
cat /etc/debian_version

buster/sid


`cat` displays the characters that are in the file. ( Notice the absence of a period at the end. )  However ...

What is really on the storage device is actually just a series of 1's and 0's.  And we can display that using the `xxd` command.  For example,

In [26]:
xxd -b -g0 /etc/debian_version | cut -d' ' -f2 | tr -d '\n' ; echo 

0110001001110101011100110111010001100101011100100010111101110011011010010110010000001010


We can modify the previous command to dispay the interpretation:

In [29]:
xxd -b -g0 /etc/debian_version | cut -d' ' -f2-

011000100111010101110011011101000110010101110010  buster
0010111101110011011010010110010000001010          /sid.


We can modify it further to format the 1's and 0's into groups of eight, called a byte, so that they are easier to view as well as prefix each line with a count ( notice that the count is zero-based. )

In [39]:
xxd -b -g1 /etc/debian_version

00000000: 01100010 01110101 01110011 01110100 01100101 01110010  buster
00000006: 00101111 01110011 01101001 01100100 00001010           /sid.


On the far right of the output, we see that the 1's and 0's get interpreted as the letters 'b', 'u', 's', etc.  Also notice that the last byte '00001010' was displayed as a '.' That byte is actually a non-printable character, one of many.  This one happens to be the end-of-line ( aka newline or '\n' ) character, which we will encounter more of later on.  Other non-printable characters frequently encountered include tab ( '\t' ) , carriage return ( '\r' ), and null ( '\0' ).

### Lines

One way to interpret a file stream is as a collection of "lines".  That is, a program can read the file one character at a time until it gets to an "end-of-line" character, then it can operate on that line, then read the next line.  `cat` does this with every line in a file. For example, we can have `cat` prefix each line with the line number:

In [42]:
cat -n /etc/debian_version

     1	buster/sid


That's not very exciting with a file that has only one line.  So, here's the same command run on a file with multiple lines:

In [53]:
cat -n /etc/os-release

     1	NAME="Ubuntu"
     2	VERSION="18.04.6 LTS (Bionic Beaver)"
     3	ID=ubuntu
     4	ID_LIKE=debian
     5	PRETTY_NAME="Ubuntu 18.04.6 LTS"
     6	VERSION_ID="18.04"
     7	HOME_URL="https://www.ubuntu.com/"
     8	SUPPORT_URL="https://help.ubuntu.com/"
     9	BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    10	PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    11	VERSION_CODENAME=bionic
    12	UBUNTU_CODENAME=bionic


In this case, `cat` reads each character until it gets to the end of the line, prints the line number followed by the line, then repeats the process until it gets to the end of the file.

Many other commands use this pattern of reading a line, operate on it, then repeat.  Let's look at a few:
- head
- tail
- cut
- wc
- file

In [4]:
# head displays the first 10 lines of a file if not given any options
## you can specify more or fewer lines by giving it the option -n X, where X is a whole number
## here we get the first 4 words from a dictionary file
head -n 4 /usr/share/dict/words

A
A'asia
A's
AATech


In [5]:
# tail displays the last 10 lines of a file if not given any options
## you can specify more or fewer lines by giving it the option -n X, where X is a whole number
## here we get the last 4 words from a dictionary file
tail -n 4 /usr/share/dict/words

évolué
évolués
événement
événements


In [11]:
# cut displays the character range specified by the -c option
## range is specified using 1-based counting
## here we get the first 6 characters from the /etc/debian_version file.
cat /etc/debian_version
cut -c 1-6 /etc/debian_version

buster/sid
buster


In [12]:
## here we get characters 7-10 from the /etc/debian_version file.
cut -c 7-10 /etc/debian_version

/sid


In [14]:
# wc gives a summary of how many characters, words, and lines there are in a file
wc /usr/share/dict/words

 654749  654749 6876726 /usr/share/dict/words


In [18]:
# file gives you a reasonable guess as to what type of file it is.
file /usr/share/dict/words
file /etc/dictionaries-common/words
file /usr/share/dict/american-english-insane
file /etc/debian_version




/usr/share/dict/words: symbolic link to /etc/dictionaries-common/words
/etc/dictionaries-common/words: symbolic link to /usr/share/dict/american-english-insane
/usr/share/dict/american-english-insane: UTF-8 Unicode text
/etc/debian_version: ASCII text
