# Data Structures on the system

### Prerequisites



Run these commands in the terminal pane of your VSCode session in CodeSpaces.
Not needed if running this notebook in Binder.

In [1]:
apt-get update
apt-get install -y wamerican-insane file

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
                        Reading package lists... 0%Reading package lists... 0%Reading package lists... 0%Reading package lists... 5%Reading package lists... 5%Reading package lists... 6%Reading package lists... 6%Reading package lists... 62%Reading package lists... 62%Reading package lists... 63%Reading package lists... 63%Reading package lists... 69%Reading package lists... 69%Reading package lists... 77%Reading package lists... 77%Reading package lists... 82%Reading package lists... 82%Reading package lists... 82%Reading package lists... 82%Reading package lists... 82%Reading package lists... 82%Reading package lists... 82%Reading package lists... 82%Reading package lists... 88%Reading package lists... 88%Reading package lists... 95%Reading

In [2]:
commands="tree find ls cat head tail wc fgrep cut seq shuf sed awk"
<<< ${commands} tr ' ' '\n' | wc -l
which ${commands} | wc -l

13
13


The system has a number of data structures that it uses.  The two that we will be exploring are these:
- the file system
- a file

And like any other data structure, we can perform the four CRUD operations on them: Create, Read, Update, Delete.

### The file system

The file system is where all data is stored and organized.  The structure is that of a hierarchy or tree with branches eminating from a "root" and from "nodes" ( aka branch points ) and terminating in "leaves".  


### A file

A file is just a series of bytes of a finite length stored on the file system.

## The file system

One command that we can use to display the file system hierarch is `tree`.

For example:

In [3]:
tree -F /etc/apt/

/etc/apt//
├── apt.conf.d/
│   ├── 01-vendor-ubuntu
│   ├── 01autoremove
│   ├── 70debconf
│   ├── docker-autoremove-suggests
│   ├── docker-clean
│   ├── docker-disable-periodic-update
│   ├── docker-gzip-indexes
│   └── docker-no-languages
├── auth.conf.d/
├── keyrings/
├── preferences.d/
├── sources.list
├── sources.list.d/
└── trusted.gpg.d/
    ├── ubuntu-keyring-2012-cdimage.gpg
    └── ubuntu-keyring-2018-archive.gpg

6 directories, 11 files


In [4]:
find /etc/apt


/etc/apt
/etc/apt/sources.list
/etc/apt/sources.list.d
/etc/apt/keyrings
/etc/apt/apt.conf.d
/etc/apt/apt.conf.d/docker-disable-periodic-update
/etc/apt/apt.conf.d/docker-autoremove-suggests
/etc/apt/apt.conf.d/70debconf
/etc/apt/apt.conf.d/docker-gzip-indexes
/etc/apt/apt.conf.d/01autoremove
/etc/apt/apt.conf.d/01-vendor-ubuntu
/etc/apt/apt.conf.d/docker-no-languages
/etc/apt/apt.conf.d/docker-clean
/etc/apt/preferences.d
/etc/apt/trusted.gpg.d
/etc/apt/trusted.gpg.d/ubuntu-keyring-2012-cdimage.gpg
/etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg
/etc/apt/auth.conf.d


( We will explain the details of using `tree` and other commands later.  For now, you can just run the commands to see what ouput they generate. )

Here we see a portion of the file system hierarchy.  The tree is made up of only two elements: directories ( aka folders ) and files.  Directories can contain other directories ( aka subdirectories ) and files.  In the above `tree` example, directories are denoted by ending with a '/' symbol.

Notice that the starting directory is prefixed with a slash '/'.  This is the symbol for the root of the tree or the "root" directory. The slash is also the symbol used to delimit directories from subdirectories.  For example, we can list the files and folders in a specific directory by using the `ls` command:

In [5]:
ls -1F /etc/apt/

apt.conf.d/
auth.conf.d/
keyrings/
preferences.d/
sources.list
sources.list.d/
trusted.gpg.d/


A few items to note:
1. all subdirectories end with a slash '/'. 
1. the contents of those subfolders are not listed
1. files are listed in alphabetical order
1. file names can have a mix of upper and lower case characters

Chaining together directories and subdirectories is called creating a path.  In the previous example '/etc/apt/' is the path. Since trusted.gpg.d is a subdirectory, we can append it to the path.  For example:

In [6]:
ls -1F /etc/apt/trusted.gpg.d

ubuntu-keyring-2012-cdimage.gpg
ubuntu-keyring-2018-archive.gpg


Note there are three files in the trusted.gpg.d subdirectory and there are no further subdirectories.  We have reached the end of the path.

## Files

### File contents

Just like `ls` displays the contents of a subdirectory, `cat` displays the contents of a file ... with some interpretation.  For example:

In [7]:
cat /etc/debian_version

bookworm/sid


`cat` displays the characters that are in the file. ( Notice the absence of a period at the end. )  However ...

What is really on the storage device is actually just a series of 1's and 0's.  And we can display that using the `xxd` command.  For example,

In [8]:
xxd -b -g0 /etc/debian_version | cut -d' ' -f2 | tr -d '\n' ; echo 

bash: xxd: command not found



We can modify the previous command to dispay the interpretation:

In [9]:
xxd -b -g0 /etc/debian_version | cut -d' ' -f2-

bash: xxd: command not found


We can modify it further to format the 1's and 0's into groups of eight, called a byte, so that they are easier to view.

In [10]:
xxd -b -g1 /etc/debian_version | cut -d' ' -f2-

bash: xxd: command not found


On the far right of the output, we see that the 1's and 0's that make a byte get interpreted as the letters 'b', 'u', 'l', etc.  Also notice that the last byte '00001010' was displayed as a '.' That byte is actually a non-printable character, one of many.  This one happens to be the end-of-line ( aka newline or '\n' ) character, which we will encounter more of later on.  Other non-printable characters frequently encountered include tab ( '\t' ) , carriage return ( '\r' ), and null ( '\0' ).  Lastly,the 1's and 0's could be represented in a more compact form know as hexadecimal.


In [11]:
hexdump -Cc /etc/debian_version | cut -c9-60 | sed -e '1s/  / /g ; 2s/^ // ; 2s/   /  /g'

bash: hexdump: command not found


That is, binary '01100010' can be represented as '62' in hexadecimal.  Both of those map to the letter 'b'.  See the [ASCII man page]( https://man7.org/linux/man-pages/man7/ascii.7.html ) for a complete map.

So, what is a file?  It is a stream of 1's and 0's ( bits ) that then get grouped into chunks of 8 bits to form bytes
that are then interpreted as characters.  It's a tad more complicated than that, but pretty close for now.

### Lines

One way to interpret a file stream is as a collection of "lines" with each line being a collection of "text" characters.  Many files use this approach, e.g. CSV, YAML, HTML.  For these "text" files, a program can read the file one character at a time until it gets to an "end-of-line" character, then it can operate on that line, then read the next line.  `cat` does this with every line in a file. For example, we can have `cat` prefix each line with the line number:

In [12]:
cat -n /etc/debian_version

     1	bookworm/sid


That's not very exciting with a file that has only one line.  So, here's the same command run on a file with multiple lines, first without numbering the lines.

In [13]:
cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


... and again with numbering the lines.

In [14]:
cat -n /etc/os-release

     1	PRETTY_NAME="Ubuntu 22.04.3 LTS"
     2	NAME="Ubuntu"
     3	VERSION_ID="22.04"
     4	VERSION="22.04.3 LTS (Jammy Jellyfish)"
     5	VERSION_CODENAME=jammy
     6	ID=ubuntu
     7	ID_LIKE=debian
     8	HOME_URL="https://www.ubuntu.com/"
     9	SUPPORT_URL="https://help.ubuntu.com/"
    10	BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    11	PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    12	UBUNTU_CODENAME=jammy


In this case, `cat` reads each character until it gets to the end of the line, prints the line number followed by the line, then repeats the process until it gets to the end of the file.

Many other commands use this pattern of reading a line, operate on it, then repeat.  Let's look at a few:
- head
- tail
- cut
- wc
- file

BTW, for all these commands, much more details on options and how they work can be found using a Google search for "unix man " followed by the command.  For example, "[unix man head](https://www.google.com/search?q=unix+man+head)"

In [15]:
# head displays the first 10 lines of a file if not given any options
## you can specify more or fewer lines by giving it the option -n X, where X is a whole number
## here we get the first 4 words from a dictionary file
head -n 4 /usr/share/dict/words

A
AA
AAA
AAAA


In [16]:
# tail displays the last 10 lines of a file if not given any options
## you can specify more or fewer lines by giving it the option -n X, where X is a whole number
## here we get the last 4 words from a dictionary file
tail -n 4 /usr/share/dict/words

zyzzyva
zyzzyva's
zyzzyvas
zzz


In [17]:
# cut displays the character range specified by the -c option or a field range specified by the -f option
## range is specified using 1-based counting
## here we get the first 6 characters from the /etc/debian_version file.
cat /etc/debian_version
cut -c 1-6 /etc/debian_version

bookworm/sid
bookwo


In [18]:
## here we get characters 7-10 from the /etc/debian_version file.
cut -c 7-10 /etc/debian_version

rm/s


In [19]:
# wc gives a summary of how many lines, words, and characters there are in a file
wc /usr/share/dict/words

 663473  663473 6922426 /usr/share/dict/words


In [20]:
# file gives you a reasonable guess as to what type of file it is.
file /usr/share/dict/words
file /etc/dictionaries-common/words
file /usr/share/dict/american-english-insane
file /etc/debian_version
file /bin/grep

/usr/share/dict/words: symbolic link to /etc/dictionaries-common/words
/etc/dictionaries-common/words: symbolic link to /usr/share/dict/american-english-insane
/usr/share/dict/american-english-insane: ASCII text
/etc/debian_version: ASCII text
/bin/grep: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8824f80502cbcaf10a4b421c2e336710f06d1562, for GNU/Linux 3.2.0, stripped


The caveat is that these programs only work if the file is organized as a "text" file.  That is, the bytes are interpreted as alpha-numeric characters with line endings.  When some other convention is used, then the file is termed a "binary" file. This can lead to some confusion as all files are "binary".  The difference is in how the bytes are organized in the file and interpreted by some program.

## Creating a file

The previous examples used pre-existing files.  Now we will use some commands that will create data and then put them into a file. We will explore the following commands:
- date
- echo
- seq
- curl

In [21]:
# date prints the date
date

Thu Feb  1 13:20:50 UTC 2024


We can tell a command to put the data into a file by redirecting its output.  That is done using the '>' symbol.  For example, to save the output from the `date` to a file called `date.txt`:

In [22]:
# show that date.txt does not exist
ls -F

IDA.ipynb                          code.kata.data-munging.python.ipynb
README.md                          data-structures.files.folders.ipynb
bash.intro.ipynb                   env.rc
bash.setup.sh                      notes.md
code.kata.data-munging.bash.ipynb


In [23]:
# generate a date and redirect the output to the date.txt file
date > date.txt


In [24]:
# show that date.txt now does exist
ls -F

IDA.ipynb         code.kata.data-munging.bash.ipynb    env.rc
README.md         code.kata.data-munging.python.ipynb  notes.md
bash.intro.ipynb  data-structures.files.folders.ipynb
bash.setup.sh     date.txt


In [25]:
# display the contents of the date.txt file with a line number
cat -n date.txt

     1	Thu Feb  1 13:20:50 UTC 2024


In [26]:
# echo displays the provided text
echo 'Hello, world!'

Hello, world!


In [27]:
# to save output to a file
echo 'Hello, world!' > hw.txt

In [28]:
# display the contents
cat -n hw.txt

     1	Hello, world!


In [29]:
# seq generates a range of numbers
seq 1 10 > seq.txt
cat -n seq.txt

     1	1
     2	2
     3	3
     4	4
     5	5
     6	6
     7	7
     8	8
     9	9
    10	10


In [30]:
# curl GETs a webpage
## here it downloads a file containing air quality data from the city of Albuquerque
curl -s http://data.cabq.gov/airquality/aqindex/history/042222.0017 > abq.air-quality.dat
head abq.air-quality.dat


BEGIN_FILE
FORMAT_VERSION,2
AGENCY,0017
FILENAME,042222.0017
DATA_VERSION,201904222215
TZONE,MST,7
BEGIN_GROUP
VARIABLE,CO
DATA_TYPE,POINT
MEASUREMENT_TYPE,SAMPLE


## Command pipeline

Much like one can do method chaining in Python, Ruby, JavaScript, and other languages, commands can be piped together using a vertical bar '|'.  In this way, the output of one command can be piped as input into the next command.  For example:

In [31]:
# here the first ten lines of a file are numbered
head abq.air-quality.dat | cat -n

     1	BEGIN_FILE
     2	FORMAT_VERSION,2
     3	AGENCY,0017
     4	FILENAME,042222.0017
     5	DATA_VERSION,201904222215
     6	TZONE,MST,7
     7	BEGIN_GROUP
     8	VARIABLE,CO
     9	DATA_TYPE,POINT
    10	MEASUREMENT_TYPE,SAMPLE


In [32]:
# here only the first field is displayed from the first ten lines and then numbered
head abq.air-quality.dat | cut -d, -f 1 | cat -n

     1	BEGIN_FILE
     2	FORMAT_VERSION
     3	AGENCY
     4	FILENAME
     5	DATA_VERSION
     6	TZONE
     7	BEGIN_GROUP
     8	VARIABLE
     9	DATA_TYPE
    10	MEASUREMENT_TYPE


In [33]:
# shuffle the numbered lines and show only 10, i.e. randomly pick 10 lines
cat -n abq.air-quality.dat | shuf -n 10

   109	Del Norte HS 2      ,350010023,3.2,3.2,3.8,3.3,3.3,3.5,4.5,4.4,5,3.5,2.7,2.6,3,2.9,2.9,3.1,2.9,3.6,4.7,3.9,2.7,1.2
   225	STATIONS,1
   201	END_DTG,201904222159
    37	AVG_TIME,60
   207	STATIONS,1
   198	MEASUREMENT_TYPE,SAMPLE
   214	VARIABLE,WS
   217	CHARACTERISTIC,OBSERVED
   208	BEGIN_DATA
    21	Del Norte HS 1      ,350010023,0.138,0.171,0.196,0.132,0.174,0.272,-999,-999,0.243,0.184,0.12,0.12,0.118,0.125,0.12,0.116,0.118,0.123,0.139,0.123,0.118,0.108


In [34]:
# randomly pick 100 words, cut the first 10 characters, show the first 10, and number them
shuf -n 100 /usr/share/dict/words | cut -c1-10 | head | cat -n


     1	tobira
     2	pycnostyle
     3	Billy's
     4	tattooer's
     5	leching
     6	recaned
     7	radiology
     8	deciduousl
     9	semiaperio
    10	lochometri
