# linux crash course

[linux](https://en.wikipedia.org/wiki/Linux) is one of the most popular operating systems in the world and perhaps the most powerful for data analytics. familiarity with the ins and outs of the linux operating system are an absolute must in the modern cloud computing environment in particular.

at the same time, "linux" as a topic is enough to fill several courses (let alone slides). much of the material I leave out here *will* be relevant to you in your future.

this is all to say: I've tried to pare down the linux world to the things I think are *most useful* and *most frequently seen* in the wild. at the very least, I hope that when you encounter these commands and utilities in the wild you will think "I think we talked about that that one time..." -- those mental guideposts can be a lifesaver.

<div align="center">**start up and log in to your ec2 instance and follow along**</div>

## file system, paths, and organization

most users are familiar with the windows/dos "drive" concept of file system organization -- you have lettered drives (e.g. the `C:\\`) and some top-level folders in that drive:

+ `C:\\`
    + Program Files
        + Google Chrome
    + Program Files (x86)
        + Minesweeper
    + Users
        + myname
            + AppData
            + My Documents
            + My Pictures
    + Windows
+ `D:\\`
    + no one knows. floppy disk? insanity.
    
over time and use, you've perhaps gotten used to know "where" things are, so when you need to tweak something, install something, or save something,  you have some instinct built up.

what are those instincts in the linux world?

there is a pretty tried-and-true filesystem hierarchy in the linux world and although it is not as human-readable as the windows analogs, knowing the organization can often prove helpful

### paths

You've probably heard this phrase before, but for completeness' sake: a "path" is a sequence of "directories" ("folders" in the windows and mac worlds) and possibly a file name with an extension. Directories have paths (hence "possibly" a file name).

these sequences are separated by the foward slash character (/) in linux world.

paths can be written in two ways:

1. absolute
    1. this is a path that contains every single directory (folder) relative to some shared root point (called the root folder, discussed below)
2. relative
    1. this is a path which looks for the path relative to the place your terminal "is"
    2. by default, your session will start in your home directory (discussed below), and paths can be relative to this point in the file system

there is a special concept often utilized in linux path discussions: "[globbing](http://tldp.org/LDP/abs/html/globbingref.html)". a "glob" is a string that looks like a path, but contains wild card characters (`*`, ?, and a handful of others). You can use glob expressions to *match* several files. For example, if you want to find every `csv` file in the current directory, you could look for the glob

```bash
*.csv
```

the `*` character will match anything.

advanced: globs are not equivalent to regular expressions. they are a subset in both syntax and capabilities

### root folder

the "root" folder is (as the name implies) the base of all file paths on your computer. This is similar to the `C:\\` in windows-world except there is no analogy to "other" root folders.

The root folder is symbolized by just one single "`/`" character.

Open a termainal and try the following:

``` bash
ls -lh /
```

*we'll get into the details later but this command "lists" (`ls`) the contents of the directory "`/`", and makes the list "long" (`-l`) and human-readable (`-h`)*

In [None]:
%%bash
ls -lh /

The contents of the [root folder](http://www.tldp.org/LDP/intro-linux/html/sect_03_01.html) are pretty established in the linux community. As your machines are all the same (Ubuntu 16.04) you should be seeing identical collections of directories and paths. From linux machine to machine there may be some changes, but you can usually count on the same structure.

*note: the linux documentation project (or "tldp") which I linked above is a great beginners resource. It has become outdated over time, but is often the best and fullest resource for explaining linux concepts -- keep an eye out!*

There are a couple of directories in `/` that deserve special mention.

#### `/bin`

the `/bin` directory (short for "binaries") contains common executible stuff (that is, stuff you can run which will do something).

the word "binary" is a bit of an anachronism now, as non-binary things commonly live in the `/bin` directory

check out the results of

```bash
ls -alh /bin/
```

notice anyting?

In [None]:
%%bash
ls -alh /bin/

#### `/etc`

Although the name comes from "*et cetera*" (it originally held all sorts of things), in modern times this folder is home to pretty much one type of file: system-wide configuration files. of all of these directories, this is the one that will most benefit you to know about.

linux has a multi-level understanding of configuration:

1. this command's configuration as I invoke it right now
    1. example: command line flags (`myapp --env dev --db mysql`)
2. "nearby" configuration
    1. example: a configuration file (`myapp.conf`) in this directory
3. user level configuration
    1. example: a configuration file (`~/myapp/myapp.conf`) in a user's home directory
4. system level configuration
    1. example: a configuration file (`/etc/myapp/myapp.conf`)

#### `/etc` (cont.)

the order above represents the order of importance -- the higher up in the list and closer you are to the user typing the command that is about to run, the more important that configuration is.

**by default**, almost all behavior is controlled by the `/etc/` configuration. these files are created as you install packages ("packages" is a term loosely meaning software you can use)

let's check out `/etc`

```bash
ls -alh /etc
```

In [None]:
%%bash
ls -alh /etc

#### `/etc` (cont.)

generally speaking, you should not mess around with files in `/etc`.

generally speaking.

that being said, eventually they will come up and you'll have to do something about it. You'll want to know how to allow a made-up user access a postgres database, or how to allow `http` access to a neo4j database, or how to restrict access to you cloud computer to *only* via ssh key authentication (a really good idea!).

#### `/home`

the linux os developed out of an environment where multiple users utilized the same resources (an im-personal computer, so to speak), so keeping different user's items separate was important.

the `/home` directory (sometimes called `/user`) contains seperate, isolated directories for each user of the computer.

it is considered to be best practice to physically or "virtually physically" isolate these directories from everything else. this is so that you can copy your profile and configuration files from one installation / distribution / computer to the next and keep “your” stuff.

#### `/home` (cont.)

as I said, every user has a director in home. see if your user (ubuntu) has one:

```bash
ls -alh /home
```

In [None]:
%%bash
ls -alh /home

#### `/mnt`

Short for "mount", this directory is where external or network drives are "mounted" (*i.e.* networked and attached to the file system so that they can be accessed). If you were to plug in a USB drive to a linux laptop or desktop, for example, those files would be "mounted" to the file system and visible from this directory.

You probably don't have anything mounted to your cloud computer, but take a peak:

In [None]:
%%bash
ls -alh /mnt

#### `/opt`

historically, this directory contained "optional" add-on software packages. This is still often the place on the filesystem you or your sysad might install "special" things (e.g. Citrix for making remote desktop connections, or special text editors). These things are sometimes also installed in another directory (`/usr/local`), so this isn't a hard-and-fast rule.

*Note: not a lot of people know that this exists or that this is what it should be used for!
If there is a thing you want to install and you think "where should I put this..." and you think maybe other users might also use it, a great answer is `/opt`*

what's in your `/opt`?

```bash
ls -alh /opt
```

In [None]:
%%bash
ls -alh /opt

#### `/tmp`

the name "tmp" is (as you guessed) short for "temporary. this directory provides an ephemeral (*i.e.* it gets wiped with some frequency, usually on restart) bucket for any temporary thing you may want to save. 

This includes cache files (spotify or chrome downloads), backups of files (autosaved by some editors), *.etc*. You should feel absolutely comfortable using it in your work with the obvious caveat that you shouldn't put anything **important** in a place called "temporary"

what's already in your temporary folder?

```bash
ls -alh /tmp
```

In [None]:
%%bash
ls -alh /tmp

#### `/var`

here "var" is short for "variable". this is not in the sense of a programmatic of mathematical variable, but rather as an item in of unknown duration, number, or volume. these are files which are separated from all other files because they *must* be writable and changeable for a computer to work.

TLDP has [a good write-up](http://www.tldp.org/LDP/Linux-Filesystem-Hierarchy/html/var.html) of the sorts of items kept in `/var`: backups, cahe files, logs, database instances, and logs

Pay attention to `/var/log` in particular -- in linux world, when things go wrong this is often where you need to go to figure out what happened

```bash
ls -alh /var
```

In [None]:
%%bash
ls -alh /var

### special folders

you may have noticed above that every call to `ls` starts with two special lines that end in `.` and `..`. Those are special path tokens that refer to

+ `.`: the current directory (i.e. the results of command `pwd`)
+ `..`: the directory one level above this one

### home folders

as I mentioned above, one of the directories within the root directory is `/home`, and this contains an isolated directory for each user. this directory is so important that there is a permanent shortcut for getting there -- a bash environment variable (more on that below) that contains that path. to show this, let's print it out -- `echo` is the most common "print" command in linux, and will print to the screen whatever is written after it (after resolving variables):

```bash
echo ~
```

In [None]:
%%bash
echo ~

your terminal should return `/home/ubuntu`, since your user name is `ubuntu`.

Let's go a step further and check out what is in that directory:

```bash
ls -alh ~
```

In [None]:
%%bash
ls -alh ~

note that almost all of the items in the `/` directory had the phrase "`root root`" in the lines describing the files, and the files in this directory instead have `ubuntu ubuntu`.

what do you think is going on there?

those two words are the `owner` and `group` of the file. They don't have to be the same and they can change. we'll cover permissions and permissions structure in just a second.

among those files, there are a few worth pointing out right away -- the ones beginning with the phrase `.bash`. let's list them in our terminal, utilizing a glob (note the `*` character):

```bash
ls -alh ~/.bash*
```

this should match every file in our home directory `~` that starts with a `.bash` and ends with any number of characters of any kind

In [None]:
%%bash
ls -alh ~/.bash*

these files all configure how [*bash*](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29) (Bourne-again shell) operates, and are responsible for a lot of the ways you interact with the terminal.

within the main file (`.bashrc`) we can set our own environment variable, change program configurations, change the color and wording in our command prompts, and much more.

*note: the "`xxxxxx rc `" naming convention is fairly ubiquitous in the linux world. it stands for "run control" for archaic reasons, but generally speaking it means "the configuration for '`xxxxxx`'"*

speaking of environment variables...

## `bash` and environment variables

the commands you are executing in your terminal are a disparate set of decades-old `C` and `python` scripts. The process which exposes them to you as commands you can type is a *shell*, and there are different implementations of shells. The most common -- and the default in ubuntu (and therefore the one on your ec2 instances) -- is one called the "Bourne-again shell", or `bash`. `bash` acts as a powerful wrapper around another built in process in the linux world which is just called `shell`, or `sh` for short.

Aside from being a way of interactively executing commands, `bash` is also a scripting language -- you can save a sequence of commands you enter in your terminal for repeat use -- this basically how batch processing works in the linux world.

again, entire text books have been written about just `bash` and `bash` scripting; we don't have the time to cover it all in this course. For now, just the basics.

### environment variables

Just like `python` or `R`, shell sessions have a concept of variables. They are strings following a dollar sign:

```bash
$VARIABLE_NAME
```

and usually come in that all-caps snake case.

a lot of environment variables are already set -- look at the following, for example:

```bash
echo $USER
echo $HOME
echo $PWD
```

In [None]:
%%bash
echo $USER
echo $HOME
echo $PWD

those are your user name, your home directory, and directory you are currently "in" (`pwd` is short for Print Work Directory).

many of these variables were set on a system level (in the aforementioned `/etc` files), and each user's `.bashrc` gives them the ability to make updates as desired.

Let's check out what variables already exist without us making changes via our `.bashrc` files. we can do this with the `env` (environment) command:

```bash
env
```

In [None]:
%%bash
env

That's a lot!

note that `USER`, `HOME`, and `PWD` all exist in that list, along with many more.

At the moment, I'll only point out one other important variable, because it comes up all the time: the `PATH` variable.

```bash
echo $PATH
```

In [None]:
%%bash
echo $PATH

the `PATH` variable is a `:`-separated list of directories in which the shell process should look for anything.

suppose you just installed an awesome program called `l33tmode` and you want to run it from the command line. when you type in `l33tmode`, the bash process will check those paths one at a time to see if `l33tmode` is within any of them. If it finds `l33tmode` once, it will execute it. If it makes it through every directory and finds nothing, it will return an error

In [None]:
%%bash
command_that_doesnt_exist

**why bring this up?**

whether or not a given thing is found in a directory in your path is often a primary cause of unexpected problems. it's also something that you often find yourself updating based on trouble-shooting or installation instructions, so you it's good to know what it's there for.

For example, if you installed a program you want to use in `/opt`, for example -- that may have been a good idea, but without adding `/opt` to your path, you will need to be explicit when calling that command -- `/opt/zachs_super_programs/l33tmode`, for example.

### `.bashrc`

as I mentioned above, `bash` has a primary configuration file for each user located at `~/.bashrc`. 

let's look at the contents of that file using the `cat` (concatenate) command 

```bash
cat ~/.bashrc
```

In [None]:
%%bash
cat ~/.bashrc

**diversion**: 

why did we `cat` this time and `echo` before?

what's different about the two programs?

+ `echo` will print exactly what is passed (after resolving variables) back to the screen
    + we used it above to *resolve* the variable `~` and then print it so we could read the result
    + if we attempted to `echo` that file, it would literally echo us -- it would take that file name and print it to screen. try it
+ `cat` is expecting a file, and it will open that file and print the contents

```bash
echo ~/.bashrc
```

In [None]:
%%bash
echo ~/.bashrc

Many things are done in `.bashrc` files, but three are most common:

1. run a command every time we start a command line session
    1. example: `source activate my_conda_environment`
2. update or set our own variables
    1. example: `export ENV=DEV`
3. create aliases
    1. example: `alias l33t=my_complicated --expression -t hat -i am --tired oftyping`
    2. an alias is a shortcut for a larger command or sequence of commands
    3. technically, this is just a type of the first element

let's focus on piece 3 -- creating an alias. Note that we've typed the same thing many times above:

```bash
ls -alh /some/path
```

it would be convenient if we didn't have to type that all the time -- maybe we should make an alias? Try

```bash
ll ~
```

In [None]:
%%bash
ll ~/

on your ec2 system, that should have worked -- some one already put that alias in your `.bashrc` (that some one is the AMI creator).

This command -- and others you will get used to, over time -- are not pre-configured on every machine. Eventually you'll get to a machine where you don't have the `ll` command, and you'll have to remember to create the alias and where you save aliases.

here's [a list of awesome aliases](https://www.cyberciti.biz/tips/bash-aliases-mac-centos-linux-unix.html). don't add them yet -- learning the commands the "long" way is useful itself, and gives you the appreciation of the reasons for these shorthands.

### keeping things separate

perhaps at this point you may have an appreciation for an off-hand comment back in the "file structure and organization" segment: the files in `/home` are seperated from the rest for a specific reason. As time goes on, you will tweak and update your `.bashrc` file to get your bash session *juuuuust* the way you want it. You'll do this for many other files too, and those tweaks and updates will live in your `~` directory.

when the sysad wants to do something reckless (like update the OS without a UAT phase, for example), it will be good to have all the files that define *your* experience living in a place that isn't disrupted by this.

it's also easier, when migrating, to just zip up the contents of your home directory and move them to your new computer. Anyone who has ever bought a new windows desktop knows the dance of keeping the old computer around long enough to be sure you haven't missed any super important files while copying over -- well, in the linux world that is a problem that is solved via discipline and the `/home` directory

## permissions

Let's dig in to the printout information that we see every time we run the `ll -h` command:

```
drwxr-xr-x 4 ubuntu ubuntu 4.0K Aug 23 02:58 ./
drwxr-xr-x 3 root   root   4.0K Aug 23 02:55 ../
-rw-r--r-- 1 ubuntu ubuntu  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3.7K Aug 31  2015 .bashrc
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:58 .cache/
-rw-r--r-- 1 ubuntu ubuntu  655 May 16 12:49 .profile
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:55 .ssh/
```

each of the above lines comes with 7 elements:

```
-rw-r--r-- 1 ubuntu ubuntu 3.7k Aug 31  2015 .bashrc
```

1. the permission bits
    1. a 10-character string with characters d, r, w, x, and -
2. the number of "links"
    1. for directories, the number of sub-directories including `.` and `..`
    2. for files, 1
3. the owner name
    1. a user name (`ubuntu` or `root`, here)
4. the owner group
    1. a group name (`ubuntu` or `root`, here)
5. the file size
    1. the flag `-h` writes them as **h**uman readable
6. a timestamp
    1. when the file was last touched
7. the file or directory name

we're going to focus on the permission items -- 1, 3, and 4

### permissions: user and group

the first thing to know is that the linux world considers three levels of permissions for every file on the system

1. user
2. group
3. global

every file has permissions that apply to the user that owns it, the group that owns it, and then everyone who does not fall into those first two buckets.

right now, your user is `ubuntu`. You could find this with either of the two following commands:

```bash
whoami
```

or

```bash
echo $USER
```

In [None]:
%%bash
whoami

In [None]:
%%bash
echo $USER

each user is also a member of some number of groups (a "group" is a collection of users or other groups). By default, every user in the ubuntu OS is put into a group with the same name as the user name, but you may be in many more

to check the groups your user is in:

```bash
groups
```

In [None]:
%%bash
groups

so, returning for a moment to our home directory:

```
drwxr-xr-x 4 ubuntu ubuntu 4.0K Aug 23 02:58 ./
drwxr-xr-x 3 root   root   4.0K Aug 23 02:55 ../
-rw-r--r-- 1 ubuntu ubuntu  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3.7K Aug 31  2015 .bashrc
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:58 .cache/
-rw-r--r-- 1 ubuntu ubuntu  655 May 16 12:49 .profile
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:55 .ssh/
```

we see that the parent directory (`..`) is owned by user `root` and group `root`, but every other item is owned by us (user `ubuntu` and group `ubuntu`)

how does that compare to the root directory? 

to `/var/log`?

*what is the command for displaying these directories, and who are the users and groups which owns those files?*

#### changing user or group ownership

**if** you are the user which owns the file, you can change the user or the group

**if** you are a member of the group which owns the file, you can change the group.


the command to change the owner is `chown`

the command to change the group is `chgrp`

### permissions: mode bit string

the real secret sauce of linux permissioning is these leading ten characters, and learning to read them and modify them is a huge win.

the very first character is either a `d` (for "directory"), or a `-` (not a directory).

the remaining 9 charaters are actually 3 groups of 3 characters.

the first group of 3 characters is for the **user** which owns the file

the second group of 3 characters is for the **group** which owns the file

the final group of 3 characters if for **everyone else**

each group of 3 characters lays out the privelege level of the user, group, or system:

1. a `r` (for "read") or a `-`
2. a `w` (for "write") or a `-`
3. a `x` (for "execute") or a `-`

"execute" means you the file is something which can be run (if the system knows how), or is a directory users can open.

again, returning to the `ll -h` results:

```
drwxr-xr-x 4 ubuntu ubuntu 4.0K Aug 23 02:58 ./
drwxr-xr-x 3 root   root   4.0K Aug 23 02:55 ../
-rw-r--r-- 1 ubuntu ubuntu  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3.7K Aug 31  2015 .bashrc
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:58 .cache/
-rw-r--r-- 1 ubuntu ubuntu  655 May 16 12:49 .profile
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:55 .ssh/
```

what are the permissions on `.bashrc`?

what are the permissions on `.ssh`?

why might the permissions between the two be different?

#### bit representation

there is a common shorthand for discussing those 9 permission characters as a set of 3 digits. the rule is as follows:

+ start with a value of 0
+ if the `x` flag is set, add 1 ($2^0$)
+ if the `w` flag is set, add 2 ($2^1$)
+ if the `r` flag is set, add 4 ($2^2$)

this guarantees a unique permission number for every one of the 8 possible permission combinations:

| permission string | value |
|-------------------|-------|
| `---` | 0 |
| `--x` | 1 |
| `-w-` | 2 |
| `-wx` | 3 |
| `r--` | 4 |
| `r-x` | 5 |
| `rw-` | 6 |
| `rwx` | 7 |

Because of this, a sequence of 9 characters is often discussed as 3 numbers:

+ `rwxrwxrwx` is 777
+ `rw-r-----` is 640

#### changing permission mode string

the command to change the permission mode string is `chmod`, and it works in both the chracter representation and the bit representation

to update the permission mode string using the character representation, you ask the following:

+ am I changing permissions for the
    + `u`: user
    + `g`: group
    + `a`: all others
+ am I
    + `+`: adding permission
    + `-`: revoking permission
+ is the permission
    + `r`: read
    + `w`: write
    + `x`: execute
    
given the above answers, you concatenate them. To give (`+`) all users (`a`) read permission (`r`) you would run

```bash
chmod a+r /my/file
```

to update the permission mode string using the numeric representation, you simply calculate the exact permission string you wish to have as a number and assign it.

for example, to give users and groups complete control over a file, but completely restrict the outside world, you would apply

```bash
chmod 770 /my/file
```

for your `ssh` key, you probably had to make the permissions more restrictive -- this is to make sure that you and only you can edit them. To do this, we would make the file readable by us and remove all permissions from group owners and everyone else:

```bash
chmod 400 ~/.ssh/my_aws_private_key.pem
```

changing file permissions is a bit of a black art at first. You'll open the same webpages for the same tutorials every time. You'll learn that there are special characters beyond `rwx` and they do wild and mysterious things.

Eventually, you'll get the hang of it!

at the very least, you should know how to read these permission strings. They will explain so many of the errors you encounter.

## super user, *aka* [`sudo`](https://xkcd.com/149/)

there is one major exception to the rules listed above: the super user.

linux has a concept called "super user" which is, as the name implies, pretty super. this user can be thought of as the adminstrator account on the linux machine, and as administrator it owns many of the most important configuration files and runs most of the essential processes.

Anything that the super user does is protected by the permission structure -- only the super user may change the files or alter the processes run by the super user.

the super user is often colloquially and sometimes literally called `root` (literally on ubuntu -- you will notice that a user named `root` owns and runs most things).

this all begs the question: when I want to break everything, how do I do it? if this super user has been put in place to properlyconfigure everything on my machine and keep things running smoothly, how am I expected to make a mess of it?

linux developers thought of this, too, and have constructed a system whereby non-`root` users can be given root permissions in some or all cases.

on your ec2 instance, user `ubuntu` has already been granted most sudo priveleges. in the real world, the sysad will make this decision. If a sysad *doesn't* give you sudo priveleges, know that while that is annoying that is *actually a very good sign* -- that sysad has standards and control policies in place and there is likely a centralized way of doing things.

*weeds note: granting sudo permission is done by adding users to a "sudoers" file, where you can configure the level and types of access granted to each user on the machine. it can be configured down to the command level*

you *use* your sudo priveleges by "becoming" the super user. you can do this in two ways:

1. `sudo su`
    1. this will have you "log in" and create a new shell in which you are acting in all way as the super user.
    2. every command you type after this until you `exit` will be performed as the root user
    3. **BE CAREFUL!**
    4. probably don't
2. `sudo [type my command here]`
    1. this will momentarily log you in as the root user and execute the provided command, then log you out
    2. **STILL BE CAREFUL!**
    3. still probably don't
    
note: `sudo su` is technically just a special instance of `sudo [type my command here]` where the command is `su`.

## philosophy

there is a much ballyhooed [list of linux and unix philosophies](https://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy):

1. small is beautiful
2. do one thing and do it well (DOTADIW)
3. build a prototype as soon as possible
4. choose portability over efficiency
5. store data in flat text files
6. use software leverage to your advantage
7. use shell scripts to increase leverage and portability
8. avoid captive user interfaces
9. make every program a filter

most of these are guidelines on how to *develop* linux, but there are important lessons in here about how you are intended to *use* linux.

##### do one thing and do it well (DOTADIW)

this is the most commonly cited linux philosophy, and is pretty central to the identity. I'd note that this is the opposite of many windows or mac world solutions, which attempt to act as swiss army knives across multiple domains.

as such, you should be wary of commands which purport the ability to do many things. though they may fail to live up to their promise, that is not the reason: the real reason to avoid such tools is that the community at large is likely to reject them, and support for them may not extend past the first developer. if that developer becomes distracted or disinterested, they may become *abandonware*

this is not a problem all that often, but something worth knowing ahead of time.

##### store data in flat text files

many of the linux and bash tools you will use are specifically optimized for reading and writing files. this is contrary to some data science and data engineering instincts, which favor databases or data lakes

personally, I recommend the following approach: flat text files are the default; advanced storage methods are the exception. 

##### avoid captive user interfaces

this is related to a basic usability question: should running a program require active participation from a user?

beginning programmers often rely on interactive prompts or hard-coded pieces of programs to set the variables and constants which configure how your program will be processed. *both* of these approaches cause a user to "interact" with the codebase in order to run a specific instance of a program.

for example, whend eveloping a data engineering pipeline, you may be inspired to program the number of folds in your cross-validation. you probably know that it's going to be annoying to have to type that number in every time you run.

why is hard-coding the number better?

why is it worse?

what is another option?

put these possibly changeable parameters and constants into *configuration* files, and build a program which knows how to *read* configuration files.

this has several benefits:

1. the user doesn't change the code *in any way* to run a particular version of the process
2. the configuration files can be version controlled separately
3. the configuration files can stand in as a strucutred representation of the particular modelling process

on the configuration points above: when you start to have several feature selection options, several modelling approaches, several evaluation criteria, several test/train split methods.... you see how it might be nice to have a compact representation of those options. for example:

```python
# file: my_fav_params.conf
myparams = {
    'featureselectors': ['boruta', 'lasso'],
    'bootstrap': True,
    'models': ['neuralnet', 'lasso', 'adaboost'],
}

clientparams = {
    'featureselectors': ['identity'],
    'bootstrap': False,
    'models': ['linear_regression'],
}
```

##### make every program a filter

most linux programs are designed to take in lines of text, perform some calculation, and spit out lines of text. in modern data engineering parlance, linux programs are ETL processes where the extract and load steps are both normalizing results to sequences of text lines. these sequences of text lines are pulled from or put into buffer objects called `stdin` and `stdout` (respectively). if errors occur, they are put into a separate buffer called `stderr`

the output of any command (`stdout`) is printed to the terminal by default. You may wish to save it to file instead -- you can change this output by routing it to a file with the `>` or `>>` characters. There are actually several options, and [this stack overflow answer](https://askubuntu.com/a/731237) lays them out well.

one of the advantages of this ETL approach is the results of one program can be passed directly to another -- this is called "piping" in linux world, and the character which does the passing around of line is "`|`" (capital backslash on most keyboards). This character is often called "pipe" because of this.

take, for example, the following command

```bash
who | awk '{print $1}' | sort | uniq
```

In [None]:
%%bash
who | awk '{print $1}' | sort | uniq

what just happend?

In [None]:
%%bash
whatis who
whatis awk
whatis sort
whatis uniq

In [None]:
%%bash
who

In [None]:
%%bash
who | awk '{print $1}'

In [None]:
%%bash
who | awk '{print $1}' | sort

In [None]:
%%bash
who | awk '{print $1}' | sort | uniq

## common commands and tools

finally -- let's get into the real crash course: the commands!

### help and information

there are a couple of built-in help and information facilities for every linux command. they mostly require knowing the name of a command first (an unavoidable problem, I think). I'll use `ls` as my test command, but the following should work for most linux commands.

##### `man`

the `man` command (short for "manual") is one of the least used but most useful linux commands. I expect that this is because of the way that the manual entry is opened -- let's talk about that in a second.

`man` will open up a somewhat-standard-formatted manual document with the following information about any command

1. name
2. synopsis
    1. a standard-format description of how to invoke the command
    2. example: `ls [OPTION]... [FILE]...`
    3. anything in `[]` characters is *optional*
3. description
    1. begins with a short paragraph explaining what the program does
4. parameterization
    1. lists all the flags (ways of passing in parameters)
    2. may list configuration files
    3. may list meaningful environment variables
5. trailing information
    + a collection of items such as author, copyright, licensing, and further information

let's look at an example:

```bash
man python
```

*note: exit this viewer by pressing the `q` key*

In [None]:
%%bash
man python

### `-h` and `--help`

#### an aside: command flags

As I mentioned above, one of the primary ways of parameterizing linux programs is through a concept called a "flag". a flag is a string starting with a `-`. Generally, there are two types of flags:

1. one dash followed by one character (`-h`)
2. two dashes folled by a spaceless word (`--help`)

usually they come in pairs where one is the full word and the other is the leading character.

sometimes just the existence of a flag signals that some action should be taken. other times it is expected that a value will be provided after the flag as well, either following an `=` sign (`-f=myfval`) or separated by a space (`-f myfval`) -- the implementation here depends on the particular program.

in practice, there are plenty of little nuances and tricks.

For example, often times the single-letter flags can be put together. Supposing `-a`, `-b`, and `-c` were all valid flags, some programs will allow you to save time by writing `-abc` instead of `-a -b -c`. 

Convenient, but confusing.

also, java makes everything different, because it's java. Java command line parameters are passed in as *single* dashes with *full word* strings.

*generally speaking*, if you pass every flag as a separate string you won't go wrong. I prefer the full words for clarity's sake. Treat everything beyond that as an optimization -- pick it up when you need it, not now.

### `-h` and `--help`

back to the information commands: many commands implement a special flag: `-h` and `--help` which will print out a short description of the process and all the allowable command line flags:

```bash
python -h
```

In [None]:
%%bash
python -h

##### `which`

the `which` command will tell you the full path of the program which will be executed by a single command, or will return that no such command exists. This becomes particularly useful when you have more than one installations of a program and confusion about which is being used (*c.f.* python distributions).

```bash
which python
```

In [None]:
%%bash
which python

##### `whatis`

the `whatis` command provides a very short description of any command.

```bash
whatis python
```

In [None]:
%%bash
whatis python

### navigation and file movement

for most users, not having access to a point-and-click file explorer is the first major hurdle when getting used to linux. there are just a few commands needed to move around the file system and manipulate files, and learning them is critical

##### `pwd`

the `pwd` (print work directory) will print to the screen the "work" directory -- the directory your session is "in". All of the commands you may wish to execute, and all paths that you might write will be judged *relative* to this path

In [None]:
%%bash
pwd

##### `ls`

we've already used this command many times, so you likely already know: to list the files in your current working directory you use the `ls` command.

the default behavior of `ls` is to print only filenames in a tabular format:

```bash
ls ~
```

In [None]:
%%bash
ls ~

##### `ls` (cont.)

the default behavior is useful in some instances (especially in scripting), but usually you want more information than just file names. Because of this, I usually invoke `ls` with at least three flags:

1. `-a`: print all files, including hidden files
2. `-l`: print the files in a detailed list format, not just names in a table
3. `-h`: print file sizes in a human-readable format

it is fairly typical for people to create the alias `ll = ls -alh`; in fact, this is a default alias on all ubuntu instances.

```bash
ll ~
```

In [None]:
%%bash
ls -alh ~

##### `cd`

`cd` (short for *c*hange *d*irectory) will do just that -- change the directory you are in. You can pass in relative or absolute paths.

passing no argument will change your working directory to your home directory.

In [None]:
%%bash
cd /tmp
echo "after 'cd /tmp'"
pwd

cd
echo "after 'cd'"
pwd

##### `touch`

this command will "touch" a file, which does two things:

1. creates it if it doesn't exist
2. updates the "last updated" timestamp of the file to be right now

this command is mostly useful for creating empty files just to have a filename (when starting a git repo with an empty `README` file, or when you're writing a large tutorial of linux commands, for example).

```bash
touch ~/testfile
ll ~/testfile
```

In [None]:
%%bash
touch ~/testfile
ls -alh ~/testfile

##### `cp`

copy a file from one path to another using `cp`

```bash
cp [file that exists] [file I want to exist]
```

In [None]:
%%bash
cp ~/testfile ~/testfile.bak
ls -alh ~/testfile*

##### `rm`

remove a file with `rm`

```bash
rm ~/testfile.bak
```

In [None]:
%%bash
ls -alh ~/testfile*
rm ~/testfile.bak
echo
ls -alh ~/testfile*

##### `mv`

move a file (literally: copy and then remove) a file with `mv`

```bash
mv [current file name] [new file name
```

In [None]:
%%bash
ls -alh ~/testfile*
mv ~/testfile ~/testfile.newname
echo
ls -alh ~/testfile*

##### `ln`

create a "link" (a shortcut) to a file or directory. Generally, you want to create "symbolic" links (flag `-s`).

```bash
ln -s [thing the shortcut points at] [shortcut name]
```

In [None]:
%%bash
ln -s /tmp ~/tmpshortcut
ls -alh ~/tm*

##### `mkdir`

make a new directory with `mkdir`. note: the `-p` (parents) flag will create all of the pieces of the path if they haven't been created before (good for creating folders several levels deep), and also will not throw an error if the directories already exist.

```bash
mkdir -p ~/code
```

In [None]:
%%bash
mkdir -p ~/code
ls -alh ~

### editing and viewing files

no matter what your preference is for editor on your laptop, there will be times where you *must* use terminal editors -- and you should! `vim` and `emacs` are among the two most full-featured and best supported text editors ever made!

There are about twelveteen million articles on which text editor is best. In particular, there is a [nerd culture war](https://en.wikipedia.org/wiki/Editor_war) between `vim` and `emacs` as best editor. I was raises in an `emacs` home. There are, presumably, people who like `nano`. there are pros and cons to all of them.

[Just pick one and learn it!](https://xkcd.com/378/)

##### `vi` and `vim`

this is the grand-mommy of 'em all. `vi` (for "visual") or `vim` (vi improved) is one of the virst ever terminal-based editors.

The first thing to know about `vi` is that it is highly optimized for *performing editing actions* and not necessarily for *text entry*. The idea is that users will far wish to do things like cut, paste, delete characters or words, or move around documents just as often as they want to actually type out characters. Normally, the action of moving up several paragraphs, copying and pasting a word, and moving back to the bottom may take considerably longer than typing. This time debt is optimized away with an extensive list of single-character shortcuts.

as a result, there are two *modes* within the `vi` editor, and you need to toggle between them to do different things:

1. "normal" mode, where keystrokes are commands that *do* things, and
2. "insert" mode, where keystrokes are literally printed to the document

when you enter `vi` you are in the *normal* mode, so you can't just start typing without accidentally executing a million strange commands.

to move from normal mode to insert mode, you need to type the `i` key. To move from the insert mode to the normal mode, you press the `ESC` key.

most importantly, when you've landed in vi and just want to leave, [ask stack overflow how to exit `vi`](https://stackoverflow.blog/2017/05/23/stack-overflow-helping-one-million-developers-exit-vim/) and they will tell you:

1. be in normal mode (so hammer `ESC` for a bit)
2. press `:`
3. then if you want to
    1. save changes (write to file) and exit: `wq`
    2. just quit: `q`
    3. quit and discard changes: `q!`

<div align="center">**open and exit vim**</div>

##### `emacs`

`emacs` (short for *E*diting *MAC*ro*S*) is the primary competitor to `vim`. the best way to describe `emacs` is via a popular backhanded compliment:

> `emacs` is a great operating system, lacking only a decent editor

`emacs` itself is actually a shell for a particular programming language (`lisp`), and as such it can do arbitrarily complicated things. fortunately, there are armies of dedicated hobbyists to make these awful monstrocities:

1. [open your ipython notebooks in emacs](https://github.com/millejoh/emacs-ipython-notebook)
2. [use emacs to browse the web](https://www.emacswiki.org/emacs/eww)
3. [play chess](https://github.com/jwiegley/emacs-chess)
4. [put on a holiday fireplace](https://github.com/johanvts/emacs-fireplace/)
5. [nyan cat mode](https://github.com/wasamasa/zone-nyan)
6. [make sounds like a typewriter](https://github.com/rbanffy/selectric-mode)

who even are these people? bless them.

Of course, for every crazy, silly `emacs` package there are hundreds of useful ones. 

the process of editing becomes *much* more reliant on chains of keyboard shortcuts and modifier keys. are you aware that in addition to `alt` and `ctrl` there is a `hyper` key? just because no keyboard has it doesn't mean it doesn't exist -- you just have to *BELIEVE*!

you will also (silently) activate context-dependent modes for different file types, which opens up different sets of commands that are specific to that context.

generally speaking, though, you probably only *really* need to know a few things:

1. open a file by pressing `ctrl + x` and then `ctrl + f`
2. write changes to file by pressing `ctrl + x` and then `ctrl + w`
3. exit by pressing `ctrl + x` and then `ctrl + c`

<div align="center">**open and exit emacs**</div>

that's right -- it's usually not installed by default, because it's so much larger than `vim` or `nano`

##### `nano`

`nano` is a great starter option for editors. It uses a small handful of `ctrl`- and `alt`-modified key sequences (like emacs), but is much more like a standard press-arrow-keys-and-type editor. Plus, the simple commands are listed at the bottom of the window at all times.

+ the `^` character stands for the `ctrl` modifier, which is the `ctrl` key
+ the `M` character stands for the `meta` modifier, which is the `alt` key or `esc` key (and often both)

<div align="center">**open and exit nano**</div>

##### `less`

the `less` command (a play on an older command called `more`, which displayed "more" of a file) is a file *viewer*, not an editor. there are [a couple of useful navigation commands](https://en.wikipedia.org/wiki/Less_(Unix)#Frequently_used_commands) you can use within the `less` program (mostly stolen from `vi`). The ones I end up using every time:

+ quit: `q`
+ search for text: `/`
    + once matches have been found, cycle *foward* with: `n`
    + cycle *backward* with: `N`
+ page up: `u`
+ page down: `d`

```bash
less ~/.bashrc
```

note: this doesn't require you to load the whole file, which may make it better for viewing large files than many other options (see `head` below, too)

##### `cat`

`cat` (for concatenate) *prints* files to `stdout` (and thus the terminal). Because it prints the *entire* file, this will be probelmatic for long files. Personally, I never use `cat` -- I always use `less`. that's a matter of preference, though. `cat` often is useful in shell scripting

```bash
cat ~/.bashrc
```

In [None]:
%%bash
cat ~/.bashrc

##### `echo`

`echo` will simply take whatever follows it on the line and print it to `stdout` (and therefore the terminal). This is *mostly* useful for resolving environment variables (as discussed above).

```bash
echo "user = $USER"
```

In [None]:
%%bash
echo "user = $USER"

##### `head`

the `head` command will display the first few lines (by default, 10) of a file (similar to head in the `R` or `python` dataframe contexts -- I wonder where they got the name...)

the flag `-n` modifies the number of records printed.

```bash
head -n 20 ~/.bashrc
```

In [None]:
%%bash
head -n 20 ~/.bashrc

##### `tail`

as the name implies, `tail` is the other half of `head` -- it prints the last few lines (default, 10) of a file.

again, the flag `-n` modifies the number of records printed. 

`tail` also has the flag `-f` for *following* a file. this means that the last `N` rows are printed, but we stay in the viewer process and update live as new lines are written. this can be perfect for watching logs. press `ctrl + c` to quit the "following" process.

```bash
tail -n 20 ~/.bashrc
```

In [None]:
%%bash
tail -n 20 ~/.bashrc

### filesystem info

you may have picked up on this by now, but the linux world is a bit more focused on files and the filesystem. Mac and Windows OS's abstract away a bit of this complexity from users, but linux does not.

There are some commonly occurring commands which deal directly with the file system that you should know about.

##### `file`

this command can be used to figure out roughly what *type* of file a given file is. Under the hood, it is performing hundreds of different checks (mostly regular expressions) to see if there are any common sequences of characters, and making a best guess -- so this is by no means the final story.

```bash
file ~/.bashrc
```

In [None]:
%%bash
file ~/.bashrc

##### `df`

the `df` command (for "disk free") lists the free (and used) space on all available mounted file systems (separate partitions, separated drives, system-use drives). my most common use case is just to see how much free space is available anywhere on the machine:

```bash
df -h
```

but sometimes you only care about the filesystem you are currently working in (usually `/`, but not always):

```bash
df -h .
```

In [None]:
%%bash
df -h

##### `du`

the `du` command (short for "Disk Usage") lists out the total file size of every file under a provided directory. By default, it will list look at the current work directory.

let's use this command as a means of exploring somethign we talked about above -- help flags.

start by doing the simplest thing: execute `du` from your home directory.

```bash
du -ah ~
```

*note: the `a` and `h` flags are doing the same things here as they did for the `ls` command: including hidden files and printing file sizes with human readable units.*

In [None]:
%%bash
#du -ah ~

this likely produced a small number of files and the sizes of each. note that it is ordered in a nested way such that every sub-directory is immediately followed by the directory it is in, and the last record is the top level (and has a size that is the sum of all it's children).

let's try that same command on a much bigger directory and see what happens:

```bash
du -ah /etc
```

what do you *think* will happen?

In [None]:
%%bash
du -ah /etc

with so many items printed out, it would be nice if we could limit them -- especially the items that are several levels deep in the tree. let's see if that's possible:

```bash
du --help
```

In [None]:
%%bash
du --help

Let's try out `--summarize`

```bash
du -h --summarize /etc
```

In [None]:
%%bash
du -h --summarize /etc

that's pretty nice. What if I still wanted to know at least the size of the items in the directory, and the size of each sub-directory, but not further? there we could use the `-d` / `--max-depth` flag to specify we are interested in a maximum depth of 1:

```bash
du -h --max-depth 1 /etc
```

In [None]:
%%bash
du -h --max-depth 1 /etc

so why spend time talking about this command? Well, if you've never written a program which accidentally produced a dataset that was too large for your file system, you should really give it a try. And when you do, this set of commands will be fairly invaluable in determining which files are the real offenders and getting rid of them asap.

##### `tree`

one last file command, and this one is the best of the bunch -- you simply cannot live without it. It's called `tree` and it prints out the directory contents (*a la* `ls`), but in a graphical "tree"-like way such that the relationship between directories, sub-directories and file is visually obvious. It's a life-saver.

it will expand to all depths by default (like `du` above), so let's limit it to only 1 level deep first using the `-L` flag. we'll also want to see all files, so `-a` should be included as well.

```bash
tree -a -L 1 ~
```

In [None]:
%%bash
tree -a -L 1 ~

wat

```
The program 'tree' is currently not installed. You can install it by typing:
sudo apt install tree
```

### packages

We simply *have* to have `tree` installed, but before we go fire off that command in the error message, let's talk for a second about packages.

packages are the linux world standard for installable software. they are basically just compressed (think "zipped", but technically don't think that) directories of all of the files needed to run an application, plus some other files needed to create or install that software and get it to "play nice" with the rest of your operating system.

unlike MSIs or DMGs in the windows and mac world, these are not programs which install software, but rather sets of files that *one* unified program can use to install software.

for you as a user, you will generally

1. want to install some software
2. know the name of that software
3. ...?

this is where *package managers* come in. a package manager is a program which will

1. find packages of applications (by name, typically)
2. resolve the *dependencies* of that package (look up all of the *other* software that you might need to download in order to have that software work)
3. download the package files and the dependency package files
4. perform any of the installation instructions or configurations in those package files
5. make sure that all parts of the system that "need to know" about new packages are informed

there are a handful of different package managers in linux world, but they are usually one-to-one with distributions:

1. `apt` (for "Advanced Package Tool")
    1. the primary package manager in modern debian (including Ubuntu) distributions
2. `apt-get` 
    1. is part of the same project as `apt`, but is an older version (being replaced everywhere by `apt`)
3. `yum` (for "Yellowdog Updater Modified")
    1. wrapper around `rpm` (below)
    2. primary package manager for Red Hat (RHEL), CentOS, and Fedora
4. `rpm` (for "Redhat Package Manager")
    1. basic package manager for Red Hat
5. downloading files and installing them yourself (usually via `make` and `make install`)
    1. this is possible and obviously a bit more advanced, but sometimes it is useful to be able to install what *you* want instead of what the package maintainer will allow you to install (which can lag behind development by years at times)

enough yaking, let's install `tree` already:

```bash
sudo apt install tree
```

one way we can check that tree is installed is just to run the command again:

```bash
tree -a -L 1 ~
```

### process information

you may have experience with the window "task manager" or the mac "activity manager", and if so you know how helpful they can be. there are gui versions of those same utilities in linux world, but they are less standard and often require a bit more configuration or user knowledge.

the tools that *are* standard tend to be a little more low-level and also a bit more single-use / niche (in following with the DOTADIW philosophy).

knowing these commands is often essential to doing any sort of debugging of system performance. that being said, you will use these much less as a linux *user* than as an admin.

##### `top` and `htop`

both `top` (Table of Processes) and `htop` (Hisham (author's name) TOP) are programs used to list out all the currently running processes (think of the "processes" tab on task or activity monitor).

both open in an interactive terminal window and can be exited (like `less`) with the `q` key.

try both!

```bash
top
```

note: you may well have to install htop with `sudo apt install htop`

```bash
htop
```

as far as I can tell, the one and only reason to use `top` instead of `htop` is that `htop` isn't installed and you can't install it

##### `ps`

short for "Process Status", this command prints out some summary information about all running processes. unlike `top` and `htop`, this is a snapshot program, so you do not see updates.

the default behavior is to print out only the running commands initiated by the current user (*e.g.* `ubuntu`), and the following valued:

1. PID: the process id, an integer which uniquely identifies that process among all running processes on the system
2. TTY: a value identifying the terminal in which that command is running (may be none for graphical or background processes)
3. TIME: the time the command began (relative to the machine' start time)
4. CMD: the actual executed command

```bash
ps
```

In [None]:
%%bash
ps

it is fairly common to modify this command with the `aux` flag, which will

1. `a`: list commands run by all users
2. `u`: add a column listing the user
3. `x`: include processes that weren't started in a terminal

```bash
ps -aux
```

In [None]:
%%bash
ps -aux

##### `kill` and `pkill`

sometimes you have a long-running process that you realize was a mistake, or has become unresponsive (*aka* a zombie process). it is nice to be able to kill these processes, but often difficult -- especially when there is no gui interface with an X button.

the command to kill a process is, appropriately, `kill`, and it takes that unique process identifier we just saw via `ps`.

```bash
kill [the PID goes here]
```

the common workflow is to run `ps`, look up the PID (this is easier done via `ps` than `[h]top` because `ps` is a static snapshot), and run kill.

under the hood, the `kill` command is sending a *signal* to the running process. there are several different signals that all effectively mean "stop this process", but they come with different levels of urgency (with lower meaning more urgent). the only two you really need to know are

1. `SIGTERM`, level 15
    1. default
    2. this requests the process be "terminated"
    3. graceful: will try and do useful cleanup before quitting (if the process supports such a thing)
    4. not guaranteed to work
2. `SIGKILL`, level 9
    1. the "just do it" option
    2. not graceful: kills the process immediately and without cleanup

an alternative to `kill` is to run `pkill`, which takes a *name* instead of a PID and will kill any proces running where that name is in the CMD of that process.

this can be dangerous: for example, it is very possible that you might have several `python` scripts running, and only one becomes a zombie. In that case, you will not want to `pkill python`, and will *have* to look up the correct process id and use plain `kill` to stop it.

note: you can also kill process from within `htop` by selecting the process with the keyboard, pressing `k`, and selecting the signal to send (basically: `15 SIGTERM` or `9 SIGKILL`) 

##### `shutdown` and `reboot`

you know what these command will do by the name: they will `shutdown` or `reboot` the computer.

technically, `reboot` is a specail instance of `shutdown`: `shutdown -r`.

you can test these if you want. I won't ;)

### utilities

the following commands are a grabbag of utilities I use regularly for various purposes. your mileage may vary

##### `history`

this prints out all of the commands you have recently executed.

```bash
history
```

In [None]:
%%bash
history 20

note: if you press `ctrl + r` you will be able to type and recursively search for commands you previously entered (those in your history) among all commands that contain that information.

try it out -- with an empty terminal line, enter `ctrl + r` and then type `apt ` and see what happens.

##### `date`

this command is among the most useful of all linux commands. in addition to the default behavior (just printing the date and time to the terminal), it is possible to print out a large number of strings representing different formats and arrangements of time values as desired. 

Let's try out two examples:

```bash
# print out the Y, M, D, and then H, M, and S as a timestamp with 
# periods separating the date from the time characters
date +%Y%m%d.%H%M%S
```

In [None]:
%%bash
date +%Y%m%d.%H%M%S

and now the same timestamp, but last Friday and then 20 days ago

```bash
date --date="last Friday" +%Y%m%d.%H%M%S
date --date="20 days ago" +%Y%m%d.%H%M%S
```

In [None]:
%%bash
date --date="last Friday" +%Y%m%d.%H%M%S
date --date="20 days ago" +%Y%m%d.%H%M%S

##### `wc`

`wc` is short for "word count", and it does just what you expect -- counts words in strings or files. It has the ability to count bytes, chards, lines, paragraphs, and maximum line lenghts as well.

in all honesty, while it is *nice* to know the number of words in a file, I use this more often to count the number of *files* in a directory. If you print the number of files as a list (`ls -l`) and use `wc -l` to count the lines, you will have the number of files (plus two: `.` and `..`).

```bash
ls -l /etc | wc -l
```

In [None]:
%%bash
ls -l /etc | wc -l

##### `dirname` and `basename`

every path in linux can be described as a full directory name (the list of all directories between the file and root) and the filename. If you consider directories themselves to be named what they are name (generalize filename to "basename"), every path can be split in half as a "directory name" and "base name". The commands `dirname` and `basename` will convert every file into those two components

```bash
dirname /this/is/a/test/path/to/file.txt
basename /this/is/a/test/path/to/file.txt
```

*note: this is parsing strings based on path name rules, not looking at actual paths on the actual file system*

In [None]:
%%bash
dirname /this/is/a/test/path/to/file.txt
basename /this/is/a/test/path/to/file.txt

##### `grep`

as I've said about a couple different linux features and utilities now, you could take an entire class on using `grep` (short for "Globally search a Regular Expression and Print" -- catchy). 

the purpose of `grep` is to provide a fast and flexible way of performing generalized text searches (*i.e.* regular expression searches). Often we may want to find all of the files in which we referenced a certain variable (*e.g.*, we want to change the variable `LogisticRegression` to `NeuralNet`, because we're feeling *spicy*), or find all instances of a known typo in a single file. this is the primary use case for `grep`.

As an example, let's suppose we want to see all of the aliases we created in our `bash` profile. We could perform a case-insensitive search (`-i`) and print off all the line numbers (`-n`) for all of the files under (`-r`, aka recursive) our home directory:

```bash
grep -nir alias ~/
```

In [None]:
%%bash
grep -nir alias ~/

##### `diff`

this utility compares two files to find the differences (hence, `diff`). this is mostly pointless for *different* files; it is used mostly to compare different iterations / versions of files. under the hood, the basic action of `git` (the preeminent version control system) is to track `diff` output between previous and current versions of code files.

just to get an example of what diff output looks and feels like, let's do the following:

```bash
echo "hello world my name is zach" > ~/test.txt
echo "hello world my awesome name is zach" > ~/test.2.txt
diff ~/test.txt ~/test.2.txt
```

In [None]:
%%bash
echo "hello world my name is zach" > ~/test.txt
echo "hello world my awesome name is zach" > ~/test.2.txt
diff ~/test.txt ~/test.2.txt

the way that `diff` displays differences is to find lines which disagree or are additions / subtractions and to print out the two different versions. The lines as they appear in the "first" file (to the left in the command line statement) get a `<` character in front of them. they are separated from the "second" file (to the right in the command line statement) by a row of `-` characters, and the second / right file lines are lead by a `>` character

##### `tar` and `unzip`

you may not often think about Tape ARchives (`tar`) as a thing that happens, but in a lot of fields it is actually a legal compliance obligration (e.g. pretty much the entire financial sector, much of the government). that being said, you don't actually have to create tape archives -- but you will *very* often use the compression algorithms used in tape archives to compress collection of files before sending them to other folks, or decompressing files sent to you.

in the windows and mac world, you are probably used to finding `.zip` files, and you are familiar with decompressing ("extracting") these archive files into several files and folders. you will soon also become familiar with `tar` and `tar.gz` file extensions -- these are archives in the linux world, and the command you use to compress or decompress files is `tar`.

because of the ubiquitous of `zip` files, the `unzip` function is also available on most linux distributions for decompressing `zip` archives.

##### `locate` and `find`

both `locate` and `find` can be used to find files by file names, but `locate` is much faster and simpler for general use. `find` is more useful for collecting names which follow certain patterns, and that in turn is useful for scripting purposes.

until you find  yourself reading about `find` on stack overflow posts, you should default to always using `locate`.

```bash
locate .bashrc
```

In [None]:
%%bash
locate .bashrc

##### `sort` and `unique`

these two commands will take in lists of strings and sort them (`sort`, obviously) and remove duplicates / reduce them to a list of unique values (`unique`). As with `find`, these are usually more useful for scripting purposes, though `sort` finds some general application.

##### `xargs`

this command is a bit advanced, but let's quickly discuss *what* it does. As we mentioned in the philosophy section, each linux command is meant to act as a filter on an input variable. for historic reasons this was not always implemented in the same way: sometimes commands were built to take in *only* items from the command line (instead of standard in, so pipes are broken), or have a capped number of allowed arguments.

`xargs` was created to solve many of these problems, and to help make sure that even "broken" or "old" processes will follow the linux philosophy.

basically, `xargs` will take a list of items and use them to build as many executions of a single command on all the items in that list.

as a side effect, the ability to chunk up list items into smaller groups allows for multi-threading and parallelization via `xargs`. this is a dirty hack, but [a pretty common one](https://github.com/RZachLamberty/zshell/blob/master/hydra-curl-nofrills.sh)

##### `screen`

one thing that is probably not obvious yet: what if the command you want to run takes a *long* time and you just kicked it off? 

what does your terminal look like now?

Can you do anything?

what happens if you close the terminal window?

as it so happens, if you start a process in a terminal window, and then close that process, it will generally be killed. this is obviously less than desirable. you could leave terminals open until process close, but what about *permanent* processes?

`screen` is a way of creating "pseudo-terminals" that you can access as you wish, but that you do not need to remain logged into.

I pretty much exclusively use `screen` to run my long-running processes (like web apps or persistent API calls / web scrapers). We may get to use of screen in the future, but for now I will just mention it so you know about this super useful package

##### `cron` and `crontab`

`cron` is named after the greek work for time, $\chi\rho\omicron\nu\omicron\sigma$ (*chronos*), and is the *de facto* methdo for scheduling executions of processes and jobs. all basic operating system actions are scheduled using this utility, so you absolutely should use it for your scheduling unless it is impossible.

the basic frequency in `cron` is 1 minutes -- in my opinion, one of the only reason to move beyond `cron` as a scheduler is to get sub-minute frequency (some like things like web apps and dashboards, but I'm pretty happy with log files!).

basically, a `cron` entry is a execution time pattern and a set of `bash` commands that you want to execute. To avoid writing it for the 18 billionth time, I'll refer to [the corresponding section on wikipedia](https://en.wikipedia.org/wiki/Cron#Overview).

to add your own entries to `cron` you will use the `crontab` utility with the `-e` (edit) flag. this will open an editor and all you will need to do is add the timestamp and command

### Networking

given that we are working on remote desktops, we will now often be interested in networked communication between computers. there are a couple of commands which feature fairly prominently in these interactions.

although this is explicitly a linux tutorial, many of these actions can be done with dedicated analogous programs in windows, and I will discuss those as well

##### `ssh`

you've already used this -- `ssh` is the basic command for all Secure SHell connections between computers.

1. linux, mac: the command `ssh` often comes pre-installed, but if not it can be installed as part of the `openssh` package
2. windows: the industry standard is [PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html)

##### `scp`

`scp` is short for Secure CoPy. basically, this is just an implementation of the copy command using the ssh connection protocol. It relies on all the same technology as `ssh` including configuration files and privay / public key exchange.

1. linux, mac: the `scp` command is a part of the `openssh` package, so if you have `ssh` installed you will have `scp` available.
2. windows: the industry standard is [WinSCP](https://winscp.net/eng/download.php)

##### `ftp`, `lftp`, and `sftp`

I mentioned it briefly in a previous lecture, but there are a few protocols (rules for constructing messages and sending them to remote services) that are explicitly dedicated to file transfer. They fall into two camps:

1. ftp: the first iteration, stands for File Transfer Protocol
2. sftp: the second iteration, stands for Secure (or SSH) File Transfer Protocol

in linux and mac world, each of these protocols has a command of the same name that implements the command line interface (cli) for that protocol. `lftp` is a general purpose command that provides many useful features in addition to the basic `ftp` and `sftp` commands

in terms of usage, these commands will effectively

1. create a connection using the corresponding protocol
2. create a new interactive session for executing ftp or sftp commands
    1. example of commands: GET, PUT, MV, CP
3. logging and error messages are all handled and displayed as needed

the growing use of S3 as a file storage and sharing utility means that *our* file storage will be done in an entirely different way. that being said, ftp and sftp are still ubiquitous. I have used an FTP or SFTP server on every project I have worked on.

let's do a quick demo of using one of these protocols (the simpler: ftp).

first, in your browser, open [the NOAA CLASS (comprehensive large array-data stewardship system) ftp site for satelite data distribution](ftp://ftp-npp.class.ngdc.noaa.gov/20170827/)

then, in your ec2 terminal try the following:

```bash
ftp ftp-npp.class.ngdc.noaa.gov
# just press enter for the user name and password
# or enter anything you want

# then enter help to see the type of commands available
# some should be familiar (ls, cd, pwd)
ftp> help
```

when you log in to an ftp server, you and all users are (by default) dropped off into a single root directory.

let's try to get our bearings by listing out the contents of this root directory in which we find ourselves. for silly reasons, you [need to turn on "passive" mode](https://serverfault.com/a/450655) before you can do anything useful.

```bash
ftp> pass
ftp> ls
ftp> cd 20170830
ftp> ls
```

1. linux, mac: `ftp`, `lftp`, and `sftp`
2. windows: [winscp](https://winscp.net/eng/index.php) (this is my recommendation for both ftp and sftp), also [filezilla](https://filezilla-project.org/)

##### `curl` and `wget`

these two tools are the primary command line tools for downloading materials over the HTTP and HTTPS protocols. They each have many, many features, and I have found myself using both for various purposes. I recommend installing and being open to using both (as opposed to learning one well, as I would recommend for editors).

let's just try the simples thing we can -- download a single, simple test webpage.

```bash
# curl: download and print to the screen
curl https://www.york.ac.uk/teaching/cws/wws/webpage1.html

# curl: download and write to file
curl -o curl.html https://www.york.ac.uk/teaching/cws/wws/webpage1.html

# wget: default behavior is to write to file "webpage1.html" (basename of url)
#       let's write to a different file name
wget -O wget.html https://www.york.ac.uk/teaching/cws/wws/webpage1.html
```

1. linux, mac: `curl`, `wget`
2. windows: [curl](https://curl.haxx.se/download.html#Win64), [wget](http://gnuwin32.sourceforge.net/packages/wget.htm)

##### `ping`

often we simply want to know if a server exists and is responsive. the act of sending a single "packet" (a single piece of information) over the internet to ask if "anyone is there" is called "pinging", and `ping` is the command which does it.

let's check on google:

```bash
ping -c 5 www.google.com
```

in addition to the fact that we hear back on all 5 of our "pings", we have some additional info:

1. the ip address we received for "www.google.com"
2. the round trip time (about 8 ms)

1. linux, mac: `ping`
2. windows: `ping` is a built-in `cmd` and `powershell` command

##### `mtr`

sometimes we would like some more information about how our packets are travelling from our server to others (c.f. complaining to comcast about internet latency). `ping` is nice for demonstrating that we can reach a server, but `mtr` is the standard command for debugging *how* we reached a server (called a "traceroute").

```bash
mtr www.google.com
```

in all honesty, I've never performed a similar operation in windows. there is a built-in command, though:

1. linux, mac: `mtr`
2. windows: `tracert` is a built-in `cmd` and `powershell` command

##### `hostname`

this one is pretty simple: print out the name of your system's host (this is the human-readable text that corresponds to your ip address).

```bash
hostname
```

In [None]:
%%bash
hostname

##### `ifconfig`

this command is generally used to simply get your IP address. 

```bash
ifconfig
```

the address you are looking for will appear in the `eth0` block after "`inet addr`".

on a windows machine, I would probably just google search "what is my ip" and let google figure it out for me.

<div align="center">have we become... ***TOO*** powerful?</div>
<img src="https://techviral.net/wp-content/uploads/2015/04/Why-Hackers-Use-Linux.jpg"></img>

# END OF LECTURE

next lecture: [environment management and python](004_python.ipynb)