Skip to content

Latest commit

 

History

History
737 lines (530 loc) · 19.4 KB

bash-notes.md

File metadata and controls

737 lines (530 loc) · 19.4 KB
title date author
Intro to Bash Shell
April 2021
Tim Dennis

UCLA Data Science Center - Intro to Shell

hackmd-github-sync-badge

Before class: Instructor

set up shell:

  • exec bash - switches to bash from zsh
    • no longer needed on mac as zsh is default, so won't be as disorienting to learners
  • enlarge text size - via preferences
  • export PS1='$ ' - changes command promp to $
  • export PROMPT_COMMAND="history 1 >> ~/Dropbox/UnixHistory.txt"
  • Turn off the text coloring in terminal (Terminal -> Preferences -> ANSI)
    • alternately, use a different user account on your Mac so you don't pick up themed or supped up CLI's
  • have students check software installation: Unix, git for windows, ask when student come in
  • Etherpad link: https://pad.carpentries.org/2021-ucla-spring-unix
  • Get data for workshop: http://swcarpentry.github.io/shell-novice/data/data-shell.zip
  • Unzip and put on desktop

Setup/Motivation/Why Use?

  • Most tasks in the shell can be done with mouse on Desktop. Why do anything differently?
  • A way to combine powerful tools together using minimal keystrokes
  • Let's us automate repetitive tasks: moving & processing files/data, run our research analysis, build applications
  • Install software & other third party tools, configure software & tools
  • Often required to use with remote machines: high performance computing (HPC), cloud computing, web servers
  • Might be good for brief exercise in etherpad: jargon around command line, bash, etc.?

Introducing the Shell

Objectives: orient to shell and how it relates to the computer, understand the benefit of CLI

What computers do:

  1. run programs
  2. store data
  3. communicate with each other
  4. communicate with us → today you'll learn a new way of doing this

terms:

  • graphical user interface: GUI
  • command line interface: CLI

how it works -- the Read Evaluate Print Loop (REPL):

  1. you type something - LOOP
  2. computer reads it - READ
  3. executes command - EVAL
  4. prints output - PRINT
  • We use command shell to make this happen: this is the interface between user and computer
  • bash: Bourne again shell, most commonly used, default on most modern implementations
  • zsh - a variant of bash, now default on mac, for us today bash and zsh are interchangable

Scenario set up for lesson (skip this)

NOTE: I often skip this

  • Our friend Nelle has six months worth of survey data collected from the North Pacific
  • 300 samples of goo
  • Her pipeline:
  1. Determine the abundance of 300 proteins
  2. Each sample has one output file with one line for each protein
  3. calculate statistics for each protein separately using program goostat
  4. compare statistics for proteins using program called goodiff
  5. write up results and submit by end of month
  • If she enters all commands by hand, will need to do 45,150 times.
  • What can she do instead? Use the command line

Benefits of the CLI:

  • Automate repetitive tasks (30 minutes vs. 2 weeks)
  • Prevent user error, manual error
  • Processing pipelnes are re-usable and sharable
  • REPL Read-Evaluate-Print-Loop, let's you interactively work things out

Files and Directories

Objectives: paths, learn basic commands for working with files and directories, learn syntax of commands, tab-completion

  • prompt: $ indicates computer is ready to accept commands
whoami

This command:

  1. finds program
  2. runs program
  3. displays program's output
  4. displays new prompt
  • let's see where we are in our file system
pwd
  • pwd stands for print working directory, in this case it is also the home directory
  • note that the home directory will look different on different OS's
  • To understand "home directory" let's look at an image

Directory structure

  • root directory: holds everything else, begins with slash /
  • structure of directories are below that in a tree type structure
  • slashes \ can also be a separator between names
  • ls listing, prints names of files and directories in current directory and prints in alphabetical order
ls
  • make the output more comprehensible by using the flag-F
ls -F
  • -F adds trailing / to names of directories (note: on Windows git bash there's syntax highlighting for directories )
  • spaces and capitalization in commands are important!
  • -F is an option, argument, or flag
  • ls has lots of other options. Let's find out what they are by:
ls --help
  • many bash commands and programs support a --help flag to display more information
  • for even more detailed information on how to usr ls type man ls (caveat WINDOWS users)
  • man is for manual and prints the description of a command and options
  • Git for Windows doesn't come with the man files, instead do a web search for unix man page COMMAND
  • to navigate man files use the up and down arrows, or space bar and b for paging, to quit q
  • We can also ls to see contents of another directory:
ls -F Desktop
  • we ls -F to the Desktop from our home directory and we see the data-shell/ folder we unziped there earlier
  • let's look inside data-shell
ls -F Desktop/data-shell
  • Let's change directories into that folder
  • What do you think the command is for changing directories?
    • Yes, cd
cd Desktop
cd data-shell
cd data
  • see where we are:
pwd
ls -F
  • We can go down the directory structure, how do we go up?
  • We might try:
cd data-shell
  • data-shell is a level above our current location, but we can go up a level this way
  • but there are different ways to navigate to directories above your current pwd
cd ..
  • .. goes up one level in file hierarchy
  • .. is special directory name meaning "the dir containing this one" (parent)
  • let's confirm it worked:
pwd
  • .. won't show up using ls by itself
  • but we can do this to see hidden files:
ls -F -a
  • -a shows hidden files, including . and ..
  • -a stands for show all
  • . is for current directory, this can be useful if you want to reference your current location in the file system
  • What happens if we type cd by itself? go ahead and do this and type in the chat what it does
cd
  • by itself will return you to your home directory
  • pwd
  • how do we get back to our data folder?
cd Desktop/data-shell/data
  • we can string together a list of directories at once
  • so far we have been using relative paths paths starting from our current directory
  • we can also use absolute paths in ls and cd
cd /Users/nelle/Desktop/data-shell
pwd

Working With Files and Directories

Objectives: create directory hierarchy that matches given diagram, create files, look in folders, delete folders

  • go back to the data-shell directory (how?) - glad you asked!
pwd
ls -F
  • Let's create a directory called thesis
mkdir thesis
  • mkdir MAKES directories

good names for directories:

  • don't use whitespaces - whitespaces break arguments on CLI unless quoted, avoid, use - or _ (or combination)
  • don't start a name with -
    • commands treat names starting with - as options
  • stay with letters, numbers, - and _
  • if you need to refer to names of files or directories that have whitespaces, quote them
cd thesis
ls -F
  • nothing inside our new dir yet
  • let's change directory to inside the thesis and create a file called draft.txt
nano draft.txt

  • Creates file, opens text editor
  • Editors are like cars -- everyone wants to customize them, so there are hundreds if not thousands of different models
  • Write some text
  • use Control+O to save file shorthand is ^O)
  • Control+X to exit
  • I don't like this draft, let's remove it:
ls
rm draft.txt
ls
  • where does file go? Can i get it back?
  • Gone Pecan!
  • Deleting is forever!
  • Let's recreate the file then move up one directory
nano draft.txt
cd ..
  • now let's try and remove a directory
  • removing a directory:
rm thesis # error
rmdir thesis # still get error
rm thesis/draft.txt
rmdir thesis
  • unix won't let us delete a directory with something inside of it as a precaution
  • we could have also used rm -r thesis, but that can be dangerous!
  • Let's recreate thesis and draft.
mkdir thesis
nano thesis/draft.txt
ls thesis
  • but draft.txt isn't very informative, let's rename it using the mv command
mv thesis/draft.txt thesis/quotes.txt
ls thesis
  • first part of mv is what you want to move, second is to where and including the new name
  • note: mv works on directories as well
  • let's move quotes into the current directory. What does . mean again?
mv thesis/quotes.txt .
ls thesis
  • ls <filename> will only list that file, let's see that our file is there
ls quotes.txt
  • if we want to keep the old version, we can use copy
cp quotes.txt thesis/quotations.txt
ls quotes.txt thesis/quotations.txt
  • let's removed the copied version
rm quotes.txt
ls quotes.txt thesis/quotations.txt

Pipes and Filters

Objectives: redirect command output to file, construct pipelines

data: http://swcarpentry.github.io/shell-novice/data/data-shell.zip

  • now we can move around and create things, let's see how we can combine existing programs in new ways
  • Let's go into the molecules directory
pwd
ls molecules
cd molecules
  • the .pdb format indicates these are Protein Data Bank files
  • Let’s run an example command wc on cubane.pdb:
wc *.pdb
  • we haven't covered the * yet. It is a wild card operator
  • the * matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files
  • word count: lines, words, characters

wc and flags

  • let's only look at the number of lines
  • what flag do you think will produce this?
wc -l *.pdb
  • only report number of lines
  • what if we run this? what do you think will happen?
wc -l *.pdb > lengths.txt
  • this will send output (redirect it) to new file named lengths.txt
  • but let's confirm that it worked by using a new command cat - let's us look inside the file
  • cat stands for concatenate
cat lengths.txt
  • can't remember how wc reports? use man wc (q to exit), wc -h, or wc –help (this should work for most unix commands), also web search unix man wc

Sorting

  • now let's use the sort command to sort the contents of our file
  • let's look at some options for sort
man sort
sort --help
  • we will use the -n flag to tell sort to sort by numerical rather than alpha
sort -n lengths.txt
  • compare to:
sort lengths.txt
  • sort by first column, using numerical order
  • does not change file, just prints output to screen
  • if we want to save results, what can we use?
  • yes, we use our rediretion operator > to save to file
sort -n lengths.txt > sorted-lengths.txt
  • arrow up to recall last few commands
  • let's use head to see the biggest:
head -1 sorted-lengths.txt

Piping

  • Saving intermediate files like sorted-lengths.txt can get messy and confusing, hard to track
  • We can make it easier to understand by combining these commands together
sort -n lengths.txt | head -n 1
  • vertical bar is called the pipe in unix
  • it sends output of command on left as input to command on right
  • head prints specified number of lines from top of file
  • we can chain multiple commands together
  • for example send the output of wc directly to sort, and then the resulting output to head
wc -l *.pdb | sort -n
  • adding head the full pipeline becomes:
$ wc -l *.pdb | sort -n | head -n 1
  • this pipe and filter programming model is important conceptually
  • let's review what we covered via this image:
  • note: you only enter the original files once!

Nelle's pipeline (skip for time):

  • start in her home directory (users/Nelle)
cd north-pacific-gyre/2012-07-03
  • all files should contain same amount of data
  • any files contain too little data?
wc -l *.txt | sort -n | head -5
  • any files contain too much data?
wc -l *.txt | sort -n | tail -5
  • file marked with Z? outside naming convention, may contain missing data
ls *Z.txt
  • records note no depth recorded for these samples
  • may not want to remove, but will later select all other files using [AB].txt
  • Socrative questions 5 and 6

Loops

Objectives: write loops that apply commands to series of files, trace values in loops, explain variables vs values, why spaces and punctuation shouldn't be used in file names, history, executing commands again

  • what if you wanted to perform the same commands over and over again on multiple files?
  • supposed we have several hundred genome data files named xxx.dat, bbb.dat, etc.
  • go to creatures directory data-shell/creatures
  • may try:
cp *.dat original-*.dat
  • but doesn't work. Why?
  • really you are saying this:
cp basilisk.dat unicorn.dat original-*.dat
  • problem is that when copy receives two files it expects the last one to be a directory in which to copy the files
  • you can perform these operations using a loop
  • first, let's looking at first three lines in each file
for filename in basilisk.dat unicorn.dat
  do
    head -3 $filename
  done
  • what does this look like when you arrow back up?
  • explain syntax: filename is variable, what does it stand for? how is it represented later?
  • shell prompt changes, if you get stuck, use control+c to get out
  • can specify whatever variable name you want
  • why might it be problematic to have filenames with spaces?
  • you can include multiple commands in a loop:
for filename in *.dat
  do
    echo $filename
    head -100 $filename | tail -20
  done
  • use of wildcard. what does echo do? why is this useful for loops?
echo Hey you
  • strategy: echo command before running to make sure the loop is functioning the way you expect

Solving the original problem

  • write a for loop to resolve the original problem of creating a backup (copy of original data)
  • going back to the original file copying problem, we can solve this with the following loop:
for filename in *.dat
do
  cp $filename original-$filename
done

Nelle's example (skip for time contraints)

  • nelle's example:
cd north-pacific-gyre/2012-07-03
  • check:
for datafile in *[AB].txt; do; echo $datafile; done`
  • add command:
for datafile in *[AB].txt; do; echo $datafile stats-$datafile; done
  • add command:
<!-- for datafile in *[AB].txt; do; goostats $datafile stats-$datafile; done -->
  • (kill job using ^C)
  • add echo:
for datafile in *[AB].txt; do; echo $datafile; goostats $datafile stats-$datafile; done
  • tab completion: move to start of line using ^A and end of line ^E (option with arrows to move by one word)
  • history: see old commands, find line number (repeat using !number)

Shell Scripts

Objectives: write shell script to run command or series of commands for fixed set of files, run shell script from command line, write shell script to operate on set of files defined on command line, create pipelines including user-written shell scripts

  • go back to molecules in nelle's directory
  • create file called middle.sh and add this command: head -15 octane.pdb | tail -5
  • bash middle.sh
  • .sh means it's a shell script
  • very important to make these in a text editor, rather than in Word!
  • edit middle.sh and replace file name with "$1"
  • quotations accommodates spaces in filenames
  • bash middle.sh octane.pdb, should get same output
  • try another file: bash middle.sh pentane.pdb
  • edit middle.sh with head “$2” “$1” | tail “$3”
  • bash middle.sh pentane.pdb -20 -5
  • to remember what you've done, and allow for other people to use: add comments to top of file
#select lines from middle of a file
#usage: middle.sh filename -end_line -num_lines
  • explain comments

  • how would we use the for loop?

  • what if we wanted to operate on many files? create new file: sorted.sh

  • wc -l “$@” | sort -n

    • unix special parameters
  • bash sorted.sh *.pdb ../creatures/*.dat

  • add comment!

  • save last few lines of history to file to remember how to do work again later:

  • history | tail -4 > redo-figure.sh

  • history | tail -5 | colrm 1 7 (1-7 characters)

    • nelle problem
  • run goostats on all data files

  • do-stats.sh:

#calculate reduced stats for data files at J = 100 C/bp
  for datafile in$@do
      echo $datafile
      bash goostats -J 100 -r $datafile stats-$datafile
  done
  • bash do-stats.sh *[AB].txt

  • or just report bash do-stats.sh *[AB].txt | wc -l

    • Socrative question 9

Finding Things

Objectives: grep to select lines in text which match patterns, find to find files whose names match patterns, nesting files, text vs binary files

  • move to writing subdirectory
  • cat haiku.txt
  • grep not haiku.txt : find lines that contain "not"
  • grep day haiku.txt : find lines that contain "day"
  • grep -w day haiku.txt : searches only for whole words
  • grep -n it haiku.txt : includes the numbers on lines that match
  • grep -n -w the haiku.txt : combine flags
  • grep -n -w -i the haiku.txt : make case insensitive
  • grep -n -w -v the haiku.txt : invert selection, only lines that do NOT contain the
  • the real strength of grep, and the origin of its name, is "regular expressions," which describes ways of programmatically describing text strings/search patterns, but we don't have time to cover these today (many awesome lessons and tutorials online)

difference between grep and find

  • find . -type d : look for things that are directories in given path

  • find . -type f : look for files instead

  • find is automatically recursive (keeps drilling down into file hierarchy)

    • can specify depth: find . -maxdepth 1 -type f
    • or -mindepth
  • can match by name: find . -name *.txt (will only give one filename! expands name prior to running command)

  • correct way: find . -name '*.txt'

  • find similar to list, but has more refined parameter searching

  • can combine together: count all lines in a group of files

  • wc -l $(find . -name '*.txt')

  • nesting or subshell

  • equivalent command: wc -l ./data/one.txt ./data/two.txt ./haiku.txt

  • can also combine find and grep: find .pbd files that contain Iron

    • grep FE $(find .. -name '*.pdb')
  • today we've only talked about text files, what about images, databases, etc? those are binary (machine readable)

  • Socrative question 10

END CLASS

  • stop shell script output