# IPython shell commands

The 3 Laws of Automation:

1. Any task that is talked about being automated, will be automated
2. If it isn't, it's broken
3. If a human is doing it, a machine will eventually do it better

Learning Objectives:
- IPython shell commands
- Shell commands with subprocess
    - e.g. capturing the output of shell commands and sending as input to processes
- Walking the file system
    - e.g. find files matching a pattern or look for a specific file type
- Command-line functions
    - e.g. automate tasks using a library, run scripts in cron
    
## Using IPython with shell commands
    
To use shell commands, just precede it with `!`

In [1]:
# returns an SList datatype
!df -h

Filesystem                Size   Used  Avail Capacity iused               ifree %iused  Mounted on
/dev/disk1s1             233Gi  138Gi   86Gi    62% 1179390 9223372036853596417    0%   /
devfs                    408Ki  408Ki    0Bi   100%    1412                   0  100%   /dev
/dev/disk1s4             233Gi  8.0Gi   86Gi     9%       8 9223372036854775799    0%   /private/var/vm
map -hosts                 0Bi    0Bi    0Bi   100%       0                   0  100%   /net
map auto_home              0Bi    0Bi    0Bi   100%       0                   0  100%   /home
EgnyteDriveFS@egnytefs0  233Gi  147Gi   86Gi    64% 1179390 9223372036853596417    0%   /Volumes/jellyfish
/dev/disk1s3             233Gi  487Mi   86Gi     1%      34 9223372036854775773    0%   /Volumes/Recovery


In [2]:
# we can assign the output to a python variable
ls = !ls

In [4]:
# the type is SList
type(ls)

IPython.utils.text.SList

> `!` only works in Jupyter, it will throw an error in python

## Passing python programs to the interpreter

There are two ways:
1. Passing a script to the Python interpreter

In [7]:
# Create a simple script 
!echo "print('hello world!')" > hello_world.py

In [8]:
!python hello_world.py

hello world!


2. Passing a program to the Python interpreter via `-c`

In [9]:
!python -c "import datetime; print(datetime.datetime.now())"

2020-02-12 14:17:25.603871


## Using python and shell together

We can assign a shell variable to python.

In [20]:
# How many csvs exist in the previous course?
csvs = !ls -h ../../6_data_processing_in_shell/notes/*.csv
len(csvs)

1

In [21]:
# How many txts exist in the previous course?
txts = !ls -h ../../6_data_processing_in_shell/notes/*.txt
len(txts)

100

## Capture IPython Shell output

One of the most important principles of UNIX is that the OS should b+provide simple tools which can be combined to create sophisticated solutions.

1. Grab the output with `!`

In [57]:
# Grab the 5th col, filesize, and sum it for all printing the sum in the end
total_size = !ls -l | awk '{ SUM+=$5} END {print SUM}'

In [66]:
total_size

['7531']

2. Grab the output with `%%bash --out output`

In [67]:
%%bash --out output
ls -l | awk '{ SUM+=$5} END {print SUM}'

In [68]:
type(output)

str

In [69]:
output

'8340\n'

### Comparison

They are pretty similar but the first options returns an `SList` file type, which is very useful.

In [70]:
type(total_size)

IPython.utils.text.SList

### Capturing the STDERR

We might want to capture the standard error stream to debug errors later.

Saves the output into the variable `output`

In [78]:
%%bash --out output --err error
ls -l | awk '{ SUM+=$5} END {print SUM}'
echo "no error so far" >&2

We now captured the output and the error in different variables

In [79]:
error

'no error so far\n'

In [80]:
output

'10086\n'

## Automate with SList

The SList format comes from the need to interface python with IPython shell commands. An SList object comes by default with three methods:
- `fields`
- `grep`
- `sort`

### `fields`

`fields` simulates the `awk` command. 

In [81]:
ls = !ls -l /usr/bin

In [84]:
# Confirming it's an SList
type(ls)

IPython.utils.text.SList

In [95]:
# Grabbing just the modification dates for a few ls entries
# Collect whitespace-separated fields
ls.fields(1,5)[1:4]

['4 23', '1 22', '1 29']

### `grep`

`grep`-like operations on the output of a shell command.

In [96]:
ls = !ls -l /usr/bin

In [98]:
# Find utilities that will kill UNIX processes
ls.grep("kill")

['-rwxr-xr-x   1 root   wheel       1621 23 Feb  2019 kill.d',
 '-rwxr-xr-x   1 root   wheel      23984 29 Jul  2019 killall',
 '-rwxr-xr-x   1 root   wheel      30512 29 Jul  2019 pkill']

### `sort`

Performs sorting on the output of a shell command.
- first argument is which column to sort on
- second argument is whether to sort by alphabetical or numerical values

In [99]:
disk_usage = !df -h

In [100]:
disk_usage.sort(5, nums = True)

['Filesystem                Size   Used  Avail Capacity iused               ifree %iused  Mounted on',
 'map -hosts                 0Bi    0Bi    0Bi   100%       0                   0  100%   /net',
 'map auto_home              0Bi    0Bi    0Bi   100%       0                   0  100%   /home',
 '/dev/disk1s4             233Gi  8.0Gi   86Gi     9%       8 9223372036854775799    0%   /private/var/vm',
 '/dev/disk1s3             233Gi  487Mi   86Gi     1%      34 9223372036854775773    0%   /Volumes/Recovery',
 'devfs                    408Ki  408Ki    0Bi   100%    1410                   0  100%   /dev',
 '/dev/disk1s1             233Gi  138Gi   86Gi    62% 1179826 9223372036853595981    0%   /',
 'EgnyteDriveFS@egnytefs0  233Gi  147Gi   86Gi    64% 1179826 9223372036853595981    0%   /Volumes/jellyfish']

### Python lists and SLists

We can use some methods from lists on SLists, like `pop`. It's also very easy to convert SLists to python lists with `list()`

In [103]:
list(disk_usage)

['Filesystem                Size   Used  Avail Capacity iused               ifree %iused  Mounted on',
 '/dev/disk1s1             233Gi  138Gi   86Gi    62% 1179826 9223372036853595981    0%   /',
 'devfs                    408Ki  408Ki    0Bi   100%    1410                   0  100%   /dev',
 '/dev/disk1s4             233Gi  8.0Gi   86Gi     9%       8 9223372036854775799    0%   /private/var/vm',
 'map -hosts                 0Bi    0Bi    0Bi   100%       0                   0  100%   /net',
 'map auto_home              0Bi    0Bi    0Bi   100%       0                   0  100%   /home',
 'EgnyteDriveFS@egnytefs0  233Gi  147Gi   86Gi    64% 1179826 9223372036853595981    0%   /Volumes/jellyfish',
 '/dev/disk1s3             233Gi  487Mi   86Gi     1%      34 9223372036854775773    0%   /Volumes/Recovery']

## Find our jupyter notebooks with `grep`


In [107]:
files = !ls ~/dev/stuff/sandbox/miguel

In [108]:
files.grep(".ipynb")

['1_introduction_to_data_engineering-1_introduction_to_data_engineering.ipynb',
 'Dataflow.ipynb',
 'Untitled.ipynb',
 'Untitled1.ipynb',
 'Untitled10.ipynb',
 'Untitled11.ipynb',
 'Untitled12.ipynb',
 'Untitled13.ipynb',
 'Untitled14.ipynb',
 'Untitled15.ipynb',
 'Untitled16.ipynb',
 'Untitled17.ipynb',
 'Untitled18.ipynb',
 'Untitled19.ipynb',
 'Untitled2.ipynb',
 'Untitled20-Copy1.ipynb',
 'Untitled20.ipynb',
 'Untitled21.ipynb',
 'Untitled22.ipynb',
 'Untitled23.ipynb',
 'Untitled24.ipynb',
 'Untitled25.ipynb',
 'Untitled26.ipynb',
 'Untitled27.ipynb',
 'Untitled28.ipynb',
 'Untitled29.ipynb',
 'Untitled3.ipynb',
 'Untitled30.ipynb',
 'Untitled31.ipynb',
 'Untitled32.ipynb',
 'Untitled33.ipynb',
 'Untitled35.ipynb',
 'Untitled4.ipynb',
 'Untitled5.ipynb',
 'Untitled6.ipynb',
 'Untitled7.ipynb',
 'Untitled8.ipynb',
 'Untitled9.ipynb',
 'adjust-merge_data-Copy1.ipynb',
 'adjust-merge_data.ipynb',
 'airflow-appsflyer-clean_mmp.ipynb',
 'airflow-appsflyer-clean_reports.ipynb',
 'airflo

# Shell commands with subprocess

One of Python's strengths is the ability to glue itself to other languages and systems. There's a Python API for almost everything, including one to interact with the UNIX shell. We can for example:
- send data to UNIX processes
- listen to output
- kill processes

## subprocess.run

This is the simplest way to run shell commands in Python 3.5+. Takes a list of strings and runs the command without capturing the output.

In [111]:
import subprocess

subprocess.run(["ls", "-l"])

CompletedProcess(args=['ls', '-l'], returncode=0)

Dealing with Unicode in Python 3+ is more powerful but also more complex. Bytes strings need to be converted to `utf-8` to be processed further. This is accomplished with:

`regular_string = res.decode("utf-8")`

## Status codes

UNIX commands return a status code which represents the status of their completion. 
- `0` means successful
- non-zero means unsuccessful

In [112]:
# Printing the status code of the last run command
# Notice the 0 at the end: was successful
!ls -l; echo $?

total 56
-rw-r--r--  1 miguel.carvalho  staff  22745 12 Feb 15:43 7_command_line_automation_in_python.ipynb
-rw-r--r--  1 miguel.carvalho  staff     22 12 Feb 14:16 hello_world.py
0


In [121]:
# A non-successful example; didn't quite work?
!ls --bogus | echo $?

0
ls: illegal option -- -
usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]


### Capturing the status code with the subprocess

In [125]:
## Notice how returncode is part of the CompletedProcess object
subprocess.run(["ls", "-l"])

CompletedProcess(args=['ls', '-l'], returncode=0)

In [126]:
## Successful example
subprocess.run(["ls", "-l"]).returncode

0

In [127]:
## Non-successful example
subprocess.run(["ls", "--lame"]).returncode

1

### Control flow for status codes

We can check for status codes in a control flow structure to account for possible errors.

In [146]:
good_user_input = "--lame"
out = subprocess.run(["ls", good_user_input])

In [147]:
out

CompletedProcess(args=['ls', '--lame'], returncode=1)

In [148]:
if out.returncode == 0:
    print("Success")
else:
    print("Unsuccessful")

Unsuccessful


In [151]:
# Running two subprocesses from Python
import subprocess

# Execute Unix command `head` safely as items in a list
with subprocess.Popen(["head", "test.txt"], stdout=subprocess.PIPE) as head:
  
  # Print each line of list returned by `stdout.readlines()`
    for line in head.stdout.readlines():
        print(line)

    # Execute Unix command `wc -w` safely as items in a list
with subprocess.Popen(["wc", "-w", "test.txt"], stdout=subprocess.PIPE) as word_count:

    # Print the string output of standard out of `wc -w`
    print(word_count.stdout.read())

b'This is a test\n'
b'This is a test\n'
b'This is a test\n'
b'This is a test\n'
b'This is a test\n'
b'This is a test\n'
b'This is a test\n'
b'This is a test\n'
b'This is a test'
b'      36 test.txt\n'


In [152]:
import subprocess

# Use subprocess to run the `ps aux` command that lists running processes
with subprocess.Popen(["ps", "aux"], stdout=subprocess.PIPE) as proc:
    process_output = proc.stdout.readlines()
    
# Look through each line in the output and skip it if it contains "python"
for line in process_output:
    if b"python" in line:
        continue
    print(line)

b'USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND\n'
b'miguel.carvalho  27559  44.5  2.1  5305376 348504   ??  R    Mon09am  33:46.78 /Applications/Google Chrome.app/Contents/Frameworks/Google Chrome Framework.framework/Versions/79.0.3945.130/Helpers/Google Chrome Helper (Renderer).app/Contents/MacOS/Google Chrome Helper (Renderer) --type=renderer --field-trial-handle=1718379636,13713992639962531526,9614059912419605989,131072 --lang=en-GB --metrics-client-id=456e333c-a0e4-4a67-aeb8-dd377343299f --enable-auto-reload --num-raster-threads=4 --enable-zero-copy --enable-gpu-memory-buffer-compositor-resources --enable-main-frame-before-activation --service-request-channel-token=317072197224008168 --renderer-client-id=7410 --no-v8-untrusted-code-mitigations --seatbelt-client=140\n'
b'_windowserver      173  14.7  0.9  8985756 153036   ??  Ss   30Jan20 519:13.67 /System/Library/PrivateFrameworks/SkyLight.framework/Resources/WindowServer -daemon\n'
b'migue

In [153]:
!ps aux

USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
miguel.carvalho  83865  10.9  0.9  6153080 148308   ??  R    Tue10am  30:39.71 /Applications/Si
miguel.carvalho  27559   5.5  2.1  5343480 347624   ??  R    Mon09am  33:50.64 /Applications/Go
miguel.carvalho   1495   4.2  3.1  6179320 519324   ??  S    30Jan20 379:38.44 /Applications/Go
root               438   3.4  0.1  4476828  12540   ??  Ss   30Jan20  17:24.64 /usr/libexec/Tou
_windowserver      173   3.1  0.9  8981064 152224   ??  Ss   30Jan20 519:16.18 /System/Library/
miguel.carvalho   2055   1.6  1.1  4999204 181628   ??  S    30Jan20  20:44.70 /System/Library/
miguel.carvalho  83863   1.5  0.1  4711488  14536   ??  R    Tue10am   5:32.60 /Applications/Si
miguel.carvalho  37822   0.6  2.4  5436332 401800   ??  S     8:50pm  19:23.28 /Applications/Go
jellyadmin        1148   0.5  0.1  5286896  14444   ??  S    30Jan20  46:46.27 /Library/Applica
miguel.carvalho   1577   0.4  2.9  6065

miguel.carvalho   1335   0.0  0.2  4511204  37012   ??  S    30Jan20   1:24.78 /System/Library/
_spotlight        1270   0.0  0.0  4380592   4544   ??  S    30Jan20   0:04.64 /usr/libexec/tru
jellyadmin        1209   0.0  0.1  6001248  21160   ??  S    30Jan20   5:16.46 /System/Library/
jellyadmin        1208   0.0  0.0  4387488   2460   ??  S    30Jan20   0:07.41 /usr/libexec/key
jellyadmin        1207   0.0  0.0  4439704   7276   ??  S    30Jan20   0:06.52 /System/Library/
jellyadmin        1194   0.0  0.0  4380576    496   ??  Ss   30Jan20   0:00.88 /System/Library/
jellyadmin        1193   0.0  0.0  4404156   4956   ??  S    30Jan20   0:13.85 /System/Library/
jellyadmin        1192   0.0  0.0  4392340   6000   ??  S    30Jan20   0:09.00 /System/Library/
jellyadmin        1191   0.0  0.0  4377564    876   ??  S    30Jan20   0:00.83 /System/Library/
root              1189   0.0  0.1  4389308  11716   ??  Ss   30Jan20   0:01.22 /System/Library/
_softwareupdate   1188   0.0  