 <div style="clear: both; display: table;">
  <div style="border: none; float: left; width: 60%; padding: 5px">
  <h1 id="subtitle">Chapter 2: Python Standard Library</h1>
  <h2 id="subtitle">Guillermo Avendaño Franco <br> Aldo Humberto Romero</h2>
  <br>
  <img src="fig/1-line logotype124-295.png" alt="Scientific Computing with Python" style="width:50%" align="left">
  </div>
  <div style="border: none; float: left; width: 30%; padding: 5px">
  <img src="fig/SCPython.png" alt="Scientific Computing with Python" style="width:100%">
  </div>
</div>

Adapted by **Guillermo Avendaño** (WVU), **Jose Rogan** (Universidad de Chile) and **Aldo Humberto Romero** (WVU) from the [Tutorials for Stanford cs228 and cs231n](https://github.com/kuleshov/cs228-material). A large parte of the info was also built from scratch. In turn, that material was adapted by [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) and [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) from the `CS231n` Python tutorial by Justin Johnson (http://cs231n.github.io/python-numpy-tutorial/).

Changes to the original tutorial include strict Python 3 formats and split of the material to fit a series of lessons on Python Programming for WVU's faculty and graduate students.

The support of the National Science Foundation and the US Department of Energy under projects: DMREF-NSF 1434897, NSF OAC-1740111 and DOE DE-SC0016176 is recognized.

<div style="clear: both; display: table;">
<div style="border: none; float: left; width: 40%; padding: 10px">
<img src="fig/NSF.jpg" alt="National Science Foundation" style="width:50%" align="left">
    </div>
    <div style="border: none; float: right; width: 40%; padding: 10px">
<img src="fig/DOE.jpg" alt="National Science Foundation" style="width:50%" align="right">
</div>

## Table of Contents

Python is a great general-purpose programming language on its own. 
This notebook is focused on the Python Standard Library (PSL). 
Some of these modules are explicitly designed to encourage and enhance the portability of Python programs by abstracting away platform-specifics into platform-neutral APIs.
The lesson is particularly oriented to Scientific Computing. The episodes in the series include:

  * Python Syntax 
  * **The Python Standard Library \[This notebook\]**
  * Numpy
  * Matplotlib
  * Scipy 
  * Pandas
  * Cython

After completing all the series in this lesson you will realize that python has become a powerful environment for scientific computing at several levels, from intereactive computing to scripting to big project developments.

## Setup

In [1]:
%load_ext watermark

In [2]:
%watermark

2019-09-16T08:43:48-04:00

CPython 3.7.4
IPython 7.6.1

compiler   : Clang 10.0.1 (clang-1001.0.46.4)
system     : Darwin
release    : 18.7.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit


# Introduction

In this tutorial, we will cover the following modules in the Python Standard Library

* sys
* math and cmath
* os and os.path
* shutil
* itertools
* json
* subprocess
* multiprocessing


The Python Standard Library (PSL) is a set of modules distributed with Python and they are included on most Python implementations. With some very specific exceptions, you can take for granted that on every machine capable of running Python code will have those modules available too. 

The Python’s standard library is very extensive. The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. The idea we are trying to use here is that the existence of some modules will help with the simplicity of the program and they will allow also portability between different systems. Basically, Python is trying to create a natural neutral platform for application programming interfaces (APIs). 

Here we are making a selection of a few modules that are commonly used in Scientific Computing. The selection itself is rather subjective but from experience most users using Python for research, specially numerical oriented calculations will use at some point several of these modules. 

The complete documentation about these modules can be found [here](https://docs.python.org/3/library/index.html)

# sys

This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available. More info can be found in [sys](https://docs.python.org/3/library/sys.html) 

In [3]:
import sys

There are a few reason to include this module in the selection, consider getting the version of Python that in use:

In [4]:
sys.version

'3.7.4 (default, Jul 11 2019, 01:08:00) \n[Clang 10.0.1 (clang-1001.0.46.4)]'

In [5]:
sys.version_info

sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)

To know information about the limits of float type. `sys.float_info` contains low level information about the precision and internal representation. The values correspond to the various floating-point constants defined in the standard header file float.h for the ‘C’ programming language; see section 5.2.4.2.2 of the 1999 ISO/IEC C standard [C99], ‘Characteristics of floating types’, for details.

In [6]:
sys.float_info

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

Each value can be retrieved independently like

In [7]:
sys.float_info.max

1.7976931348623157e+308

Similarly for integers:

In [8]:
sys.int_info

sys.int_info(bits_per_digit=30, sizeof_digit=4)

To get the size of any object in bytes:

In [9]:
a=list(range(1000))
sys.getsizeof(a)

9120

In [10]:
b=range(1000)
sys.getsizeof(b)

48

By itself, the builtin function sys.getsizeof() is not helpful determining the size of a container (a given object) and all of its contents, but can be used with a recipe like [this](https://code.activestate.com/recipes/577504/) to recursively collect the contents of a container.

To know the paths to search for modules

In [11]:
sys.path

['/Users/aldoromero/Class/PythonIntroduction/intro_python',
 '/Users/aldoromero/Class/PythonIntroduction/intro_python',
 '/Users/aldoromero/Python/TSASE/tsase',
 '/Users/aldoromero/Python/ase/3.11.0/ase',
 '/Users/aldoromero/MiguelMarques/ase',
 '/Applications/vtkpython/bin/vtk/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python37.zip',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload',
 '',
 '/Users/aldoromero/Library/Python/3.7/lib/python/site-packages',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymatgen-2019.1.24-py3.7-macosx-10.14-x86_64.egg',
 '/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sympy-1.3-py3

Prefix path where the current version of Python in use:

In [12]:
sys.prefix

'/opt/local/Library/Frameworks/Python.framework/Versions/3.7'

To collect arguments such as

myscript.py arg1 arg2 arg3
    
from the command line `sys.argv` can be used, in particular for scripts.

In [13]:
sys.argv

['/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py',
 '-f',
 '/Users/aldoromero/Library/Jupyter/runtime/kernel-faaf64e3-f8ea-47df-8db2-666418579375.json']


However, `sys.argv` is very primitive for practical purposes. The module `argparse` is the recommended module to parse arguments from command line.

# math and cmath

This module provides access to the mathematical functions defined by the C standard. The similar module for complex numbers is `cmath`

In [14]:
import math
import cmath

The arguments for the functions in `math` and `cmath` must be numbers. As we will see on the lesson for `numpy` when the functions have to operate over multiple numbers the functions on `numpy` are a far more efficient alternative and avoid expensive loops over list or other low performance containers.

A few functions are shown as examples:

## math

In [15]:
math.ceil(2.5)

3

In [16]:
math.fabs(-3.7)

3.7

`fabs` only works for real numbers and returns always a float even if the argument is integer.
In the case of complex numbers the built-in `abs()` returns the magnitude of the complex number

In [17]:
abs(-1.7+4.5j)

4.810405388322278

GCD stands for Greatest Common Divisor of the integers a and b.

In [18]:
math.gcd(91, 133)

7

In [19]:
math.sqrt(256)

16.0

In [20]:
math.cos(math.pi/3)

0.5000000000000001

## cmath

In [21]:
cmath.sqrt(-256)

16j

In [22]:
cmath.cos(1j*math.pi/3)

(1.600286857702386-0j)

# os and os.path

Sooner or later you will interact with files and folders. The module os not only provides basic operativity over the filesystem but also allow us to gain information about the operating system that is executing Python

## os

In [23]:
import os

The module os, provides operating system dependent functionality. Some functions are not available in some Operating Systems returning `os.OSError` in those cases.

In [24]:
os.name

'posix'

In [25]:
os.environ

environ{'MANPATH': '/opt/local/share/man:/usr/local/opt/gnu-sed/libexec/gnuman:',
        'TERM_PROGRAM': 'Apple_Terminal',
        'TERM': 'xterm-color',
        'SHELL': '/bin/bash',
        'BINDIR': '/opt/local/bin',
        'TMPDIR': '/var/folders/sd/7rq2vwcx7fd8stfzsvfnz9980000gq/T/',
        'INCLUDEDIR': '/opt/local/include/p4vasp',
        'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.Yv9OTEQLBw/Render',
        'DFTB_COMMAND': '/Users/aldoromero/DFTB+/dftb+_1.2.2/dftb+_1.2.2_src/prg_dftb/_obj_macos/dftb+',
        'TERM_PROGRAM_VERSION': '421.2',
        'MKDIR_P': '/bin/mkdir -p',
        'OLDPWD': '/Users/aldoromero/Class/PythonIntroduction',
        'TERM_SESSION_ID': '795FDCCE-41B0-4041-87AD-86EEE2D96F27',
        'LC_ALL': 'en_US.UTF-8',
        'P4VASP_HOME': '/opt/local/lib/p4vasp',
        'USER': 'aldoromero',
        'LD_LIBRARY_PATH': ':/Applications/vtkpython/lib:/usr/local/lib',
        'ASE_TAGS': 'https://svn.fysik.dtu.dk/projects/ase/tags/',
  

Individual environment variables can be retrived

In [26]:
os.getenv('USER')

'aldoromero'

A couple of functions reproduce the effect of a few commands in Unix/Linux like `pwd`, `cd` and `mkdir`

In [27]:
# Equivalent to pwd
os.getcwd()

'/Users/aldoromero/Class/PythonIntroduction/intro_python'

In [28]:
# Equivalent to mkdir
if not os.path.exists('test_folder'):
    os.mkdir('test_folder')

In [29]:
# Equivalent to cd
os.chdir('test_folder')
os.chdir('..')

In [30]:
# Equivalent to ls
os.listdir("test_folder")

[]

In [31]:
# Equivalent to rmdir
os.rmdir('test_folder')

These functions are useful in HPC to determine the number of cores on a machine

In [32]:
os.cpu_count()

8

The `os` module is particularly extense and the functions above are just a tiny fraction of all the commands available. It is always better to use commands like `os.mkdir()` than using external calls to system commands.
A bad program habit is using for example:

In [33]:
os.system("mkdir test_folder")

0

This command, not only makes the code non portable (will not work in Windows) but also on Unix systems is creating a subshell for a function that can be executed using `os.mkdir()`

## os.path

This module implements some useful functions on pathnames. For checking the existance of a file or folder or spliting the filename from the full path

In [34]:
import os.path

To know if a file or folder exists:

In [35]:
if not os.path.exists('test_folder'):
    os.mkdir('test_folder')

In [36]:
os.path.isfile('test_folder')

False

In [37]:
os.path.isdir('test_folder')

True

In [38]:
fullpath=os.path.abspath('test_folder')
print(fullpath)

/Users/aldoromero/Class/PythonIntroduction/intro_python/test_folder


In [39]:
os.path.split(fullpath)

('/Users/aldoromero/Class/PythonIntroduction/intro_python', 'test_folder')

This function splits a path in two components (head, tail) where tail is the last pathname component and head is everything leading up to that. The tail part will never contain a slash; if path ends in a slash, tail will be empty.

It is useful to separate the filename from the path to that file.

# shutil

For high level operations on one or more files. Most functions in `shutil` support file copying and removal of multiple files from a single call. These functions are more efficient that creating loops and operate over the files individually.

In [40]:
import shutil

In [41]:
wf=open('newfile1','w')
wf.close()
if not os.path.exists('test_folder'):
    os.mkdir('test_folder')
shutil.copy2('newfile1', 'test_folder')

'test_folder/newfile1'

In [42]:
shutil.rmtree('test_folder')

# itertools

Combinations and permutations are often found in scientific problems. The module `itertools` offers efficient functions for create iterables for those operations. Compared to actual lists, iterators can create infinite iterations, producing new elements as needed. An iterator has the advantage of using less memory than actual lists.

For example this iterators create new elements without limit:

In [43]:
import itertools

In [44]:
index=0
for i in itertools.count(13):
    print(i)
    index=index+1
    if index>9:
        break

13
14
15
16
17
18
19
20
21
22


In [45]:
index=0
for i in itertools.cycle('aeiou'):
    print(i)
    index=index+1
    if index>9:
        break

a
e
i
o
u
a
e
i
o
u


In [46]:
for i in itertools.repeat('one',5):
    print(i)

one
one
one
one
one


For large interations, this is more memory efficient than and equivalent:

In [47]:
for i in 5*['one']:
    print(i)

one
one
one
one
one


Iterators for Combinations and Permutations can be created as follows:

In [48]:
for i in itertools.permutations('ABCD',3):
    print(i)

('A', 'B', 'C')
('A', 'B', 'D')
('A', 'C', 'B')
('A', 'C', 'D')
('A', 'D', 'B')
('A', 'D', 'C')
('B', 'A', 'C')
('B', 'A', 'D')
('B', 'C', 'A')
('B', 'C', 'D')
('B', 'D', 'A')
('B', 'D', 'C')
('C', 'A', 'B')
('C', 'A', 'D')
('C', 'B', 'A')
('C', 'B', 'D')
('C', 'D', 'A')
('C', 'D', 'B')
('D', 'A', 'B')
('D', 'A', 'C')
('D', 'B', 'A')
('D', 'B', 'C')
('D', 'C', 'A')
('D', 'C', 'B')


In [49]:
for i in itertools.combinations('ABCD',3):
    print(i)

('A', 'B', 'C')
('A', 'B', 'D')
('A', 'C', 'D')
('B', 'C', 'D')


In [50]:
for i in itertools.product('ABCD',repeat=2):
    print(i)

('A', 'A')
('A', 'B')
('A', 'C')
('A', 'D')
('B', 'A')
('B', 'B')
('B', 'C')
('B', 'D')
('C', 'A')
('C', 'B')
('C', 'C')
('C', 'D')
('D', 'A')
('D', 'B')
('D', 'C')
('D', 'D')


# json

JSON is a lightweight data interchange format inspired by JavaScript object literal syntax. It is effective and standard way of storing structurated data. The JSON is just a format of serializing data similar to XML but more compact and easier to read for humans.

In [51]:
import json

Consider serializing this dictionary:

In [52]:
polygons={'triangle': 3, 'square': 4, 'pentagon': 5, 'hexagon': 6}

In [53]:
js=json.dumps(polygons)
js

'{"triangle": 3, "square": 4, "pentagon": 5, "hexagon": 6}'

This is a string that can be easily read by humans and also easily converted into a python dictionary.

In [54]:
poly=json.loads(js)
poly

{'triangle': 3, 'square': 4, 'pentagon': 5, 'hexagon': 6}

There are extra arguments to beautify the string, for example:

In [55]:
print(json.dumps(polygons, sort_keys=True, indent=4))

{
    "hexagon": 6,
    "pentagon": 5,
    "square": 4,
    "triangle": 3
}


Similar to `json.dumps` and `json.loads` there are functions to write and read JSON content directly from readable files. The functions `json.dump(obj, fp, ...)` and `json.load(fp, ...)` work on File-like objects. File-like objects have to support `write()` and `read()` like normal text file objects.

# subprocess

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several older modules and functions like `os.system`. 

The underlying process creation and management in this module is handled by the Popen class. It offers a lot of flexibility so that developers are able to handle the less common cases not covered by the convenience functions.

In [56]:
import subprocess

In [57]:
sp= subprocess.Popen(["ls","-lha","/"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)


In [58]:
sp.wait()

0

The process.communicate() call reads input and output from the process.  stdout is the process output. stderr will be written only if an error occurs.  If you want to wait for the program to finish you can call Popen.wait().

In [59]:
stout, sterr = sp.communicate()

In [60]:
print(stout)

total 182
drwxr-xr-x   61 root        wheel   1.9K Sep 11 05:48 .
drwxr-xr-x   61 root        wheel   1.9K Sep 11 05:48 ..
-rw-rw-r--    1 root        admin    10K Sep 13 18:44 .DS_Store
d--x--x--x    9 root        wheel   288B Sep 11 05:50 .DocumentRevisions-V100
dr-xr-xr-t@   2 root        wheel    64B Jul 15  2017 .HFS+ Private Directory Data
-rw-r--r--    1 root        wheel   992B Sep 11 05:48 .OSInstallerMessages
drwxr-xr-x    2 root        admin    64B Oct  9  2017 .PKInstallSandboxManager
drwx------    2 root        admin    64B Sep 13 08:51 .PKInstallSandboxManager-SystemSoftware
drwx------    5 root        wheel   160B Jul 15  2017 .Spotlight-V100
d-wx-wx-wt    2 root        wheel    64B Oct  9  2017 .Trashes
-rw-r--r--@   1 aldoromero  staff   312B Oct 28  2012 .apdisk
-rw-r--r--@   1 aldoromero  staff     0B Oct 22  2013 .com.apple.timemachine.donotpresent
srwxrwxrwx    1 root        wheel     0B Jul 28 08:51 .dbfseventsd
----------    1 root        admin     0B Aug 17  201

`subprocess` module have receive several important changes in the last versions of Python 3.x. Prior to version 3.5 the high level function was `subprocess.call()`, `subprocess.check_call()` and `subprocess.check_output()` all this functionality was replaced by `subprocess.run()` from version 3.5 and beyond.

# multiprocessing

Up to now, we have been dealing with serial processes but now most computer have several cores that allow us to do multiprocessing. Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken to smaller routines that run independently and in in more cases they talk to each other very unfrequently. A simple way to see this is to have 4 different drivers that try to go from point A to point B. Each driver can take its own path but at the end they will get together at point B. Python have difference methods, where the operating system allocates these threads to the processors improving performance of the system.

`multiprocessing` is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

For intensive numerical calculations, `multiprocessing` must be preferred over `multithreading` a similar module that spawns threads instead of processes.

The frequently use class `Pool` offers a simple way to spawn multiple workers to divide the same function over an interable divinding the workload over a number of workers. The prototypical example is like this:

In [61]:
import multiprocessing

In [62]:
import math

def f(x):
    return math.cos(math.pi/x)

if __name__ == '__main__':
    with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
        print(p.map(f, range(1,100)))

[-1.0, 6.123233995736766e-17, 0.5000000000000001, 0.7071067811865476, 0.8090169943749475, 0.8660254037844387, 0.9009688679024191, 0.9238795325112867, 0.9396926207859084, 0.9510565162951535, 0.9594929736144974, 0.9659258262890683, 0.970941817426052, 0.9749279121818236, 0.9781476007338057, 0.9807852804032304, 0.9829730996839018, 0.984807753012208, 0.9863613034027223, 0.9876883405951378, 0.9888308262251285, 0.9898214418809327, 0.9906859460363308, 0.9914448613738104, 0.9921147013144779, 0.992708874098054, 0.993238357741943, 0.9937122098932426, 0.9941379571543596, 0.9945218953682733, 0.9948693233918952, 0.9951847266721969, 0.9954719225730846, 0.9957341762950345, 0.9959742939952391, 0.9961946980917455, 0.9963974885425265, 0.9965844930066698, 0.99675730813421, 0.996917333733128, 0.9970658011837404, 0.9972037971811801, 0.9973322836635516, 0.9974521146102535, 0.9975640502598242, 0.9976687691905392, 0.9977668786231532, 0.9978589232386035, 0.9979453927503363, 0.9980267284282716, 0.998103328737044

In [63]:
multiprocessing.cpu_count()

8

This is a function to get the number of cores on the system. That is different from the number of cores available to the Python process. The recommended method is using `os.sched_getaffinity(0)` but it is absent on some architectures. In particular in MacOS, Windows and some old Linux distros.

# Final Remarks

The Python Standard Library is extense, and the API more prone to changes than the language itself. In real project is better to decide what will be the oldest version of Python that will be supported and keep compatibility until the marker is shifted for a more recent version. Most Linux distributions today includes Python 3.5 or newer.