# Day 3: overview of useful Python packages

* `git` and [github.com](https://www.github.com)
* `pre-commit`
* `ruff` and/or `black`
* `mypy`
* unit tests
* `numba`
* `seaborn` and `plotly`

## `git`

# Working with version control systems (git)

## What is a version control system?

![Version Control](https://github.com/martwo/teaching/raw/master/WS_2022_23/advanced_python/figures/version_control.png)

* Allows to keep track of changes (versions) of files.
* Tracks the history of a file.
* Manages a **repository**, i.e. a collection of files, usually a directory tree.
* Allows collaborating on same files by several people.
    * Conflict resolution
* Allows branching-off from the main development path, i.e. **branches**.
    * Branches can be merged

## What is `git`?

* A version control system developed by the inventor of the Linux kernel, Linus Torvalds.
* A decentralized system
    * repositories are stored locally on the user's computer
    * a central (remote) repository might exists, e.g. github.com

## Cloning a repository

To clone a repository use the `git clone <repo_url>.git <local_dir>` command:

```bash
git clone https://github.com/clagunas/example_project.git example_project
```

This copies (clones) the **master** branch of the repository from the remote location to the local disk.

## Working with branches

![Git Branches](https://github.com/martwo/teaching/raw/master/WS_2022_23/advanced_python/figures/git_branches.png)

* A branch represents an independent line of development.
* Always use a new branch when making large code changes, e.g. new feature development or bug fix! 

### List available branches
To list all available remote branches type:

```bash
git branch -a
```

### Creating a new branch

To work on a new feature or bug fix, one should create a new branch from the *master* branch.

* Clone the repository.
* Checkout a new branch via the `git checkout -b <branch>` command.

```bash
git clone https://github.com/clagunas/example_project.git example_project.my_new_feature
cd example_project.my_new_feature/
git checkout -b my_new_feature
```

After creating the new branch it lives in your local repository. In order to copy it to the remote repository one uses the `git push --set-upstream origin <branch>` command:

```bash
git push --set-upstream origin my_new_feature
```

To see on which branch you are currently working on use

```bash
git branch
```
to get the list of available branches.

## Commiting code changes

After making code changes, these changes must be committed to the branch as a commit.

To see the changed files:

```bash
git status
```

To see the changes of a file:

```bash
git diff <file>
```

To stage a changed file for a commit, use the `git add` command:

```bash
git add <file>
```

To commit all staged files:

```bash
git commit -m "My commit description"
```


## Merging two branches

After a bug was fixed or a new feature was implemented the development branch needs to be merged with the *master* branch.

![Merging two branches](https://github.com/martwo/teaching/raw/master/WS_2022_23/advanced_python/figures/git_branch_merge.png)

* Make sure to be in the correct receiving branch, e.g. master:

```bash
git checkout master
```

* Fetch and pull the latest updates of the receiving branch:

```bash
git fetch
git pull --rebase
```
Note: always do ``git pull --rebase``!

* Marge the feature branch into the master branch:

```bash
git merge my_new_feature
```

* Delete the feature branch:

```bash
git branch -d my_new_feature
```

* Reset the last commit: 

```bash
git reset HEAD^
```

* Stashing before pulling to temporarily store uncommitted work.

```bash
git stash save 'some message'
```

* To restore the stashed work:

```bash
git stash apply
```

## Resolving merge conflicts

* Merge conflicts occur when the **same part of the same file** was **changed by both branches**.
* `git merge` will fail before creating the merge commit.
* `git status` will tell where merge conflics are.

```
here is some content not affected by the conflict
<<<<<<< master
this is conflicted text from master
=======
this is conflicted text from my_new_feature branch
>>>>>>> my_new_feature;
```

* Edit the conflicting files to resolve the conflict.
* Stage the changed files with `git add`.
* Perform a final merge commit with `git commit`.

## Pull requests on github.com

* *github.com* provides the functionality to create **pull requests**.
* Pull requests are branches that **ask to be pulled** into the master branch via a merge operation.
* Pull requests can be **code-reviewed** before merging.

# `pre-commit`

Git hook scripts are useful for identifying and automatically fixing issues before committing code changes.

You can install `pre-commit` via `pip install pre-commit`

Add a pre-commit configuration file `.pre-commit-config.yaml` to the root of your repository:

Example configuration file:

```yaml
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace

- repo: https://github.com/astral-sh/ruff-pre-commit
  # Ruff version.
  rev: v0.7.2
  hooks:
    # Run the linter.
    - id: ruff
    # Run the formatter.
    - id: ruff-format

-   repo: https://github.com/psf/black
    rev: 22.10.0
    hooks:
    -   id: black
```

Run `pre-commit install` locally to set up the git hook scripts. Now `pre-commit` will run automatically on every git commit!

# What is numba?

- Numba provides decorators to compile specific python funcions
 
- Functions are compiled automatically "Just-In-Time" (JIT) into optimised machine code

- Functions are entirely python

# @jit: just-in-time compilation

- A Python function that is decorated with @jit is compiled “just-in-time”, i.e. on the first call of the function

- Subsequent calls use the already-complied implementation of the function

- Simple Python/Numpy operations will not benefit significantly from @jit 

- However, complex Numpy operations and loops can benefit greatly!

In [1]:
from numba import jit
import numpy as np

In [2]:
x = np.arange(1.e7)

In [3]:
def ident_np(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [4]:
%timeit ident_np(x)

130 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [5]:
@jit
def ident_np(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

In [6]:
%timeit ident_np(x)

76.1 ms ± 185 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
def ident_loops(x):
    r = np.empty_like(x) 
    for i in range(len(x)):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2 
    return r

In [8]:
%timeit ident_loops(x)

9.39 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
@jit
def ident_loops(x):
    r = np.empty_like(x) 
    for i in range(len(x)):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2 
    return r

In [10]:
%timeit ident_loops(x)

76.2 ms ± 51.9 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# @jit(nopython=True)

- @jit will always try to operate in “no-Python mode”, i.e. it generates code that does not access the Python C API

- If there are types or functions used in the decorated function that are unknown to Numba, then @jit falls back onto “object mode” (this fall-back is being deprecated)

- In “object mode” the optimisations that Numba can provide are limited

- One should therefore always strive to achieve “no-Python mode” with the nopython=True argument (or @njit), Numba will raise an error if “no-Python mode” could not be achieved for the function

In [11]:
from numba import njit

In [12]:
@njit
def ident_loops(x):
    r = np.empty_like(x) 
    for i in range(len(x)):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2 
    return r

In [13]:
%timeit ident_loops(x)

77.2 ms ± 989 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# @jit(fastmath=True)

- @jit can provide additional performance if you relax the numerical rigour. This is provided by the LLVM compiler, allowing:

    - The execution of potentially unsafe floating-point operations.
    - Approximations for arithmetic and mathematical functions
    - Floating-point reassociation

- Fastmath allows the for-loop to be vectorised

In [16]:
@njit(fastmath=False)
def do_sum_slow(A):
    acc = 0. 
    for x in A:
        acc += np.sqrt(x) 
    return acc

In [14]:
@njit(fastmath=True)
def do_sum(A):
    acc = 0. 
    for x in A:
        acc += np.sqrt(x) 
    return acc

In [17]:
%timeit do_sum_slow(x)

9.54 ms ± 19.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [15]:
%timeit do_sum(x)

6.31 ms ± 41.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# `@jit(nopython=True, parallel=True)`

- @jit can parallelize loops, and certain supported operations 

- Use `numba.prange` in place of `range` for parallelized for-loop

In [20]:
@jit
def ident_loops(x):
    r = np.empty_like(x) 
    for i in range(len(x)):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2 
    return r

In [21]:
%timeit ident_loops(x)

76.9 ms ± 403 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [22]:
from numba import prange

In [27]:
@jit(nopython=True)
def ident_loops(x):
    r = np.empty_like(x) 
    for i in prange(len(x)):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2 
    return r

In [28]:
%timeit ident_loops(x)

77.1 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [25]:
@jit(nopython=True, parallel=True)
def ident_loops(x):
    r = np.empty_like(x) 
    for i in prange(len(x)):
        r[i] = np.cos(x[i]) ** 2 + np.sin(x[i]) ** 2 
    return r

In [26]:
%timeit ident_loops(x)

19 ms ± 98.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# @vectorize: Creating numpy ufuncs

- Universal functions (ufuncs) are the building blocks of Numpy (and Scipy)
- Defines how the scalar elements of an array are operated on (e.g. np.sin, np.add...)
- Normally creating a Numpy ufunc can be difficult, often requiring C code to be written
- The Numba @vectorize decorator provides a straightforward way to create a new ufunc from pure Python, producing optimised compiled code similar to @jit

In [None]:
from numba import vectorize
from math import sin, cos

@vectorize
def ident_ufunc(x, y):
    return cos(x) ** 2 + sin(y) ** 2

print(ident_ufunc(3, 5))
print(ident_ufunc(3, np.arange(5)))
print(ident_ufunc(np.arange(3), np.arange(5)[:, None]))

1.899620907863409
[0.98008514 1.68815856 1.80690695 1.         1.55283516]
[[1.         0.29192658 0.17317819]
 [1.70807342 1.         0.88125161]
 [1.82682181 1.11874839 1.        ]
 [1.01991486 0.31184144 0.19309305]
 [1.57275002 0.8646766  0.74592821]]


# Summary

Numba Pros:

- Easily-produced efficient code
- Functions are entirely defined in Python
- Compilation is automatic and only when needed
- Allows Python loops without sacrificing performance

Numba Cons:

- Cannot utilise unsupported methods (e.g. SciPy)
- Steep learning curve

When to use Numba:

- Calling the same function many times, which applies complex operations on large arrays 
- In place of wrapping C/C++ code?

How to use Numba:

- Use @njit to force no-Python mode
- Feel free to write Python loops
- Replace SciPy-provided mathematical formulae with your own “jitted” versions