# Python: the Language and Ecosystem


Shane Steinert-Threlkeld

# Roadmap

* **Background**
* Getting Started
* Language Basics
* Best Practices / Ecosystem / Further Resources

# Background

* Python _interpreted_ in most implementations

```python
:~$ python
>>> print('hello world!')
hello world
```

* `echo "print('hello world')" > hello.py && python hello.py`

* Can be compiled into bytecode (\*.pyc) and interface with C
    * See [Cython](https://cython.org/)
    * So: can be a very efficient language (more later on this)

# Background

* Python has a very active user community, and useful package index ([PyPi.org](http://pypi.org)) and package manager (`pip`)
    * **N.B.** best used in concert with _virtual environments_ (again: more later)
* Many scientific computing packages:
    * **`numpy`**, `scipy`, **`pandas`**
    * `nltk`
    * `scikit-learn`
    * Lingua franca of deep learning: `tensorflow`, `pytorch`
    * And NLP, e.g. `spacy`, `huggingface`

# Roadmap

* Background
* **Getting Started**
* Language Basics
* Best Practices / Ecosytem / Further Resources

# Installing Python

Global / system-wide installation (more on this later):

* **macOS**: MacPorts / homebrew
    * `port install python37`
    * `brew install python`
* **Linux**
    * `apt-get python3`
    * `yum install python3`
* **Windows**
    * [http://python.org/downloads/windows](http://python.org/downloads/windows)

# Installing via Anaconda

Alternatively, use [Anaconda](http://anaconda.org) or  [miniconda](https://docs.conda.io/projects/miniconda/en/latest/).  Comes with:
* lots of scientific computing packages
* great command-line tools (`conda`) for managing virtual environments
    * highly encouraged!
    * great for custom/local python installs on `patas`
    * now used by **all 57x courses**
* use `wget` if on a headless machine

# Editing Python

* **PyCharm**
    * Integrated Development Environment (IDE)
    * Professional version free for students
    * [https://www.jetbrains.com/pycharm/](https://www.jetbrains.com/pycharm/)

* `vim`: my old faithful, with packages
    * worth learning `vim` or `emacs` for powerful text editing via keybindings among other things
 
* [VSCode](https://code.visualstudio.com) (with vim keybindings of course :))
    * great plugins / community
    * built-in git support
    * highly extensible

# Editing Python

* Jupyter Notebooks
    * "Literate programming" paradigm
    * Create distributable "notebooks" mixing markdown (incl. LaTeX) inline with code
    * E.g.: _these slides_!
    
Caveat: can encourage bad practices. See [Joel Grus' slides](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1).  In general: I **discourage notebooks** for anything beyond very early and rapid prototyping.

For a notebook that tries to overcome some of these bad practices, see [nbdev](https://nbdev.fast.ai/).

More useful editing tools later in the slides!

# Roadmap

* Background
* Getting Started
* **Language Basics**
* Best Practices / Ecosystem / Resources

# This Tutorial:

[https://github.com/shanest/python-tutorial-clms](https://github.com/shanest/python-tutorial-clms)

NB: will skip over some language stuff, but you can download and run the notebook yourself!

Following the steps in the README on the GH repo will be good practice in working with `conda` and getting an environment set up.

# Basics: Built-In Types

In [1]:
# basics (this is a single-line comment)
an_int = 1
a_float = 1.2
a_bool = True

# Basics: Built-In Types

In [2]:
# strings
string1 = 'CLMS rules!'  # this is a comment
string2 = "Some people prefer double-quotes."
string3 = '''If you use three quotes, 
the string can include
line breaks.'''

"""
This is a 
mult-line comment.
"""

print(string3)

If you use three quotes, 
the string can include
line breaks.


# Gotcha 1: Duck-typing

In [32]:
# We'll return to this later

an_int = 1
a_float = 1.2
a_bool = True
a_string = ""

an_int and a_bool and a_string

''

# Basics: Built-in Types

In [4]:
# sequences 
a_list = [1, 2, 3, 1,]  # mutable
a_tuple = (1, 2, 3, 1,)  # immutable
a_set = {1, 2, 3, 1,}  # no duplicates, no order

a_tuple + (3, 4,)
a_tuple

a_list.extend([3,4])
a_list

[1, 2, 3, 1, 3, 4]

In [5]:
a_list[1:3]

[2, 3]

# Basics: Built-in Types

In [6]:
# dictionaries = hash-tables
a_dict = {'key1': 'value1',
          'key2': 'value2',
          3: 4.4}

a_dict[3]
a_dict['key2']

'value2'

# Basics: Time Complexity

* Think about data structures and what operations you will be performing often
    * [Time complexity of Python data structures](https://wiki.python.org/moin/TimeComplexity)
* Generally:
    * list and list-likes are good for insertion at end; bad for look-up/membership
    * set + dict are good for look-up

# Basics: Methods

In [7]:
def hello(string):
    output = 'Hello ' + string
    print(output)

In [8]:
hello('world')

Hello world


# Gotcha 2: White Space

In [9]:
def hello(string):
    # whitespace is meaningful!
    output = 'Hello ' + string
    return output

# Basics: Classes

In [10]:
class Student:
    # class variable
    program = 'CLMS'
    
    def __init__(self, name):
        self.name = name
        
    def set_name(self, new_name):
        self.name = new_name
    
    @classmethod
    def class_method(cls, blah):
        cls.blah = blah
        
    @staticmethod
    def check_name(name):
        return type(name) is str

In [11]:
shane = Student('Shane')
shane.name
shane.set_name('Shania')
shane.name

'Shania'

# Basics: Control Flow

In [12]:
a = 1
if a is None:
    a = 1
    print('None no more')
elif a == 3:
    a = 1
else:
    a = None
    
a = 3 if 2 + 2 == 5 else 4
a

4

# Basics: Control Flow

In [13]:
a_list = [1, 2, 3, 1,]  # mutable
a_tuple = (1, 2, 3, 1,)  # immutable
a_set = {1, 2, 3, 1,}  # no duplicates, no order

total = 0
for num in a_set:
    total += num
    
print(total)
print(sum(a_list))

for num in range(5):
    print(num)
    
# comprehensions
added = [num + 1 for num in a_list]
print(added)

{n: n+1 for n in a_list}

6
7
0
1
2
3
4
[2, 3, 4, 2]


{1: 2, 2: 3, 3: 4}

# Basics: Control Flow

In [14]:
num = 5
while num > 0:
    num -= 1
    print(num)

4
3
2
1
0


# Basics: Files

In [15]:
with open('hello.txt', 'r') as f:  # always open files in a `with`!
    for line in f:
        print(line)

hello

world



# Regular Expressions

* Useful for searching / matching patterns in text (e.g. corpora)
* In Python: `re` module
    * collections of methods, class definitions, etc.
    * every file roughly defines a module (but more compicated structures)

# Regular Expressions

In [16]:
import re

word = 'raced'
re.search('ed$', word)
re.split('ed$', word)
re.sub('ed$', 'er', word)

'racer'

In [17]:
pattern = re.compile('ed$')
if pattern.search(word):
    print('maybe past')

maybe past


# Regular Expressions

In [18]:
# find digits
string = 'LING 571'
pattern = re.compile('[0-9]')
pattern.search(string)

<re.Match object; span=(5, 6), match='5'>

In [19]:
# find float-like
pattern2 = re.compile('[0-9]\.[0-9]')  # what's wrong with this?

# Text Processing

In [20]:
string = 'quick brown fox'
string.split(' ')

['quick', 'brown', 'fox']

In [21]:
string.replace(' ', ', ')

'quick, brown, fox'

In [22]:
'quick' in string

True

# Roadmap

* Background
* Getting Started
* Language Basics
* **Best Practices / Ecosystem / Further Resources**

# Type Hinting

In Python 3.5+, you can add type annotations ([https://docs.python.org/3/library/typing.html](https://docs.python.org/3/library/typing.html)):

In [23]:
def hello(string: str) -> str:
    a = string + 2
    return 'Hello ' + string

an_int: int = 2

# Type Hinting

You should _always_ (pretty much) add type hints. Why?
* Readability! Code is for people, not just machines.
* Static analysis:
    * `mypy` or `pyright`: catch errors before runtime
    * Good linter! 
    * Always use one of these in your editor!
* Editor tools:
    * code completion, etc, can use type hints in very helpful ways

# Code Formatting

Writing clean, consistent code will be extremely valuable for you, your peers, colleagues, future self, etc.

But: it can be a PITA.

**Use a code formatter!**

* [black](https://black.readthedocs.io/en/stable/)
* [ruff](https://docs.astral.sh/ruff/)
* [yapf](https://github.com/google/yapf)

# Comments and Docstrings

Write detailed comments and docstrings!

I try to follow [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html) for this.

In [24]:
def find_token(sentence, token, sep=" "):
    for idx, element in enumerate(sentence.split(sep)):
        if element == token:
            return idx
    raise KeyError(f"Token {token} not found in sentence.")

print(find_token("Hello world my name is Shane", "name"))

3


In [25]:
def find_token(sentence: str, token: str, sep: str =" ") -> int:
    """Checks whether a specified token is found in a provided sentence.

    If so, returns the index in the sentence of the first occurrence of the token.
    If not, raises an error.

    Args:
        sentence: the sentence to search
        token: the token to search for
        sep: a separator by which to split the sentence into tokens

    Returns:
        the index of the first occurrence of `token` in `sentence`, if it exists

    Raises:
        KeyError, if `token` is not found in `sentence`, when split by `sep`
    """
    # split the sentence by the separator, and enumerate through the tokens by index
    for idx, element in enumerate(sentence.split(sep)):
        # return the index if token is found
        if element == token:
            return idx
    # end of sentence reached, token not found, so raise an error
    raise KeyError(f"Token {token} not found in sentence.")

# Unit Tests

Write unit tests!

Python built-in `unittest` module.

My recommendation (widely used): [pytest](https://pytest.org)

# Virtual Environments

* **Always** use virtual environments, one per project!
    * I recommend `conda`, like this tutorial
    * Also manages non-Python packages / dependencies
    * Compatible with / can use `pip`
* Reproducibility
    * Include `environment.yml` or `requirements.txt` (for `pip`)
    * This repo, as an example
* If need even more: _containerization_ ([Docker](https://docker.com))  

# Useful Packages

* Natural Language ToolKit [http://nltk.org](http://nltk.org)
    * Large collection of NLP tools, corpora, algorithms:
        * tokenizers, stemmers
        * parsers
        * semantic analysis
        * corpus fragments
    * Pedagogically oriented: online book (better than docs), examples
    * Heavily used in 571, useful elsewhere
* **[numpy](https://numpy.org)!!**
    * wrapper around very fast C code for numerical computation
    * learn to _vectorize_ numerical code as much as possible
* PyTorch / TensorFlow 

# Writing High-performance Python

* Use new versions when possible! (3.11 has great speed boosts)
* Vectorization with numpy (and TF, PyTorch, ..) 
* JIT compilation with [numba](https://numba.org)
* ...

In [26]:
import numpy as np
import timeit

a: np.ndarray = np.arange(100)
b: np.ndarray = np.arange(100)

def add_loop(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    c = []
    for idx in range(a.shape[0]):
        c.append(a[idx] + b[idx])
    return np.array(c)

print(timeit.timeit(lambda: add_loop(a, b), number=1000))

0.019323292013723403


In [27]:
print(timeit.timeit(lambda: a+b, number=1000))

0.00033645896473899484


In [28]:
def sum_loop(arr: np.ndarray) -> float:
    total = 0
    for elt in arr:
        total += elt
    return total
print(timeit.timeit(lambda: sum_loop(a), number=1000))
print(timeit.timeit(lambda: sum(a), number=1000))
print(timeit.timeit(lambda: a.sum(), number=1000))

0.025751749984920025
0.01085312501527369
0.0005847919965162873


# A few notes on numpy

Become friends with `np.ndarray` :)

In [29]:
# shape: [10, 10]
a = a.reshape(10, 10)
# shape: [10]
a[4] # select a "row"
# shape: [10]
a[:, 4] # select a "column"
# shape: 3
a[4, 2:5] # "slice"

# broadcasting
# shape: [10, 10]
a + 10
a + np.arange(10)
a + np.arange(10)[:, np.newaxis] # column instead of row

# operations along an axis
a.sum()
a.sum(axis = 0) # sum along "rows"
a.sum(axis = 1) # sum along "columns"

array([ 45, 145, 245, 345, 445, 545, 645, 745, 845, 945])

# Writing High-performance Python: Vectorization

* For numerical computation, use `numpy` as much as possible
    * PyTorch, TensorFlow, Jax: add automatic differentiation
    * Similarities with numpy API (Tensor = ndarray)
* Good rule of thumb: _never_ iterate over an array/tensor
* Vectorize! ([Array programming](https://en.wikipedia.org/wiki/Array_programming))
* Get comfortable with fancy indexing, slicing, reshaping as well
* Annotate arrays/tensors with _shape_ information in comments
* Tutorials:
    * https://numpy.org/doc/stable/user/absolute_beginners.html
    * https://numpy.org/doc/stable/user/basics.html
    * https://www.askpython.com/python-modules/numpy/vectorization-numpy

# Writing High Performance Python: Numba

[numba](https://numba.org): just-in-time (jit) compilation! For highly paralellizable code, can help a lot

In [30]:
from numba import jit
import numpy as np
import timeit

# from the numba docs

x = np.arange(100).reshape(10, 10)

def go_fast_slow(a): # Function is uncompiled
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

go_fast_slow(x)

@jit(nopython=True)
def go_fast(a): # compiled, machine code
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

go_fast(x) # compiles now, once


array([[  9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.],
       [ 19.,  20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.],
       [ 29.,  30.,  31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.],
       [ 39.,  40.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,  48.],
       [ 49.,  50.,  51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.],
       [ 59.,  60.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.],
       [ 69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.],
       [ 79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.],
       [ 89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.],
       [ 99., 100., 101., 102., 103., 104., 105., 106., 107., 108.]])

In [31]:
print(timeit.timeit(lambda: go_fast_slow(x), number=1000))
print(timeit.timeit(lambda: go_fast(x), number=1000))

0.01677587500307709
0.00042454199865460396


# Writing High Performance Python: New Languages

A couple of new projects that are worth keeping an eye on, but not yet widely adopted:

* [Mojo](https://www.modular.com/mojo): "The expressiveness of Python, with the performance of C", i.e. the holy grail
    * Powerful dev team, allows progressive typing / performance upgrades
    * Still not as ergonomic as they claim to be
* [Codon](https://docs.exaloop.io/codon): "a high-performance Python compiler that compiles Python code to native machine code without any runtime overhead"
    * Actually a new language that's almost, but not identical to, Python
    * Could be more ergonomic than Mojo, but also slightly earlier stages
    * Has a JIT decorator, similar to `numba`, for existing code-bases

# Python Resources

* Books:
    * Lutz and Ascher, _Learning Python_, O'Reilly
    * Martelli, _Python in a Nutshell_, O'Reilly
    * Beazley, _Python Essential Reference_, Developers Library
* Online
    * Mark Wilson, _Dive into Python_ [http://www.diveintopython3.net/](http://www.diveintopython3.net/)
        * for experienced programmers
    * [http://python.org](http://python.org)
    * [NLTK book](http://www.nltk.org/book)