<a href="https://colab.research.google.com/github/worldbank/dec-python-course/blob/session3/1-foundations/3-numpy-and-pandas/foundations-s3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Libraries
Within the realm of python, a package is a collection of modules, a library is a collection of packages. In practice, "python library" and "python package" are used interchangeably to refer to a reusable chunk of code. Use of libraries allows us to "stand on the shoulders of giants".

## Examples of python libraries
- [NumPy](https://numpy.org/) stands for Numerical Python. It is the fundamental Python package for scientific computing.
- [pandas](https://pandas.pydata.org/) is a Python package for fast and efficient processing of tabular data, time series, matrix data, etc.
- [Matplotlib](https://matplotlib.org/)  is a comprehensive library for creating data visualizations in Python. 

## How do I get some?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print(np.pi)

3.141592653589793


`as` is optional; it is usually used to alias the library name to a shorthand or for disambiguation. The above are some conventional aliases for these libraries. If you `import numpy` wihtout aliasing, just be sure to use `numpy` instead of `np` when calling the library's functions later.

`import` the library like you would import a built-in python module e.g. `import math` works for common libraries on Google Colab. To see which libraries are pre-installed and their versions:

In [None]:
# Note: ! in google colab executes a bash command
# `| more` shortens the output content for immediate display
!pip freeze | more

absl-py==1.1.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.2.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arviz==0.12.1
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
atari-py==0.2.9
atomicwrites==1.4.0
attrs==21.4.0
audioread==2.1.9
autograd==1.4
Babel==2.10.2
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==5.0.0
blis==0.7.7
bokeh==2.3.3
branca==0.5.0
[K

To check if a library you want to use is already installed:

In [None]:
!pip freeze | grep pandas

pandas==1.3.5
pandas-datareader==0.9.0
pandas-gbq==0.13.3
pandas-profiling==1.4.1
sklearn-pandas==1.8.0


[pip](https://pip.pypa.io/) is the de facto python package manager. You can use it to view the current installed packages and to install a new library or upgrade an existin library:

In [None]:
# install/upgrade to the latest stable verison of a package
!pip install pandas --upgrade 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# install a specific version of a package
!pip install pandas==1.3.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Exercises
The following exercises are meant to be completed step by step. Upon successful completion of all steps, the last cell should execute without any `AssertionError`.
1. Run a command to find out what is the current installed version of `matplotlib`:

In [None]:
# hint: use ! and bash command `pip freeze` and pipe the results to bash command `grep matplotlib`

2. Install version 3.5.2 of `matplotlib` for this notebook. Upon successful installation, click on "RESTART RUNTIME" in the code cell output.

In [None]:
# hint: use ! and bash command `pip install`

3. Import `matplotlib`'s `pyplot` module and alias it as `plt`

In [None]:
# hint: use `import ... as ...`

# === Do not modify code below ===
fig = plt.figure(figsize=(10, 80))
assert hasattr(fig, "subfigures"),\
 "If correct version of matplotlib were installed you should not see this message"

print("Well done!")

AssertionError: ignored

<Figure size 720x5760 with 0 Axes>

If you have successfully completed the above exercies and are feeling adventurous, head over to `foundations-s3-bonus.ipynb` – it contains some bonus content and corresponding exercises on this topic.

# NumPy

NumPy (**Numerical Python**) is an open source Python library that’s used in almost every field of science and engineering. It’s the **universal standard** for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. It serves as the fundamention for popular data science and scientific Python packages, such as Pandas, SciPy, Matplotlib, scikit-learn.



## ndarray
The NumPy library provides **powerful functionalities** for numerical operations and does so **effeciently**. Its basic building block is `ndarray`, a homogeneous n-dimensional array object, with methods to efficiently operate on it. `ndarray` is a simple and flexible data structure that can represent vectors (1-D arrays), matrices (2-D arrays), and tensors (3-D or higher dimensional arrays).

// TODO: add picture



### Wait, what's an array?

In computer science, an array is a data structure consisting of a collection of elements, each identified by an index. 

### How does a NumPy `ndarray` compare to a Python `list` or `array`?
In Python, arrays are most often represented using `list`, which allows for different data types within a single list. Python standard library does come with its own `array` module which requires that a single array can only contain one data type of elements. Python's `array` is more compact (takes less space in storage) than `list` and comes with some basic math operations. NumPy's `ndarray` is like Python's `array` on steroids, functionality wise.

In [2]:
from array import array
import numpy as np

py_list = [1, 2, '3']
py_array = array('i', [1, 2, 3]) # i indicates integer
np_array = np.array([1, 2, 3])

### How to create a NumPy `ndarray`?

In [20]:
# give me 3 zeros
np.zeros(3)

array([0., 0., 0.])

In [21]:
# give me 4 ones
np.ones(4)

array([1., 1., 1., 1.])

In [64]:
# give me 5 numbers randomly sampled from [0.0, 1.0)
np.random.random(5)

array([0.65698592, 0.66042937, 0.9694192 , 0.35748349, 0.30430441])

In [28]:
# give me 4 consecutive numbers
# starts at 0 by default, stop is excluded
one_d_array = np.arange(4)
one_d_array

array([0, 1, 2, 3])

In [23]:
# start, stop, step
np.arange(1, 9, 2)

array([1, 3, 5, 7])

In [4]:
# 2-D array / matrix
two_d_array = np.array([[1, 2], [3, 4]])
two_d_array

array([[1, 2],
       [3, 4]])

In [16]:
# 3-D array
three_d_array = np.array([[[1.0, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
three_d_array

array([[[ 1.,  2.],
        [ 3.,  4.]],

       [[ 5.,  6.],
        [ 7.,  8.]],

       [[ 9., 10.],
        [11., 12.]]])

### How to find properties of a `ndarray`?

In [32]:
# dimension
two_d_array.ndim

2

In [33]:
three_d_array.ndim

3

In [12]:
# shape
two_d_array.shape

(2, 2)

In [6]:
three_d_array.shape

(3, 2, 2)

In [11]:
# how many elements total?
three_d_array.size

12

In [17]:
# what's the data type of the elements?
three_d_array.dtype

dtype('float64')

In [34]:
print(f'''Array dimension: {one_d_array.ndim},
      shape: {one_d_array.shape}, 
      size: {one_d_array.size},
      data type: {one_d_array.dtype}''')

Array dimension: 1,
      shape: (4,), 
      size: 4,
      data type: int64


## Manipulating and operating on `ndarray`
With basic understanding of ndarray, let's go ahead do some maths with it.

### How to do maths and (even) linear algebra with ndarrays?

In [35]:
one_d_array.min()

0

In [36]:
one_d_array.max()

3

In [37]:
one_d_array.sum()

6

In [38]:
# broadcasting
one_d_array * 2

array([0, 2, 4, 6])

In [39]:
# arithmetic operations
one_d_array + one_d_array * 2

array([0, 3, 6, 9])

In [40]:
one_d_array - two_d_array

ValueError: ignored

### How to reshape a `ndarray`?
Use the `reshape` function with the new shape as arguments. Note that the new shape must have the same number of elements as the original array.

<img src="https://numpy.org/doc/stable/_images/np_reshape.png" />

Image credit: https://numpy.org/doc/stable/user/absolute_beginners.html#transposing-and-reshaping-a-matrix

In [58]:
one_d_array.reshape(2, 2) - two_d_array

array([[-1, -1],
       [-1, -1]])

In [42]:
one_d_array.reshape(1, 4) - two_d_array.reshape(1, 4)

array([[-1, -1, -1, -1]])

To transpose:

In [47]:
one_d_array.transpose()

array([0, 1, 2, 3])

In [45]:
# or use short T attribute on the array
one_d_array.T

array([0, 1, 2, 3])

In [46]:
one_d_array.T - two_d_array.reshape(1, 4)

array([[-1, -1, -1, -1]])

To flatten any multidimensional array to a 1-d array:

In [48]:
two_d_array.flatten()

array([1, 2, 3, 4])

In [50]:
one_d_array - two_d_array.flatten()

array([-1, -1, -1, -1])

### How to stack and split ndarrays?

Multiple arrays can be stacked together along different axes:

In [130]:
another_two_d_array = two_d_array * 2
print(two_d_array)
print(another_two_d_array)

[[1 2]
 [3 4]]
[[2 4]
 [6 8]]


In [None]:
from numpy.core.shape_base import vstack
# stack row wise
vstacked = np.vstack((two_d_array, another_two_d_array))
vstacked

array([[1, 2],
       [3, 4],
       [2, 4],
       [6, 8]])

In [None]:
# stack column wise
hstacked = np.hstack((two_d_array, another_two_d_array))
hstacked

array([[1, 2, 2, 4],
       [3, 4, 6, 8]])

An array can be split into multiple arrays, along the specified axes:

In [None]:
vstacked

array([[1, 2],
       [3, 4],
       [2, 4],
       [6, 8]])

In [None]:
# split to 2 arrays row wise
vsplit1, vsplit2 = np.vsplit(vstacked, 2)
print(vsplit1)
print(vsplit2)

[[1 2]
 [3 4]]
[[2 4]
 [6 8]]


In [None]:
# split to 2 arrays column wise
hsplit1, hsplit2 = np.hsplit(vstacked, 2)
print(hsplit1)
print(hsplit2)

[[1]
 [3]
 [2]
 [6]]
[[2]
 [4]
 [4]
 [8]]


## Exercises

Task: Calculate the dot product of vector `(1, 2, 3)` wtih itself.

The following exercises are meant to be completed step by step. Upon successful completion of all steps, the last cell should execute without any `AssertionError`.

1. represent vector `(1, 2, 3)` as a 1-d NumPy array of shape (3, ) and store in variable `a`:

In [61]:
# hint: use np.array or np.arange
# Your code here

# === Do not modify code below ===
assert a.shape == (3, )
assert a[0] == 1
assert a[1] == 2

2. Implement the dot product calculation and store the result in variable `result`:

In [57]:
# hint: use tranpose or reshap, then sum or np.sum
# Your code here

# === Do not modify code below ===
assert result == 14


## Universal functions

In NumPy, universal functions (`ufunc`) are functions operate elementwise on an array, producing an array as output.

The `dot` function we used in the exerciese above is a universal function. Here are some other examples:

<p><a href="https://numpy.org/doc/stable/reference/generated/numpy.all.html#numpy.all" title="numpy.all"><code class="xref py py-obj docutils literal notranslate"><span class="pre">all</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.any.html#numpy.any" title="numpy.any"><code class="xref py py-obj docutils literal notranslate"><span class="pre">any</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html#numpy.apply_along_axis" title="numpy.apply_along_axis"><code class="xref py py-obj docutils literal notranslate"><span class="pre">apply_along_axis</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.argmax.html#numpy.argmax" title="numpy.argmax"><code class="xref py py-obj docutils literal notranslate"><span class="pre">argmax</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.argmin.html#numpy.argmin" title="numpy.argmin"><code class="xref py py-obj docutils literal notranslate"><span class="pre">argmin</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.argsort.html#numpy.argsort" title="numpy.argsort"><code class="xref py py-obj docutils literal notranslate"><span class="pre">argsort</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.average.html#numpy.average" title="numpy.average"><code class="xref py py-obj docutils literal notranslate"><span class="pre">average</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.bincount.html#numpy.bincount" title="numpy.bincount"><code class="xref py py-obj docutils literal notranslate"><span class="pre">bincount</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.ceil.html#numpy.ceil" title="numpy.ceil"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ceil</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.clip.html#numpy.clip" title="numpy.clip"><code class="xref py py-obj docutils literal notranslate"><span class="pre">clip</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.conj.html#numpy.conj" title="numpy.conj"><code class="xref py py-obj docutils literal notranslate"><span class="pre">conj</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html#numpy.corrcoef" title="numpy.corrcoef"><code class="xref py py-obj docutils literal notranslate"><span class="pre">corrcoef</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cov.html#numpy.cov" title="numpy.cov"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cov</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cross.html#numpy.cross" title="numpy.cross"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cross</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cumprod.html#numpy.cumprod" title="numpy.cumprod"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cumprod</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html#numpy.cumsum" title="numpy.cumsum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cumsum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.diff.html#numpy.diff" title="numpy.diff"><code class="xref py py-obj docutils literal notranslate"><span class="pre">diff</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.floor.html#numpy.floor" title="numpy.floor"><code class="xref py py-obj docutils literal notranslate"><span class="pre">floor</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.inner.html#numpy.inner" title="numpy.inner"><code class="xref py py-obj docutils literal notranslate"><span class="pre">inner</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.invert.html#numpy.invert" title="numpy.invert"><code class="xref py py-obj docutils literal notranslate"><span class="pre">invert</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.lexsort.html#numpy.lexsort" title="numpy.lexsort"><code class="xref py py-obj docutils literal notranslate"><span class="pre">lexsort</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/functions.html#max" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">max</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.maximum.html#numpy.maximum" title="numpy.maximum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">maximum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.mean.html#numpy.mean" title="numpy.mean"><code class="xref py py-obj docutils literal notranslate"><span class="pre">mean</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.median.html#numpy.median" title="numpy.median"><code class="xref py py-obj docutils literal notranslate"><span class="pre">median</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/functions.html#min" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">min</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.minimum.html#numpy.minimum" title="numpy.minimum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">minimum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html#numpy.nonzero" title="numpy.nonzero"><code class="xref py py-obj docutils literal notranslate"><span class="pre">nonzero</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.outer.html#numpy.outer" title="numpy.outer"><code class="xref py py-obj docutils literal notranslate"><span class="pre">outer</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.prod.html#numpy.prod" title="numpy.prod"><code class="xref py py-obj docutils literal notranslate"><span class="pre">prod</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/re.html#module-re" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">re</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/functions.html#round" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">round</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.sort.html#numpy.sort" title="numpy.sort"><code class="xref py py-obj docutils literal notranslate"><span class="pre">sort</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.std.html#numpy.std" title="numpy.std"><code class="xref py py-obj docutils literal notranslate"><span class="pre">std</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.sum.html#numpy.sum" title="numpy.sum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">sum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.trace.html#numpy.trace" title="numpy.trace"><code class="xref py py-obj docutils literal notranslate"><span class="pre">trace</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.transpose.html#numpy.transpose" title="numpy.transpose"><code class="xref py py-obj docutils literal notranslate"><span class="pre">transpose</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.var.html#numpy.var" title="numpy.var"><code class="xref py py-obj docutils literal notranslate"><span class="pre">var</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.vdot.html#numpy.vdot" title="numpy.vdot"><code class="xref py py-obj docutils literal notranslate"><span class="pre">vdot</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html#numpy.vectorize" title="numpy.vectorize"><code class="xref py py-obj docutils literal notranslate"><span class="pre">vectorize</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.where.html#numpy.where" title="numpy.where"><code class="xref py py-obj docutils literal notranslate"><span class="pre">where</span></code></a></p>

Most universal functions can operate not only on NumPy arrays (ndarray), but also on Python lists, arrays or any object convertible to a NumPy array:

In [111]:
# cross product of two vectors
np.cross([3, 0, 2], [-1, 4, 2])

array([-8, -8, 12])

More usage examples of universal functions:

In [65]:
# reverse an array
np.flip(one_d_array)

array([3, 2, 1, 0])

In [67]:
# sort an array, default ascending
np.sort(np.random.random(5))

array([0.15732179, 0.32612004, 0.43485142, 0.4977131 , 0.54538816])

In [75]:
# get the index of the max element
one_to_nine = np.arange(1, 10)
np.argmax(one_to_nine)

8

In [77]:
# look up the element by index
one_to_nine[np.argmax(one_to_nine)]

9

In [110]:
# check if any of the element evaluates to true
np.any([False, 0])

False

In [109]:
# check if all of the elements evaluate to true
np.all([True, 3, np.nan])

True

In [80]:
# pair-wise element comparison
np.greater([1, 2, 3], [1, 2, 1])

array([False, False,  True])

In [81]:
# Does any of the pair-wise comparison evaluate to true?
np.any(np.greater([1, 2, 3], [1, 2, 1]))

True

In [93]:
# select the indexes of elements that evaluate to true
a = np.array([1, 2, 3, 4])
np.where(a % 2) # find odd numbers

(array([0, 2]),)

In [95]:
# select the elements that evaluate to true
np.where(a % 2, a, np.nan)

array([ 1., nan,  3., nan])

## Exercises

Task: Calculate the dot product of vector `(1, 2, 3)` with itself **using one of NumPy's universal functions**. Store the result in variable `dot_result`:

In [None]:
# Your code here

# === Do not modify code below ===
assert dot_result == 14