# Introduction to Python
## Overview
* Lesson 1 - Oct. 24th (today): Setting up a python environment with Anaconda and using Jupyter notebook
* **Zoom lecture by Prof. Müller** - Oct 26 th (**Thursday**)
* Tutorial 1 - Oct. 27th (Friday): Solution to the first exercise sheet
* Lesson 2 - Oct. 31st (next Tuesday): preprocessing, evaluation, visualisation and random number generators
* Lesson 3 - Nov. 2nd (next Thursday): training neural networks using PyTorch

## What is the purpose of these lessons?
* We expect students to be able to comfortably use Python already - for those who need to catch up, we offer this additional intro course.
* There will be programming exercises, but they are awarded less points than the manual exercises.
* The notebooks will be uploaded to moodle. 

## Repetition: What is Python?
* object-oriented, high-level programming language
* Interpreted language --> just-in-time compiler $^0$ 
    * you can change your code as you go (-> patching)
    * compile-time errors will only be detected once code is used
* This extra step at run-time makes Python programs slow
    * when Python programs do complicated calculations fast, they typically use `C`-code internally (-> numpy)

---
$^0$ pre-compiled Python code (byte code) exists as well (`.pyc`)  

# Python for Data Analytics

## Virtual environments and package managing
Python packages often depend on many other python packages in turn. These relationships can quickly become too complex to maintain by hand. There are many package managers that take this burden off the user. We recommend using the Anaconda tool in the open-source [individual edition](https://www.anaconda.com/products/individual). You can browse available packages on [Anaconda.org](https://anaconda.org/).

### Why do virtual environments matter?
Developing different projects on the same machine often requires installing and keeping different versions of the same package. Virtual environments allow to isolate those different versions and execute python programs in an environment that contains the required packages with all their dependencies.

### Creating a virtual environment
To create a new empty environment called *bda_empty*, start up an Anaconda command prompt and enter the following:
```
(base) > conda create --name bda_empty
```

While it is possible to install packages into an existing environment, it is recommended to specify all required packages at creation time $^1$. So let's try this again: In this case, we will install some python packages commonly used for scientific programming: `scikit-learn`, `pandas` and `plotly`, as well as `nbformat` which plotly requires for the 3D scatter plot at the bottom. `ipykernel` is needed to have a Python kernel to use with Jupyter notebooks, more on that later. We also direct Anaconda to look for these packages in the channel `conda-forge`:

```
(base) > conda create -n bda scikit-learn pandas plotly nbformat ipykernel -c conda-forge
``` 

Missing something? The popular scientific computing packages `numpy` and `scipy` will be installed implicitly as dependencies of  `scikit-learn`, so unless we want to specify the version of `numpy` and `scipy`, we don't have to list them in the command.

---
$^1$ _A common problem is that Anaconda will choose a python version that is too new for some of the packages that one may later attempt to install. When specifying all required packages at environment creation time, Anaconda can determine the appropriate version of python or other implicitly installed packages._

Note how the command prompt changes after creating the new environment:
```
(bda) >
```
This means that the currently active Anaconda environment is `bda`. 

If you open a new Anaconda prompt, it will default to using the `base` environment. To switch to the environment where we have all the packages installed for our experiments, type the following:
```
(base) > conda activate bda
(bda)  > 
```

## Anaconda, but fast?
A final remark on the package manager: If you try to create complex environments with many sub-dependencies, Anaconda may run for a long time (think, an hour) only to find that there are incompatible package requests in your query. For such cases, `mamba` was created: a reimplementation of the conda package manager in C++. It uses `libsolv` for much faster dependency solving.
* there is `mamba` as a drop-in-replacement of `conda`
* and `micromamba` as a drop-in-replacement of `miniconda`
* both have additional features on top of `conda` that may be interesting for advanced use cases.

See [Mamba's documentation](https://mamba.readthedocs.io/en/latest/index.html) and [github](https://github.com/mamba-org/mamba) pages.

## Jupyter Notebook
> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
 
(source: [jupyter.org](https://jupyter.org))

The Jupyter Notebook supports many programming languages, including python. It allows you to run a python script in segments, contained in code cells, instead of running the whole script at once. Because cells can not only contain program code, but `markdown` (formatted text) as well Jupyter Notebooks can also be used to document results and explain the code. This document here is a Jupyter notebook as well. Its file ending is `.ipynb`.

### Running a Jupyter Notebook
In order to interact with your Jupyter notebook, you need to connect to a jupyter server. To start up your own server, type:
```
(base) > jupyter notebook
```

This will start a web server on your local machine and open a new tab in your browser showing the Jupyter GUI. Navigate to the notebook file you want to open or create a new file. Note: _you may want to navigate to the root folder of your notebooks in the command line before starting the jupyter server._

Another popular choice is the Visual Studio Code IDE, which now comes with built-in support for jupyter notebooks. Follow the steps below to install the `ipykernel` package in your environment if you want to use this tool.

### Handling multiple environments
It is considered best practice to install the `notebook` package only into one environment. This package provides the functionality to run a Jupyter notebook server that let's you display notebooks and run their code. 

Most practitioners choose to install the notebook server into their `base` conda environment. The Anaconda base environment usually even comes with the Jupyter server pre-installed $^2$.

Initially, users can only use packages (that includes Python libraries) installed into the same environment as the `notebook` package. In order to be able to use different environments for your notebooks, we'll install the package `nb_conda_kernels` into the same environment:

```
(base) > conda install nb_conda_kernels
```

---
$^2$ _to verify, run the command_ `pip show notebook` _in the environment. This should either return information about the installed version or a warning stating that the package has not been found._

This will allow automatically detecting any environment with the `ipykernel` package installed and making them available as so called `kernel`s. They will have the same name as their environment.


Fortunately, we have already installed this package into our development environment `bda`. So, after starting up a jupyter server in the `base` environment (starts a local web server), we should be able to choose the kernel `bda` through the web-interface by clicking in the top-right corner of the page (also in VS Code):

```
(base) > jupyter notebook
```

### Executing code
Now that we have a jupyter notebook with a running kernel, let's execute some code! Press `shift`+`enter` keys after focusing on a code cell: 

In [1]:
print("Hello Python!")

Hello Python!


Note the little number in brackets next to the code cell indicating the sequence number. Remember that the execution sequence does still matter, even when the position of the cell inside the notebook does not. Also take care when defining global variables in a notebook!

In [2]:
a = 1

def add_one():
    return a + 1

In [5]:
add_one()

4

Let's change the value of the global variable `a` and execute above line of code once more:

In [4]:
a = 3

add_one()

4

The cells' output is persistent in Jupyter, i.e. if we save the notebook now and open it back up later, all the results will still be visible, even if no code was executed and the context no longer exists.

## Data manipulation in Python 

In [6]:
import numpy as np
import timeit  # this provides the "cell magic" command timeit. Use with `%%timeit` at the beginning of the cell to measure the runtime.

def analyze_arr(arr: np.ndarray):
    assert type(arr) == np.ndarray, f"Expected a numpy array as input, got {type(arr)} instead."
    print(f"This numpy array has shape {arr.shape} and data type {arr.dtype}.")

Next, we are getting familiar with numpy arrays by looking at different ways to create a one-dimensional array of size 100 with the constant value $-1$. 

We are going to use the `%%timeit` cell magic to get some *qualitative* results on how these methods compare in terms of runtime. Note that as with all good magic, a lot more is happening behind the curtains than the eye can see. For the purpose of this comparison: the `%%timeit` command has to be in the very first line of the cell to be recognized as *cell magic*. This will result in the cell's code being executed multiple times and the average time that took being printed. While the code in the cell can use packages imported elsewhere in the notebook, any variables only exist in the context of this cell and cannot be retrieved after execution of this cell has stopped. Click [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) to learn more.

In [12]:
%%timeit
# create an uninitialized numpy array with 100 elements and assign the value -1 to each of them. Measure the run time.

_empty = np.empty((100,))
_empty[:] = -1

1.02 µs ± 91 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [15]:
analyze_arr(_empty)

In [16]:
%%timeit
# create a numpy array initialized with 100 elements of the value -1. Measure the run time.

_full = np.full((100,), -1)

2.37 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [17]:
%%timeit
# create a numpy array initialized with 100 ones. Then assign the value -1 to each element. Measure the run time.

_ones = np.ones((100,))
_ones[:] = -1

3.06 µs ± 466 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [18]:
%%timeit
# create a numpy array initialized with 100 ones. Then assign the value by multiplying each element with -1. Measure the run time.

_mult = np.ones((100,))
_mult = _mult * -1

5.22 µs ± 683 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [19]:
%%timeit
# create a numpy array initialized with 100 zeros. Then assign the value -1 to each element. Measure the run time.

_zeros = np.zeros((100,))
_zeros[:] = -1

1.09 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


Why are two of these approaches much faster than the rest? Because they are [implemented in C](https://github.com/numpy/numpy/blob/v1.26.0/numpy/core/src/multiarray/multiarraymodule.c)! The [documentation](https://numpy.org/doc/stable/reference/generated/numpy.ones.html) also includes links to the source code for the respective function. 

### Pandas DataFrames

Now, we are going to look at manipulating a large pandas `DataFrame` element by element. Our dataframe has 100,000 rows.

In [1]:
import numpy as np
import pandas as pd

# initialize a new random number generator instance
rng = np.random.default_rng()

df = pd.DataFrame({
    'rand_a':rng.integers(low=1, high=100, size=100000),
    'rand_b':rng.integers(low=100, high=1000, size=100000),
    'rand_c':rng.integers(low=1000, high=10000, size=100000)
})
df

Unnamed: 0,rand_a,rand_b,rand_c
0,35,815,4468
1,29,521,1028
2,11,824,9358
3,36,959,6187
4,75,456,4016
...,...,...,...
99995,79,761,8817
99996,30,552,1773
99997,10,813,6158
99998,29,266,2864


The `apply` function loops over every element and applies a scalar function. In this case, we will define a simple function in-place, using the lambda idiom. Our function takes an input `x`, squares it and returns the result. We can record how much time elapses using the `timeit` library.  

In [4]:
%%timeit

df['rand_a'].apply(lambda x: x**2)

Unnamed: 0,rand_a,rand_b,rand_c
0,35,815,4468
1,29,521,1028
2,11,824,9358
3,36,959,6187
4,75,456,4016
...,...,...,...
99995,79,761,8817
99996,30,552,1773
99997,10,813,6158
99998,29,266,2864


This took quite a while. Can we improve on this? Instead of sequentially operating on individual elements, we can make use of a so-called _vectorized_ operation that is optimized to work on a whole column at a time:

In [22]:
%%timeit

df['rand_a'] ** 2

189 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


We are going to leave this topic with a slightly more complex example. This time, we want to find the difference between the largest and the smallest value in any given row. If we call the `apply` function on the whole Dataframe instead of a Series (see below) it will loop over all columns for every row, so the argument provided to our lambda function represents a row this time. The parameter `axis=1` indicates that we want to work on a row.

In [23]:
%%timeit 
df.apply(lambda x: x.max() - x.min(), axis=1)

4.97 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


To optimize this we can use the vectorized operations for computing the max and min value for a row, then use another vecorized operation that subtracts one row value from the other:

In [24]:
%%timeit
# refactor above code to use vectorized operations instead.

df.max(axis=1) - df.min(axis=1)

29.9 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Data loading
Finally, let's get our hands on a dataset to do some experiments on. We're going to need pandas and our dataset, which is stored as a `.csv` file at `data/penguins.csv` relative to this notebook.

In [25]:
penguins_df = pd.read_csv('data/penguins.csv')

We can inspect the first five samples in our dataset:

In [26]:
penguins_df.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181,3750,MALE
1,Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
2,Adelie,Torgersen,40.3,18.0,195,3250,FEMALE
3,Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
4,Adelie,Torgersen,39.3,20.6,190,3650,MALE


### Slicing

In [27]:
penguins_df[2:5]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
2,Adelie,Torgersen,40.3,18.0,195,3250,FEMALE
3,Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
4,Adelie,Torgersen,39.3,20.6,190,3650,MALE


Apart from the python-based indexing, Pandas offers its own access function `loc`. We can also specify the column names that we want to retrieve. Note how we have one more line than the python-based indexing method returned. The reason is that Pandas includes the endpoint for slices. In this case, the index is numeric, but it could also contain datetimes or be object based. Pandas does not consider index positions, but labels and the slice returns both specified labels and everything in between.

penguins_df.loc[2:5,['sex', 'flipper_length_mm']]

A Pandas Dataframe is a collection of columns with a shared index. Every column is represented by a Series, which is basically a numpy array with an index. If we access a single column, Pandas will thus return a Series object, if we have multiple columns, it returns a DataFrame.

In [31]:
penguins_df.loc[:, 'island']

0      Torgersen
1      Torgersen
2      Torgersen
3      Torgersen
4      Torgersen
         ...    
337       Biscoe
338       Biscoe
339       Biscoe
340       Biscoe
341       Biscoe
Name: island, Length: 342, dtype: object

On a Series object you can also access all elements in reverse or just every $n^{th}$ object, as you would be able to do on a numpy array:

In [32]:
penguins_df.loc[:, 'island'][::-1]

341       Biscoe
340       Biscoe
339       Biscoe
338       Biscoe
337       Biscoe
         ...    
4      Torgersen
3      Torgersen
2      Torgersen
1      Torgersen
0      Torgersen
Name: island, Length: 342, dtype: object

In fact, let's access the underlying numpy array and try to retrieve every $5^{th}$ object:

In [33]:
penguins_df.loc[:,'island'].to_numpy()[::5]

array(['Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Biscoe',
       'Biscoe', 'Dream', 'Dream', 'Dream', 'Dream', 'Biscoe', 'Biscoe',
       'Biscoe', 'Biscoe', 'Torgersen', 'Torgersen', 'Torgersen', 'Dream',
       'Dream', 'Dream', 'Biscoe', 'Biscoe', 'Biscoe', 'Torgersen',
       'Torgersen', 'Torgersen', 'Torgersen', 'Dream', 'Dream', 'Dream',
       'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream',
       'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream', 'Dream',
       'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe',
       'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe',
       'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe',
       'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe', 'Biscoe',
       'Biscoe'], dtype=object)

Note how the index has disappeared!

---
Now for some filtering example, let's find all entries where the recorded bill length exceeds 40 mm.

In [34]:
penguins_df.loc[:, 'bill_length_mm'] > 40.0

0      False
1      False
2       True
3      False
4      False
       ...  
337     True
338     True
339     True
340     True
341     True
Name: bill_length_mm, Length: 342, dtype: bool

Great, that worked! But it is a little cumbersome to compare for every entry if the comparison holds true by hand. Instead, we can index a Dataframe or Series object with a boolean array. Maybe for our analysis, we also only care to know the `species`, `island`, as well as the `beak_length_mm` for every match. So let's retrieve a slice of columns. 

In [35]:
# index the dataframe with the boolean array and restrict the columns to species, island and beak_length_mm

penguins_df.loc[penguins_df.loc[:, 'bill_length_mm'] > 40.0, 'species':'bill_length_mm']

Unnamed: 0,species,island,bill_length_mm
2,Adelie,Torgersen,40.3
8,Adelie,Torgersen,42.0
11,Adelie,Torgersen,41.1
16,Adelie,Torgersen,42.5
18,Adelie,Torgersen,46.0
...,...,...,...
337,Gentoo,Biscoe,47.2
338,Gentoo,Biscoe,46.8
339,Gentoo,Biscoe,50.4
340,Gentoo,Biscoe,45.2


### Feature analysis

We want to create a plot at the end of this chapter. We are going to use the `plotly` library. As you can see below, we import the `plotly.express` module. It gives enough flexibility for most simple use-cases (see the [plotly.express documentation](https://plotly.com/python/plotly-express/)). The plotly library produces interactive (HTML) plots that can be displayed directly in Jupyter notebooks or open in a new browser tab. For more advanced control over your plots, you'll need to use the lower-level plotly functionalities. [Plotly Fundamentals](https://plotly.com/python/plotly-fundamentals/) is a good place to start diving deeper.

In [36]:
import plotly.express as px

from sklearn import decomposition

Let's separate the features $X$ from the target data $y$. For this example, we will just be using the numerical feature dimensions.

In [37]:
X = penguins_df.loc[:, ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = penguins_df.loc[:,'species']

Note that while the subset of (feature) columns taken from the original pandas Dataframe still form a Dataframe, the single column that indicates the target values ($y$) now is of type Series. 

In [38]:
X

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181,3750
1,39.5,17.4,186,3800
2,40.3,18.0,195,3250
3,36.7,19.3,193,3450
4,39.3,20.6,190,3650
...,...,...,...,...
337,47.2,13.7,214,4925
338,46.8,14.3,215,4850
339,50.4,15.7,222,5750
340,45.2,14.8,212,5200


We are going to use sklearn's implementation of principal component analysis, see the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [39]:
# project the input data to a lower dimensional representation, keeping 3 principal components.
pca = decomposition.PCA(n_components=3)
pca.fit(X)
principal_comps = pca.transform(X)

In [40]:
principal_comps

array([[-4.52023209e+02,  1.33366364e+01,  1.14798019e+00],
       [-4.01949980e+02,  9.15269401e+00, -9.03734153e-02],
       [-9.51740904e+02, -8.26147557e+00, -2.35184450e+00],
       ...,
       [ 1.54840123e+03,  2.39957316e+00,  9.98962243e-01],
       [ 9.98297511e+02,  4.68961215e+00, -1.56098005e+00],
       [ 1.19830521e+03,  5.57434574e+00,  2.93963322e+00]])

Note that `principal_comps` is no longer a pandas DataFrame, but a three-dimensional numpy array:

In [42]:
analyze_arr(principal_comps)

This numpy array has shape (342, 3) and data type float64.


Make a scatter plot for the transformed data. The color of each point reflects its target class. 

In [43]:
# plot the lower dimensional representation as a 3d-scatter plot using the plotly library.

fig = px.scatter_3d(principal_comps, x=0, y=1, z=2, color=y, opacity=0.7)

In [45]:
fig.show()

Now, to wrap up this session: **Create a python script** `py_intro_1.py` **that produces above plot** when called from the command line:
```
(bda) > python py_intro_1.py
```