# Python for (open) Neuroscience

_Lecture 1.2_ - More on `pandas`

Luigi Petrucco

Jean-Charles Mariani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec/blob/main/lectures/Lecture1.2_More-pandas.ipynb)

## Advanced pandas

### `.rolling()`

### `.groupby()`

`.register()`

- or, the magic of semantic indexing and aggregation

 - not that magical for R people. But hopefully will give you a good replacement.

### A problem for arrays

In arrays, we cannot work with "semantic" axes (i.e., we will always have to remember what our axes were)

Also, we always need to work with square arrays (same numbers of values over every axis).

This can be a pain for real world -_i.e._, dishomogenous - data!

## `pandas` 🐼 can help us here

In [1]:
import pandas as pd

### `pd.DataFrame` and `pd.Series`

`pd.DataFrame` and `pd.Series` are to `pandas` what `np.ndarray` is to `numpy`!

### `pd.DataFrame`

2D data structure with labelled **columns** and indexed **rows**

Dataframes are a great way of storing multiple data for the same elements!

Index dataframe over columns:

Index data over rows:

select over both rows and columns using .loc

### `pd.Series`

`pd.Series` are 1-dimensional data structures - basically columns of  `pd.DataFrame`s

### Create pd.DataFrames

We can create a dataframe from a dictionary of arrays (lists), or from a list of dictionaries:

In [2]:
dict_list = []

for i, name in enumerate(["foo", "bar"]):
    dict_list.append(dict(idx=i, name=name))


pd.DataFrame(dict_list)

Unnamed: 0,idx,name
0,0,foo
1,1,bar


In [3]:
array_dict = dict(idx=[0, 1], name=["foo", "bar"])

pd.DataFrame(array_dict)

Unnamed: 0,idx,name
0,0,foo
1,1,bar


Calculate the average of an array, either global or along some axis:

In [14]:
import numpy as np
arr = np.random.randint(0, 255, (5, 6))
print(f"{arr};\nmean: {np.mean(arr)}")

[[ 92  78 188  64  82 135]
 [156 185  38 113 192 200]
 [122 200 142  40  88  47]
 [  4 180 144 166 174 151]
 [212 135  22  53 202 152]];
mean: 125.23333333333333


If we want to take the average along a specific dimension, we can pass the axis as a parameter:

In [16]:
import numpy as np
arr = np.random.randint(0, 255, (5, 6))
arr_mean = np.mean(arr, axis=0)  # we specify one axis
print(f"{arr};\nmean: {arr_mean}")

[[ 34 169 118  23 227  61]
 [229  60 127  68 205   5]
 [148  63 102 164 160 144]
 [112 152  66 155 144 119]
 [107 187 209 224  66  86]];
mean: [126.  126.2 124.4 126.8 160.4  83. ]


If there are nan values around, we have to use `np.nanmean()`:

In [20]:
import numpy as np

# we need a float dtype to use nan values:
arr = np.random.randint(0, 255, (5, 6)).astype(float) 

arr[0, 0] = np.nan
arr_mean = np.mean(arr)  # regular mean
arr_nan_mean = np.nanmean(arr)  # use nanmean
print(f"{arr};\nregular mean: {arr_mean}\nnanmean: {arr_nan_mean}")

[[ nan  41.  30. 156. 110.  27.]
 [142.  32.  42.  40.  54. 139.]
 [ 75. 102. 154. 167. 207. 104.]
 [129. 223. 143.  57. 163.  40.]
 [219.   7.  81. 205. 122. 200.]];
regular mean: nan
nanmean: 110.72413793103448


In [22]:
arr = np.random.randint(0, 10, (3,4))

arr.median()

AttributeError: 'numpy.ndarray' object has no attribute 'median'

Many of the functions we're about to see behave in this way - assume they have a nan-dealing equivalent!

 - `np.std()` / `np.nanstd()`
 - `np.percentile()` / `np.nanpercentile()`
 - `np.max()` / `np.nanmax()`
 - ...

### `np.std()` / `np.nanstd()`

Calculate the standard deviation of an array, either global or along some axis:

In [23]:
arr = np.random.normal(0, 3, 1000)
np.std(arr)

2.9911940005367863

### `np.median()` / `np.nanmedian()`

Calculate the median of an array, either global or along some axis:

In [27]:
arr = np.random.randint(0, 100, (1000, 10))
np.median(arr, axis=0)

array([51., 49., 49., 49., 51., 50., 49., 47., 49., 53.])

### `np.max()` / `np.min()`

Calculate the minimum or maximum of an array, either global or along some axis:

In [28]:
arr = np.random.randint(0, 100, 1000)

np.min(arr), np.max(arr)  # print min and max together

(0, 99)

### `np.percentile()`

Calculate a given percentile of an array, either global or along some axis:

In [29]:
arr = np.random.randint(0, 1000, 10000)

np.percentile(arr, 75)  # print min and max together

756.0

### `np.unique()`

Return unique values of an array, and if asked their counts

In [33]:
np.unique(np.array([1, 2, 2, 3, 3, 3]))

array([1, 2, 3])

In [35]:
# If we ask we can get counts as well
unique_values, counts = np.unique([1,2,2,3], return_counts=True) 

print("unique:", unique_values)

print("counts:", counts)

unique: [1 2 3]
counts: [1 2 1]


In [36]:
arr = np.random.normal(0, 1, 1000000)

### `np.diff()` / `np.cumsum()`  

We can compute cumulative sums (integrals) of an array with `np.cumsum()`:

In [39]:
my_arr = np.array([[1,2,3,4],[1,2,4,5]])
np.cumsum(my_arr, axis=0)

array([[1, 2, 3, 4],
       [2, 4, 7, 9]])

We can compute differences between consecutive elements of an array using `np.diff()`:

In [40]:
my_arr = np.array([1,2,3,4])
np.diff(my_arr)

array([1, 1, 1])

(Practicals 1.1.0)

### Write code the `numpy` way

When operating with matrices, you should always aim at writing <span style="color:indianred">vectorized</span> code

Vectorized code: code where for loops are replaced by operations over matrix dimensions

An very simple example: vectors multiplication

In [None]:
vector_1 = np.random.normal(0, 1, (10000000,))
vector_2 = np.random.normal(0, 1, (10000000,))

In [None]:
%%timeit
product = np.zeros(vector_1.shape)  # initialize empty result vector

# Compute the multiplication in a loop:
for i in range(vector_1.shape[0]):
    product[i] = vector_1[i] * vector_2[i]

In [None]:
%%timeit
product = vector_1 * vector_2

Another example: Z-score the rows of a matrix:

In [None]:
data_matrix = np.random.randint(0, 255, (100000, 100))

In [None]:
%%timeit
normalized_matrix = np.zeros(data_matrix.shape)  # start an empty matrix of matching shape 

# Loop over rows (first dimension), take mean and std, subtract and divide:
for i in range(data_matrix.shape[0]):
    row_mean = np.mean(data_matrix[i, :])
    row_std = np.std(data_matrix[i, :])
    
    normalized_matrix[i, :] = (data_matrix[i, :] - row_mean) / row_std

In [None]:
%%timeit
rows_mean = np.mean(data_matrix, axis=1)  # vectorized mean
rows_std = np.std(data_matrix, axis=1)  # vectorized std

# Write the normalization as a vector operation.
# Note how we use broadcasting to make sure the right dimensions are propagated!

normalized_matrix = data_matrix - rows_mean[:, np.newaxis] / rows_std[:, np.newaxis]

## Search indexes

Some functions find indexes of the elements of an array that match some criterion.

### `np.argmin()` / `np.argmax()` 

Find the position of the maximum or the minimum of an array

In [None]:
arr = np.array([5, 0, 2, 9, 6,])

np.argmin(arr)  # give index of smallest element

In [None]:
np.argmax(arr)  # give index of biggest element

For a multi-dimensional array:

In [3]:
arr = np.array([[5, 1, 2], [3, 0, 4]])
print(arr)
np.argmin(arr)

[[5 1 2]
 [3 0 4]]


4

### Index raveling / unraveling

We can use `np.unravel_index()` to get a tuple of indexes:

In [6]:
arr = np.array([[5, 1, 2], [3, 0, 4]])
idx = np.argmin(arr)
np.unravel_index(idx, arr.shape)

(1, 1)

The converse operation, called unravel, can be done with `np.ravel_multi_index()`:

In [7]:
np.ravel_multi_index((1, 1), (2,3))

4

### `np.nonzero()` / `np.argwhere()`

`np.nonzeros()` find True elements and gives us a tuple of indexes arrays for each dimension:

In [10]:
arr = np.array([1,2,3,4,5])
print(arr)

arr[np.nonzero(arr > 2)]

[1 2 3 4 5]


array([3, 4, 5])

In [None]:
arr = np.array([[1,2,3], [4, 5, 6]])
print(arr)
np.argwhere(arr % 2 == 0)

`np.argwhere()` find True elements and gives us a (n_pts, n_matrix_dims) array of indexes:

In [12]:
arr = np.array([1,2,3,4,5])

arr[np.argwhere(arr > 2)]

array([[3],
       [4],
       [5]])

In [13]:
arr = np.array([[1,2,3], 
                [4, 5, 6]])
print(arr)
np.argwhere(arr % 2 == 0)

[[1 2 3]
 [4 5 6]]


array([[0, 1],
       [1, 0],
       [1, 2]])

(Practicals 1.1.1)