# Python for (open) Neuroscience

_Lecture 1.3_ - More on `pandas`

Luigi Petrucco

Jean-Charles Mariani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec/blob/main/lectures/Lecture1.3_More-pandas.ipynb)

## Announcements

- Next week we'll be setting up local Python installations, tutorial soon!
- There will be a second assignment, but not a third one - start thinking to a project though!
- Related: still looking for datasets!
- Questionnaire soon

### More `pandas`

In [None]:
import pandas as pd
import numpy as np

## Organize data in a dataframe

In [None]:
# Imagine we have 4 experimental subjects; to each one we show a stimulus 3 times; over each repetition 
# we measure 2 variables.
n_subjects = 4
n_repetitions = 3

# We could represent the data for each stimulus as a dictionary, 
# and the data for each subject as a list of dictionaries:
subject_data = [dict(var_1=np.random.rand(), var_2=np.random.rand()) for _ in range(n_repetitions)]
subject_data

In [None]:
# And the data for all subjects as a dictionary of lists of dictionaries:
all_subjects_data = dict()

for i in range(n_subjects):
    all_subjects_data[f"subj_{i}"] = \
        [dict(var_1=np.random.rand(), var_2=np.random.rand()) for _ in range(n_repetitions)]
all_subjects_data

This is now organized but very nested! it is not easy to perform statistics on it.

In [None]:
# Imagine we want to average the results across all subjects for variable_1:
means = []
for subject_results in all_subjects_data.values():
    for result in subject_results:
        means.append(result["var_1"])
np.mean(means)

Instead, we can represent the data in a dataframe, **keeping it as flat as possible**!

Remember!


    🪷 The Zen of Python 🪷
        
        Flat is better than nested

In [None]:
# We can turn the data into a dataframe (does not matter how we do it here! this is just an ugly example)
trials_df = pd.DataFrame([dict(subject=i, repetition=j, **all_subjects_data[i][j])
                             for i in all_subjects_data.keys()
                             for j in range(n_repetitions)])

trials_df

We can now easily perform statistics on the data:

In [None]:
var1_mean = trials_df["var_1"].mean()

You do not always need pandas dataframes!!

Not efficient with many columns!

Many times your raw data (ephys, imaging...) can live in numpy array and you put in pandas derived quantities.

### Principles for organizing `pandas` dataframes

Keep in the same dataset all the data of the same type you have across groups (such as subjects). 

If you load lists of dataframes concatenate before working on them!

Consider having multiple dataframes to describe different aspects of your experiment. For example:
- a `subject` dataset with the info on your subjects
- a `trials` dataset with the trial responses across subjects

And keep consistent ids / nomenclature to easily work over both!

Example:

In [None]:
# Let's build a subjects dataframe for the experiment above:
np.random.seed(42)
subjects_df = pd.DataFrame(dict(sex=np.random.choice(["F", "M"], size=n_subjects),
                                handedness=np.random.choice(["left", "right"], size=n_subjects),
                                age=np.random.randint(20, 40, size=n_subjects)),
                          index=[f"subj_{i}" for i in range(n_subjects)])
subjects_df

We can now easily filter the subjects we want to work on based on categories:

In [None]:
selected_subjects_df = subjects_df[(subjects_df["sex"] == "F") & (subjects_df["age"] >=30)]
selected_subjects_df

In [None]:
selected_subjects_df.index

And restrain our analysis of the `trials_df` to these subjects :

In [None]:
# Here, we'll use another handy pandas method: `isin()`:
selection = trials_df["subject"].isin(selected_subjects_df.index)
selection

In [None]:
trials_df.loc[selection, "var_1"].mean()

(Practicals 1.3.0)

## Aggregate statistics

It can be useful to aggregate statistics based on the values of a column.

Imagine we want to quickly compute the mean of the values across trials for each subject.



### `.groupby()`

We have a handy syntax to average within each category with `.groupby()`.

The sintax is :
```python
df.groupby("name_of_the_category_column").operation()
```

Now, we want to compute average for every subject:

In [None]:
df.head(3)

In [None]:
# In this case, the operation is `mean()`.
# Note how the result will have the variable we group by as index:

subj_means_df = df.groupby("subject").mean()
subj_means_df

By the way, this is a reason why methods are better than functions in this case: they can be chained with a clearer syntax!

# Index broadcasting in `pandas`

Let's subtract from each subject the mean for each variable.

In [None]:
trials_df.head(3)

In [None]:
subj_means_df.head(3)

The shapes obviously don't match:

In [None]:
print(trials_df.shape)
print(subj_means_df.shape)

In [None]:
trials_df - subj_means_df  # this is obviously funny:

But pandas will broadcast values using indices if we make them consistent!

In [None]:
trials_df = trials_df.set_index("subject")
trials_df.head()

So now we can write:

In [None]:
normalized = trials_df - subj_means_df
normalized.head()

This broadcasting is super powerful! Give us very expressive and concise syntax to work with aggregated data without using loops.

(Practicals 1.3.1)

## Rolling functions with `.rolling()`

Imagine we have a time series of data, and we want to compute the mean over a window of time (e.g., for smoothing).

In [None]:
# Let's create a time series:
time_series = pd.Series(np.random.rand(100))

In [None]:
# This will compute the mean in a rolling window - ie, smoothing it!
rolling_wnd_size = 10
smoothed = time_series.rolling(rolling_wnd_size, center=True).mean()

In [None]:
time_series.plot()
smoothed.plot()

When done with averaging, same results as other smoothing tools

But now we can use arbitrary functions! (standard deviation, significance tests, etc)