# Python for (open) Neuroscience

_Lecture 1.3_ - More on `pandas`

Luigi Petrucco

Jean-Charles Mariani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec/blob/main/lectures/Lecture1.3_More-pandas.ipynb)

## Aggregate over columns

It can be useful to aggregate statistics based on the values of a column.

In [None]:
# In this dataset, a column represent a category, either "a" or "b"
df = pd.DataFrame(dict(labels=["a", "a", "b", "b"], values=[1, 2, 3, 4]))
df

### `.groupby()`

We have a handy syntax to average within each category with `.groupby()`.

The sintax is :
```python
df.groupby("name_of_the_category_column").operation()
```

In [None]:
# In this case, the operation is mean()

df.groupby("labels").mean()

## Organize data in a dataframe

In [None]:
# Imagine we have 4 experimental subjects; to each one we show a stimulus 3 times; over each repetition we measure 2 variables.

# We could represent the data for each stimulus as a dictionary:
stimulus = dict(variable_1=np.random.rand(), variable_2=np.random.rand())

In [None]:
# And the data for each subject as a list of dictionaries:
subject_data = [dict(variable_1=np.random.rand(), variable_2=np.random.rand()) for _ in range(3)]

In [None]:
# And the data for all subjects as a dictionary of lists of dictionaries:
all_subjects = dict()

for i in range(4):
    all_subjects[f"subject_{i}"] = [dict(variable_1=np.random.rand(), variable_2=np.random.rand()) for _ in range(3)]
all_subjects

This is now clean and tidy (?), but being so nested, it is not easy to perform statistics on it.

In [None]:
# Imagine we want to average the results across all subjects for variable_1:
means = []
for subject_results in all_subjects.values():
    for result in subject_results:
        means.append(result["variable_1"])
np.mean(means)

Instead, we can represent the data in a dataframe, **keeping it as flat as possible**!

Remember!


    🪷 The Zen of Python 🪷
        
        Flat is better than nested

In [None]:
# We can turn the data into a dataframe (does not matter how we do it here!!! - this is just an ugly example)
df = pd.DataFrame([dict(subject=i, repetition=j, **all_subjects[i][j].copy())
                             for i in all_subjects.keys()
                             for j in range(len(all_subjects[i]))])

df

In [None]:
# We can now easily perform statistics on the data, aggregating over the repetitions or subjects using groupby:

In [None]:
group_means = df.groupby("subject").mean()
group_means.drop(columns=["repetition"])  # subtract the mean for each group

### `.groupby()`

We have a handy syntax to average within each category with `.groupby()`.

The sintax is :
```python
df.groupby("name_of_the_category_column").operation()
```

In [None]:
# In this case, the operation is mean()

df.groupby("labels").mean()

## Organize data in a dataframe

In [None]:
# Imagine we have 4 experimental subjects; to each one we show a stimulus 3 times; over each repetition we measure 2 variables.

# We could represent the data for each stimulus as a dictionary:
stimulus = dict(variable_1=np.random.rand(), variable_2=np.random.rand())

In [None]:
# And the data for each subject as a list of dictionaries:
subject_data = [dict(variable_1=np.random.rand(), variable_2=np.random.rand()) for _ in range(3)]

In [None]:
# And the data for all subjects as a dictionary of lists of dictionaries:
all_subjects = dict()

for i in range(4):
    all_subjects[f"subject_{i}"] = [dict(variable_1=np.random.rand(), variable_2=np.random.rand()) for _ in range(3)]
all_subjects

This is now clean and tidy (?), but being so nested, it is not easy to perform statistics on it.

In [None]:
# Imagine we want to average the results across all subjects for variable_1:
means = []
for subject_results in all_subjects.values():
    for result in subject_results:
        means.append(result["variable_1"])
np.mean(means)

Instead, we can represent the data in a dataframe, **keeping it as flat as possible**!

Remember!


    🪷 The Zen of Python 🪷
        
        Flat is better than nested

In [None]:
# We can turn the data into a dataframe (does not matter how we do it here!!! - this is just an ugly example)
df = pd.DataFrame([dict(subject=i, repetition=j, **all_subjects[i][j].copy())
                             for i in all_subjects.keys()
                             for j in range(len(all_subjects[i]))])

df

In [None]:
# We can now easily perform statistics on the data, aggregating over the repetitions or subjects using groupby:

In [None]:
group_means = df.groupby("subject").mean()
group_means.drop(columns=["repetition"])  # subtract the mean for each group

## Advanced pandas

### `.rolling()`

### `.groupby()`