# Python for (open) Neuroscience

_Lecture 1.1_ - Intro to `pandas`

Luigi Petrucco

Jean-Charles Mariani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vigji/python-cimec/blob/main/lectures/Lecture1.2_Intro-pandas.ipynb)

## `numpy` 🧮

A final note on `numpy` boolean indexing & arrays

## Boolean operations with arrays

In [None]:
import numpy as np

In [None]:
an_array = np.array([1, 2, 3, 4, 5])

In [None]:
condition_0 = an_array > 2
condition_1 = an_array < 5

print(condition_0)
print(condition_1)

To compute the and condition element-wise we use `&`:

In [None]:
condition_0 & condition_1

To compute the or condition element-wise we use `|`:

In [None]:
condition_0 | condition_1

To compute the not condition (over a single array) element-wise we use `~`:

In [None]:
~condition_0

Mind the execution order!

In [None]:
an_array > 0 & an_array < 5


We get the error because the first operation happening is 0 & an_array, which is problematic

In [None]:
# Correct:
(an_array > 0) & (an_array < 5)

## `pandas`!

Or, the magic of semantic indexing and data aggregation

 - not that magical for R people. But hopefully you'll feel at home

- (also, under a geek definition of "magic"...)

### A problem for arrays

With numpy arrays, we cannot work with "semantic" axes (i.e., we will always have to remember what our axes were)

Also, we always need to work with square arrays (same numbers of values over every axis).

This can be a pain for real world -_i.e._, dishomogenous - data!

Also, numpy does not offer handy ways to aggregate data

## `pandas` 🐼 can help us here

In [None]:
import pandas as pd
import numpy as np

In [None]:
# get a pandas sample dataset:
def get_meteo_dataset():
    URL = "https://api.open-meteo.com/v1/forecast?latitude=52.52&longitude=13.41&hourly=temperature_2m,relativehumidity_2m,precipitation,windspeed_10m,winddirection_10m&start_date=2023-02-01&end_date=2023-05-28&format=csv"
    return pd.read_csv(URL, skiprows=3)
df = get_meteo_dataset()
df

### `pd.DataFrame` and `pd.Series`

`pd.DataFrame`/`pd.Series` are `pandas` data collection type!
 - `pd.DataFrame` is a 2D data structure
  - `pd.Series` is a 1D data structure

## `pd.DataFrame`

2D data structure with labelled **columns** and indexed **rows**

In [None]:
df = pd.DataFrame(
    data=[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
    columns=["col0", "col1", "col2"],
    index=["row1", "row2", "row3", "row4"],
)

Dataframes are a great way of storing multiple data for the same elements!

## `pd.DataFrame` indexing

Index dataframe over columns:

In [None]:
df["col0"]

Once we have selected a column, what we get is a `pd.Series`

### `pd.Series`

`pd.Series` are 1-dimensional data structures - basically columns of  `pd.DataFrame`s

`pd.Series` have indexed rows, and a name (the name of the column they come from):

In [None]:
a_series_from_df = df["col0"]
type(a_series_from_df)

In [None]:
a_series_from_df.index

In [None]:
a_series_from_df.name

### Back to indexing dataframes...

We can select multiple columns (with a list of columns):

In [None]:
df[["col0", "col2"]]

Index data over rows:

We can select rows (with a list / range of rows):

In [None]:
df.loc["row1":"row3"]

### `.loc`

Index over both rows and columns using `.loc` (not a method! mind the square brackets):

In [None]:
df.loc["row1", "col0"]

We can select multiple rows and columns (with a list / range of rows and columns):

In [None]:
df.loc["row1":"row3", "col0":"col2"]

Often, we use boolean indexing to select rows:

In [None]:
df.loc[df["col2"] > 5]

### `.iloc`

It is very common to use multiple critieria to select rows:

In [None]:
df.loc[(df["col2"] > 5) & (df["col1"] < 10)]

If we feel like using numpy-like indexing, we can use `.iloc`:

In [None]:
df.iloc[0, 0]

In [None]:
pd.Series([1, 2, 3], name="a")

### Create `pd.DataFrames`

Tipically, we create a dataframe from a dictionary of arrays (lists):

In [None]:
dict_array = dict(int_col=[1, 2, 3], float_col=[4., 5., .6], str_col=["a", "b", "c"])

pd.DataFrame(dict_array)

 or from a list of dictionaries:

In [None]:
pd.DataFrame([dict(int_col=1, float_col=4., str_col="a"),
              dict(int_col=2, float_col=5., str_col="b"),
              dict(int_col=3, float_col=.6, str_col="c")])

Or, as we saw, from a numpy array:

In [None]:
pd.DataFrame(np.random.rand(3, 3), columns=["a", "b", "c"], index=["row1", "row2", "row3"])

### `pd.DataFrame`'s methods

`pd.DataFrame`s and `pd.Series` have many, many methods!

(Why those are methods and not functions? There's a reason, we'll get to that...)

It is actually way too many to cover in a single lecture! It is more important to know that they exist, and to know how to find them! (google, stackoverflow, pandas documentation, chatGPT...)

Methods to drop rows/columns:

In [None]:
dict_array = dict(int_col=[3, 2, 1, 1], float_col=[4., 5., .6, 7.], str_col=["a", "d", "c", "a"])
df = pd.DataFrame(dict_array)
df

In [None]:
df.drop(columns=["int_col"])  # drop columns

In [None]:
df.drop(index=[0, 2])  # drop rows

Methods to sort rows/columns:

In [None]:
df.sort_values(by="int_col") # sort by a column

In [None]:
df.sort_values(by=["int_col", "float_col"])  # sort by multiple columns

### Methods for statistics

In [None]:
df = get_meteo_dataset()

df.head()  # show first 5 rows

In [None]:
df["temperature_2m (°C)"].mean()

In [None]:
df["temperature_2m (°C)"].median()

In [None]:
df["temperature_2m (°C)"].std()

In [None]:
df[["temperature_2m (°C)", "precipitation (mm)", "windspeed_10m (km/h)"]].describe()

### `pd.DataFrame`'s plotting methods

`pd.DataFrame`s and `pd.Series` have many, many plotting methods!

In [None]:
df.plot()

In [None]:
df["temperature_2m (°C)"].plot(kind="box")

In [None]:
df["temperature_2m (°C)"].plot(kind="hist")

In [None]:
df.plot(kind="scatter", x="temperature_2m (°C)", y="precipitation (mm)")

## Clean data using `pd.DataFrame`s

### Missing data

As in numpy, we represent missing data by `NaN` (not a number).

In [None]:
df = pd.DataFrame(dict(a=[1, 2, np.nan, 4], b=[0, np.nan, 4, 5]))

To deal with missing data, we can use `pd.DataFrame`'s interpolation methods. By default, it will use linear interpolation:

In [None]:
df.interpolate()

## Aggregate over columns

It can be useful to aggregate statistics based on the values of a column.
Imagine we have a dataframe with labels on a column and values on another:

In [None]:
df = pd.DataFrame(dict(labels=["a", "a", "b", "b"], values=[1, 2, 3, 4]))
df = df.set_index("labels")
df

In [None]:
df.groupby("labels").mean()

## Organize data in a dataframe

At the beginning the lack of hierarchy of a dataframe can be surprising!

But! As long as we assign label columns to our groups, we can perform statistics very easily.

In [None]:
df

In [None]:
group_means = df.groupby("labels").mean()
df - group_means  # subtract the mean for each group

In [None]:
means