# Pandas Exercises

Tamás Gál (tamas.gal@fau.de)

The latest version of this notebook is available at [https://github.com/escape2020/school2021](https://github.com/escape2020/school2021)

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as ml
import sys
plt = ml.pyplot
ml.rcParams['figure.figsize'] = (10.0, 5.0)

print(f"Python version: {sys.version}\n"
      f"Pandas version: {pd.__version__}\n"
      f"NumPy version: {np.__version__}\n"
      f"Matplotlib version: {ml.__version__}\n"
      f"seaborn version: {sns.__version__}")

In [None]:
from IPython.core.magic import register_line_magic

@register_line_magic
def shorterr(line):
    """Show only the exception message if one is raised."""
    try:
        output = eval(line)
    except Exception as e:
        print("\x1b[31m\x1b[1m{e.__class__.__name__}: {e}\x1b[0m".format(e=e))
    else:
        return output
    
del shorterr

In [None]:
import warnings
warnings.filterwarnings('ignore')  # annoying UserWarnings from Jupyter/seaborn which are not fixed yet

## Exercise 1

Use the `pd.read_csv()` function to create a `DataFrame` from the dataset `data/neutrinos.csv`.

In [None]:
%shorterr neutrinos = pd.read_csv('data/neutrinos.csv')

### Problems encountered

- the first few lines represent a plain header and need to be skipped
- comments are indicated with `$` at the beginning of the line
- the column separator is `:`
- the decimal delimiter is `,`
- the index column is the first one
- there is a footer to be excluded
- footer exclusion only works with the Python-engine

### Solution to exercise 1

In [None]:
!head -n 15 data/neutrinos.csv

In [None]:
neutrinos = pd.read_csv('data/neutrinos.csv',
                        skiprows=5,
                        comment='$',
                        sep=':',
                        decimal=',',
                        index_col=0,
                        skipfooter=1,
                        engine='python')

In [None]:
neutrinos.head(3)

### Check the dtypes to make sure everthing is parsed correctly (and is not an `object`-array)

In [None]:
neutrinos.dtypes  # everything's ok now ;)

## Exercise 2

Create a histogram of the neutrino energies.

### Solution to exercise 2

In [None]:
neutrinos.energy.hist(bins=100)
plt.xlabel('Neutrino energy [GeV]');
plt.ylabel('Count');
plt.show()

# alternative:

neutrinos.hist('energy', bins=100)

## Exercise 3

Use the `pd.read_csv()` function to create a `DataFrame` from the dataset `data/reco.csv`.

### Problems encountered

- need to define index column

### Solution to exercise 3

In [None]:
reco = pd.read_csv('data/reco.csv', index_col=0)
reco.head()

## Exercise 4

Combine the `neutrinos` and `reco` `DataFrames`  into a single `DataFrame`

pd.concat()

### Problems encountered

- need to define the right axis
- identical column names should be avoided

### Solution to exercise 4

In [None]:
data = pd.concat([neutrinos, reco.add_prefix('reco_')], axis="columns")

In [None]:
data.head(3)

In [None]:
data.columns

## Exercise 5

Make a scatter plot to visualise the zenith reconstruction quality.

`data = pd.concat([neutrinos reco.add_prefix('reco_')], axis="columns")`

### Problems encountered

- `DataFrame.plot()` is not suited to do scatter plots in earlier Pandas versions (inverts axis, sets weird limits etc.)

### Solution to Ex. 5

In [None]:
data.plot(x='zenith', y='reco_zenith', style='.');

### Solution to exercise 5, using `plt.scatter()`

Sometimes it's better not to fight against `DataFrame.plot()`, just switch to Matplotlib ;)

In [None]:
fig, ax = plt.subplots()
# s change the dots size
# alpha change the transparency
ax.scatter(data['zenith'], data['reco_zenith'], s=1, alpha=0.05);
ax.set_xlabel('True zenith');
ax.set_ylabel('Reconstructed zenith');

### Solution to exercise 5, using `plt.hist2d()`

In [None]:
fig, ax = plt.subplots()
counts, xedges, yedges, im = ax.hist2d(data['zenith'], data['reco_zenith'], bins=50);
ax.set_xlabel('True zenith');
ax.set_ylabel('Reconstructed zenith');
fig.colorbar(im)

## Exercise 6

Create a histogram of the cascade probabilities (__`neutrinos`__ dataset: `proba_cscd` column) for the energy ranges 1-5 GeV, 5-10 GeV, 10-20 GeV and 20-100 GeV.

### Naive solution to exercise 6

In [None]:
mask = (neutrinos.energy >= 1) & (neutrinos.energy < 5)
neutrinos[mask].proba_cscd.hist(histtype='step', label='[0-5) GeV')

mask = (neutrinos.energy >= 5) & (neutrinos.energy < 10)
neutrinos[mask].proba_cscd.hist(histtype='step', label='[5-10) GeV')

mask = (neutrinos.energy >= 10) & (neutrinos.energy < 20)
neutrinos[mask].proba_cscd.hist(histtype='step', label='[10-20) GeV')

mask = (neutrinos.energy >= 20) & (neutrinos.energy < 100)
neutrinos[mask].proba_cscd.hist(histtype='step', label='[20-100) GeV')

plt.legend()
plt.xlabel('proba cscd')

### More elegant solution
If we have a lot of bins, the naive solution can be difficult to do

In [None]:
ebins = [0, 5, 10, 20, 100]
# directly find the bins of each event using pd.cut
bin_index = pd.cut(neutrinos.energy, ebins, labels=False).values
bin_index

In [None]:
for i in set(bin_index):
    plt.hist(neutrinos.proba_cscd[bin_index==i], label=f'[{ebins[i]}-{ebins[i+1]}) GeV', bins=30, histtype='step')
plt.legend()
plt.xlabel('proba cscd')

#### Or even quicker:

In [None]:
ebins = [1, 5, 10, 20, 100]
neutrinos['ebin'] = pd.cut(neutrinos.energy, ebins, labels=False)
neutrinos.hist('proba_cscd', by='ebin', bins=50);

## Exercise 7

Create a 2D histogram showing the distribution of the `x` and `y` values of the starting positions (`pos_x` and `pos_y`) of the neutrinos. This is basically a 2D plane of the starting positions.

### Solution to exercise 7

In [None]:
fig, ax = plt.subplots()
counts, xedges, yedges, im = plt.hist2d(data.pos_x, data.pos_y, bins=100, cmap='viridis')
ax.set_xlabel('x [m]')
ax.set_ylabel('y [m]')
ax.set_title('2D Plane')
ax.axis('equal')
fig.colorbar(im);

## Exercise 8

Check out `seaborn` (`import seaborn as sns`) and recreate the 2D histogram from Exercies 7.

### Solution to exercise 8

In [None]:
sns.displot(data, x="pos_x", y="pos_y", cbar=True);

In [None]:
sns.jointplot(data=data, x="pos_x", y="pos_y", s=2, alpha=0.2)

## Exercise 9

Create two histograms of the `azimuth` and `zenith` distribution side by side, in one plot (two subplots).

Try `pandas` built-in matplotlib wrapper and also the raw matplotlib library.

In [None]:
data.head(2)

### Solution to exercise 9

In [None]:
data.hist(['azimuth', 'zenith'], bins=100, figsize=(10, 3));

#### Solution using matplotib

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))

for idx, column in enumerate(['azimuth', 'zenith']):
    data[column].hist(bins=100, ax=axes[idx])  # zenith=0 is coming from above
    axes[idx].set_xlabel(column + ' [rad]')
    axes[idx].set_ylabel('count')

## Exercise 10

Split the data into two groups: `upgoing` and `downgoing`, based on the `zenith` value (`zenith == 0` is coming from above).

Try out `sns.stripplot` to verify your "cut" on the data!

### Solution to exercise 10

Here, we are adding a new column to our dataset which contains True/False for each entry, regarding of its zenith direction

In [None]:
data['upgoing'] = data.zenith < np.pi/2

In [None]:
data_by_upgoing = data.groupby('upgoing')

Seaborn automatically recognises the grouped Pandas DataFrame:

In [None]:
sns.stripplot(x="upgoing", y="zenith", data=data);

## Exercise 11

Create a combined histogram (two histograms overlayed in the same plot) for both `upgoing` and `downgoing` datasets, showing the `zenith` angle.

### Solution to exercise 11

In [None]:
fig, ax = plt.subplots()

for upgoing, sub_data in data_by_upgoing:
    sub_data.hist('zenith', ax=ax, bins=100,
                  label='upgoing' if upgoing else 'downgoing',
                  histtype='step', linewidth=2)
ax.legend();