# Pandas Exercises

Original author: Tamás Gál (tamas.gal@fau.de)

Adapted from [https://github.com/escape2020/school2021](https://github.com/escape2020/school2021) under MIT license.


In [None]:
## If working on Google Colab, uncomment these lines and run the cell

# from urllib import request
# from pathlib import Path
# Path('data').mkdir(exist_ok=True)

# request.urlretrieve('https://raw.githubusercontent.com/vuillaut/info801/main/pandas/data/neutrinos.csv',
#                    'data/neutrinos.csv',
#                    )
# request.urlretrieve('https://raw.githubusercontent.com/vuillaut/info801/main/pandas/data/reco.csv',
#                    'data/reco.csv',
#                    )


In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
# import seaborn as sns
import matplotlib as ml
import sys
plt = ml.pyplot
ml.rcParams['figure.figsize'] = (10.0, 5.0)

print(f"Python version: {sys.version}\n"
      f"Pandas version: {pd.__version__}\n"
      f"NumPy version: {np.__version__}\n"
      f"Matplotlib version: {ml.__version__}\n"
      # f"seaborn version: {sns.__version__}",
     )

## Exercise 1

Use the `pd.read_csv()` function to create a `DataFrame` from the dataset `data/neutrinos.csv`.

### Problems encountered

- the first few lines represent a plain header and need to be skipped
- comments are indicated with `$` at the beginning of the line
- the column separator is `:`
- the decimal delimiter is `,`
- the index column is the first one
- there is a footer to be excluded
- footer exclusion only works with the Python-engine

In [None]:
neutrinos = pd.read_csv('data/neutrinos.csv', 
            engine='python', 
            on_bad_lines='skip',
            skiprows=5,
            sep=':',
            decimal=',',
            comment='$',
            index_col=0,
            skipfooter=1
           )
neutrinos.dtypes

## Exercise 2

Create a histogram of the neutrino energies.

In [None]:
neutrinos.hist(column='energy', bins=100)
# neutrinos['energy'].hist()

## Exercise 3

Use the `pd.read_csv()` function to create a `DataFrame` from the dataset `data/reco.csv`.

In [None]:
reco = pd.read_csv('data/reco.csv', index_col=0)

reco.describe()

## Exercise 4

Combine the `neutrinos` and `reco` `DataFrames`  into a single `DataFrame`

pd.concat()

In [None]:
df = pd.concat([neutrinos, reco.add_prefix('reco_')], axis='columns')
df

## Exercise 5

Make a scatter plot to visualise the zenith reconstruction quality.



In [None]:
plt.scatter(df['zenith'], df['reco_zenith'], s=1, alpha=0.05)


In [None]:
df.plot(x='zenith', y='reco_zenith', style='.')

In [None]:
plt.hist2d(df['zenith'], df['reco_zenith'], bins=100);

## Exercise 6

Create a histogram of the cascade probabilities (__`neutrinos`__ dataset: `proba_cscd` column) for the energy ranges 1-5 GeV, 5-10 GeV, 10-20 GeV and 20-100 GeV.

In [None]:
mask = (neutrinos.energy >= 1) & (neutrinos.energy < 5)
neutrinos[mask].proba_cscd.hist(histtype='step')

mask = (neutrinos.energy >= 5) & (neutrinos.energy < 10)
neutrinos[mask].proba_cscd.hist(histtype='step')

mask = (neutrinos.energy >= 10) & (neutrinos.energy < 20)
neutrinos[mask].proba_cscd.hist(histtype='step')


In [None]:
ebins = [0, 5, 10, 20, 100]
bin_index = pd.cut(neutrinos.energy, ebins, labels=False).values
bin_index

In [None]:
bin_index==0

In [None]:
for i in set(bin_index):
    print(i)
    plt.hist(neutrinos.proba_cscd[bin_index==i], label=i, bins=30, histtype='step')
plt.legend()

## Exercise 7

Create a 2D histogram showing the distribution of the `x` and `y` values of the starting positions (`pos_x` and `pos_y`) of the neutrinos. This is basically a 2D plane of the starting positions.

## Exercise 8

Check out `seaborn` (`import seaborn as sns`) and recreate the 2D histogram from Exercies 7.

## Exercise 9

Create two histograms of the `azimuth` and `zenith` distribution side by side, in one plot (two subplots).

Try `pandas` built-in matplotlib wrapper and also the raw matplotlib library.

## Exercise 10

Split the data into two groups: `upgoing` and `downgoing`, based on the `zenith` value (`zenith == 0` is coming from above).

Try out `sns.stripplot` to verify your "cut" on the data!

## Exercise 11

Create a combined histogram (two histograms overlayed in the same plot) for both `upgoing` and `downgoing` datasets, showing the `zenith` angle.