# Introduction to pandas

* **pandas** is a Python package providing convenient data structures to work with labelled data.

* **pandas** is perfectly suited for observational / statistical data sets, having many similarities with Excel spreadsheets.

* Key features:

    - easy handling of **missing data**
    - **size mutability**: columns can be inserted and deleted from DataFrame
    - automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
    - powerful, flexible **group by** functionality to perform split-apply-combine operations on data sets
    - make it **easy to convert** ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
    - intelligent label-based **slicing**, **fancy indexing**, and **subsetting** of large data sets
    - intuitive **merging** and **joining** data sets
    - flexible **reshaping** and pivoting of data sets
    - **hierarchical** labeling of axes (possible to have multiple labels per tick)
    - robust IO tools for loading data from **flat files** (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
    - **time series**-specific functionality

### Primary data structures of pandas
* **Series** (1-dimensional)
* **DataFrame** (2-dimensional)

pandas is built on top of **NumPy** and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

## Loading data

First, we import pandas module. We use an alias "pd" to write code quicker.

In [None]:
import pandas as pd

We also import the `os` module that is useful for building paths to files (among many other things). And `numpy` with `matplotlib` just in case too.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
fname = '../data/ship_ctd_short.csv'

Let'sread the data using `pandas.read_csv()` function.

In [None]:
ctd_data = pd.read_csv(fname)

## Data structures: `DataFrame` and `Series`

Let's interrogate the `DataFrame` object!

In [None]:
type(ctd_data)

In [None]:
# Internal nature of the object
print(ctd_data.shape)
print()
print(ctd_data.dtypes)

In [None]:
# View just the tip of data
ctd_data.head(5)

In [None]:
# View the last rows of data
ctd_data.tail(n=2)  # Note the optional argument (available for head() too)

Get descriptors for the **vertical** axis (axis=0):

In [None]:
ctd_data.index

Get descriptors for the horizontal axis (axis=1):

In [None]:
ctd_data.columns

A lot of information at once including memory usage:

In [None]:
ctd_data.info()

### Series, pandas' 1D data containter

A series can be constructed with the `pd.Series` constructor (passing an array of values) or from a `DataFrame`, by extracting one of its columns.

In [None]:
temp = ctd_data['Temperature']

Some of its attributes:

In [None]:
print(type(temp))
print(temp.dtype)
print(temp.shape)
print(temp.nbytes)

Show me what you got!

In [None]:
# uncomment to see the values
# temp

### Numpy as pandas's backend

It is always possible to fall back to a good old NumPy array to pass on to scientific libraries that need them: SciPy, scikit-learn, etc

In [None]:
ctd_data['Temperature'].values

In [None]:
type(ctd_data['Temperature'].values)

## Cleaning data

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The truth about data science: cleaning your data is 90% of the work. Fitting the model is easy. Interpreting the results is the other 90%.</p>&mdash; Jake VanderPlas (@jakevdp) <a href="https://twitter.com/jakevdp/status/742406386525446144">June 13, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

In [None]:
list(ctd_data)

### Renaming columns

If we know the units of the variables, we can rename the columns to include these units

In [None]:
ctd_data.columns = ['Depth_(m)', 'Temperature_(C)', 'Oxygen_(ml/l)', 'Irradiance', 'Salinity_(psu)']
ctd_data.columns

Note the use of underscores `_` in the renaming. Putting spaces in variable names can cause problems down the line

### Deleting columns

Let us concentrate our attention on temperature, salinity and oxygen, deleting irradiance from the data frame

In [None]:
ctd_data = ctd_data.drop('Irradiance', 1)

In [None]:
ctd_data.head()

Really it would make more sense to have Depth as the index as the other variables are expected to vary with depth. Make it so!

In [None]:
ctd_data.set_index('Depth_(m)', inplace=True)
ctd_data

## Basic visualisation

### Exercise

Try calling `plot()` method of the `ctd_data` object:

In [None]:
ctd_data.plot()

What happens if put `subplots=True` as an argument of the `plot()` method?

In [None]:
# ctd_data.plot( ... )

It is easy to create other useful plots using `DataFrame`:

In [None]:
fig, (ax0, ax1) = plt.subplots(ncols=2,figsize=(8,4))
ctd_data.boxplot(ax=ax0, column=['Salinity_(psu)'])
ctd_data.boxplot(ax=ax1, column=['Oxygen_(ml/l)']);

As well as just a simple line plot:

## Some statistics

In [None]:
ctd_data.describe()

Much easier than calling them individually with NumPy!

## Computing correlations

Both `Series` and `DataFrames` have a **`corr()`** method to compute the correlation coefficient.

If series are already grouped into a `DataFrame`, computing all correlation coefficients is trivial:

In [None]:
ctd_data.corr()

If you want to visualise this correlation matrix, uncomment the following code cell.

In [None]:
#fig, ax = plt.subplots()
#p = ax.imshow(ctd_data.corr(), interpolation="nearest", cmap='RdBu_r', vmin=-1, vmax=1)
#ax.set_xticks(np.arange(len(ctd_data.corr().columns)))
#ax.set_yticks(np.arange(len(ctd_data.corr().index)))
#ax.set_xticklabels(ctd_data.corr().columns)
#ax.set_yticklabels(ctd_data.corr().index)
#fig.colorbar(p)

## Creating DataFrames

* `DataFrame` can also be created manually, by grouping several `Series` together.
* Now just for fun we switch to **another dataset**
    - create 2 `Series` objects from 2 CSV files
    - create a `DataFrame` by combining the two `Series`

* Data are monthly values of
    - Southern Oscillation Index (SOI)
    - Outgoing Longwave Radiation (OLR), which is a proxy for convective precipitation in the western equatorial Pacific
* Data were downloaded from NOAA's website: https://www.ncdc.noaa.gov/teleconnections/

In [None]:
soi_df = pd.read_csv('../data/soi.csv', skiprows=1, parse_dates=[0], index_col=0, na_values=-999.9,
                     date_parser=lambda x: pd.datetime.strptime(x, '%Y%m'))

In [None]:
soi_df.head()

In [None]:
olr_df = pd.read_csv('../data/olr.csv', skiprows=1, parse_dates=[0], index_col=0, na_values=-999.9,
                     date_parser=lambda x: pd.datetime.strptime(x, '%Y%m'))

In [None]:
olr_df.head()

In [None]:
df = pd.DataFrame({'OLR': olr_df.Value,
                   'SOI': soi_df.Value})

In [None]:
# df.describe()

## Ordinary Least Square (OLS) regressions

### Primitive way: using numpy's polynomial fitting

In [None]:
from numpy.polynomial import polynomial as P

In [None]:
x = df['OLR'].values
y = df['SOI'].values

In [None]:
idx = np.isfinite(x) & np.isfinite(y)

In [None]:
coefs, stats = P.polyfit(x[idx], y[idx], 1, full=True)

In [None]:
y2 = P.polyval(x, coefs)

In [None]:
plt.plot(x, y, linestyle='', marker='o')
plt.plot(x, y2)

### Recommended (and more convenient) ways (require additional packages)

##### Statsmodels

In [None]:
# import statsmodels.formula.api as sm

In [None]:
# sm_model = sm.ols(formula="SOI ~ OLR", data=df).fit()

In [None]:
# df['SOI'].plot()
# df['OLR'].plot()
# ax = sm_model.fittedvalues.plot(label="model prediction")
# ax.legend(loc="lower center", ncol=3)

More examples: https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html

## Exercise: rolling functions

**1. Subset data**

* Start by subsetting the SOI `DataFrame`
* Use either numerical indices, or, even better, datetime indices

In [None]:
# your code here

**2. Plot the subset data**

* You can create figure and axis using `matplotlib.pyplot`
* Or just use the `plot()` method of pandas `DataFrame`

In [None]:
# your code here

**3. Explore what `rolling()` method is**

* What does this method return?

In [None]:
# df.rolling?

In [None]:
# your code here

**4. Plot the original series and the smoothed series**

In [None]:
# your code here

## References
* https://github.com/jonathanrocher/pandas_tutorial
* https://github.com/koldunovn/python_for_geosciences
* http://pandas.pydata.org/pandas-docs/stable/index.html#module-pandas
* http://pandas.pydata.org/pandas-docs/stable/10min.html