# Introduction to pandas

**pandas** is a Python package providing convenient data structures to work with labelled data.

pandas is perfectly suited for observational / statistical data sets, like Excel spreadsheets.

### Key features:

* easy handling of **missing data**
* **size mutability**: columns can be inserted and deleted from DataFrame
* automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
* powerful, flexible **group by** functionality to perform split-apply-combine operations on data sets
* make it **easy to convert** ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
* intelligent label-based **slicing**, **fancy indexing**, and **subsetting** of large data sets
* intuitive **merging** and **joining** data sets
* flexible **reshaping** and pivoting of data sets
* **hierarchical** labeling of axes (possible to have multiple labels per tick)
* robust IO tools for loading data from **flat files** (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
* **time series**-specific functionality

### Primary data structures of pandas
* **Series** (1-dimensional)
* **DataFrame** (2-dimensional)

pandas is built on top of **NumPy** and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

## Loading data

Exersise: let us now use pandas to read csv file that contains observational air quality data from one of the monitoring sites in London. The data are hourly measurements of ozone (O3), nitrogen oxides (NOx), carbon monoxide (CO) and PM10 particulate matter.

In [None]:
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import os
%matplotlib inline

In [None]:
data_path = os.path.join(os.path.pardir, 'data')
fname = os.path.join(data_path, 'air_quality_hourly_london_marylebone.csv')

In [None]:
# Read data
df = pd.read_csv(fname, sep=',', header=4, skipfooter=4, na_values='No data', parse_dates=[0], engine='python')

**Q**: What happens if you remove the header? skipfooter? engine?

## Data structures: `DataFrame` and `Series`

In [None]:
# View data
df.head()

In [None]:
df.tail()

### Numpy as pandas's backend

Try to use tab completion with column names:

In [None]:
df.Ozone.values

## Creating DataFrames

## Cleaning data

### Renaming columns

### Deleting columns

### Setting missing values

### Choosing index

It seems that autocomplition does not work when column name contains spaces, so let us rename columns for later convenience:

In [None]:
NO.columns = ['date', 'time', 'no', 'status']

In [None]:
# Old column names
df.columns

In [None]:
# New column names
df.columns = ['date', 'hour', 'O3', 'O3_status', 'NOx', 'NOx_status', 'CO', 'CO_status', 'PM10', 'PM10_status', 'Co', 'Co_status']
df.columns

As you see, we have negative values of ozone concentration, which is probably not correct. So, let us replace those negative values with NaN:

In [None]:
# Replace negative ozone values with NaN
# df[df.O3.values < 0] = None

Let us concentrate our attention on the first 4 chemical species, and remove cobalt data from our data frame:

In [None]:
df = df.drop('Co', 1)
df = df.drop('Co_status', 1)

In [None]:
df.head()

In [None]:
# Useful trick for plotting titles
df.NOx.name

## Basic visualisation

## Saving data

## Some statistics

In [None]:
# df.O3.describe()

### Rolling functions

In [None]:
# with plt.style.context('ggplot'):
#     df.Ozone.plot()
#     rolled_series = df.Ozone.rolling(window=100, center=False)
#     # print(rolled_series)
#     rolled_series.mean().plot(lw=2)

## Correlations and regressions

## Correlations

### Ordinary Least Square (OLS) regressions

The recommeded way to build ordinaty least square regressions is by using `statsmodels`.

In [None]:
import statsmodels.formula.api as sm

## References
* https://github.com/jonathanrocher/pandas_tutorial
* http://pandas.pydata.org/pandas-docs/stable/index.html#module-pandas
* http://pandas.pydata.org/pandas-docs/stable/10min.html

* Data source: https://uk-air.defra.gov.uk/data/
* Site description: https://uk-air.defra.gov.uk/networks/site-info?uka_id=UKA00315