# Python Numeric Data Analysis - Pandas

* For the times when you have Structured data, CSV, spreadsheets, R dataframes, SQL tables.

* Like numpy tables but more sophisticated labelling of rows and columns.  Good at dealing with missing and messy data. Heterogeneous data types. Time series data.

* Clean up and explore data, prepare it for analysis.

* Analyse or pass on to other systems (Scikit-learn, tensorflow, etc)

In [None]:
# The usual suspects
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
%matplotlib inline
plt.rcParams['figure.figsize'] = [14, 8]

# And pandas
import pandas as pd


## Understanding the Pandas classes

### Series

Like a numpy array but with index labels:

In [None]:
s = pd.Series([1.1, 2.2, 3.3, 4.4])
s

In [None]:
s[1:3]

In [None]:
s = pd.Series([1.1, 2.2, 3.3, 4.4], index=["alice", "bob", "charles", "diana"])
s

In [None]:
s['charles']

In [None]:
s['alice':'charles':2]

In [None]:
s.index

In [None]:
s.values

In [None]:
populations = pd.Series({
    "London":8173941,
    "Birmingham":1085810,
    "Glasgow": 590507,
    "Liverpool": 552267, 
    "Bristol": 535907
})
populations

A lot like a Python dict...

In [None]:
for k in populations.keys():
    print(k)

but with ordering, and the power of numpy arrays:

In [None]:
populations / 1000000

In [None]:
populations.std()

In [None]:
populations.idxmax()

In [None]:
populations[populations > 1000000]

c.f.
`SELECT * FROM populations WHERE value > 1000000;`

Percentage of men in the population:

In [None]:
male_percent = pd.Series({
    "London": 49.12,
    "Birmingham": 49.42,
    "Leeds": 49.43,
    "Glasgow": 47.73,
    "Bristol": 49.59,
})
male_percent

Percentage of women:

In [None]:
100 - male_percent

In [None]:
female_pops = populations * (100 - male_percent) / 100.0
female_pops

In [None]:
female_pops[ female_pops.notnull() ].astype(int)

All pretty clever, but that's just the Series.  There's a more powerful class...

---

### DataFrames

![Dataframe](dataframe.png)

## A quick look at a dataframe

In [None]:
df = pd.read_excel("landmarks.xls", sheet_name="landmarks")
df

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.index

### Columns come first

It's *important* to appreciate that the first axis in a dataframe selects *columns*, not *rows*.

In [None]:
df['confidence']

In [None]:
type(df['confidence'])

In [None]:
df['confidence'].values.mean()

In [None]:
df['confidence'].mean()

In [None]:
df['confidence'][23:26]

Note that this is still a Series, so you need to respect its indexing:

In [None]:
df['confidence'][23:26][23]

## A potential source of confusion

So the dataframe is indexed first by the columns:

    df['confidence']

In [None]:
for i in df.keys():
    print(i)

If you're used to Numpy arrays, note that you can't normally do:

    df[0]
    
unless there happens to be a *column* named '0'.

If you do want to get a *row* by number, there's an attribute for that:

In [None]:
df.iloc[2]  # returns a Series for the row

In [None]:
df.iloc[2]['y_0']

`iloc` can take other selectors, such as a slice, which will return a dataframe:

In [None]:
df.iloc[-20::2]

**IMPORTANT ALERT!**

Now, because slicing a few rows is a sufficiently common operation, as a shortcut you *can* index into a dataframe **using a slice** and it will return rows.

    df[10:20]

is basically equivalent to

    df.iloc[10:20]

Why am I emphasising this?  Because it's different from NumPy arrays and it confused me initially:

* You can use `df[:10]` to refer to the first 10 rows, but you can't use `df[0]` to refer to the first row.
* However, `df[0]` will sometimes work, and it will give you a column!

In [None]:
for p in df[:5]:
    print(p)

OK, are we happy?

## Back to the columns

If the column name is suitable, you can refer to it as an attribute:

In [None]:
df.confidence.max()

In [None]:
df[ ['frame', 'timestamp', 'confidence'] ].head()

In [None]:
df.describe()

# Some sample data - the Lab Weather Station

The lab has a weather station at https://www.cl.cam.ac.uk/research/dtg/weather/.

![dials](https://www.cl.cam.ac.uk/research/dtg/weather/images/current-dials.png?)

The data is collated into various downloadable files:

In [None]:
df = pd.read_csv('https://www.cl.cam.ac.uk/research/dtg/weather/weather-raw.csv', header=None)
df

Note that if the columns don't have names as headers, they'll be given numbers:

In [None]:
df[0]

Let's give the columns names:

In [None]:
df = pd.read_csv(
    'https://www.cl.cam.ac.uk/research/dtg/weather/weather-raw.csv', 
    names=[
        'timestamp', 'temp_dc','humidity', 'dewpoint_dc', 'pressure_mbar', 
        'mean_wind_speed_dk', 'ave_wind_bearing', 'sunshine_ch', 'rainfall_um', 'max_wind_dk'
    ],
    parse_dates=['timestamp'])

df.head(10)

In [None]:
df.dtypes

In [None]:
df.timestamp.head()

In [None]:
df.temp_dc[:10000].plot();
# Or you can do plt.plot(df.temp_dc[:10000])

You can easily create new columns:

In [None]:
df['temp']            = df['temp_dc'] / 10.0
df['dewpoint']        = df['dewpoint_dc'] / 10.0
df['mean_wind_kts']   = df['mean_wind_speed_dk'] / 10.0
df['max_wind_kts']    = df['max_wind_dk'] / 10.0
df['sunshine_hours']  = df['sunshine_ch'] / 100.0
df['rainfall_mm']     = df['rainfall_um'] / 1000.0
df.tail()

For us, the time is more useful as an index:

In [None]:
df.set_index('timestamp', inplace=True)
df

In [None]:
df.index

Note 'inplace'. Some methods have names like 'set_index' but don't change the original by default.

In [None]:
df['humidity']['2018-11-08 21:30:00']

In [None]:
df['humidity'][datetime(year=2018, month=11, day=8, hour=21, minute=30, second=0)]

In [None]:
df.temp.plot();

In [None]:
df[["temp", "rainfall_mm"]].plot();

Henceforth, let's ignore the data after Sept 2015.

In [None]:
df = df[ :datetime(year=2015, month=9, day=30) ]
df.temp.plot();

In [None]:
print(df.temp.max(), df.temp.idxmax())
print(df.temp.min(), df.temp.idxmin())

In [None]:
df['temp'].quantile(0.99)

In [None]:
df.temp.idxmin() - df.temp.idxmax()

In [None]:
df.info()

In [None]:
df.plot(kind='scatter', x='mean_wind_kts', y='temp');

In [None]:
df['temp'][df.index.hour == 0][5000:7000].plot()
df['temp'][df.index.hour == 14][5000:7000].plot();

# GroupBy

In [None]:
df.index.month

In [None]:
monthgrouper = df.groupby(df.index.month)
monthgrouper

Not very helpful - what's happening behind the scenes?

In [None]:
monthgrouper.groups

In [None]:
for month, groupframe in monthgrouper:
    groupframe['temp'].plot(style='.', label=month)
plt.legend();

Maybe plot the first 10000...

In [None]:
monthgrouper.mean()

In [None]:
for month, groupframe in monthgrouper:
    print("month:",month)
    groupframe.plot(x='temp', y='mean_wind_kts', kind='scatter', xlim=[-10,35], ylim=[0, 50])
    plt.show()

# That's all for now!

Suggest you Google for '[A gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)'