# Pandas Survey

- [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)
- [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) - original book about pandas (Wes McKinney - creator)
- [Data analysis in Python with pandas](https://www.youtube.com/watch?v=w26x-z-BdWQ) - YouTube training video from PyCon (Wes McKinney)
- [Pandas Training Video](http://shop.oreilly.com/product/0636920047537.do) - training video from O'Reilly

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# plot all graphs inline
%matplotlib inline    

print("Pandas version: {}".format(pd.__version__))

Pandas version: 0.24.2


## Data Structures

## Series

A `Series` is a 1D data structure that can contain any data of the same type.  A minimal `Series` constructor looks like:
```
s = pd.Series(data, index, name)
```
here `data` can be either a `np.ndarray`, a Python `dict` or a scalar value.  The `index` is not required, but if provided it must be the same length as the provided data.  If an `index` is not provided a zero based integer index is created based on the size of the data.  If input is a Python `dict` the key values are used as the index labels.

The `name` parameter is not needed but can be used to label the entire `Series`, if the named `Series` is then used to build a `DataFrame` (see below) the name will be used as the column label.

In [3]:
N = 5
aser = pd.Series(np.random.randn(5), index=list('abcde'), name='A')
aser

a   -1.000970
b    0.597744
c   -0.035878
d    0.178229
e   -0.474080
Name: A, dtype: float64

In [5]:
aser.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

A series acts much like a standard numpy array with a few additional features

In [6]:
print('The length of the Series is: {}'.format(len(aser)))

The length of the Series is: 5


In [7]:
# Series acts like an array or list for selection
aser[1:4]

b    0.597744
c   -0.035878
d    0.178229
Name: A, dtype: float64

In [8]:
# Can also select using labels, kind of like a dict
aser['b':'d']

b    0.597744
c   -0.035878
d    0.178229
Name: A, dtype: float64

In [9]:
# Can operate on a series just like a numpy array
aser*2 + 3

a    0.998061
b    4.195488
c    2.928244
d    3.356458
e    2.051841
Name: A, dtype: float64

## Data Alignment

In [10]:
# Create another Series with a slightly different index
data = [np.random.randint(-200, 200) for i in range(5)]
bser = pd.Series(data, index=list('bcdef'), dtype=np.dtype('int16'), name='B')
bser

b     67
c     88
d   -161
e     81
f   -160
Name: B, dtype: int16

Now add the two series together:

In [11]:
cser = aser + bser
cser

a           NaN
b     67.597744
c     87.964122
d   -160.821771
e     80.525920
f           NaN
dtype: float64

Data is **intrinsically** aligned on the index.  `aser` does not have a value at label `f` and `bser` has no value for label `a` thus in the sum these both get `NaN`

## Handling Missing

Missing data will be marked with the value `np.nan` as a sentinel value, even for non-numeric data.

In [12]:
cser.notna()  # can also use cser.notnull()

a    False
b     True
c     True
d     True
e     True
f    False
dtype: bool

In [13]:
print("Default ignore nan values: {}".format(cser.mean()))
print("You can always include them: {}".format(cser.mean(skipna=False)))

Default ignore nan values: 18.816503766189328
You can always include them: nan


## DataFrame

A `DataFrame` is a 2D data structure contains one or more 1D data arrays or Series, each with its own data type.  A minimal `DataFrame` constructor looks like:
```
df = pd.DataFrame(data, columns, index)
```
where `data` can be

- a `dict` of 1D `np.ndarray`s, `list`, `dict` or `Series`: the `dict` keys are used as the column names
- a 2D `np.ndarray`
- a structured or record `np.ndarray`
- a `pd.Series`
- another `DataFrame`

The `index` is not required, but if provided it must be the same length as the provided data.  If an `index` is not provided a zero based integer index is created based on the size of the data.  The column names are taken from the `columns` input list (if provided) or the key values if a `dict` is used as the data.

In [14]:
# Create from a dict of mixed data
df = pd.DataFrame({'A': aser, 'B': bser, 'C': [3]*6})
df

Unnamed: 0,A,B,C
a,-1.00097,,3
b,0.597744,67.0,3
c,-0.035878,88.0,3
d,0.178229,-161.0,3
e,-0.47408,81.0,3
f,,-160.0,3


### Selection

In [15]:
# Select a single column, result is a Series
df['B']

a      NaN
b     67.0
c     88.0
d   -161.0
e     81.0
f   -160.0
Name: B, dtype: float64

In [16]:
# Select multiple columns, result is a DataFrame, NOTE: use list
df[['C', 'B']]

Unnamed: 0,C,B
a,3,
b,3,67.0
c,3,88.0
d,3,-161.0
e,3,81.0
f,3,-160.0


In [17]:
# Select a row by label, result is a Series
df.loc['a']

A   -1.00097
B        NaN
C    3.00000
Name: a, dtype: float64

In [18]:
# Select a subset of rows, result is a DataFrame
df[3:6]

Unnamed: 0,A,B,C
d,0.178229,-161.0,3
e,-0.47408,81.0,3
f,,-160.0,3


In [19]:
# Select based on a boolean condition
df[df['A'] < 0.5]

Unnamed: 0,A,B,C
a,-1.00097,,3
c,-0.035878,88.0,3
d,0.178229,-161.0,3
e,-0.47408,81.0,3


### Assignment/Deletion 

In [20]:
# Assign to an known column
df['C'] = 5
df

Unnamed: 0,A,B,C
a,-1.00097,,5
b,0.597744,67.0,5
c,-0.035878,88.0,5
d,0.178229,-161.0,5
e,-0.47408,81.0,5
f,,-160.0,5


In [21]:
# Assign to an unknown column creates a new column (at end by default)
df['D'] = df['A']*100
df

Unnamed: 0,A,B,C,D
a,-1.00097,,5,-100.096954
b,0.597744,67.0,5,59.774399
c,-0.035878,88.0,5,-3.587812
d,0.178229,-161.0,5,17.822881
e,-0.47408,81.0,5,-47.407961
f,,-160.0,5,


In [22]:
# Delete a column
del df['D']
df

Unnamed: 0,A,B,C
a,-1.00097,,5
b,0.597744,67.0,5
c,-0.035878,88.0,5
d,0.178229,-161.0,5
e,-0.47408,81.0,5
f,,-160.0,5


In [None]:
df.values

## Date/Time Functionality

In [23]:
dt = pd.to_datetime('2016-06-10 15:00:00', utc=True)
dt

Timestamp('2016-06-10 15:00:00+0000', tz='UTC')

In [24]:
# Datetime in nanoseconds 
dt.value

1465570800000000000

In [25]:
pd.to_datetime(1465570900000000, unit='us', utc=True)

Timestamp('2016-06-10 15:01:40+0000', tz='UTC')

## DatetimeIndex

In [26]:
dt_index = pd.date_range('2016-06-10 15:00:00', periods=100, freq='10S')
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'), index=dt_index)
df.head()

Unnamed: 0,A,B,C,D
2016-06-10 15:00:00,-0.290231,0.440647,-1.775568,1.034476
2016-06-10 15:00:10,0.237819,0.734726,-0.586192,0.536026
2016-06-10 15:00:20,-0.920117,1.433064,-0.012759,-0.340736
2016-06-10 15:00:30,-0.250639,0.167048,-0.808868,2.01997
2016-06-10 15:00:40,0.600805,-0.512624,0.065505,-0.891154


In [27]:
# Select based on timestamp, NOTE: includes the last timestamp
df['2016-06-10 15:01:00':'2016-06-10 15:02:00']

Unnamed: 0,A,B,C,D
2016-06-10 15:01:00,0.18905,1.345574,-1.076729,0.430136
2016-06-10 15:01:10,-0.501678,-0.149934,1.605561,0.911529
2016-06-10 15:01:20,-0.253523,-1.387403,1.213068,0.382227
2016-06-10 15:01:30,0.665015,0.498216,0.279032,-0.544346
2016-06-10 15:01:40,-0.018148,1.035476,-0.691665,-1.749729
2016-06-10 15:01:50,0.075567,0.625191,0.683316,1.013481
2016-06-10 15:02:00,-0.317078,0.710981,-0.780414,2.487746


In [None]:
df.loc['2016-06-10 15:01:00':'2016-06-10 15:02:00', ['C', 'B']]

In [None]:
df.sort_index()