# Pandas

> library for data manipulation and analysis

- Built on top of the `NumPy` library
- `DataFrame` and `Series` objects
- Reading and writing data
- Time series-functionality

In [4]:
import pandas as pd

## Series

In [5]:
x = pd.Series([1, 2, 3, 4, 5])
x

0    1
1    2
2    3
3    4
4    5
dtype: int64

The `values` attribute is a `NumPy` array

In [6]:
x.values

array([1, 2, 3, 4, 5])

The `index` attribute is an `pd.Index` object

In [7]:
x.index

RangeIndex(start=0, stop=5, step=1)

Indexing can be done using integers, like in `NumPy`

In [9]:
x[1]

2

In [10]:
x[2:3]

2    3
dtype: int64

But can also be done using different types

In [12]:
y = pd.Series([0.5, 1, 1.5, 2], index=['a', 'b', 'c', 'd'])
y

a    0.5
b    1.0
c    1.5
d    2.0
dtype: float64

In [13]:
y['b']

1.0

## DataFrame

In [16]:
df = pd.DataFrame(y, columns=['values'])
df

Unnamed: 0,values
a,0.5
b,1.0
c,1.5
d,2.0


In [17]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [27]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
states_df = pd.DataFrame({'area':area, 'pop':pop})
states_df

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


## Indexing

Using normal indexing the explicit index is used. When slicing the implicit index is used. This can cause confusion like in the next examples:

In [18]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [19]:
# explicit index when indexing
data[1]

'a'

In [20]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Therefore it is safer to use one of the following `Pandas` attributes:

- `loc` always reference the explicit index
- `iloc` always reference the implicit index

In [21]:
data.loc[1]

'a'

In [22]:
data.loc[1:3]

1    a
3    b
dtype: object

In [23]:
data.iloc[1]

'b'

In [24]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Transpose

In [28]:
states_df.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135


In [30]:
states_df.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [31]:
states_df.loc[:'New York', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


## Index alignment in DataFrame

In [33]:
import numpy as np
rand = np.random.RandomState(1)
A = pd.DataFrame(rand.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,5,11
1,12,8


In [35]:
B = pd.DataFrame(rand.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,9,5,0
1,0,1,7
2,6,9,2


In [36]:
A + B

Unnamed: 0,A,B,C
0,10.0,20.0,
1,13.0,8.0,
2,,,


In [37]:
A.stack()

0  A     5
   B    11
1  A    12
   B     8
dtype: int64

In [38]:
A.stack().mean()

9.0

In [39]:
A.add(B, fill_value=A.stack().mean())

Unnamed: 0,A,B,C
0,10.0,20.0,9.0
1,13.0,8.0,16.0
2,18.0,15.0,11.0
