# Operating on Data in Pandas

We all know NumPy was great at element-wise operations, both basic and complex.

Pandas builds upon this and *inherits* alot of this functionality.

Universal functions are key to this

So, what do pandas add?

- preservation of index and column labels

- alignment of indices

The above are tricky with NumPy arrays

Easy with pandas.

## Ufuncs: Index Preservation

Given the compatibility, any NumPy ufunc will work on Pandas `Series` and `DataFrame` objects.

Example:

In [62]:
import pandas as pd
import numpy as np

In [63]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))          # range 0 - 10 and print 4 number 
ser

0    6
1    3
2    7
3    4
dtype: int32

In [64]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:

In [65]:
np.exp(ser)    

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

For something slightly more complex:

In [66]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## UFuncs: Index Alignment

For binary operations on two `DataFrames` or `Series` objects, Pandas will align the indices in the process of performing the operation.

Useful when working on incomplete data

## Example using `Series` objects

Combining two different data sources of area and population:

In [67]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

What if we divide these to computer the population density?

In [68]:
x = population / area
print(x)

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64


The resulting array contains the union of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [69]:
area.index.union(population.index)

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [70]:
x.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with `NaN`, or "Not a Number," which is how Pandas marks missing data 

This is useful as missing values are filled with this NaN:

In [71]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])       # NaN convert int result to float 
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.

In [72]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

## Index alignment in a `DataFrame`

A similar type of alignment takes place for both columns and indices when performing operations on a `DataFrame`

In [73]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [86]:
A = pd.DataFrame(rng.randint(0, 20, (2, 4)),
                 columns=list('ABCD'))
A

Unnamed: 0,A,B,C,D
0,14,6,11,7
1,14,2,13,16


In [88]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,5,1,9
1,1,9,3
2,7,6,8


In [89]:
A + B

Unnamed: 0,A,B,C,D
0,15.0,11.0,20.0,
1,23.0,3.0,16.0,
2,,,,


N.B Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. 

`fill_value` can also be used here if needs be

In [None]:
A.stack()

In [93]:
A.stack().mean()

10.375

In [92]:
B

Unnamed: 0,B,A,C
0,5,1,9
1,1,9,3
2,7,6,8


In [94]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C,D
0,15.0,11.0,20.0,17.375
1,23.0,3.0,16.0,26.375
2,16.375,17.375,18.375,


## Ufuncs: Operations Between DataFrame and Series

When performing operations between a `DataFrame` and a `Series`, the index and column alignment is similarly maintained. 

Similar to interaction between a 2D and 1D NumPy array:

In [95]:
A = rng.randint(10, size=(3, 4))
A

array([[7, 4, 1, 4],
       [7, 9, 8, 8],
       [0, 8, 6, 8]])

In [80]:
A - A[0]

array([[ 0,  0,  0,  0],
       [ 0, -5,  8,  5],
       [ 1, -2,  0,  0]])

Pandas offers similar functionality:

In [81]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,0,-5,8,5
2,1,-2,0,0


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the `axis` keyword:

In [82]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,2,0,-5,-3
1,7,0,8,7
2,5,0,-3,-1


Further interesting example: alignment of indices between two elements

In [83]:
df

Unnamed: 0,Q,R,S,T
0,8,6,1,3
1,8,1,9,8
2,9,4,1,3


In [84]:
halfrow = df.iloc[0, ::2]
halfrow

Q    8
S    1
Name: 0, dtype: int32

In [85]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,0.0,,8.0,
2,1.0,,0.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context

Prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

# Summary

Despite the addition of a more complex index system Pandas enables robust and consistent computation via the `NaN` injection

This enables foolproof computation