# Operating on Data in Pandas

## Ufuncs: Index preservation

Because Pandas is designed to work with Numpy, any Numpy ufunc will work on Pandas Series and DataFrame objects. <br>
ufunc means universal functions

In [3]:
import pandas as pd
import numpy as np

In [4]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
# 0-> low, 10->high, 4-> no. of obs
ser

0    6
1    3
2    7
3    4
dtype: int32

I used np.random.RandomState? to check the documentation and used tab+shift in the () to check the parameters in the method

In [17]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
# (3,4) is the shape of the matrix
                 columns = ['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,3,1,7,3
1,1,5,5,9
2,3,5,1,9


If we apply a Numpy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:

In [23]:
np.exp(ser)
# power of x with base e

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

Or, for a slightly more complex calculation:

In [26]:
np.sin(df*np.pi/4)

Unnamed: 0,A,B,C,D
0,0.707107,0.707107,-0.707107,0.707107
1,0.707107,-0.707107,-0.707107,0.707107
2,0.707107,-0.707107,0.707107,0.707107


## Ufuncs: Index alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when working with incomplete data.

### Index alignment in Series

Suppose we are combining two different data sources, and find only the top three US states by area and the top three states by population:

In [37]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 
                 'California': 423967}, name = 'area')
population = pd.Series({'California': 388332521, 'Texas': 26448193, 
                        'New York': 19651127}, name = 'popuulation')

Notice that there are some states that are in population but not in area, vise versa. Let's see what happens when we divide these to compute the population density:

In [38]:
population / area

Alaska               NaN
California    915.949876
New York             NaN
Texas          38.018740
dtype: float64

The resulting array contains the **union** of indices of the two input arrays, which could be determined using standard Pythin set artithmetic on these indices:

In [39]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number", which is how Pandas marks missing data 

In [41]:
A = pd.Series([2, 4, 6], index = [0, 1, 2])
B = pd.Series([1, 3, 5], index = [1, 2, 3])
A+B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behaviour, the fill value can be modified using appropriate object methods in place of the operations. For example, call A.add(B) is equivalent to calling A+B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:

In [42]:
A.add(B, fill_value = 0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index alignment in DataFrame

A simple type of alignment takes place for both column and indices when performing operations on DataFrame:

In [48]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), 
                columns = list('AB'))
A

Unnamed: 0,A,B
0,8,0
1,11,7


In [51]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                columns = list('BAC'))
B

Unnamed: 0,B,A,C
0,6,9,8
1,6,8,7
2,1,0,6


In [60]:
A+B
# even if B has value and A doesn't, A+B also doesn't have value

Unnamed: 0,A,B,C
0,17.0,6.0,
1,19.0,13.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and the indices in the result are sorted. As was the case with Series, we can use the associated object's arithmetic method and pass any desired fill_value to be used in place of missing entries.  Here we'll fill with mean of all values (computed by first stacking the row of A):

In [57]:
fill = A.stack().mean()
fill

6.5

In [58]:
A.add(B, fill_value = fill)

Unnamed: 0,A,B,C
0,17.0,6.0,14.5
1,19.0,13.0,13.5
2,6.5,7.5,12.5


The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s) | 
| :--- | --- |
|+	|add()
|-	|sub(), subtract()
|*	|mul(), multiply()
|/	|truediv(), div(), divide()
|//	|floordiv()
|%	|mod()
|**	|pow()

## Ufuncs: Operations Between DataFrame and Series
When performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [5]:
A = rng.randint(10, size=(3, 4))
A

array([[6, 9, 2, 6],
       [7, 4, 3, 7],
       [7, 2, 5, 4]])

In [6]:
A - A[0]

array([[ 0,  0,  0,  0],
       [ 1, -5,  1,  1],
       [ 1, -7,  3, -2]])

According to NumPy's broadcasting rules (see Computation on Arrays: Broadcasting), subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates **row-wise** by default:

In [7]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,1,-5,1,1
2,1,-7,3,-2


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword:

In [9]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-3,0,-7,-3
1,3,0,-1,3
2,5,0,3,2


Note that these DataFrame/Series operations, like the operations discussed above, will automatically align indices between the two elements:

In [10]:
halfrow = df.iloc[0, ::2]
halfrow

Q    6
S    2
Name: 0, dtype: int32

In [11]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,1.0,,1.0,
2,1.0,,3.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.