# Operating on Data in Pandas

- NumPy provides quick element-wise operations for basic arithmetic and sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).

- Pandas inherits this functionality and adds useful twists:

    - unary operations like negation and trigonometric functions *preserve index and column labels* in the output.
    - binary operations such as addition and multiplication automatically *align indices* when passing the objects to the ufunc.

### Ufuncs: Index Preservation

- Any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.

In [1]:
import pandas as pd
import numpy as np

In [2]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [3]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


- If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*

In [4]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [5]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


### UFuncs: Index Alignment

- For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices while performing the operation. This is very convenient when working with incomplete data.

### Index alignment in Series

- Suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [6]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

- What happens when we divide these to compute the population density:

In [7]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

- The result contains the *union* of indices of the two input arrays.

In [8]:
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

- Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data. Index matching is implemented this way for any built-in arithmetic expressions.

In [9]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

- The fill value can be modified.

In [10]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index alignment in DataFrame

- A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

In [13]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),  columns=list('AB'))
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),  columns=list('BAC'))
A

Unnamed: 0,A,B
0,8,6
1,17,3


In [14]:
B

Unnamed: 0,B,A,C
0,8,1,9
1,8,9,4
2,1,3,6


In [15]:
A + B

Unnamed: 0,A,B,C
0,9.0,14.0,
1,26.0,11.0,
2,,,


- Indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.
- We can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries. 
- Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``).

In [16]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,9.0,14.0,17.5
1,26.0,11.0,12.5
2,11.5,9.5,14.5



| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


### Ufuncs: Operations Between DataFrame and Series

- Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.

In [17]:
A = rng.randint(10, size=(3, 4))
A

array([[7, 2, 0, 3],
       [1, 7, 3, 1],
       [5, 5, 9, 3]])

In [18]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-6,  5,  3, -2],
       [-2,  3,  9,  0]])

- According to NumPy's broadcasting rules, subtraction between a two-dimensional array and one of its rows is applied row-wise. Pandas has the same default convention.

In [19]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-6,5,3,-2
2,-2,3,9,0


- Use the ``axis`` keyword to operate column-wise.

In [20]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,5,0,-2,1
1,-6,0,-4,-6
2,0,0,4,-2


- These ``DataFrame``/``Series`` operations, like above, will automatically align  indices between the two elements.

In [21]:
halfrow = df.iloc[0, ::2]
halfrow

Q    7
S    0
Name: 0, dtype: int64

In [22]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-6.0,,3.0,
2,-2.0,,9.0,
