# 8. Automatic Index Alignment

## Introduction
This notebook discusses **automatic index alignment** a surprising, useful, and frustrating feature built into to Pandas. Automatic alignment of the index happens when you are operating on two Pandas objects at the same time. Whether you are operating with two Series, two DataFrames, or one of each, automatic alignment of index takes place first and then the operation completes.

# Adding two Series - Not as simple as it sounds
Adding two Series together should be a simple, and most of the time it is, but you can be in for quite a surprise if the indexes do not align. Let's create two identical Series. The **`copy`** method allows us to do this.

In [3]:
import numpy as np
import pandas as pd

In [4]:
s1 = pd.Series(index=['a', 'b', 'c', 'd'], data=[0, 1, 2, 3])
s2 = s1.copy()

In [5]:
s1

a    0
b    1
c    2
d    3
dtype: int64

In [6]:
s2

a    0
b    1
c    2
d    3
dtype: int64

Note, that these are two distinct objects. If we wrote **`s2 = s1`**, we would not have created a new object, just two variable names that refer to the same object.

In [8]:
s1 is s2

False

### Add the Series together
The Series have the same index and the same values.

In [9]:
s1 + s2

a    0
b    2
c    4
d    6
dtype: int64

## Create new Series with index values in a different order
We create new Series **`s3`** below with the same index values but in a different position than **`s1`**

In [12]:
s3 = pd.Series(index=['d', 'c', 'b', 'a'], data=[0, 1, 2, 3])
s3

d    0
c    1
b    2
a    3
dtype: int64

### Add `s1` to `s3`

In [13]:
s1 + s3

a    3
b    3
c    3
d    3
dtype: int64

### What happened?
Pandas aligns the data first by the index and then completes the operation. Index 'a' aligns for both Series. In **`s1`** index 'a' labels value 3 and in **`s3`** it labels value 0. Added together they sum to 3. All the indexes align in this manner and all sum to 3.

## Adding a NumPy array to a Series
NumPy arrays have no index, just values and integer locations that refer to those values. NumPy arrays align by their integer position (which is what you would expect).

Let's create a simple array with integers 0 to 3 and add it to our Series from above. The index of the Series plays no role in the following operations.

In [17]:
a = np.arange(4)
a

array([0, 1, 2, 3])

In [18]:
s1 + a

a    0
b    2
c    4
d    6
dtype: int64

In [19]:
s3 + a

d    0
c    2
b    4
a    6
dtype: int64

Adding the array to itself also aligns by integer location.

In [21]:
a + a

array([0, 2, 4, 6])

## Adding arrays to Series - Must have same number of elements
For a successful array to Series addition to occur, they both need to have the same number of elements or else an error will occur.

In [22]:
a = np.arange(5)
a

array([0, 1, 2, 3, 4])

In [25]:
try:
    s1 + a
except Exception as e:
    print(type(e), e)

<class 'ValueError'> operands could not be broadcast together with shapes (4,) (5,) 


## Adding Series that don't have the same index labels
Adding Series that do not have the same index labels is possible. In fact, adding two Series together will always complete (unless their values are incompatible - such as adding a number to a string).

In the following example, we have two Series of different lengths. **`s1`** has one more index label, **`d`**, that **`s2`** does not have. When we add them together, again the indexes align, except for the **`d`**. It has no matching index in **`s2`**. Pandas keeps this label in the returned Series but with a missing value.

Any label that does not match in the other Series is always kept and its associated value will always be missing.

In [30]:
s1 = pd.Series(index=['a', 'b', 'c', 'd'], data=[0, 1, 2, 3])
s2 = pd.Series(index=['a', 'b', 'c'], data=[0, 1, 2])

In [31]:
s1 + s2

a    0.0
b    2.0
c    4.0
d    NaN
dtype: float64

### Missing index labels in each Series
If each of the Series have index labels that do not appear in the other, then they will both be kept in the result with missing values.

In [35]:
s1 = pd.Series(index=['a', 'b', 'c', 'd'], data=[0, 1, 2, 3])
s2 = pd.Series(index=['a', 'b', 'c', 'e'], data=[0, 1, 2, 3])

In [36]:
s1 + s2

a    0.0
b    2.0
c    4.0
d    NaN
e    NaN
dtype: float64

## Adding Series with duplicate values in the index
A big surprise awaits when you add two Series that each share duplicated index labels. Take a look at both Series below. **`s1`** and **`s2`** each have 3 'a', index labels. **`s1`** has 3 'b', 4 'c' and 1 'd' index label while **`s2`** has 2 'b', 1 'c', 1 'e' labels.

Let's add them together to see what happens.

In [47]:
s1 = pd.Series(index=['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd'], data=np.arange(11))
s2 = pd.Series(index=['a', 'a', 'a', 'b', 'b', 'c', 'e'], data=np.arange(7))

In [48]:
s1

a     0
a     1
a     2
b     3
b     4
b     5
c     6
c     7
c     8
c     9
d    10
dtype: int64

In [49]:
s2

a    0
a    1
a    2
b    3
b    4
c    5
e    6
dtype: int64

In [50]:
s1 + s2

a     0.0
a     1.0
a     2.0
a     1.0
a     2.0
a     3.0
a     2.0
a     3.0
a     4.0
b     6.0
b     7.0
b     7.0
b     8.0
b     8.0
b     9.0
c    11.0
c    12.0
c    13.0
c    14.0
d     NaN
e     NaN
dtype: float64

In [51]:
len(s1 + s2)

21

### 21 elements in resulting Series?

### A Cartesian product has taken place
Each index label 'a' from Series **`s1`** aligns with each index label 'a' from **`s2`**. There are 3 'a' labels in each which creates a total of 9 in the result. This is what is meant by a **Cartesian product**. All possible combinations of same index labels in each Series will have a result.

Similarly, Series **`s1`** has 3 'b' labels and **`s2`** has 2 'b' for a total of 6 in the result. Simply multiply the count of the labels in each Series together to get the total labels in the result. 

Label 'c' is found 4 times in **`s1`** and 1 time in **`s2`** for a total of 4 in the result. Labels 'd' and 'e' are unique to each Series so only occur once in the result with a missing value.

## An exception to Cartesian Product rule
If both Series share the exact same index labels then no Cartesian product will occur.

In [60]:
s1 = pd.Series(index=['a', 'a', 'a', 'b', 'b'], data=np.arange(5))
s2 = pd.Series(index=['a', 'a', 'a', 'b', 'b'], data=np.arange(5))

In [61]:
s1 + s2

a    0
a    2
a    4
b    6
b    8
dtype: int64

But even if one index label is different than a Cartesian product will happen:

In [63]:
s1 = pd.Series(index=['a', 'a', 'a', 'b', 'b'], data=np.arange(5))
s2 = pd.Series(index=['a', 'a', 'a', 'b', 'b', 'c'], data=np.arange(6))
s1 + s2

a    0.0
a    1.0
a    2.0
a    1.0
a    2.0
a    3.0
a    2.0
a    3.0
a    4.0
b    6.0
b    7.0
b    7.0
b    8.0
c    NaN
dtype: float64

## Cartesian product still happens if order is not the same
Even if the index labels share the same number of occurrences in the Series, a Cartesian Product will still happen if the order is different. Below, **`s1`** and **`s2`** have the same number of 'a' and 'b' labels but have a different order for the 3rd and 4th labels.

In [65]:
s1 = pd.Series(index=['a', 'a', 'b', 'a', 'b'], data=np.arange(5))
s2 = pd.Series(index=['a', 'a', 'a', 'b', 'b'], data=np.arange(5))
s1 + s2

a    0
a    1
a    2
a    1
a    2
a    3
a    3
a    4
a    5
b    5
b    6
b    7
b    8
dtype: int64

# DataFrames align on both their index and columns

In [74]:
df1 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, index=['a', 'b', 'c', 'd'])
df2 = df1.copy()

Operations happen as expected whenever index and columns match exactly.

In [82]:
df1 + df2

Unnamed: 0,first,second,third
a,0.0,,
b,2.0,,
c,4.0,,
d,,,
e,,,


### DataFrame Index alignment
The label needs to be present in both DataFrames for a value to be computed or else it will be missing.

In [76]:
df1 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, index=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, index=['a', 'b', 'c', 'e'])
df1

Unnamed: 0,first,second
a,0,0
b,1,1
c,2,2
d,3,3


In [77]:
df2

Unnamed: 0,first,second
a,0,0
b,1,1
c,2,2
e,3,3


In [78]:
df1 + df2

Unnamed: 0,first,second
a,0.0,0.0
b,2.0,2.0
c,4.0,4.0
d,,
e,,


## When Columns do not align

In [83]:
df1 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, index=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(data={'first': np.arange(4), 'third': np.arange(4)}, index=['a', 'b', 'c', 'e'])
df1

Unnamed: 0,first,second
a,0,0
b,1,1
c,2,2
d,3,3


In [84]:
df2

Unnamed: 0,first,third
a,0,0
b,1,1
c,2,2
e,3,3


In [85]:
df1 + df2

Unnamed: 0,first,second,third
a,0.0,,
b,2.0,,
c,4.0,,
d,,,
e,,,


# Cartesian Product over index and columns

In [99]:
df1 = pd.DataFrame(data=np.random.rand(7, 5), 
                   index=['a', 'a', 'a', 'b', 'b', 'c', 'f'], 
                   columns=['first', 'first', 'second', 'second', 'third'])
df2 = pd.DataFrame(data=np.random.rand(8, 5), 
                   index=['a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                   columns=['first', 'first', 'first', 'second', 'second'])
(df1 + df2).shape

(15, 11)

In [96]:
df2

Unnamed: 0,first,first.1,first.2,second,second.1
a,0.073697,0.142855,0.266156,0.812797,0.619567
a,0.005055,0.553022,0.819766,0.223139,0.264186
b,0.37417,0.423487,0.048706,0.695765,0.794687
b,0.704631,0.601275,0.396332,0.876774,0.352477
c,0.730383,0.371497,0.186939,0.123131,0.769738
c,0.546997,0.129504,0.260247,0.10891,0.924331
d,0.031439,0.929202,0.501951,0.234963,0.740949
d,0.015015,0.789543,0.910344,0.39767,0.435275


In [97]:
df1 + df2

Unnamed: 0,first,first.1,first.2,first.3,first.4,first.5,second,second.1,second.2,second.3,third
a,0.264907,0.334065,0.457365,0.161372,0.23053,0.35383,1.292675,1.099446,0.909314,0.716085,
a,0.196265,0.744232,1.010975,0.09273,0.640697,0.907441,0.703018,0.744065,0.319656,0.360703,
a,0.224361,0.293518,0.416819,1.060915,1.130072,1.253373,1.52319,1.329961,1.303956,1.110727,
a,0.155719,0.703685,0.970429,0.992273,1.540239,1.806983,0.933533,0.97458,0.714299,0.755346,
a,1.009048,1.078206,1.201507,0.708086,0.777243,0.900544,0.913501,0.720272,1.729732,1.536503,
a,0.940406,1.488373,1.755117,0.639444,1.18741,1.454154,0.323843,0.36489,1.140075,1.181121,
b,0.849271,0.898588,0.523807,0.569953,0.619269,0.244488,1.09489,1.193811,1.386439,1.485361,
b,1.179731,1.076375,0.871432,0.900413,0.797057,0.592114,1.275899,0.751602,1.567448,1.043151,
b,0.390061,0.439377,0.064596,0.467622,0.516939,0.142158,1.219307,1.318229,1.401733,1.500655,
b,0.720521,0.617165,0.412222,0.798082,0.694727,0.489783,1.400316,0.876019,1.582742,1.058445,
