# Automatic Index Alignment

This chapter discusses **automatic index alignment**, a surprising, occasionally useful, and sometimes frustrating feature built into pandas. Automatic alignment of the index happens when operating with two pandas objects simultaneously. Whether operating with two Series, two DataFrames, or one of each, automatic alignment of the index takes place first, and then the operation completes.

## Adding two Series - Not as simple as it sounds

Adding two Series together sounds simple, and most of the time it is, but you can be in for quite a surprise if the indexes do not align. Let's create two identical Series, one with the constructor, and the other with the `copy` method.

In [None]:
import pandas as pd
s1 = pd.Series(index=['Andy', 'Bridget', 'Cali', 'Dino'], data=[0, 1, 2, 3])
s2 = s1.copy()

Below, we output the contents of each Series

In [None]:
s1

In [None]:
s2

These Series are two distinct objects in memory. If we created `s2` with an assignment statement using `s2 = s1`, we would not have created a new object, and instead have two variable names that refer to the same underlying object in memory. We verify each Series is a unique object in memory with the `is` operator.

In [None]:
s1 is s2

### Add the Series together

Adding these two Series together yields an unsurprising result. The index values have stayed the same, while the values have doubled.

In [None]:
s1 + s2

### Create a new Series with index values in a different order

Let's create a new Series, `s3`, with the same index values, but with a different order.

In [None]:
s3 = pd.Series(index=['Dino', 'Cali', 'Bridget', 'Andy'], data=[0, 1, 2, 3])
s3

### Add `s1` to `s3`

Adding `s1` to `s3` yields a result, that might surprise some. The index is ordered alphabetically and all the values are the same.

In [None]:
s1 + s3

### Explanation

Whenever two Series are added together, pandas aligns them by the index first and then completes the operation. In the above example, index value `'Andy'` from `s1` aligns with index value `'Andy'` from `s3` and then the integer values are added. In `s1`, index `'Andy'` labels value 3 and in `s3`, it labels value 0. Added together they sum to 3. All other index values align in this manner and all sum to 3.

### Reversing the order of the operation (commutative property)

As in formal mathematics, the order of the operation does not change the results for those operations that adhere to the commutative property (`1 + 2 == 2 + 1`). Here, we add the same two Series together but do so with `s3` first.

In [None]:
s3 + s1

### Automatic index sorting - when it doesn't happen

When operating with two Series, the result will automatically be sorted by the index whenever the index values are not in the same order, as we've seen above. It will, however, preserve the order of the index if each Series has the same index. Here, `s4` is created with the same index values, in the same order as `s3`.

In [None]:
s4 = pd.Series(index=['Dino', 'Cali', 'Bridget', 'Andy'], data=[10, -4, 33, 99])
s4

When added to `s3`, the index values align, but remain in the same order.

In [None]:
s3 + s4

## Adding together numpy arrays

numpy arrays have no index, just values and integer locations that refer to those values. These arrays align by their integer location (which is what you would expect). Let's create two simple arrays with integers 0 to 3 and add them together.

In [None]:
import numpy as np
a = np.arange(4)
b = a.copy()
a

In [None]:
b

Adding the two arrays yields the following unsurprising result.

In [None]:
a + b

### Adding numpy arrays to pandas Series

Since numpy arrays do not have an index, pandas will add the values together by integer location (ignoring the index labels of the Series) and return a Series with its index in the same order as the original.

In [None]:
a + s3

### Arrays and Series must have same number of elements

For a successful array to Series arithmetic operation to occur, both must have the same number of elements or else an error will occur. Here, we create a new array with one more element than `s3`.

In [None]:
a = np.arange(5)
a

The shapes are misaligned and the operation fails.

In [None]:
a + s3

## Operating on two Series with different index values

Performing arithmetic operations on two Series that do not have the same index labels is possible. In fact, adding two Series together will always complete (unless their values are incompatible - such as adding a number to a string).

In the following example, we have two Series of *different* lengths. `s1` has one more index label, `'Dino'`, that `s2` does not have. When we add them together, the indexes align again, except for `'Dino'`. It has no matching index in `s2`. This label is kept in the returned Series, but with a missing value. Any index label unique to one Series will be kept in the result and have its value as missing.

In [None]:
s1 = pd.Series(index=['Andy', 'Bridget', 'Cali', 'Dino'], data=[0, 1, 2, 3])
s2 = pd.Series(index=['Andy', 'Bridget', 'Cali'], data=[0, 1, 2])
s1 + s2

### Missing index labels in each Series

If each of the Series have index labels that do not appear in the other, then they will both be kept in the result with missing values. Here, index labels `'Andy'`, `'Bridget'`, and `'Cali'` align, but `'Dino'` and `'Elias'` are unique to one Series.

In [None]:
s1 = pd.Series(index=['Andy', 'Bridget', 'Cali', 'Dino'], data=[0, 1, 2, 3])
s2 = pd.Series(index=['Andy', 'Bridget', 'Cali', 'Elias'], data=[0, 1, 2, 3])
s1 + s2

## Adding Series with duplicate values in the index

A big surprise awaits when you add two Series that each share duplicated index labels. Take a look at both Series below. `s1` and `s2` each have 3 `'Andy'` index labels. `s1` has 3 `'Bridget'`, 4 `'Cali'`, and 1 `'Dino'` index label while `s2` has 2 `'Bridget'`, 1 `'Cali'`, and 1 `'Elias'` label. Let's add them together to see what happens.

In [None]:
s1 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget', 'Bridget', 
                      'Cali', 'Cali', 'Cali', 'Cali', 'Dino'], 
               data=np.arange(11))
s2 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget', 
                      'Cali', 'Elias'], 
               data=np.arange(7))
s1 + s2

### A Cartesian product has taken place

Each index label `'Andy'` from Series `s1` aligns with each index label `'Andy'` from `s2`. There are 3 `'Andy'` labels in each which creates a total of 9 in the result. This is what is meant by a **Cartesian Product**. All possible combinations of the same index labels in each Series will have a result.

Similarly, Series `s1` has 3 `'Bridget'` labels and `s2` has 2 `'Bridget'` for a total of 6 in the result. Multiply the count of the labels in each Series together to get the total labels in the result. 

Label `'Cali'` is found 4 times in `s1` and 1 time in `s2` for a total of 4 values in the result. Labels `'Dino'` and `'Elias'` are unique to each Series so only occur once in the result with a missing value.

### An exception to Cartesian Product rule

If both Series share the exact same index labels, in the same order, then no Cartesian product will occur. Here, each Series has 3 `'Andy'` labels and 2 `'Bridget'` labels, but since they are in the same order, no Cartesian product will occur.

In [None]:
s1 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget'], 
               data=np.arange(5))
s2 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget'], 
               data=np.arange(5))
s1 + s2

A Cartesian product will happen even if a single label is different.

In [None]:
s1 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget'], data=np.arange(5))
s2 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget', 'Cali'], data=np.arange(6))
s1 + s2

### Cartesian product still happens if order is not the same

If the index labels share the same number of occurrences in the Series, a Cartesian Product will still happen if the order is different. Below, `s1` and `s2` have the same number of `'Andy'` and `'Bridget'` labels, but have a different order for the 3rd and 4th labels.

In [None]:
s1 = pd.Series(index=['Andy', 'Andy', 'Bridget', 'Andy', 'Bridget'], 
               data=np.arange(5))
s2 = pd.Series(index=['Andy', 'Andy', 'Andy', 'Bridget', 'Bridget'], 
               data=np.arange(5))
s1 + s2

## Arithmetic operations with two DataFrames

We'll now cover what happens when two DataFrames are used in an arithmetic operation. Automatic alignment of both the index and columns happens. We begin by creating two identical DataFrames.

In [None]:
df1 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, 
                   index=['Andy', 'Bridget', 'Cali', 'Dino'])
df2 = df1.copy()
df1

In [None]:
df2

Adding these two together yields no surprises. Each value corresponds exactly to one other value.

In [None]:
df1 + df2

### DataFrame index alignment

Let's now see what happens when the index labels are not the same, but the column labels are.

In [None]:
df1 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, 
                   index=['Andy', 'Bridget', 'Cali', 'Dino'])
df2 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, 
                   index=['Andy', 'Bridget', 'Cali', 'Elias'])
df1

In [None]:
df2

Adding together these two DataFrames returns two rows of missing values, one for each of the unique labels in each DataFrame.

In [None]:
df1 + df2

### DataFrame column alignment

Similarly, the columns of each DataFrame also align with one another and create entire columns of missing values when unique to one DataFrame.

In [None]:
df1 = pd.DataFrame(data={'first': np.arange(4), 'second': np.arange(4)}, 
                   index=['Andy', 'Bridget', 'Cali', 'Dino'])
df2 = pd.DataFrame(data={'first': np.arange(4), 'third': np.arange(4)}, 
                   index=['Andy', 'Bridget', 'Cali', 'Elias'])
df1

In [None]:
df2

Here, we have one unique column and one unique index label resulting in two rows and two columns of missing values.

In [None]:
df1 + df2

## Appending new columns to a DataFrame from a Series

In this section, we'll cover how index alignment affects the creation of new columns in a DataFrame. Previously, we've appended new columns to the end of our DataFrame by using existing columns. Let's review this now by reading in the sample dataset.

In [None]:
df = pd.read_csv('../data/sample_data.csv', index_col='name')
df

Creating new columns from operations between two columns that already exist in the DataFrame is fairly straight forward. Here, we create two columns, `age_score` and `older_than_30` with data types float and boolean.

In [None]:
df['age_score'] = df['age'] * df['score']
df['older_than_30'] = df['age'] > 30
df

Let's now create a new Series the same length of the DataFrame, but with no index labels in common.

In [None]:
index = ['Andy', 'Bridget', 'Cali', 'Dino', 'Elias', 'Foti', 'Giannis']
s1 = pd.Series(index=index, data=np.arange(len(index)))
s1

We can create a new column in this manner, as normal, with the assignment statement. Automatic alignment of the index always takes place whenever two Series/DataFrames are operated on together. Since no index labels are in common, none of the values from `s` are used in the new column.

In [None]:
df['new_values'] = s1
df

It's possible to use any Series of any length to assign a new column. If there are any labels in common, those values will exist in the new column. Here, a Series with more values than rows in the DataFrame is created that has two matching index values (Cornelia and Jane).

In [None]:
index = ['Andy', 'Cornelia', 'Cali', 'Dino', 'Elias', 'Jane', 
         'Giannis', 'Holden', 'Issac', 'Johnny']
s2 = pd.Series(index=index, data=np.arange(len(index)))
s2

When using this Series to create a new column, the two index labels that matched have their values in the new column.

In [None]:
df['new_values2'] = s2
df

As you probably noticed above, the order of the index labels is irrelevant. The automatic index alignment ensures that the value associated with that index gets placed in that row. Let's see this in action by creating a Series with an index that contains the same values as the DataFrame, but in a different order. We do this by selecting the `food` column and then sorting it by the index.

In [None]:
s3 = df['food'].sort_index()
s3

Assigning a new column using this Series simply duplicates the original column as the index alignment will return the values to their original order.

In [None]:
df['food2'] = s3
df

## Arithmetic operations with one DataFrame and one Series

We've covered arithmetic operations when the two pandas objects are the same type. In this section, we'll cover what happens when they're different, i.e. one DataFrame and one Series. We begin by reading in the City of Houston employee dataset.

In [None]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

To help motivate this operation, we create a pivot table with the total salary for each unique combination of sex and department.

In [None]:
df_total = emp.pivot_table(index='sex', columns='dept', 
                         values='salary', aggfunc='sum').round(-3).astype('int64')
df_total

Let's say we are interested in finding the percentage of total fire department salary that is paid to Females. We first need to find the total salary of all departments. To do this, we call the `sum` method, which totals both columns.

In [None]:
total_by_dept = df_total.sum()
total_by_dept

Dividing the values in each department column by this total will answer the above question. But, we can extend this inquiry to finding the percentage for each department. We can accomplish this by dividing the DataFrame `df_total` by the Series `total_by_dept`. You might guess that if we write the operation, `df_total / total_by_dept`, pandas would align the objects by their index since a Series has no columns. But, this isn't what happens. Surprisingly, **pandas automatically aligns the DataFrame columns with the Series index**.

In [None]:
(df_total / total_by_dept).round(3) * 100

This informs us that of the grand total of fire department salaries, 5.6% was paid to females with varying percentages for the other departments.

## DataFrame arithmetic and comparison methods

All DataFrame arithmetic and comparison operators have a corresponding method that produces the same, but with more options. For instance, the addition operator may be duplicated with the `add` method. Below we add 5 to each value of `df1`.

In [None]:
df1 + 5

The DataFrame `add` method may be used instead to reproduce the same result.

In [None]:
df1.add(5)

Use the `gt` in place of the greater than operator. Here, we test whether each value is greater than one.

In [None]:
df1.gt(1)

The following table summarizes these arithmetic and comparison operators which can also be found in [the official documentation][0].


| Arithmetic <br> Operator |          Method        |     | Comparison <br> Operator | Method |
|:-------------------:|:------------------------:|---|:-------------------:|:------:|
|         `+`         |           `add`          |    |         `>`        |   `gt`   |
|          `-`          |     `sub`/`subtract`     |   |        `<`         |  `lt`    |
|          `*`          |     `mul`/`multiply`     |    |       `>=`          |  `ge`    |
|          `/`          | `div`/`divide`/`truediv` |   |        `<=`          |  `le`    |
|          `//`         |        `floordiv`        |  |         `==`          |   `eq`   |
|          `**`       |         `pow`              |   |        `!=`          |   `ne`   |
|          `%`       |         `mod`              |   |                  |      |

### Why use these methods instead of the operators

Personally, I never use these methods as the operators are universal and clearer in their meaning. However, since they are methods, they do provide additional functionality not available with the operator and their are special cases when using the method is necessary.

[0]: https://pandas.pydata.org/docs/reference/frame.html#binary-operator-functions

### Changing the direction of the operation

To showcase the necessity of these methods, let's turn back to our employee data and find the distribution of salary by department within each sex. We would be able to answer questions like "Out of the total of all female salaries, what percentage was paid to those in the fire department?". For this question, we need to aggregate salaries by sex, which we do below.

In [None]:
total_by_sex = df_total.sum(axis=1)
total_by_sex

As we just saw, arithmetic operations between a DataFrame and a Series align on the columns and the index, respectively. This isn't what we want, but let's execute the operation regardless to see what happens.

In [None]:
df_total / total_by_sex

Because the DataFrame columns and Series index have no values in common, all returned values are missing. The new columns of the DataFrame are a union of the original DataFrame columns and Series index.

In order to successfully complete this operation, we need to tell pandas to align each object by its index. We must use one of the methods from above, which allows us to change the direction of the operation with the `axis` parameter. In this instance, we use the `div` method, setting the `axis` parameter to `0` (or `'index'`). By default, it is set to `'columns'` or `1`, which is the opposite of nearly all other DataFrame methods. Now, the alignment happens on the index first and then the operation completes.

[0]: https://pandas.pydata.org/docs/reference/frame.html#binary-operator-functions

In [None]:
df_total.div(total_by_sex, axis=0).round(3) * 100

From this DataFrame, 3.8% of all female salaries are from the fire department.

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Create two Series of integers with no missing values. Make one with 4 values and the other with five. When added together, they result should be a Series with 10 values, 2 of which are missing.</span>

### Exercise 2

<span style="color:green; font-size:16px">Create two Series of integers, each with three values, but with a non-identical index. When added together, the result should be a Series with three values.</span>

### Exercise 3

<span style="color:green; font-size:16px">Add two Series together containing integers resulting in a new Series with all missing values.</span>

### Exercise 4

<span style="color:green; font-size:16px">You add two Series together, one with four values, and the other with five. Each index label is the same. For instance, all labels for all Series could be `'a'`. How many total values would be in the resulting Series? Answer the question without pandas, and then check your work with it.</span>

### Exercise 5

<span style="color:green; font-size:16px">Can you determine the shape of the resulting addition between the following two DataFrames before completing the operation?</span>

In [None]:
df1 = pd.DataFrame(data=np.random.randint(1, 6, (4, 8)), 
                   index=['a', 'b', 'b', 'd'], 
                   columns=['a', 'b', 'c', 'd', 
                            'e', 'f', 'g', 'h'])
df2 = pd.DataFrame(data=np.random.randint(1, 6, (6, 5)),
                   index=['a', 'b', 'b', 'b', 'b', 'c'],
                   columns=['a', 'b', 'c', 'd', 'e'])

### Exercise 6

<span style="color:green; font-size:16px">Read in the sample dataset (`sample_data.csv`), placing the name column in the index. Create a new Series with two values and use it to create a new column in the DataFrame. Make it such that only one of the values appears in the resulting DataFrame.</span>

### Exercise 7

<span style="color:green; font-size:16px">Read in the x, y, and z columns from the diamonds dataset. Find the mean of each row and subtract that value from each value in the row.</span>