# 2.3 – Operating on Data in Pandas

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).
Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in [Computation on NumPy Arrays: Universal Functions](L13_Computation_on_Arrays_UFuncs.ipynb) are key to this.

Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.
This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof ones with Pandas.
We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures.

## UFuncs: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

In [1]:
import pandas as pd
import numpy as np

In [2]:
rng = np.random.RandomState(42)  # a random number generator
ser = pd.Series(rng.randint(0, 10, 4))   
ser

0    6
1    3
2    7
3    4
dtype: int64

In [3]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*

In [4]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

Or, for a slightly more complex calculation:

In [5]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


Any of the ufuncs discussed in [Computation on NumPy Arrays: Universal Functions](L13_Computation_on_Arrays_UFuncs.ipynb) can be used in a similar manner.

## UFuncs: Index Alignment

For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.
This is very convenient when working with incomplete data, as we will see in some of the examples that follow.

### Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [6]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [7]:
print(population, "\n")
print(area)

California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64 

Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64


Let's see what happens when we divide these to compute the population density:

In [8]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [9]:
area.index.union(population.index)

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data (see further discussion of missing data in [Handling Missing Data](L24_Missing_Values.ipynb)).
This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:

In [10]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

In [11]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

### Index alignment in DataFrame

A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

In [12]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [13]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [14]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.
As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.
Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

In [15]:
A.stack().mean()

4.5

In [16]:
fill = A.stack().mean()
print(fill)
AB = A.add(B, fill_value=fill) # missing entries are filled with "fill"
AB

4.5


Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


**Your turn.** Create a DataFrame ``D`` with a single column ``D`` with values from 3 to 5 indexed from 2 to 4. Then:
- Add the DataFrame ``D`` to the DataFrame ``AB`` using the ``+`` operator. Inspect the result and explain it.
- Repeat the same procedure but using the ``add()`` method. Inspect the result and explain it.

In [17]:
# write your code here



In [18]:
# write your code here



## UFuncs: Operations Between DataFrame and Series

When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [19]:
A = rng.randint(10, size=(3, 4))
A

array([[3, 8, 2, 4],
       [2, 6, 4, 8],
       [6, 1, 3, 8]])

In [20]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1, -2,  2,  4],
       [ 3, -7,  1,  4]])

According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](L15_Computation_on_Arrays_Broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [21]:
df = pd.DataFrame(A, columns=list('QRST'))
print(df)
df - df.iloc[0]

   Q  R  S  T
0  3  8  2  4
1  2  6  4  8
2  6  1  3  8


Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,-2,2,4
2,3,-7,1,4


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:

In [22]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-5,0,-6,-4
1,-4,0,-2,2
2,5,0,2,7


Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align  indices between the two elements:

In [23]:
halfrow = df.iloc[0, ::2]  # first row, every-second entry
halfrow

Q    3
S    2
Name: 0, dtype: int64

In [24]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-1.0,,2.0,
2,3.0,,1.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

---

## Exercises

**Exercise 2.3.1** Create the following Pandas objects:

- A series `A` of 10 random integers from 0 to 9 and indexed from 0 to 9

In [25]:
# write your solution here



- A series `B` of 10 random integers from 0 to 9 and indexed from 5 to 14

In [26]:
# write your solution here



- Add series `A` and `B` and explain the result.

In [27]:
# write your solution here



- Add series ``A`` and ``B`` using method ``add()`` and pass ``fill_value=0``, and explain the result.

In [28]:
# write your solution here



- Series `C` and `D` both consisting of 10 random integers from 10 to 19 and indexed from "a" to "j"

In [29]:
# write your solution here



- Combine series `C` and `D` into a two-column DataFrame ``CD`` with columns labelled "C" and "D"

In [30]:
# write your solution here



- Add a new column labelled "CD" with values being a product of columns "C" and "D"

In [31]:
# write your solution here



- Subtract 1 from all the even values in the ``CD`` DataFrame, so that ``CD`` would contain odd values only

In [32]:
# write your solution here



- Create a new DataFrame ``CD1`` consisting of rows of ``CD`` that have no repeating values. 

In [33]:
# write your solution here



- Create a new DataFrame ``CD2`` consisting of rows of ``CD`` that have repeating values. Then use method ``drop(columns="D")`` to drop the "D" column.

In [34]:
# write your solution here



---

**Exercises 2.3.2** Create the following Pandas objects:

- Construct a DataFrame ``A`` out of a 10x10 numpy array of random integers with columns labelled from "A" to "J" and rows labelled from "a" to "j"

In [35]:
# write your solution here



- Create a DataFrame ``B`` consisting of rows having means greater that the overall mean of ``A``

In [36]:
# write your solution here



- Create a DataFrame ``C`` consisting of columns having medians greater that the overall median of ``A``

In [37]:
# write your solution here



- Add ``B`` and ``C`` using ``add()`` method and ``fill_value`` set to the overall median of ``C``, and explain the outcome.

In [38]:
# write your solution here



---

**Exercise 2.3.3** Create the following Pandas objects:

- Construct a DataFrame ``A`` out of a 10x10 numpy array of random integers with rows and columns labelled from 1 to 10

In [39]:
# write your solution here



- Construct a DataFrame ``B`` consisting of even labelled rows of ``A`` 

In [40]:
# write your solution here



- Construct a DataFrame ``C`` consisting of odd labelled columns of ``A`` 

In [41]:
# write your solution here



- Subtract from ``A`` its first row. Then subtract first column of the resulting matrix

In [42]:
# write your solution here



---

**Exercise 2.3.4** Consider the [List of cities in the United Kingdom](https://en.wikipedia.org/wiki/List_of_cities_in_the_United_Kingdom) article of Wikipedia.

- Choose 5 (or more) cities from all four nations of the UK (Scotland, Nortern Ireland, Wales or England) and create following Pandas Series using Python Dictionaries as constructs:

 - ``Year`` consisting of cities and the year they were granted or confirmed
 
 - ``Nation`` consisting of cities and their nation
 
 - ``Population`` consisting of cities and their populations

In [43]:
# here's something to begin with

Year = pd.Series({"Durham":995,"Ely":1109,"Edinburgh":1329,"Leeds":1893,"Lisburn":2002})
                 
Nation = pd.Series({"Durham":"England","Ely":"England","Edinburgh":"Scotland","Leeds":"England",
                    "Lisburn":"Northern Ireland"})

Population = pd.Series({"Durham":94375,"Ely":20256,"Edinburgh":468720,"Leeds":751485,"Lisburn":120165})

- Combine the Pandas Series you have created into a single DataFrame ``UK``

In [44]:
# write your solution here



- Find the total population of each nation in the UK and the overall population of the UK

 Hint: Use ``UK["Population"].groupby(UK['Nation']).sum()`` - the ``groupby()`` method will be discussed in [Section 2.8](L28_Aggregation_and_Grouping.ipynb) of these notebooks.

In [45]:
# write your solution here



- Assuming that cities had 0 inhabitants when they were found, find the average annual increase in population for each city. Save the result to a new column ``Annual Growth``

In [46]:
# write your solution here



- Select the following slices of the UK DataFrame:

 - Cities in Scotland only
 
 - Cities in England that were found after 1500
 
 - Cities in England that are more populous then the overall average population of cities in the other three nations of the UK

In [47]:
# write your solution here



In [48]:
# write your solution here



In [49]:
# write your solution here



---

<!--NAVIGATION-->
< [2.2 – Data Indexing and Selection](L22_Data_Indexing_and_Selection.ipynb) | [Contents](../index.ipynb) | [2.4 – Missing Values](L24_Missing_Values.ipynb) >


*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; also available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*