<h1>Essential Functionality</h1>

This notebook will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame.

<h2>Reindexing</h2>

An important method on pandas objects is `reindex`, which means to create a new object with the values rearranged to align with the new index. Consider an example:

In [112]:
import numpy as np 
import pandas as pd

In [113]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling `reindex` on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [114]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])

obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, you may want to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:

In [115]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])

obj3

0      blue
2    purple
4    yellow
dtype: object

In [116]:
obj4 = obj3.reindex(np.arange(6), method="ffill")

obj4

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, `reindex` can alter the `(row)` index, `columns`, or `both`. When passed only a sequence, it reindexes the rows in the result:

In [117]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                    index=["a", "c", "d"],
                    columns=["Ohio", "Texas", "California"])

frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [118]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])

frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [119]:
states = ["Texas", "Utah", "California"]

frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Because `"Ohio"` was not in `states`, the data for that column is dropped from the result.

<h2>Dropping Entries from an Axis</h2>

Dropping one or more entries from an axis is simple if you already have an index array or list without those entries, since you can use the `reindex` method or .loc-based indexing. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [120]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])

obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [121]:
new_obj = obj.drop(["c"])

new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [122]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                        index=["Ohio", "Colorado", "Utah", "New York"], columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


To drop labels from the columns, instead use the `columns` keyword:

In [123]:
data.drop(columns=["two"])

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


You can also drop values from the columns by passing `axis=1` (which is like NumPy) or `axis="columns"`:

In [124]:
data.drop("two", axis=1)

data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


<h2>Indexing, Selection, and Filtering</h2>

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [125]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])

# obj["b"]

# obj["2"]

# obj[2:4]

# obj[["b", "a", "d"]]

# obj[[2, 3]]

# obj[obj < 2]

While you can select data by label this way, the preferred way to select index values is with the special `loc` operator:

In [126]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

The reason to prefer `loc` is because of the different treatment of integers when indexing with []. Regular []-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index. For example:

In [127]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])

obj1[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

In [128]:
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

obj2[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

In [129]:
obj2.loc[['a', 'c']]

a    1
c    3
dtype: int64

<b>Note:</b> When using loc, the expression obj.loc[[0, 1, 2]] will fail when the index does not contain integers</p>

Since `loc` operator indexes exclusively with labels, there is also an iloc operator that indexes exclusively with integers to work consistently whether or not the index contains integers:

In [130]:
obj1.iloc[[0, 1, 2]]

# obj2.iloc[[0, 1, 2]]

2    1
0    2
1    3
dtype: int64

In [131]:
obj1.loc[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

<b>Note!</b> You can also slice with labels, but it works differently from normal Python slicing in that the endpoint is inclusive

In [132]:
obj2.loc["b":"c"]

b    2
c    3
dtype: int64

Assigning values using these methods modifies the corresponding section of the Series:

In [133]:
obj2.loc["b":"c"] = 5

obj2

a    1
b    5
c    5
dtype: int64

Indexing into a DataFrame retrieves one or more columns either with a single value or sequence:

In [134]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [135]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [136]:
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12



Indexing like this has a few special cases. The first is slicing or selecting data with a Boolean array:

In [137]:
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Another use case is indexing with a Boolean DataFrame, such as one produced by a scalar comparison. Consider a DataFrame with all Boolean values produced by comparing with a scalar value:

In [138]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [139]:
data[data < 5] = 0

data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


<h3>Selection on DataFrame with loc and iloc</h3>

Like Series, DataFrame has special attributes `loc` and `iloc` for label-based and integer-based indexing, respectively. Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`).

In [140]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


You can combine both row and column selection in `loc` by separating the selections with a comma:

In [141]:
data.loc[["Colorado"], ["two", "three"]]

Unnamed: 0,two,three
Colorado,5,6


We'll then perform some similar selections with integers using `iloc`:

In [142]:
# data.iloc[[2, 1]]

print(type(data.iloc[2, [3, 0, 1]]))

data.iloc[2, [3, 0, 1]]

<class 'pandas.core.series.Series'>


four    11
one      8
two      9
Name: Utah, dtype: int32

In [143]:
print(type(data.iloc[[1], [3, 0, 1]]))

data.iloc[[1, 2], [3, 0, 1]]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or lists of labels:

In [144]:
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [145]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Boolean arrays can be used with `loc` but not `iloc`:

In [146]:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


<h3>Integer indexing pitfalls</h3>

Working with pandas objects indexed by integers can be a stumbling block for new users since they work differently from built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:

In [147]:
ser = pd.Series(np.arange(3.))

# Error
# ser[-1]

On the other hand, with a noninteger index, there is no such ambiguity:

In [148]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])

ser2[-1]

2.0

If you have an axis index containing integers, data selection will always be label oriented. As I said above, if you use `loc` (for labels) or `iloc` (for integers) you will get exactly what you want:

In [149]:
ser.iloc[-1]

2.0

<h3>Pitfalls with chained indexing</h3>

Previously we looked at how you can do flexible selections on a DataFrame using `loc` and `iloc`. These indexing attributes can also be used to modify DataFrame objects in place, but doing so requires some care.

In [150]:
data.loc[:, "one"] = 1

data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [151]:
data.iloc[2] = 5

data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [152]:
data.loc[data["four"] > 5] = 3

data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


A common gotcha for new pandas users is to chain selections when assigning, like this:

In [153]:
data.loc[data.three == 5]["four","one"] = 6

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data.three == 5]["four","one"] = 6


ipython-input-11-0ed1cf2155d5:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.

Try using `.loc[row_indexer,col_indexer] = value` instead

In [154]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


In these scenarios, the fix is to rewrite the chained assignment to use a single `loc` operation:

In [155]:
data.loc[data.three == 5, ["one","three"]] = 6

data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,6,5,6,5
New York,3,3,3,3


<h2>Arithmetic and Data Alignment</h2>

Pandas can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example in the case of DataFrame, alignment is performed on both rows and columns:

In [41]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                        index=["Ohio", "Texas", "Colorado"])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                        index=["Utah", "Ohio", "Texas", "Oregon"])

df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Since the `"c"` and `"e"` columns are not found in both DataFrame objects, they appear as missing in the result. The same holds for the rows with labels that are not common to both objects.

<h3>Arithmetic methods with fill values</h3>

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other. Here is an example where we set a particular value to NA (null) by assigning `np.nan` to it:

In [42]:
df2.loc[1, "b"] = np.nan

df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0
1,,,


Adding these results in missing values in the locations that don’t overlap:

In [43]:
df1 + df2

Unnamed: 0,b,c,d,e
1,,,,
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Using the add method on `df1`, I pass `df2` and an argument to `fill_value`, which substitutes the passed value for any missing values in the operation:

In [44]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
1,,,,
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


Series and DataFrame methods for arithmetic. Each has a counterpart, starting with the letter `r`, that has arguments reversed. So these two statements are equivalent:

In [45]:
# 1 / df1

df1.rdiv(1)

Unnamed: 0,b,c,d
Ohio,inf,1.0,0.5
Texas,0.333333,0.25,0.2
Colorado,0.166667,0.142857,0.125


<h3>Operations between DataFrame and Series</h3>

As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows:

In [46]:
arr = np.arange(12.).reshape((3, 4))

arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [47]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

<h2>Function Application and Mapping</h2>

NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [156]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                        columns=list("bde"),
                        index=["Utah", "Ohio", "Texas", "Oregon"])

frame

Unnamed: 0,b,d,e
Utah,0.724912,0.199858,-1.460481
Ohio,-0.477962,0.890924,0.242234
Texas,1.697963,0.306747,0.580094
Oregon,0.33156,1.960998,-1.242044


In [157]:
# frame
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.724912,0.199858,1.460481
Ohio,0.477962,0.890924,0.242234
Texas,1.697963,0.306747,0.580094
Oregon,0.33156,1.960998,1.242044


Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s `apply` method does exactly this:

In [160]:
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

b    2.175925
d    1.761140
e    2.040574
dtype: float64

Here the function `f`, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in `frame`. The result is a Series having the columns of `frame` as its index.

If you pass `axis="columns"` to apply, the function will be invoked once per row instead. A helpful way to think about this is as "apply across the columns":

In [50]:
frame.apply(f1, axis="columns")

Utah      1.719964
Ohio      0.822967
Texas     1.067682
Oregon    3.211249
dtype: float64

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in `frame`. You can do this with `applymap`:

In [51]:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

Unnamed: 0,b,d,e
Utah,1.3,-0.32,-0.42
Ohio,-0.49,-0.67,0.15
Texas,-0.29,-1.28,-0.22
Oregon,-2.68,0.53,0.42


The reason for the name `applymap` is that Series has a `map` method for applying an element-wise function:

In [52]:
frame["e"].map(my_format)

Utah      -0.42
Ohio       0.15
Texas     -0.22
Oregon     0.42
Name: e, dtype: object

<h2>Sorting and Ranking</h2>

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column label, use the `sort_index` method, which returns a new, sorted object:

In [53]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])

obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int32

With a DataFrame, you can sort by index on either axis:

In [54]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                    index=["three", "one"],
                    columns=["d", "a", "b", "c"])

# frame.sort_index()

frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default but can be sorted in descending order, too:

In [55]:
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its sort_values method:

In [56]:
obj = pd.Series([4, 7, -3, 2])

obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

<b>Note:</b> Any missing values are sorted to the end of the Series by default</p>

In [57]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

# obj.sort_values()

obj.sort_values(na_position="first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to `sort_values`:

In [58]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})

# frame.sort_values("b")

frame.sort_values(["a", "b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


<h2>Axis Indexes with Duplicate Labels</h2>

Up until now almost all of the examples we have looked at have unique axis labels (index values). While many pandas functions (like `reindex`) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [59]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])

obj.index.is_unique

False

In [60]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int32

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

In [61]:
obj["a"]

a    0
a    1
dtype: int32

In [62]:
obj["b"]

b    2
b    3
dtype: int32