# 5.2 Essential Functionality

1. [Reindexing](#reindexing)
2. [Dropping Entries From an Axis](#dropping)
3. [Indexing, Selection, and Filtering](#indexing)
4. [Arithmetic and Data Alignment](#arithmetic)
5. [Function Application and Mapping](#function)
6. [Sorting and Ranking](#sorting)
7. [Axis Indexes with Duplicate Labels](#duplicates)

<a name="reindexing"></a>
# Reindexing

**method**  for both `Series` and `DataFrame`

Create a new object with the values rearranged to align with the new index.  

If the new index has values not in the original object, they'll be added as missing values. 

If the new index is missing values not in the original object, they'll be removed


In [2]:
import numpy as np
import pandas as pd

In [3]:
### Create a DataFrame
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [4]:
### Re-index
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [10]:
### Omit an index
obj2.reindex(["a", "d", "e"])

a   -5.3
d    4.5
e    NaN
dtype: float64

When re-indexing a Series/DataFrame into a larger one, you might want to interpolate (fill-in) values. One such method is `ffill` or forward-fill. 

In the below example, the Series starts with indices 0, 2, and 4. When I re-index, I give it the indices 0 through 5. Without forward fill, 1, 3, and 5 would be NAs. With forward fill, 0 will be copied into 1, 2 into 3, etc.

In [5]:
# Series with non-continuous indices
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [6]:
# Re-index and fill
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

DataFrames can be reindexed on the row index, columns or both.  

Default is index only (but can specify with `index` argument for clarity). Need the `column` argument for columns. Provide both to do both.  


In [7]:
# Build a DataFrame
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [8]:
# Re-index the rows (and add an empty)
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [9]:
# Omitting `index` gives same result:
frame.reindex(["a", "b", "c", "d"])

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [12]:
# Re-index the columns - method 1
# Note how Ohio is omitted and Utah is added
frame.reindex(columns=["Texas", "Utah", "California"])

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [13]:
# Re-index columns - method 2
frame.reindex(["Texas", "Utah", "California"], axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [14]:
# Either of the above can be done with a previously-assigned variable
states=["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


The `loc` operator can also be used to reindex. Again, wait for further detail. Seems like this makes it function more like R's data.frame where it's df[rows,columns]

This method can only operate on existing indices! Can't add NA rows/columns like `reindex` does.

In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

<img src="./myImages/table5.3_reindexArgs.png" width = 600>

<a name="dropping"></a>
# Dropping Entries from an Axis

As shown above, `reindex` and `loc` can be used to drop values from a Series/DataFrame, but you can also use the `drop` method.  

In DataFrames, rows can be dropped with the `index` argument and columns with the `column` argument.  

In [15]:
# Make a Series
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [16]:
# Drop one index
new_obj = obj.drop("c")
new_obj


a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [17]:
# Drop 2 indices with a liast
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

In [18]:
# Make a DataFrame
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# Drop rows
data.drop(index=["Colorado", "Ohio"])

In [None]:
# Drop columns
data.drop(columns=["two"])

In [None]:
# Drop using the "axis" NumPy method
data.drop("two", axis=1)

In [None]:
# Drop using the "axis" pandas method
data.drop(["two", "four"], axis="columns")

<a name="indexing"></a>
# Indexing, Selection, and Filtering

You can index (select) a Series the same way you would a NumPy array `obj[...]` with the index values themselves (or the integer placements).  

**But it's better to use the `loc` operator instead.**  

If you provide integers to the `obj[...]` method, they will act as labels if the index is of type integer, otherwise they'll just default to 0, 1, 2, etc. if they're not. See below:

In fact, trying to use keys incorrectly is now giving a warning!

In [None]:
# Make a series
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

In [None]:
# Index variously using the NumPy array method
print(f"The index label 'b' returns the associated value: obj['b'] = {obj['b']}")
print("\n")
print(f"The value 1 (referring to the second index) also returns this value: obj[1] = {obj[1]}")
print("\n")
print(f"Slicing with integers (remember it's [inclusive:exclusive]): obj[2:4] = \n{obj[2:4]}")
print("\n")
print(f"Slicing with index values: obj[['b', 'a', 'd']] = \n{obj[['b', 'a', 'd']]}")
print("\n")
print(f"Providing a list of integers: obj[[1,3]] = \n{obj[[1,3]]}")
print("\n")
print(f"Can even use expressions: obj[obj < 2] = \n{obj[obj < 2]}")

In [29]:
# New series
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
print(obj1)
print("\n")
print(obj2)

2    1
0    2
1    3
dtype: int64


a    1
b    2
c    3
dtype: int64


In [34]:
# Object 1 has integer indices:
obj1[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

In [None]:
# Object 2 does not
obj2[[0, 1, 2]]

The `loc` method indexes exclusively with labels:

In [None]:
# Object 2 with the loc method will fail if given integers b/c they don't exist as labels
obj2.loc[[0, 1]]

In [33]:
obj2.loc[["c", "b"]]

c    3
b    2
dtype: int64

The `iloc` method indexes exclusively with integers - so it will work consistently whether or not the index is a number or not.  

In the above example, `obj[[0, 1, 2]]` reordered `obj1` using its labels and reordered `obj2` using its integer position inex.  

Below, using the `iloc` method, `obj.iloc[[0, 1, 2]]` will reorder both `obj1` and `obj2` by their integer position index:

In [35]:
obj1.iloc[[0, 1, 2]]

2    1
0    2
1    3
dtype: int64

In [36]:
obj2.iloc[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

You can slice using the labels as well, but the endpoint is inclusive (which kinda makes sense to me..harder to remember what the preceding label is than knowing that the preciding numeric index is just 1 less)

In [37]:
obj2.loc["a":"c"]

a    1
b    2
c    3
dtype: int64

Final note for Series indexing with `loc` and `iloc` is that you can assign in place with them:

In [39]:
obj2.loc["b":"c"] = 5
obj2

a    1
b    5
c    5
dtype: int64

Indexing on a DataFrame without `loc`/`iloc` (i.e. with `[]`)

Can provide a column label as a single value or a sequence of multiple labels.  

**You can only use the labels! You can't use numeric indices the way you might for the rows**

There are a few special cases:

1. If you slice, you subset rows instead of columns
1. if you provide a boolean, you get rows instead of columns

In [51]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [41]:
# Grab a single column with column label
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [42]:
# Grab multiple columns
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [45]:
# Grab rows using slicing
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [46]:
# Grab rows using a comparison
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [52]:
# Can use a scalar comparison to assign values
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


## Selection on DataFrame with loc and iloc

Essentially the same as with Series. `loc` is axis labels and `iloc` is axis integers.  

Can do either rows or columns. Separate them by a comma (same as in R essentially)

### Loc

In [None]:
data

In [53]:
# Single row returns a series with column names as the index labels
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [54]:
# Multiple rows returns another DataFrame
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


In [57]:
# Single column - converts it to a Series
# you have to grab all of the rows
# data.loc[,"two"] this would throw an error
data.loc[:,"two"]

Ohio         0
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [58]:
# Select rows and columns (single row turns it into a Series)
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

In [59]:
# Multiple rows maintains DataFrame
data.loc[["Colorado", "New York"], ["two", "three"]]

Unnamed: 0,two,three
Colorado,5,6
New York,13,14


### iloc

In [61]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# Again, single value returns a Series and column labels -> indices
data.iloc[2]

In [None]:
# >= 2 values returns DataFrame
data.iloc[[2, 1]]

In [None]:
# Single row returns a Series, providing columns subsets it
data.iloc[2, [3, 0, 1]]

In [60]:
# Separate with a comma to get both rows and columns
data.iloc[[1,2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [62]:
# Slicing (remember it's inclusive)
data.loc[:"Utah", :"two"]

Unnamed: 0,one,two
Ohio,0,0
Colorado,0,5
Utah,8,9


In [65]:
# Can chain them together as well
print(data.iloc[:, :3])
data.iloc[:, :3][data.three > 5]

          one  two  three
Ohio        0    0      0
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14


Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


In [66]:
# Boolean arrays can be used with loc
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


<img src="./myImages/table5.4_DFIndexing.png" width=600>

## Integer Indexing Pitfalls

It's best to:

1. Use `loc` and `iloc` for indexing
1. Avoid integer indexes (if possible) in favor of text

Below are a few examples where pandas objects behave differently than base python objects and may create confusion.

In [67]:
# Create a series
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

Trying to grab one index from the end (`[-1]`) will throw an error. 

Since the indices are integers, pandas doesn't want to guess if this should be a label or an numeric place index. If it's a label, then it doesn't exist in the Series. If treated as an integer index, it would be 1 from the end, but how can you know what the user intended?

```python
ser[-1]

KeyError                                  Traceback (most recent call last)
Cell In[68], line 1
----> 1 ser[-1]

File ~/miniconda3/envs/pydata-book/lib/python3.10/site-packages/pandas/core/series.py:1121, in Series.__getitem__(self, key)
   1118     return self._values[key]
   1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
   1123 # Convert generator to list before going through hashable part
   1124 # (We will iterate through the generator there to check for slices)
   1125 if is_iterator(key):

File ~/miniconda3/envs/pydata-book/lib/python3.10/site-packages/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
   1234     return self._values[label]
   1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
   1239 if is_integer(loc):
   1240     return self._values[loc]

File ~/miniconda3/envs/pydata-book/lib/python3.10/site-packages/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
    413         return self._range.index(new_key)
    414     except ValueError as err:
--> 415         raise KeyError(key) from err
    416 if isinstance(key, Hashable):
    417     raise KeyError(key)

KeyError: -1
```

If the indices are non-integer, then there is no issue:

In [72]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [73]:
ser2[-1]

  ser2[-1]


2.0

Slicing with integers is always integer (i.e. non-label) oriented. Below the slice `[:2] gets the first and second positions (0 and 1)

In [None]:
ser[:2]

Best to just use `iloc` if you want to integer index. The above -1 error can be avoided:

In [74]:
ser.iloc[-1]

2.0

## Pitfalls with Chained Indexing

As mentioned previously, `loc` and `iloc` are powerfull, flexible methods for selecting elements of a DataFrame.

You can modify DataFrame objects in place with these, but must be careful.

Below: assign to a column/row by label or integer position.

In [75]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [76]:
# Assign the value 1 to all rows in the 'one' column
data.loc[:,"one"] = 1
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [77]:
# Assign the value 5 to row 3 for all columns
data.iloc[2] = 5
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [78]:
# Assign the value 3 to all columns of the rows where column "four"s values are greater than 5
data.loc[data["four"] > 5] = 3
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


Attempt to assign the value 6 to all rows of column "three" where column "three" is currently 5.

Might try to do this by chaining together selections.


In [79]:
# Select the row(s) where colunn "three" == 5
data.loc[data.three == 5]

Unnamed: 0,one,two,three,four
Utah,5,5,5,5


In [80]:
# Select column 3 from that
data.loc[data.three == 5]["three"]

Utah    5
Name: three, dtype: int64

In [82]:
# Combine both of those and try to assign a new value:
data.loc[data.three == 5]["three"] = 6
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[data.three == 5]["three"] = 6


Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


The DataFrame is unmodified and we get a warning indicating "A value is trying to be set on a copy of a slice from a DataFrame". This is because our chaining has created the intermediate object (the full Utah row) and then we're trying to assign a value to a subset of that instead of the actual DataFrame

Chained indexing should be avoided when doing assignments. The better way would be to specify the desired rows and columns in the same expression with `loc` (Which I think is what I would naturally do coming from R anyway...)

In [83]:
# use the .loc[row,column] subset method:
data.loc[data.three == 5, "three"] = 6
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


<a name="arithmetic"></a>
# Arithmetic and Data Alignment

pandas will attempt to handle non-matching indexes between objects during arithmetic operations.  

If two pandas objects are combined (added for this example). Any non-matching indices will be coerced to NA.

This kind of makes sense if you imagine that the object that's missing the index gets padded with that index as a missing value, then adding a value to a missing value will just return the missing value.  

So only matching indices will return values. Non-matching values will become NA. If an entire row or column is unique to one or the other object, it will be added to the result as a row/column of entirely NAs.

There are fill functions that can determine how to handle these instead of just making them be NA.

In [84]:
# Two series with different indices
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=["a", "c", "e", "f", "g"])
print(s1)
print("\n")
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64


a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


When we add these together, the value associated with index `d` will become NA and so will those associated with `f` and `g`.  

As mentioned above imagine that prior to the addition, `s1` is padded with two NAs associated with indices `f` and `g`, while `s2` is padded with an NA associated with index `d`

In [85]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The same occurs for DataFrames, except for both rows and columns.



In [86]:
# Two DataFrames with different rows and columns.
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
print(df1)
print("\n")
print(df2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0


          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


When these are added together:

1. Column `c` will be entirely NAs because it's only in `df1`
1. Column `e` will likewise be entirely NAs
1. Row `Oregon` will also be all NAs for the same reason
1. The values for columns `b` and `d` in rows `Ohio` and `Texas` will be the only ones left

In [87]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


If nothing overlaps, then you'll get 100% NAs!

(Even though the row indices are the same, `A` is only in `df1` and `B` is only in `df2`)

In [88]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
print(df1)
print("\n")
print(df2)
print("\n")
print(df1+df2)


   A
0  1
1  2


   B
0  3
1  4


    A   B
0 NaN NaN
1 NaN NaN


## Arithmetic methods with fill values

Depending on what we're doing, we might not want to just turn everything to NAs.  

In order to supply a fill value, we need to use the arithmetic methods (e.g. `DF.add` instead of `+`) so that we can submit arguments to them. (`fill_value` in this case)

Also shown here is how to add singluar NA values with `np.nan`

In [90]:
# Make two DataFrames
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))

# Add an NA to df2
df2.loc[1, "b"] = np.nan

print(df1)
print("\n")
print(df2)
print("\n")
print(df1 + df2)

     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0


      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0   NaN  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN


The above example behaves exactly as we expect. But what if we'd rather treat non-overlapping values as 0 instead of missing?

Again have to think about the missing values getting padded in the samller object before the operation. For example, `df1` is padded with column `e`, but the `fill_value` argument will pad with `0` this time instead of `NA`

In [None]:
df1.add(df2, fill_value=0)

Another use of `fill_value` is when reindexing. If you add a blank column to a DataFrame, you can specify a value to fill the rows with, instead of just NA:

In [93]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


Here are some examples of arithmetic methods that can be used on Series and DataFrames:  

<img src="./myImages/table5.5_arithmeticMethods.png" width = 600>  

Notice how they all have "reverse" operations. This is a little confusing but is necessary in order to use methods instead of the actual operators.

For example, how would you do `1 / df1` using the `div` method? You'd have to be able to apply the `div` method to `1`, which doesn't make sense.  Doing `df1.div(1)` would just divide `df1` by 1, which isn't what we want.

The reverse operations are the solution:

In [91]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [92]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


## Operations betweeen DataFrame and Series

Below is a 2-D array `arr`. When one of its rows `arr[0]` is subtracted from it, the subtraction is performed once for each row (`broadcasting`).

Subtraction of a Series from a DataFrame behaves similarly. The one bit that might be less intuitive is that the Series' indices match up with the DataFrames columns and the arithmetic is broadcast down the rows. (I would think the opposite)

In [94]:
# Create an array and subtract a row from it
arr = np.arange(12.).reshape((3, 4))
print(arr)
print("\n")
print(arr[0])
arr - arr[0]

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]


[0. 1. 2. 3.]


array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [95]:
# Create a DataFrame and Series and do the same
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]
print(frame)
print("\n")
print(series)

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64


In [96]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


The same padding as shown above will also occur with mismatched indices.  

Below, we add a new Series that shares `b` and `e` with the DataFrame.

If we add the two together, we'll get:
1. 0 will be added to every row in column `b`
1. Every row in column `d` will become NA
1. 1 will be added to ever row in column `e`
1. New column `f` will be made, with all NA values.

In [97]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
series2

b    0
e    1
f    2
dtype: int64

In [98]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


Broadcasting the opposite way (i.e. matching up the rows and propagating the column) is not the default.  

In order to do this, you have to use the *arithmetic methods* (i.e. `.sub` instead of `-`) and provide the `axis` argument, specifying "index" for rows.

Obviously the Series indices have to match the DataFrames row indices. If you tried to subtract `series2` from `frame` on the row index, you'd all NAs:

```python
frame.sub(series2, axis="index")
         b   d   e
Ohio   NaN NaN NaN
Oregon NaN NaN NaN
Texas  NaN NaN NaN
Utah   NaN NaN NaN
b      NaN NaN NaN
e      NaN NaN NaN
f      NaN NaN NaN
```

In [101]:
# Make a new Series whose index values match the DataFrame
series3 = frame["d"]
print(frame)
print("\n")
print(series3)

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64


In [102]:
# Subtract on the rows
frame.sub(series3, axis="index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


<a name="function"></a>
# Function Application and Mapping

Remember the NumPy `ufuncs` (element-wise array methods) from the previous chapter. They will also work on pandas objects.

In [104]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame

Unnamed: 0,b,d,e
Utah,-0.165069,0.71294,-0.892112
Ohio,-1.890304,1.11002,0.768769
Texas,-0.543075,0.359814,0.400817
Oregon,0.900408,0.429803,0.246686


In [105]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.165069,0.71294,0.892112
Ohio,1.890304,1.11002,0.768769
Texas,0.543075,0.359814,0.400817
Oregon,0.900408,0.429803,0.246686


You might also want to apply a function to each column or row in a 1-d array.

The DataFrame `apply` method accomplishes this and is essentially the same as R's `apply()` in terms of format.

The main difference is that the default behavior will apply the function across the columns and you have to specify `axis="columns"` to apply once per row.  

Again with the "opposite" nomenclature to what I would expect...He asys to think of it as "apply across the columns", so I guess "for each row, apply this function across its columns"

In [106]:
# Define a function
def f1(x):
    return x.max() - x.min()

# Apply "across the rows" (i.e. for each column, apply this across all its rows)
frame.apply(f1)

b    2.790713
d    0.750205
e    1.660881
dtype: float64

In [107]:
# Apply "across the columns" (i.e. for each row, apply this across all its columns)
frame.apply(f1, axis="columns")

Utah      1.605052
Ohio      3.000324
Texas     0.943892
Oregon    0.653722
dtype: float64

Before you use `apply` check if there is already a method for it. Most of the common array statistics (e.g. `sum` and `mean`) are already methods so you don't need to wrap it in apply.  

You can return Series with multiple values out of apply functions - it doesn't have to be scalar. Just define the output in your function

In [108]:
# Define a function that returns a series
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

# Apply said function "across the rows" (i.e. for each column, apply this across all its rows)
frame.apply(f2)

Unnamed: 0,b,d,e
min,-1.890304,0.359814,-0.892112
max,0.900408,1.11002,0.768769


`applymap` is another useful method. This applies element-wise Python functions.  

I think this just means that it will be applied across all rows and columns (i.e. every element).

NEVERMIND! THIS IS DEPRECATED AND `MAP` IS THE STANDARD USAGE NOW.

In [109]:
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

  frame.applymap(my_format)


Unnamed: 0,b,d,e
Utah,-0.17,0.71,-0.89
Ohio,-1.89,1.11,0.77
Texas,-0.54,0.36,0.4
Oregon,0.9,0.43,0.25


In [110]:
frame.map(my_format)

Unnamed: 0,b,d,e
Utah,-0.17,0.71,-0.89
Ohio,-1.89,1.11,0.77
Texas,-0.54,0.36,0.4
Oregon,0.9,0.43,0.25


<a name="sorting"></a>
# Sorting and Ranking

Sorting and Ranking can be done on Series and on either row or column labels for DataFrames.

Sort
1. Default behavior:
    - Ascending order
    - Missing values placed at the end (`na_position="first"` to put them in front)
1. Methods
    - `sort_index` to sort by row/column indices
    - `sort_values` to sort by row/column values
    - For DataFrame, specify which axis to sort by

Default rank:
1. Rank of 1 is assigned to the lowest value and increases from there
1. Ties are solved by taking the average (e.g. if 4th and 5th values are the same, they're both ranked 4.5)
1. Arguments
    - `method="first"` to deal with ties by their order of occurrence instead of averaging
    -`ascending=False` to rank in descending order

Methods for breaking rank ties:  
<img src="./myImages/table5.6_rankTieBreakers.png" width = 600>

## Sort Examples

### Series

In [None]:
# Make a series
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj

In [112]:
# Sort by index
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [113]:
# Sort by index descending
obj.sort_index(ascending=False)

d    0
c    3
b    2
a    1
dtype: int64

In [114]:
# Sort by value 
obj.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

In [115]:
# Make a series with missing values
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [116]:
# Default sort
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [117]:
# NAs up front
obj.sort_values(na_position="first")

1    NaN
3    NaN
4   -3.0
5    2.0
0    4.0
2    7.0
dtype: float64

### DataFrame

In [118]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [119]:
# Default sort is by row index
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [120]:
# Add axis argument to sort by columns
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [121]:
# Same ascending/descending behavior
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


Sorting by values is a little different. 

This is sorting by the values that are in the cells of the DataFrame.  

One or more column can be provided - the rows are then sorted in ascending order by each column in turn

In [126]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1], "c": [4, 3, 2, 1]})
frame

Unnamed: 0,b,a,c
0,4,0,4
1,7,1,3
2,-3,0,2
3,2,1,1


In [127]:
# Sort in ascending order by b
frame.sort_values("b")

Unnamed: 0,b,a,c
2,-3,0,2
3,2,1,1
0,4,0,4
1,7,1,3


In [128]:
# Sort by a and then b
frame.sort_values(["a", "b"])

Unnamed: 0,b,a,c
2,-3,0,2
0,4,0,4
3,2,1,1
1,7,1,3


In [129]:
# Sort by B and then A
# No different than sorting by B alone b/c there are no duplicates.
frame.sort_values(["b", "a"])

Unnamed: 0,b,a,c
2,-3,0,2
3,2,1,1
0,4,0,4
1,7,1,3


## Rank Examples

### Series

In [131]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [132]:
# Default is to average ties
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [133]:
# method = first
obj.rank(method="first")

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [134]:
# Descending
obj.rank(method = "first", ascending=False)

0    1.0
1    7.0
2    2.0
3    3.0
4    5.0
5    6.0
6    4.0
dtype: float64

### DataFrame

In [135]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [136]:
# Default is the same as axis="rows", which means the rows in each column are ranked
frame.rank()

Unnamed: 0,b,a,c
0,3.0,1.5,2.0
1,4.0,3.5,3.0
2,1.0,1.5,4.0
3,2.0,3.5,1.0


In [137]:
frame.rank(axis="rows")

Unnamed: 0,b,a,c
0,3.0,1.5,2.0
1,4.0,3.5,3.0
2,1.0,1.5,4.0
3,2.0,3.5,1.0


In [139]:
# axis="columns" means the columns in each row are ranked
frame.rank(axis="columns")

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


<a name="duplicates"></a>
# Axis Indexes with Duplicate Labels

It's best practice to have your axis labels be unique (most pandas functions even require it), but it's not mandatory.

It can make code complicated, however, because selections on unique labels and non-unique labels often return different object types.

`is_unique` is an attribute of both Series and DataFrame Indexes that will indicate if the labels are unique.

If a unique label is selected, a scalar value is returned. If a non-unique label is selected, a series of all values associated with that label is returned.

## Series

In [140]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [141]:
# Unique index returns scalar
obj["c"]

4

In [142]:
# Non-unique returns series
obj["a"]

a    0
a    1
dtype: int64

## DataFrame

Similar behavior - unique values return a Series, while non-unique return DataFrames

In [145]:
df = pd.DataFrame(np.random.standard_normal((5, 4)),
                  index=["a", "a", "b", "b", "c"],
                  columns=["One", "Two", "Three", "Three"])
df

Unnamed: 0,One,Two,Three,Three.1
a,0.015654,-1.152113,-1.08026,-0.646942
a,-0.579382,0.486395,0.01537,0.720662
b,1.387634,-0.139371,0.139984,1.767489
b,0.666812,-0.110744,-0.379758,0.651668
c,-0.958652,2.246756,1.005083,1.238565


In [146]:
# Unique value
df.loc["c"]

One     -0.958652
Two      2.246756
Three    1.005083
Three    1.238565
Name: c, dtype: float64

In [147]:
type(df.loc["c"])

pandas.core.series.Series

In [148]:
# Non-unique
df.loc["b"]

Unnamed: 0,One,Two,Three,Three.1
b,1.387634,-0.139371,0.139984,1.767489
b,0.666812,-0.110744,-0.379758,0.651668


In [149]:
# Column unique
df.loc[:,"One"]

a    0.015654
a   -0.579382
b    1.387634
b    0.666812
c   -0.958652
Name: One, dtype: float64

In [150]:
# Column non-unique
df.loc[:,"Three"]

Unnamed: 0,Three,Three.1
a,-1.08026,-0.646942
a,0.01537,0.720662
b,0.139984,1.767489
b,-0.379758,0.651668
c,1.005083,1.238565
