When Series or DataFrames are combined, each dimension of the data automatically aligns
on each axis first before any computation happens. This silent and automatic alignment of
axes can confuse the uninitiated, but it gives flexibility to the power user. This chapter explores
the Index object in-depth before showcasing a variety of recipes that take advantage of its
automatic alignment.

In [2]:
import numpy as np
import pandas as pd

In [3]:
pd.set_option('max_columns', 4, 'max_rows', 10)

In [3]:
college = pd.read_csv("C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv",)

In [4]:
columns = college.columns

In [5]:
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [6]:
# Use the .values attribute to access the underlying
# NumPy array:
columns.values

array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

In [7]:
# Select items from the index by position with a
# scalar, list, or slice:
columns[5]

'WOMENONLY'

In [8]:

columns[[1, 8, 10]]

Index(['CITY', 'SATMTMID', 'UGDS'], dtype='object')

In [9]:
columns[-7: -4]

Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dtype='object')

In [10]:
# Indexes share many of the same methods as Series
# and DataFrames:
columns.min(), columns.max(), columns.isnull().sum()

('CITY', 'WOMENONLY', 0)

In [11]:
# You can use basic arithmetic and comparison 
# operators on Index objects:
columns + "_A"

Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')

In [12]:
columns > "G"

array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

In [13]:
# Trying to change an Index value after its creation 
# fails. Indexes are immutable objects:
columns[1] = "city"

TypeError: Index does not support mutable operations

In [14]:
# Indexes support the set operations—union, 
# intersection, difference, and symmetric difference:
c1 = columns[:4]

In [15]:
c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [16]:
c2 = columns[2:6]

In [17]:
c2

Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')

In [18]:
c1.union(c2)  # or 'c1 | c2'

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')

In [19]:
c1.symmetric_difference(c2)  # or 'c1 ^ c2'

Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

## Producing Cartesian products

In [20]:
# Construct two Series that have indexes that are different but contain some of the
# same values:
s1 = pd.Series(index=list("aaab"), data=np.arange(4))

In [21]:
s1

a    0
a    1
a    2
b    3
dtype: int32

In [22]:
s2 = pd.Series(index=list('cababb'), data=np.arange(6))

In [23]:
s2

c    0
a    1
b    2
a    3
b    4
b    5
dtype: int32

In [24]:
# Add the two Series together to produce a Cartesian
# product. For each a index value
# in s1, we add every a in s2:
s1 + s2

a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

The Cartesian product is not created when the indexes are unique or contain both the same
exact elements and elements in the same order. When the index values are unique or they
are the same and have the same order, a Cartesian product is not created, and the indexes
instead align by their position. Notice here that each element aligned exactly by position and
that the data type remained an integer:

In [25]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('bbaaa'), data=np.arange(5))

In [26]:
s1 + s2

a    2
a    3
a    4
a    3
a    4
    ..
a    6
b    3
b    4
b    4
b    5
Length: 13, dtype: int32

Be aware of this as pandas has two drastically different outcomes for this same operation.
Another instance where this can happen is during a groupby operation. If you do a groupby
with multiple columns and one is of the type categorical, you will get a Cartesian product
where each outer index will have every inner index value.
Finally, we will add two Series that have index values in a different order but do not have
duplicate values. When we add these, we do not get a Cartesian product:

In [27]:
s3 = pd.Series(index=list('ab'), data=np.arange(2))
s4= pd.Series(index=list('ba'), data=np.arange(2))

In [28]:
s3 + s4

a    1
b    1
dtype: int32

## Exploding indexes

In [29]:
# Read in the employee data and set the index
# to the RACE column:
employee = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/employee.csv', index_col='RACE')

In [30]:
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,...,HIRE_DATE,JOB_DATE
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Hispanic/Latino,0,ASSISTANT DIRECTOR (EX LVL),...,2006-06-12,2012-10-13
Hispanic/Latino,1,LIBRARY ASSISTANT,...,2000-07-19,2010-09-18
White,2,POLICE OFFICER,...,2015-02-03,2015-02-03
White,3,ENGINEER/OPERATOR,...,1982-02-08,1991-05-25
White,4,ELECTRICIAN,...,1989-06-19,1994-10-22


In [31]:
# Select the BASE_SALARY column as two different Series. Check to see whether 
# this operation created two new objects:
salary1 = employee["BASE_SALARY"]
salary2 = employee["BASE_SALARY"]

salary1 is salary2

True

The salary1 and salary2 variables are referring to the same object. This means
that any change to one will change the other. To ensure that you receive a brand new
copy of the data, use the .copy method

In [34]:
salary2 = employee['BASE_SALARY'].copy()

In [35]:
salary1 is salary2 

False

In [36]:
# Let's change the order of the index for one of the
# Series by sorting it:
salary1 = salary1.sort_index()
salary1.head()

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64

In [37]:
salary2.head()

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

In [38]:
# let's add these salary Series together
salary_add = salary1 + salary2
salary_add.head()

RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

The operation completed successfully. Let's create one more Series of salary1
added to itself and then output the lengths of each Series. We just exploded the
index from 2,000 values to more than one million:

In [39]:
salary_add1 = salary1 + salary1

In [40]:
len(salary1), len(salary2), len(salary_add), len(salary_add1)

(2000, 2000, 1175424, 2000)

We can verify the number of values of salary_add by doing a little mathematics. As a
Cartesian product takes place between all of the same index values, we can sum the square
of their counts. Even missing values in the index produce Cartesian products with themselves

In [41]:
index_vc = salary1.index.value_counts(dropna=False)

In [42]:
index_vc

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [43]:
index_vc.pow(2).sum()

1175424

## Filling values with unequal indexes

When two Series are added together using the plus operator and one of the index labels does
not appear in the other, the resulting value is always missing. pandas has the .add method,
which provides an option to fill the missing value. Note that these Series do not include
duplicate entries, hence there is no need to worry about a Cartesian product exploding the
number of entries.
In this recipe, we add together multiple Series from the baseball dataset with unequal (but
unique) indexes using the .add method with the fill_value parameter to ensure that
there are no missing values in the result.

In [44]:
# Read in the three baseball datasets and set playerID
# as the index:
baseball_14 = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/baseball14.csv', index_col='playerID')

In [45]:
baseball_15 = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/baseball15.csv', index_col='playerID')

In [46]:
baseball_16 = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/baseball16.csv', index_col='playerID')

In [47]:
baseball_14.head()

Unnamed: 0_level_0,yearID,stint,...,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,2014,1,...,5.0,20.0
cartech02,2014,1,...,4.0,12.0
castrja01,2014,1,...,3.0,11.0
corpoca01,2014,1,...,2.0,3.0
dominma01,2014,1,...,7.0,23.0


In [48]:
# Use the .difference method on the index to discover
# which index labels are in baseball_14 and not 
# in baseball_15, and vice versa:
baseball_14.index.difference(baseball_15.index)

Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')

In [49]:
# There are quite a few players unique to each index.
# Let's find out how many hits each player has in 
# total over the three-year period. The H column 
# contains the number of hits:
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']


In [50]:
hits_14.head()

playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

In [51]:
# Let's first add together two Series using the
# plus operator
(hits_14 + hits_15)

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
             ...  
singljo02      NaN
springe01    175.0
tuckepr01      NaN
valbulu01      NaN
villajo01     88.0
Name: H, Length: 24, dtype: float64

Even though players congeha01 and corpoca01 have values for 2015, their result
is missing. Let's use the .add method with the fill_value parameter to avoid
missing values

In [52]:
hits_14.add(hits_15, fill_value=0).head()



playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

In [53]:
# We add hits from 2016 by chaining the add method once
# more:
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)

In [54]:
hits_total.head()

playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

In [55]:
# Check for missing values in the result:
hits_total.hasnans

False

There will be occasions when each Series contains index labels that correspond to missing
values. In this specific instance, when the two Series are added, the index label will still
correspond to a missing value regardless of whether the fill_value parameter is used.
To clarify this, take a look at the following example where the index label a corresponds to a
missing value in each Series:

In [56]:
s = pd.Series(index=['a', 'b', 'c', 'd'], data=[np.nan, 3, np.nan, 1])

In [57]:
s

a    NaN
b    3.0
c    NaN
d    1.0
dtype: float64

In [58]:
s1 = pd.Series(index=['a', 'b', 'c'], data=[np.nan, 6, 10])

In [59]:
s1

a     NaN
b     6.0
c    10.0
dtype: float64

In [60]:
s1.add(s, fill_value=5)

a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

This recipe shows how to add Series with only a single index together. It is also possible to
add DataFrames together. Adding two DataFrames together will align both the index and
columns before computation and insert missing values for non-matching indexes. Let's start
by selecting a few of the columns from the 2014 baseball dataset:

In [61]:
df_14 = baseball_14[['G', 'AB', 'R', 'H']]
df_14.head()

Unnamed: 0_level_0,G,AB,R,H
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,158,660,85,225
cartech02,145,507,68,115
castrja01,126,465,43,103
corpoca01,55,170,22,40
dominma01,157,564,51,121


In [62]:
# Let's also select a few of the same and a few 
# different columns from the 2015 baseball dataset:

df_15 = baseball_15[['AB', 'R', 'H', 'HR']]
df_15.head()

Unnamed: 0_level_0,AB,R,H,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,638,86,200,15
cartech02,391,50,78,24
castrja01,337,38,71,11
congeha01,201,25,46,11
correca01,387,52,108,22


Adding the two DataFrames together creates missing values wherever rows or column labels
cannot align. You can use the .style attribute and call the .highlight_null method to
see where the missing values are:

In [63]:
(df_14 + df_15).head(10).style.highlight_null('yellow')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,,425.0,,171.0
cartech02,898.0,,193.0,,118.0
castrja01,802.0,,174.0,,81.0
congeha01,,,,,
corpoca01,,,,,
correca01,,,,,
dominma01,,,,,
fowlede01,,,,,
gattiev01,,,,,
gomezca01,,,,,


Only the rows where playerID appears in both DataFrames will be available. Similarly, the
columns AB, H, and R are the only ones that appear in both DataFrames. Even if we use the
.add method with the fill_value parameter specified, we still might have missing values.
This is because some combinations of rows and columns never existed in our input data; for
example, the intersection of playerID congeha01 and column G. That player only appeared in
the 2015 dataset that did not have the G column. Therefore, that value was missing:

In [64]:
(
    df_14.add(df_15, fill_value=0)
    .head(10)
    .style.highlight_null('yellow')
)

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,158.0,425.0,15.0,171.0
cartech02,898.0,145.0,193.0,24.0,118.0
castrja01,802.0,126.0,174.0,11.0,81.0
congeha01,201.0,,46.0,11.0,25.0
corpoca01,170.0,55.0,40.0,,22.0
correca01,387.0,,108.0,22.0,52.0
dominma01,564.0,157.0,121.0,,51.0
fowlede01,434.0,116.0,120.0,,61.0
gattiev01,566.0,,139.0,27.0,66.0
gomezca01,149.0,,36.0,4.0,19.0


### Adding columns from different DataFrames

All DataFrames can add new columns to themselves. However, as usual, whenever
a DataFrame is adding a new column from another DataFrame or Series, the indexes align
first, and then the new column is created.

In [65]:
# Import the employee data and select the DEPARTMENT
# and BASE_SALARY columns in a new DataFrame
employee = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/employee.csv')

In [66]:
dept_sal = employee[['DEPARTMENT', 'BASE_SALARY']]

In [67]:
# Sort this smaller DataFrame by salary within each
# department:
dept_sal.sort_values(['DEPARTMENT', 'BASE_SALARY'], ascending=[True, False])

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
237,Admn. & Regulatory Affairs,130416.0
1679,Admn. & Regulatory Affairs,103776.0
988,Admn. & Regulatory Affairs,72741.0
693,Admn. & Regulatory Affairs,66825.0
...,...,...
1140,Solid Waste Management,30410.0
1243,Solid Waste Management,30410.0
387,Solid Waste Management,28829.0
57,Solid Waste Management,27622.0


In [68]:
# Use the .drop_duplicates method to keep the first
# row of each DEPARTMENT:b
max_dept_sal = dept_sal.drop_duplicates(subset='DEPARTMENT')

In [69]:
max_dept_sal.head()

Unnamed: 0,DEPARTMENT,BASE_SALARY
0,Municipal Courts Department,121862.0
1,Library,26125.0
2,Houston Police Department-HPD,45279.0
3,Houston Fire Department (HFD),63166.0
4,General Services Department,56347.0


In [73]:
max_dept_sal.head()

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Municipal Courts Department,121862.0
Library,26125.0
Houston Police Department-HPD,45279.0
Houston Fire Department (HFD),63166.0
General Services Department,56347.0


In [70]:
# Put the DEPARTMENT column into the index for
# each DataFrame
max_dept_sal = max_dept_sal.set_index("DEPARTMENT")


In [71]:
employee = employee.set_index("DEPARTMENT")

In [74]:
# Now that the indexes contain matching values, we 
# can add a new column to the employee DataFrame
employee = employee.assign(
    MAX_DEPT_SALARY=max_dept_sal['BASE_SALARY']
)

In [75]:
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,...,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Municipal Courts Department,0,ASSISTANT DIRECTOR (EX LVL),...,2012-10-13,121862.0
Library,1,LIBRARY ASSISTANT,...,2010-09-18,26125.0
Houston Police Department-HPD,2,POLICE OFFICER,...,2015-02-03,45279.0
Houston Fire Department (HFD),3,ENGINEER/OPERATOR,...,1991-05-25,63166.0
General Services Department,4,ELECTRICIAN,...,1994-10-22,56347.0


In [77]:
# We can validate our results with the query method to
# check whether there exist any rows where BASE_SALARY 
# is greater than MAX_DEPT_SALARY:
employee.query("BASE_SALARY > MAX_DEPT_SALARY")

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,...,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Houston Police Department-HPD,5,SENIOR POLICE OFFICER,...,2005-03-26,45279.0
Public Works & Engineering-PWE,8,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,...,2013-01-05,71680.0
Houston Airport System (HAS),9,AIRPORT OPERATIONS COORDINATOR,...,2016-03-14,42390.0
Houston Police Department-HPD,14,POLICE SERGEANT,...,2015-05-25,45279.0
Houston Police Department-HPD,17,POLICE SERGEANT,...,2007-03-03,45279.0
...,...,...,...,...,...
Parks & Recreation,1990,BUILDING MAINTENANCE SUPERVISOR,...,2010-03-20,26125.0
Houston Police Department-HPD,1993,POLICE SERGEANT,...,2011-09-03,45279.0
Houston Police Department-HPD,1994,POLICE CAPTAIN,...,2004-07-08,45279.0
Houston Fire Department (HFD),1996,COMMUNICATIONS CAPTAIN,...,2013-10-06,63166.0


In [78]:
# refractor our code into a chain 
employee = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/employee.csv')

In [79]:
max_dept_sal = (
    employee[['DEPARTMENT', 'BASE_SALARY']]
    .sort_values(
        ['DEPARTMENT', 'BASE_SALARY'], ascending=[True, False]
    )
    .drop_duplicates(subset='DEPARTMENT')
    .sort_index('DEPARTMENT')
)

  employee[['DEPARTMENT', 'BASE_SALARY']]


ValueError: No axis named DEPARTMENT for object type DataFrame

In [84]:
(
    employee
    .set_index('DEPARTMENT')
    .assign(
        MAX_DEPT_SALARY=max_dept_sal['BASE_SALARY']
    )
)

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,...,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Municipal Courts Department,0,ASSISTANT DIRECTOR (EX LVL),...,2012-10-13,121862.0
Library,1,LIBRARY ASSISTANT,...,2010-09-18,26125.0
Houston Police Department-HPD,2,POLICE OFFICER,...,2015-02-03,45279.0
Houston Fire Department (HFD),3,ENGINEER/OPERATOR,...,1991-05-25,63166.0
General Services Department,4,ELECTRICIAN,...,1994-10-22,56347.0
...,...,...,...,...,...
Houston Police Department-HPD,1995,POLICE OFFICER,...,2015-06-09,45279.0
Houston Fire Department (HFD),1996,COMMUNICATIONS CAPTAIN,...,2013-10-06,63166.0
Houston Police Department-HPD,1997,POLICE OFFICER,...,2015-10-13,45279.0
Houston Police Department-HPD,1998,POLICE OFFICER,...,2011-07-02,45279.0


In [80]:
random_salary = dept_sal.sample(n=10, random_state=42).set_index('DEPARTMENT')

In [81]:
random_salary

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Houston Police Department-HPD,
Houston Police Department-HPD,66614.0
Housing and Community Devp.,
Houston Police Department-HPD,
Public Works & Engineering-PWE,32635.0
Public Works & Engineering-PWE,51584.0
Houston Fire Department (HFD),43528.0
Houston Fire Department (HFD),28024.0
Houston Airport System (HAS),43826.0
Houston Police Department-HPD,66614.0


Notice how there are several repeated departments in the index. When we attempt to create
a new column, an error is raised alerting us that there are duplicates. At least one index label
in the employee DataFrame is joining with two or more index labels from random_salary

In [85]:
employee['RANDOM_SALARY'] = random_salary["BASE_SALARY"]

ValueError: cannot reindex from a duplicate axis

During alignment, if there is nothing for the DataFrame index to align to, the resulting value
will be missing. Let's create an example where this happens. We will use only the first three
rows of the max_dept_sal Series to create a new column:

In [87]:
(
    employee
    .set_index('DEPARTMENT')
    .assign(
        MAX_SALARY2=max_dept_sal['BASE_SALARY'].head(3)
    )
    .MAX_SALARY2
    .value_counts(dropna=False)
)

NaN         1298
45279.0      638
26125.0       36
121862.0      28
Name: MAX_SALARY2, dtype: int64

My preference is to use the following code rather than the code in step 7. This code uses the
.groupby method combined with the .transform method, which is discussed in a later
chapter. This code reads much cleaner to me. It is shorter and does not mess with reassigning
the index:

In [88]:
max_sal = (
    employee
    .groupby('DEPARTMENT')
    .BASE_SALARY
    .transform("max")
)

In [89]:
max_sal

0       121862.0
1       107763.0
2       199596.0
3       210588.0
4        89194.0
          ...   
1995    199596.0
1996    210588.0
1997    199596.0
1998    199596.0
1999    210588.0
Name: BASE_SALARY, Length: 2000, dtype: float64

In [90]:
(employee.assign(MAX_DEPT_SALARY=max_sal))

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,...,JOB_DATE,MAX_DEPT_SALARY
0,0,ASSISTANT DIRECTOR (EX LVL),...,2012-10-13,121862.0
1,1,LIBRARY ASSISTANT,...,2010-09-18,107763.0
2,2,POLICE OFFICER,...,2015-02-03,199596.0
3,3,ENGINEER/OPERATOR,...,1991-05-25,210588.0
4,4,ELECTRICIAN,...,1994-10-22,89194.0
...,...,...,...,...,...
1995,1995,POLICE OFFICER,...,2015-06-09,199596.0
1996,1996,COMMUNICATIONS CAPTAIN,...,2013-10-06,210588.0
1997,1997,POLICE OFFICER,...,2015-10-13,199596.0
1998,1998,POLICE OFFICER,...,2011-07-02,199596.0


This works because .transform preserves the original index. If you did a .groupby that
creates a new index, you can use the .merge method to combine the data. We just need to
tell it to merge on DEPARTMENT for the left side and the index for the right side:

In [92]:
max_sal = (
    employee
    .groupby('DEPARTMENT')
    .BASE_SALARY
    .max()
)

In [93]:
(
    employee.merge(
        max_sal.rename("MAX_DEPT_SALARY"), 
        how='left', 
        left_on="DEPARTMENT",
        right_index=True
    )
)

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,...,JOB_DATE,MAX_DEPT_SALARY
0,0,ASSISTANT DIRECTOR (EX LVL),...,2012-10-13,121862.0
1,1,LIBRARY ASSISTANT,...,2010-09-18,107763.0
2,2,POLICE OFFICER,...,2015-02-03,199596.0
3,3,ENGINEER/OPERATOR,...,1991-05-25,210588.0
4,4,ELECTRICIAN,...,1994-10-22,89194.0
...,...,...,...,...,...
1995,1995,POLICE OFFICER,...,2015-06-09,199596.0
1996,1996,COMMUNICATIONS CAPTAIN,...,2013-10-06,210588.0
1997,1997,POLICE OFFICER,...,2015-10-13,199596.0
1998,1998,POLICE OFFICER,...,2011-07-02,199596.0


### Highlighting the maximum value from each column

In [94]:
college = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv', index_col='INSTNM')

In [95]:
college.dtypes

CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
                       ...   
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
Length: 26, dtype: object

All the other columns besides CITY and STABBR appear to be numeric. Examining
the data types from the preceding step reveals unexpectedly that the MD_EARN_WNE_
P10 and GRAD_DEBT_MDN_SUPP columns are of the object type and not numeric.
To help get a better idea of what kinds of values are in these columns, let's examine
a sample from them:

In [96]:
college.MD_EARN_WNE_P10.sample(10, random_state=42)

INSTNM
Career Point College                                            20700
Ner Israel Rabbinical College                       PrivacySuppressed
Reflections Academy of Beauty                                     NaN
Capital Area Technical College                                  26400
West Virginia University Institute of Technology                43400
Mid-State Technical College                                     32000
Strayer University-Huntsville Campus                            49200
National Aviation Academy of Tampa Bay                          45000
University of California-Santa Cruz                             43000
Lexington Theological Seminary                                    NaN
Name: MD_EARN_WNE_P10, dtype: object

In [97]:
college.GRAD_DEBT_MDN_SUPP.sample(10, random_state=42)

INSTNM
Career Point College                                            14977
Ner Israel Rabbinical College                       PrivacySuppressed
Reflections Academy of Beauty                       PrivacySuppressed
Capital Area Technical College                      PrivacySuppressed
West Virginia University Institute of Technology                23969
Mid-State Technical College                                      8025
Strayer University-Huntsville Campus                          36173.5
National Aviation Academy of Tampa Bay                          22778
University of California-Santa Cruz                             19884
Lexington Theological Seminary                      PrivacySuppressed
Name: GRAD_DEBT_MDN_SUPP, dtype: object

These values are strings, but we would like them to be numeric. I like to use the
.value_counts method in this case to see whether it reveals any characters that
forced the column to be non-numeric:

In [101]:
college.MD_EARN_WNE_P10.value_counts()

PrivacySuppressed    822
38800                151
21500                 97
49200                 78
27400                 46
                    ... 
84000                  1
66900                  1
52800                  1
67800                  1
186500                 1
Name: MD_EARN_WNE_P10, Length: 598, dtype: int64

In [102]:
set(college.MD_EARN_WNE_P10.apply(type))

{float, str}

The culprit appears to be that some schools have privacy concerns about these two
columns of data. To force these columns to be numeric, use the pandas function to_
numeric. If we use the errors='coerce' parameter, it will convert those values
to NaN:

In [103]:
cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']

In [104]:
for col in cols:
    college[col] = pd.to_numeric(
        college[col], errors='coerce'
    )

In [106]:
college.dtypes.loc[cols]

MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object

Use the .select_dtypes method to filter for only numeric columns. This will
exclude STABBR and CITY columns, where a maximum value doesn't make sense
with this problem:

In [108]:
college_n = college.select_dtypes(np.number)

In [109]:
college_n.head()

Unnamed: 0_level_0,HBCU,MENONLY,...,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,1.0,0.0,...,30300.0,33888.0
University of Alabama at Birmingham,0.0,0.0,...,39700.0,21941.5
Amridge University,0.0,0.0,...,40100.0,23370.0
University of Alabama in Huntsville,0.0,0.0,...,45500.0,24097.0
Alabama State University,1.0,0.0,...,26600.0,33118.5


Several columns have binary only (0 or 1) values that will not provide useful
information for maximum values. To find these columns, we can create a Boolean
Series and find all the columns that have two unique values with the .nunique
method:

In [110]:
binary_only = college_n.nunique() == 2

In [111]:
binary_only.head()

HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

In [114]:
# Use the Boolean array to create 
# a list of binary columns
binary_cols = binary_only[binary_only].index

In [115]:
binary_cols

Index(['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER'], dtype='object')

In [116]:
# Since we are looking for the maximum values, we can
# drop the binary columns using the .drop method:
college_n2 = college_n.drop(columns=binary_cols)

In [117]:
college_n2.head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,...,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,424.0,420.0,...,30300.0,33888.0
University of Alabama at Birmingham,570.0,565.0,...,39700.0,21941.5
Amridge University,,,...,40100.0,23370.0
University of Alabama in Huntsville,595.0,590.0,...,45500.0,24097.0
Alabama State University,425.0,430.0,...,26600.0,33118.5


In [118]:
# Now we can use the .idxmax method to find the index
# label of the maximum value for each column:
max_cols = college_n2.idxmax()

In [119]:
max_cols

SATVRMID                      California Institute of Technology
SATMTMID                      California Institute of Technology
UGDS                               University of Phoenix-Arizona
UGDS_WHITE                Mr Leon's School of Hair Design-Moscow
UGDS_BLACK                    Velvatex College of Beauty Culture
                                         ...                    
PCTPELL                                 MTI Business College Inc
PCTFLOAN                                  ABC Beauty College Inc
UG25ABV                           Dongguk University-Los Angeles
MD_EARN_WNE_P10                     Medical College of Wisconsin
GRAD_DEBT_MDN_SUPP    Southwest University of Visual Arts-Tucson
Length: 18, dtype: object

In [120]:
# Call the .unique method on the max_cols Series. 
# This returns an ndarray of the index values in
# college_n2 that has the maximum values:
unique_max_cols = max_cols.unique()

In [121]:
unique_max_cols

array(['California Institute of Technology',
       'University of Phoenix-Arizona',
       "Mr Leon's School of Hair Design-Moscow",
       'Velvatex College of Beauty Culture',
       'Thunderbird School of Global Management',
       'Cosmopolitan Beauty and Tech School',
       'Haskell Indian Nations University', 'Palau Community College',
       'LIU Brentwood',
       'California University of Management and Sciences',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'ABC Beauty College Inc',
       'Dongguk University-Los Angeles', 'Medical College of Wisconsin',
       'Southwest University of Visual Arts-Tucson'], dtype=object)

Use the values of max_cols to select only those rows that have schools with a
maximum value and then use the .style attribute to highlight these values:

In [123]:
college_n2.loc[unique_max_cols].style.highlight_max()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
California Institute of Technology,765.0,785.0,983.0,0.2787,0.0153,0.1221,0.4385,0.001,0.0,0.057,0.0875,0.0,0.0,0.1126,0.2303,0.0082,77800.0,11812.5
University of Phoenix-Arizona,,,151558.0,0.3098,0.1555,0.076,0.0082,0.0042,0.005,0.1131,0.0131,0.3152,0.0,0.6009,0.592,,,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.625,0.2,,15710.0
Velvatex College of Beauty Culture,,,25.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.7692,0.0,0.52,,
Thunderbird School of Global Management,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,118900.0,
Cosmopolitan Beauty and Tech School,,,110.0,0.0091,0.0,0.0182,0.9727,0.0,0.0,0.0,0.0,0.0,0.3182,0.7761,0.1244,0.9545,,
Haskell Indian Nations University,430.0,440.0,805.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0224,0.8396,0.0,0.2089,22800.0,
Palau Community College,,,602.0,0.0,0.0017,0.0,0.0,0.0,0.9983,0.0,0.0,0.0,0.3887,0.856,0.0,0.2616,24700.0,
LIU Brentwood,,,15.0,0.0,0.1333,0.2667,0.0,0.0,0.0,0.5333,0.0,0.0667,0.4,0.5652,0.7826,0.7826,44600.0,25499.0
California University of Management and Sciences,,,98.0,0.0102,0.0204,0.0,0.0408,0.0,0.0,0.0,0.9286,0.0,0.0,0.0926,0.0556,0.6852,,


In [124]:
# Refractor the code to make it easier to read:
def remove_binary_cols(df):
    binary_only = df.nunique() == 2
    unique = max_cols.unique()
    return df.drop(columns=cols)

def select_row_with_max_cols(df):
    max_cols = df.idxmax()
    unique = max_cols.unique()
    return df.loc[unique]



In [132]:
(
    college
    .assign(
        MD_EARN_WNE_P10=pd.to_numeric(
            college.MD_EARN_WNE_P10, errors='coerce'
        ), 
        GRAD_DEBT_MDN_SUPP=pd.to_numeric(
            college.GRAD_DEBT_MDN_SUPP, errors='coerce'
        ),
    )
        .select_dtypes(np.number)
        .pipe(remove_binary_cols)
        .pipe(select_row_with_max_cols)
    
)

Unnamed: 0_level_0,HBCU,MENONLY,...,PCTFLOAN,UG25ABV
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,1.0,0.0,...,0.8284,0.1049
Yeshiva Ohr Elchonon Chabad West Coast Talmudical Seminary,0.0,1.0,...,0.0000,0.0000
Judson College,0.0,0.0,...,0.7205,0.2622
Amridge University,0.0,0.0,...,0.7795,0.8540
California Institute of Technology,0.0,0.0,...,0.2303,0.0082
...,...,...,...,...,...
California University of Management and Sciences,0.0,0.0,...,0.0556,0.6852
Le Cordon Bleu College of Culinary Arts-San Francisco,0.0,0.0,...,0.5358,0.5440
MTI Business College Inc,0.0,0.0,...,1.0000,0.3986
ABC Beauty College Inc,0.0,0.0,...,1.0000,0.4688


By default, the .highlight_max method highlights the maximum value of each column.
We can use the axis parameter to highlight the maximum value of each row instead. Here,
we select just the race percentage columns of the college dataset and highlight the race with
the highest percentage for each school:

In [133]:
college_ugds = college.filter(like='UGDS_').head()

In [135]:
college_ugds.style.highlight_max(axis='columns')

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


## Replicating idxmax with method chaining

A good exercise is to attempt an implementation of a built-in DataFrame method on your own.
This type of replication can give you a deeper understanding of other pandas methods that
you normally wouldn't have come across. .idxmax is a challenging method to replicate using
only the methods covered thus far in the book.
This recipe slowly chains together basic methods to eventually find all the row index values
that contain a maximum column value.

In [136]:
# Load in the college dataset and execute the same
# operations as the previous recipe to get only the
# numeric columns that are of interest

def remove_binary_cols(df):
    binary_only = df.nunique() == 2
    cols = binary_only[binary_only].index.to_list()
    return df.drop(columns=cols)

In [137]:
college_n = (
    college
    .assign(
        MD_EARN_WNE_P10=pd.to_numeric(
            college.MD_EARN_WNE_P10, errors='coerce'
        ),
        GRAD_DEBT_MDN_SUPP=pd.to_numeric(
            college.GRAD_DEBT_MDN_SUPP, errors='coerce'
        )
    )
    .select_dtypes(np.number)
    .pipe(remove_binary_cols)
)

In [138]:
# find the maximum of each column with the .max method
college_n.max().head()

SATVRMID         765.0
SATMTMID         785.0
UGDS          151558.0
UGDS_WHITE         1.0
UGDS_BLACK         1.0
dtype: float64

Use the .eq DataFrame method to test each value against the column .max method.
By default, the .eq method aligns the columns of the column DataFrame with the
labels of the passed Series index:

In [139]:
college_n.eq(college_n.max()).head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,...,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,False,False,...,False,False
University of Alabama at Birmingham,False,False,...,False,False
Amridge University,False,False,...,False,False
University of Alabama in Huntsville,False,False,...,False,False
Alabama State University,False,False,...,False,False


All the rows in this DataFrame that have at least one True value must contain
a column maximum. Let's use the .any method to find all such rows that have
at least one True value:

In [140]:
has_row_max = (
    college_n
    .eq(college_n.max())
    .any(axis='columns')
)

In [141]:
has_row_max

INSTNM
Alabama A & M University                                  False
University of Alabama at Birmingham                       False
Amridge University                                        False
University of Alabama in Huntsville                       False
Alabama State University                                  False
                                                          ...  
SAE Institute of Technology  San Francisco                False
Rasmussen College - Overland Park                         False
National Personal Training Institute of Cleveland         False
Bay Area Medical Academy - San Jose Satellite Location    False
Excel Learning Center-San Antonio South                   False
Length: 7535, dtype: bool

In [142]:
# There are only 18 columns, which means that there 
# should only be at most 18 True values in has_row_max.
# Let's find out how many there are:
college_n.shape

(7535, 18)

In [143]:
has_row_max.sum()

401

This was a bit unexpected, but it turns out that there are columns with many rows
that equal the maximum value. This is common with many of the percentage columns
that have a maximum of 1. .idxmax returns the first occurrence of the maximum
value. Let's back up a bit, remove the .any method, and look at the output from
step 3. Let's run the .cumsum method instead to accumulate all the True values:

In [144]:
college_n.eq(college_n.max()).cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,...,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,0,0,...,0,0
University of Alabama at Birmingham,0,0,...,0,0
Amridge University,0,0,...,0,0
University of Alabama in Huntsville,0,0,...,0,0
Alabama State University,0,0,...,0,0
...,...,...,...,...,...
SAE Institute of Technology San Francisco,1,1,...,1,2
Rasmussen College - Overland Park,1,1,...,1,2
National Personal Training Institute of Cleveland,1,1,...,1,2
Bay Area Medical Academy - San Jose Satellite Location,1,1,...,1,2


Some columns have one unique maximum, like SATVRMID and SATMTMID, while
others like UGDS_WHITE have many. 109 schools have 100% of their undergraduates
as White. If we chain the .cumsum method one more time, the value 1 would only
appear once in each column and it would be the first occurrence of the maximum:

In [145]:
college_n.eq(college_n.max()).cumsum().cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,...,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,0,0,...,0,0
University of Alabama at Birmingham,0,0,...,0,0
Amridge University,0,0,...,0,0
University of Alabama in Huntsville,0,0,...,0,0
Alabama State University,0,0,...,0,0
...,...,...,...,...,...
SAE Institute of Technology San Francisco,7305,7305,...,3445,10266
Rasmussen College - Overland Park,7306,7306,...,3446,10268
National Personal Training Institute of Cleveland,7307,7307,...,3447,10270
Bay Area Medical Academy - San Jose Satellite Location,7308,7308,...,3448,10272


In [146]:
# We can now test the equality of each value against 1 
# with the .eq method and then use the .any method to
# find rows that have at least one True value:

has_row_max2 = (
    college_n.eq(college_n.max())
    .cumsum()
    .cumsum()
    .eq(1)
    .any(axis='columns')
)

In [148]:
has_row_max2.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

In [149]:
# Check that has_row_max2 has no more True values
# than the number of columns:
has_row_max2.sum()

16

In [150]:
# We need all the institutions where has_row_max2 is 
# True. We can use Boolean indexing on the Series itself
idxmax_cols = has_row_max2[has_row_max2].index

In [151]:
idxmax_cols

Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')

All 16 of these institutions are the index of the first maximum occurrence for at least
one of the columns. We can check whether they are the same as the ones found with
the .idxmax method:

In [152]:
set(college_n.idxmax().unique()) == set(idxmax_cols)

True

## Finding the most common maximum of columns

In [4]:
# Read in the college dataset and select just those
# columns with undergraduate race 
# percentage information:
college = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/college.csv', index_col='INSTNM')

In [5]:
college_ugds = college.filter(like="UGDS_")

In [6]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,...,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,0.0333,0.9353,...,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,...,0.0179,0.01
Amridge University,0.299,0.4192,...,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,...,0.0332,0.035
Alabama State University,0.0158,0.9208,...,0.0243,0.0137


In [7]:
# Use the .idxmax method applied against the column
# axis to get the college name with the highest race
# percentage for each row:
highest_percentage_race = college_ugds.idxmax(
    axis='columns'
)

In [8]:
highest_percentage_race.head()

INSTNM
Alabama A & M University               UGDS_BLACK
University of Alabama at Birmingham    UGDS_WHITE
Amridge University                     UGDS_BLACK
University of Alabama in Huntsville    UGDS_WHITE
Alabama State University               UGDS_BLACK
dtype: object

In [9]:
# Use the .value_counts method to return the
# distribution of maximum occurrences.
# Add the normalize=True parameter so that it sums 
# to 1:
highest_percentage_race.value_counts(normalize=True)

UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
UGDS_UNKN     0.023422
UGDS_ASIAN    0.012074
UGDS_AIAN     0.006110
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
dtype: float64

In [10]:
(
    college_ugds
    [highest_percentage_race == 'UGDS_BLACK']
    .drop(columns="UGDS_BLACK")
    .idxmax(axis='columns')
    .value_counts(normalize=True)
)

UGDS_WHITE    0.661228
UGDS_HISP     0.230326
UGDS_UNKN     0.071977
UGDS_NRA      0.018234
UGDS_ASIAN    0.009597
UGDS_2MOR     0.006718
UGDS_AIAN     0.000960
UGDS_NHPI     0.000960
dtype: float64