# Chapter 6 : Index Alignment

## Recipes
* [6.1 Examining the Index object](#6.1-Examining-the-Index-object)
* [6.2 Producing Cartesian product](#6.2-Producing-Cartesian-product)
* [6.3 Exploding Indexes](#6.3-Exploding-Indexes)
* [6.4 Filling values with unequal indexes](#6.4-Filling-values-with-unequal-indexes)
* [6.5 Appending columns from different DataFrames](#6.5-Appending-columns-from-different-DataFrames)
* [6.6 Highlighting maximum value from each column](#6.6-Highlighting-maximum-value-from-each-column)
* [6.7 Replicating idxmax with method chaining](#6.7-Replicating-idxmax-with-method-chaining)
* [6.8 Finding the most common maximum](#6.8-Finding-the-most-common-maximum)

In [1]:
import pandas as pd
import numpy as np

# 6.1 Examining the Index object

In [2]:
### [Tech] Index object의  values 는 NumPy ndarray로 구성되어 동일한 기능 수행이 가능하다.
###     인덱싱, 슬라이싱, 팬시 인덱싱, 산술 메소드, 산술 연산 및 비교 연산이 가능하다. 
###     단 , Index 는 불변 객체이므로 값의 변경은 불가능하다. 
### [Goal] college 의 column Index를 통해 Index object의 특성 파악

## >> How it works...

In [3]:
# 6.1.1  college.csv로 부터 college를 load 하고, columns index를 변수 columns에 할당한다. 
college = pd.read_csv('data/college.csv')
columns = college.columns
columns  # columns 객체는 Index 이다. 

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [4]:
# 6.1.2  column Index의 values 는  ndarray 형태로 컬럼명을 갖고 있다. 
columns.values

array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

In [5]:
# 6.1.3 Index 객체에서 인덱싱,  팬시 인덱싱 ,  슬라이싱이 잘 작동한다. 
columns[5]

'WOMENONLY'

In [6]:
columns[[1,5, 8,10]]

Index(['CITY', 'WOMENONLY', 'SATMTMID', 'UGDS'], dtype='object')

In [7]:
columns[-7:-4]

Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dtype='object')

In [8]:
# 6.1.4 Index 객체에서 최소, 최대, 및 누락값 확인  메소드 등이 잘 작송한다.  
columns.min(), columns.max(), columns.isnull().sum()

('CITY', 'WOMENONLY', 0)

In [9]:
# 6.1.5 Index 객체에서  산술연산,  비교연산이 잘 작동한다. 
columns + '_A'

Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')

In [10]:
columns > 'G'

array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

In [11]:
# 6.1.6 단, 인덱스 객체는 불변 immutable 이라서 값의 변경은 불가능하다. 
columns[1] = 'city'

TypeError: Index does not support mutable operations

## >> There's more 6.1

In [12]:
# 인덱스는 중복 된 값을 갖지 않을 경우, 해시 hash 테이블로 구현하여 행과 열의 선택이 빠르다. 
# 해시로 구현하기 위해서는 딕셔너리의 키 처럼 불변 데이터 immutable data (문자열 , 정수, 튜플)만
# 가능하다. 

In [13]:
# 인덱스는 unique 한 값으로 구성 된 경우,  집합 연산을 지원한다. 
c1 = columns[:4]
c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [14]:
c2 = columns[2:5]
c2

Index(['STABBR', 'HBCU', 'MENONLY'], dtype='object')

In [15]:
c1.union(c2)   # c1 | c2

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR'], dtype='object')

In [16]:
c1.symmetric_difference(c2)

Index(['CITY', 'INSTNM', 'MENONLY'], dtype='object')

# 6.2  Producing Cartesian product

In [17]:
### [Tech] Index의 개수와 순서가 일치 하지 않는 경우, DF과 DF, Series와 DF, Series와 Series
###       연산에서 같은 Index 값 내에서 cartesian product 연산이 일어난다. 
### [Goal] 인덱스의 일부가 서로 다른 series 연산의 작동 방식 확인

## >> How it works...

In [18]:
# 6.2.1 (이해도를 위해서 교재와 값을 달리함)
s1 = pd.Series(index=list('aaabcdde'), data=np.arange(10,90,10))
s1

a    10
a    20
a    30
b    40
c    50
d    60
d    70
e    80
dtype: int32

In [19]:
s2 = pd.Series(index=list('aabbcdd'), data=np.arange(1,8))
s2

a    1
a    2
b    3
b    4
c    5
d    6
d    7
dtype: int32

In [20]:
# 6.2.2 서로 다른 인덱스를 갖는 두 객체간의 연산에서 
#        동일한 인덱스 간에  각각 카테시안 곱이 발생한다. 
# a : s1에서 3개, s2에서 2개  3x2  6개의 연산 발생
# b : s1에서 1개, s2에서 2개  1x2  2개의 연산 발생
# c : s1에서 1개, s2에서 1개  1x1  1개의 연산 발생
# d : s1에서 2개, s2에서 2개  2x2  4개의 연산 발생 
# e : s1에서 1개, s2에서 0개   연산이 불가해서 NaN 값을 갖는다. 
s1+s2

a    11.0
a    12.0
a    21.0
a    22.0
a    31.0
a    32.0
b    43.0
b    44.0
c    55.0
d    66.0
d    67.0
d    76.0
d    77.0
e     NaN
dtype: float64

## >> There's more 6.2

In [21]:
# 동일한 인덱스 구조를 갖는 경우, 1대1로 매칭하여 연산한다. 
s1 = pd.Series(index=list('aaabb'), data=np.arange(10,60,10))
s2 = pd.Series(index=list('aaabb'), data=np.arange(5))
s1 + s2

a    10
a    21
a    32
b    43
b    54
dtype: int32

In [22]:
#  동일한 개수 만큼 인덱스가 나타나더라도 순서가 다르면, 같은 인덱스 끼리 카테시안 곱을 수행한다. 
s1 = pd.Series(index=list('aaabb'), data=np.arange(10,60,10))
s2 = pd.Series(index=list('bbaaa'), data=np.arange(5))
s1 + s2

a    12
a    13
a    14
a    22
a    23
a    24
a    32
a    33
a    34
b    40
b    41
b    50
b    51
dtype: int32

# 6.3 Exploding Indexes

In [23]:
### [Tech] 중복된 값을 갖는 ( unique 하지 않은) 인덱스를 갖는 객체간 연산시,, 
###        카테시안 곱의 발생으로 인덱스 개수가 팽창 할 수 있다.
### [Goal] employee의 RACE를 인덱스 (인덱스가 중복되는 값을 가짐)로 하는 연산에서 
### 발생할 수 있는 오류 사항을 검토 한다. 

## >> How it works...

In [24]:
# 6.3.1 employee.csv로 부터 RACE를 인덱스로 하는 employee를 읽어 들인다. 
employee = pd.read_csv('data/employee.csv', index_col='RACE')
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Hispanic/Latino,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Full Time,Female,Active,2006-06-12,2012-10-13
Hispanic/Latino,1,LIBRARY ASSISTANT,Library,26125.0,Full Time,Female,Active,2000-07-19,2010-09-18
White,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,Full Time,Male,Active,2015-02-03,2015-02-03
White,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,Full Time,Male,Active,1982-02-08,1991-05-25
White,4,ELECTRICIAN,General Services Department,56347.0,Full Time,Male,Active,1989-06-19,1994-10-22


In [25]:
# 6.3.2 기본급에 대한 Series 2개를 생성한다.이 때, 단순히 employee['BASE_SALARY']를 할당하면, 
#       해당 Series의 view에 대한 reference를 갖게 되므로 주의 한다. 
salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']
salary1 is salary2     # 2 개의 변수가 같은 객체임

True

In [26]:
# 6.3.3 .copy() 메소드를 활용해서 복사본으로 2개의 급여 Series를 생성한다. 
salary1 = employee['BASE_SALARY'].copy()
salary2 = employee['BASE_SALARY'].copy()
salary1 is salary2    # 서로 다른 객체임

False

In [27]:
# 6.3.4   salary1 만 sorting을 해서 salary2와 다른 순서의 Index를 갖도록 함
salary1 = salary1.sort_index()
salary1.head()

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64

In [28]:
salary2.head()

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

In [29]:
# 6.3.5 salary1 + salary2를 통해  중복값을 포함하고 서로 다른 Index를 갖는 연산을 수행한
#      salary_add1를 생성하였다. 
salary_add1 = salary1 + salary2
salary_add1.head()

RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

In [30]:
# 6.3.6  동일한 Index 간의 연산 수행결과와 비교하기 위해 
#         salary_add2 는 salary_add1에 자기자신을 더하여 생성한다. 
salary_add2 = salary1 + salary1
# salary1, salary2, salary_add1, salary_add2의 개수를 구한다. 
# salary_add1은 100만개가 넘는 인덱스를 갖는 DataFrame이 되어 버렸다. 
len(salary1), len(salary2), len(salary_add1), len(salary_add2)

(2000, 2000, 1175424, 2000)

## >> There's more 6.3

In [31]:
# salary_add1의 인덱스 개수를 산출해보자 
# 각각의 인덱스 값은 자기 자신과 동일한 값 만큼의 카테시안이 발생하므로 
# 각 인덱스의 중복개수의 제곱의 합이 된다.

# 아래 예시에서 salary1의 인덱스는 각각 인종별로 , 700, 665, 480... 개의 중복값을 갖게 되며
#  700*700 + 665*665 +...+2*2 = 1,175,424 가 된다. 

index_vc = salary1.index.value_counts(dropna = False)
index_vc

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [32]:
index_vc.pow(2).sum()

1175424

# 6.4 Filling values with unequal indexes

In [33]:
### [Tech] 서로 다른 인덱스를 갖는 DataFrame(Series)간 연산에서 누락값은 fill_value로 처리한다.
### [Goal] 서로 다른 연도별 야구선수 파일 (DataFrame)의 병합

## >> How it works...

In [34]:
# 6.4.1 2014, 2015, 2016년도의 야구 선수별 기록 파일을 합쳐서 분석해 본다. 
# 선수명을 인덱스로 한다. 
baseball_14 = pd.read_csv('data/baseball14.csv', index_col='playerID')
baseball_15 = pd.read_csv('data/baseball15.csv', index_col='playerID')
baseball_16 = pd.read_csv('data/baseball16.csv', index_col='playerID')
baseball_14.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2014,1,HOU,AL,158,660,85,225,47,3,...,59.0,56.0,9.0,36,53.0,7.0,5.0,1.0,5.0,20.0
cartech02,2014,1,HOU,AL,145,507,68,115,21,1,...,88.0,5.0,2.0,56,182.0,6.0,5.0,0.0,4.0,12.0
castrja01,2014,1,HOU,AL,126,465,43,103,21,2,...,56.0,1.0,0.0,34,151.0,1.0,9.0,1.0,3.0,11.0
corpoca01,2014,1,HOU,AL,55,170,22,40,6,0,...,19.0,0.0,0.0,14,37.0,0.0,3.0,1.0,2.0,3.0
dominma01,2014,1,HOU,AL,157,564,51,121,17,0,...,57.0,0.0,1.0,29,125.0,2.0,5.0,2.0,7.0,23.0


In [35]:
# 3개 파일의 외관 확인
# 행 인덱스의 개수가 서로 다르다.  컬럼은 21개 컬럼을 갖는다. 
baseball_14.shape, baseball_15.shape , baseball_16.shape

((16, 21), (15, 21), (16, 21))

In [36]:
# 컬럼의 특성 확인 - baseball_14 은 unique한 컬럼명을 갖는다 (21개)
baseball_14.columns.is_unique

True

In [37]:
# baseball_14, _15, _16, 3 개 table의 컬럼은 같다
set(baseball_14.columns)== set(baseball_15.columns) == set(baseball_16.columns)

True

In [38]:
# 팀명과 리그명을 제외하고는 수치형 데이터를 갖는 야구 기록 값이다. 
baseball_14.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16 entries, altuvjo01 to villajo01
Data columns (total 21 columns):
yearID    16 non-null int64
stint     16 non-null int64
teamID    16 non-null object
lgID      16 non-null object
G         16 non-null int64
AB        16 non-null int64
R         16 non-null int64
H         16 non-null int64
2B        16 non-null int64
3B        16 non-null int64
HR        16 non-null int64
RBI       16 non-null float64
SB        16 non-null float64
CS        16 non-null float64
BB        16 non-null int64
SO        16 non-null float64
IBB       16 non-null float64
HBP       16 non-null float64
SH        16 non-null float64
SF        16 non-null float64
GIDP      16 non-null float64
dtypes: float64(9), int64(10), object(2)
memory usage: 2.8+ KB


In [39]:
# 각 필드에 대한 누락값은 없다. 
baseball_14.isnull().any().any(),\
baseball_15.isnull().any().any(),\
baseball_16.isnull().any().any()

(False, False, False)

In [40]:
# 6.4.2 행 인덱스별로는 서로 다른 선수들이 존재 한다. 
baseball_14.index.difference(baseball_15.index)

Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')

In [41]:
baseball_14.index.difference(baseball_16.index)

Index(['cartech02', 'corpoca01', 'dominma01', 'fowlede01', 'grossro01',
       'guzmaje01', 'hoeslj01', 'krausma01', 'preslal01', 'singljo02',
       'villajo01'],
      dtype='object', name='playerID')

In [42]:
# 총 29명의 선수가 등장한다. 
print (baseball_14.index.union(baseball_15.index).union(baseball_16.index))
len(baseball_14.index.union(baseball_15.index).union(baseball_16.index))

Index(['altuvjo01', 'bregmal01', 'cartech02', 'castrja01', 'congeha01',
       'corpoca01', 'correca01', 'dominma01', 'fowlede01', 'gattiev01',
       'gomezca01', 'gonzama01', 'gourryu01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'kempto01', 'krausma01', 'lowrije01', 'marisja01',
       'preslal01', 'rasmuco01', 'reedaj01', 'singljo02', 'springe01',
       'tuckepr01', 'valbulu01', 'villajo01', 'whitety01'],
      dtype='object', name='playerID')


29

In [43]:
# 6.4.3 각 연도의 안타 hit 수를 갖는 Series 생성
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']
hits_14.head()

playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

In [44]:
# 6.4.4 14년과 15년을 합산 할 때, 한 쪽 테이블에만 존재하면 , 합산 값이 NaN으로 처리 된다. 
(hits_14 + hits_15).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

In [45]:
# 6.4.5 .add() 메소드를 사용하고 매개변수 fill_value = 0 을 전달하면 해당 문제가 해결된다. 
hits_14.add(hits_15, fill_value=0).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

In [46]:
# 6.4.6 동일한 요령으로 2016년도 안타수도 chain 하여 구한다. 
hits_total = hits_14.add(hits_15, fill_value=0)\
                    .add(hits_16, fill_value=0)
hits_total.head()

playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

In [47]:
# 6.4.7 안타수를 합해서 구한 hits_total에 NaN 값이 없는 것을 확인한다. 
hits_total.hasnans

False

## >> There's more 6.4

In [48]:
# 양쪽에 인덱스가 존재하되, 둘 다 NaN 값을 갖는 경우에는 fill_value가 지정되어도
# NaN으로 처리한다. 
s = pd.Series(index=['a', 'b', 'c', 'd'], data=[np.nan, 3, np.nan, 1])
s

a    NaN
b    3.0
c    NaN
d    1.0
dtype: float64

In [49]:
s1 = pd.Series(index=['a', 'b', 'c'], data=[np.nan, 6, 10])
s1

a     NaN
b     6.0
c    10.0
dtype: float64

In [50]:
s.add(s1, fill_value=5)

a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

In [51]:
# 일부러 컬럼 인덱스도 행인덱스도 다른 경우를 살펴보자.
# df14 에만  'G' 를 , df_15에만 'HR' 컬럼을 갖게 한다. 
df_14 = baseball_14[['G','AB', 'R', 'H']]
df_15 = baseball_15[['AB', 'R', 'H', 'HR']]
display (df_14.head())
display (df_15.head())

Unnamed: 0_level_0,G,AB,R,H
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,158,660,85,225
cartech02,145,507,68,115
castrja01,126,465,43,103
corpoca01,55,170,22,40
dominma01,157,564,51,121


Unnamed: 0_level_0,AB,R,H,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,638,86,200,15
cartech02,391,50,78,24
castrja01,337,38,71,11
congeha01,201,25,46,11
correca01,387,52,108,22


In [52]:
# 한 쪽에만 나타나는 행(선수), 컬럼(기록)은 모두 NaN 처리가 된다. 
(df_14 + df_15).head(10).style.highlight_null('yellow')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,,425.0,,171.0
cartech02,898.0,,193.0,,118.0
castrja01,802.0,,174.0,,81.0
congeha01,,,,,
corpoca01,,,,,
correca01,,,,,
dominma01,,,,,
fowlede01,,,,,
gattiev01,,,,,
gomezca01,,,,,


In [53]:
# .add 메소드로 fill_value 매개변수를 지정하면 한 쪽에만 나타나는 값에 대해서 
#  한 쪽 값만으로 정상 처리 한다. 
# 단, 여전히 연산이 불가능한 값은 NaN 처리 된다. 
#      - 2014년에만 나타나는 선수는 2015년에만 존재하는 'G' 에 대한 기록이 없다.
df_14.add(df_15, fill_value=0).head(10).style.highlight_null('yellow')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298,158.0,425,15.0,171
cartech02,898,145.0,193,24.0,118
castrja01,802,126.0,174,11.0,81
congeha01,201,,46,11.0,25
corpoca01,170,55.0,40,,22
correca01,387,,108,22.0,52
dominma01,564,157.0,121,,51
fowlede01,434,116.0,120,,61
gattiev01,566,,139,27.0,66
gomezca01,149,,36,4.0,19


# 6.5 Appending columns from different DataFrames

In [54]:
### [Tech] 인덱스를 기준으로 lookup 기능을 구현한다.  
#         lookup 대상이 되는 테이블은 인덱스 값의 구성이 unique 해야 한다. 
### [Goal] employee table에  자신의 부서의 최대 연봉값  칼럼을 추가

## >> How it works...

In [55]:
# 6.5.1 employee.csv로 부터 employee를 읽어 들인다. 
employee = pd.read_csv('data/employee.csv')

#        lookup table에 해당하는 부서별 최대 급여를 구하기 위해 
#        부서명과 기본급만 갖는 dept_sal DataFrame을 생성한다. 
#        3.4에서 사용한 기법 (부서별 급여로 정렬 후 중복제거 )으로 구한다. 

dept_sal = employee[['DEPARTMENT', 'BASE_SALARY']]


In [56]:
# 6.5.2  부서별 최대 급여 값을 구하기 위해 sorting을 수행한다. 
#        부서 오름차순, 기본급 내림차순
dept_sal = dept_sal.sort_values(['DEPARTMENT', 'BASE_SALARY'], ascending=[True, False])
dept_sal.head(3)

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
237,Admn. & Regulatory Affairs,130416.0
1679,Admn. & Regulatory Affairs,103776.0


In [57]:
# 6.5.3 부서별로 최초로 등장하는 하나의 행(각 부서의 최대 기본급)만 남기고 중복 된 행을 삭제한다.
max_dept_sal = dept_sal.drop_duplicates(subset='DEPARTMENT')
max_dept_sal.head()

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
149,City Controller's Office,64251.0
236,City Council,100000.0
647,Convention and Entertainment,38397.0
1500,Dept of Neighborhoods (DON),89221.0


In [58]:
# 6.5.4 max_dept_sal에 부서를 인덱스로 설정함으로서 lookup table 준비가 되었다. 
#     새로운 컬럼을 추가할 employee table도 부서 DEPARTMENT를 인덱스로 설정한다. 
#     각각 24개 부서와 2,000개의 개인 데이터를 갖고 있다. 
max_dept_sal.set_index('DEPARTMENT', inplace = True)
employee.set_index('DEPARTMENT',inplace = True)
max_dept_sal.shape, employee.shape

((24, 1), (2000, 9))

In [59]:
# 6.5.5 employee에 max_dept_sal['BASE_SALARY'] 를 참조하여 추가한다. 
#     각 부서별  n x 1 의 카테시안 곱이 발생하며, lookup table이 unique하여 
#     1개의 값을 갖는 한, 원래 컬럼을 추가하는 테이블의 행의 개수는 변하지 않는다. 
employee['MAX_DEPT_SALARY'] = max_dept_sal['BASE_SALARY']
employee.shape, employee.head(), # 2000 행 변함 없다. 1개 컬럼이 추가 되었다. 

((2000, 10),
                                UNIQUE_ID               POSITION_TITLE  \
 DEPARTMENT                                                              
 Municipal Courts Department            0  ASSISTANT DIRECTOR (EX LVL)   
 Library                                1            LIBRARY ASSISTANT   
 Houston Police Department-HPD          2               POLICE OFFICER   
 Houston Fire Department (HFD)          3            ENGINEER/OPERATOR   
 General Services Department            4                  ELECTRICIAN   
 
                                BASE_SALARY             RACE EMPLOYMENT_TYPE  \
 DEPARTMENT                                                                    
 Municipal Courts Department       121862.0  Hispanic/Latino       Full Time   
 Library                            26125.0  Hispanic/Latino       Full Time   
 Houston Police Department-HPD      45279.0            White       Full Time   
 Houston Fire Department (HFD)      63166.0            White       

In [60]:
# 6.5.6 각 부서의 최대값이므로 새로 추가한 MAX_DEPT_SALARY 보다 BASE_SALARY가 
#      큰 경우는 발생 할 수 없다. 확인해보면 해당 케이스는 발생하지 않는다. 
employee.query('BASE_SALARY > MAX_DEPT_SALARY')

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


## >> There's more 6.5

In [61]:
# lookup table에 중복된 값의 인덱스가 존재 하는 경우 (즉 unique 하지 않은 경우),
# 인덱스간 참조 할당은 불가 하다.
# 고의로 중복된 값을 갖는 lookup table random_salary를 생성하고 새 컬럼 추가를 시도해보자. 
np.random.seed(1234)
random_salary = dept_sal.sample(n=10).set_index('DEPARTMENT')
random_salary

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Public Works & Engineering-PWE,50586.0
Houston Police Department-HPD,66614.0
Houston Police Department-HPD,66614.0
Housing and Community Devp.,78853.0
Houston Police Department-HPD,66614.0
Parks & Recreation,
Public Works & Engineering-PWE,37211.0
Public Works & Engineering-PWE,54683.0
Human Resources Dept.,58474.0
Health & Human Services,47050.0


In [62]:
# 중복된 값이 있는 축 axis 에 대해서 reindex가 불가하다는 오류 메시지가 발생한다. 
employee['RANDOM_SALARY'] = random_salary['BASE_SALARY']

ValueError: cannot reindex from a duplicate axis

In [63]:
# 만약에 새컬럼을 추가하려는 원본 테이블에는 존재하는 부서지만, 
#  lookup table 상에 매핑되는 부서가 없다면, NaN 값을 갖는다. 

# 고의로 max_dept_sal에서 3개 줄 만 갖는 lookup table을 참조 시켜보자.

employee['MAX_SALARY2'] = max_dept_sal['BASE_SALARY'].head(3)
employee.MAX_SALARY2.value_counts()

140416.0    29
100000.0    11
64251.0      5
Name: MAX_SALARY2, dtype: int64

In [64]:
# 매핑 되는 부서값이 없었던 대부분의 행 (97.75%)이 NaN 값을 갖는다. 
employee.MAX_SALARY2.isnull().mean()

0.9775

# 6.6 Highlighting maximum value from each column

In [65]:
### [Tech] .idxmax()를 이용해서 각 컬럼의 최대값을 갖는 인덱스 레이블을 구하고 이를 강조해 본다. 
### [Goal] college df에서 각 컬럼의 최대값을 강조해본다. 

## >> How it works...

In [66]:
# 6.6.1 college.csv로 부터 INSTNM을 인덱스로 하는 college를 load 하고, dtypes를 확인한다. 
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college.dtypes

CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
RELAFFIL                int64
SATVRMID              float64
SATMTMID              float64
DISTANCEONLY          float64
UGDS                  float64
UGDS_WHITE            float64
UGDS_BLACK            float64
UGDS_HISP             float64
UGDS_ASIAN            float64
UGDS_AIAN             float64
UGDS_NHPI             float64
UGDS_2MOR             float64
UGDS_NRA              float64
UGDS_UNKN             float64
PPTUG_EF              float64
CURROPER                int64
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
dtype: object

In [67]:
# 6.6.2 숫자값을 갖는 object 형 데이터 MD_EARN_WNE_P10, GRAD_DEBT_MDN_SUPP 이 있다. 
college.MD_EARN_WNE_P10.iloc[0]

'30300'

In [68]:
college.GRAD_DEBT_MDN_SUPP.iloc[0]

'33888'

In [69]:
# 6.6.3 PrivacySuppressed를 일부 행이 값으로 갖고 있다. 
college.MD_EARN_WNE_P10.sort_values(ascending = False).head()

INSTNM
Sharon Regional Health System School of Nursing    PrivacySuppressed
Northcoast Medical Training Academy                PrivacySuppressed
Success Schools                                    PrivacySuppressed
Louisiana Culinary Institute                       PrivacySuppressed
Bais Medrash Toras Chesed                          PrivacySuppressed
Name: MD_EARN_WNE_P10, dtype: object

In [70]:
# 6.6.4 해당 행을 무시하고 강제로 숫자로 변환하는 to_number(  , errors= 'coerce')를 적용한다. 
cols = ['MD_EARN_WNE_P10','GRAD_DEBT_MDN_SUPP']
for col in cols : 
    college[col] = pd.to_numeric(college[col], errors='coerce')
college.dtypes.loc[cols]

MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object

In [71]:
# 6.6.5 숫자형 컬럼만 추출해보니, 1, 0 값을 갖는 flag (binary) 컬럼이 다수 존재한다. 
#     이들을 선별하여 제거 할 것이다. 
college_n = college.select_dtypes(include=[np.number])
college_n.head() # only numeric columns

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600.0,33118.5


In [72]:
# pd.DataFrame.nunique?

In [73]:
# 6.6.6 crieteria에 컬럼별로 binanry 여부에 대한 T/F를 갖는 Boolean Series를 생성한다. 
# nunique() 메소드는 컬럼별(default axis = 0) 로 몇 종류의 값으로 구성 되어 있는지 반환한다. 
criteria = college_n.nunique() == 2
criteria.head()

HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

In [74]:
# 6.6.7 criteria를 이용해서 제거 대상이 되는 binary_cols 리스트를  생성한다. 
binary_cols = college_n.columns[criteria].tolist()
binary_cols

['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']

In [75]:
# 6.6.8 제거 대상 컬럼을 drop 한다. 
# ( 6.6.6 의 criteria를 이용하여 boolean index를 적용해도 된다. )
college_n2 = college_n.drop(labels=binary_cols, axis='columns')
college_n2.head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,424.0,420.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,570.0,565.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,,,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,595.0,590.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,425.0,430.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,0.7347,0.7554,0.127,26600.0,33118.5


In [76]:
# 6.6.9 .idxmax() 메소드를 사용하여 각 컬럼별로 최대값을 갖는 인덱스에 대한 Series, 
#        max_col을  생성한다.  
max_cols = college_n2.idxmax()
max_cols

SATVRMID                             California Institute of Technology
SATMTMID                             California Institute of Technology
UGDS                                      University of Phoenix-Arizona
UGDS_WHITE                       Mr Leon's School of Hair Design-Moscow
UGDS_BLACK                           Velvatex College of Beauty Culture
UGDS_HISP                       Thunderbird School of Global Management
UGDS_ASIAN                          Cosmopolitan Beauty and Tech School
UGDS_AIAN                             Haskell Indian Nations University
UGDS_NHPI                                       Palau Community College
UGDS_2MOR                                                 LIU Brentwood
UGDS_NRA               California University of Management and Sciences
UGDS_UNKN             Le Cordon Bleu College of Culinary Arts-San Fr...
PPTUG_EF                        Thunderbird School of Global Management
PCTPELL                                        MTI Business Coll

In [77]:
# 6.6.10 중복값을 제거한 unique_max_col을 생성한다. 
unique_max_cols = max_cols.unique()
unique_max_cols[:5]

array(['California Institute of Technology',
       'University of Phoenix-Arizona',
       "Mr Leon's School of Hair Design-Moscow",
       'Velvatex College of Beauty Culture',
       'Thunderbird School of Global Management'], dtype=object)

In [78]:
# 6.6.11 unique_max_cols를 .loc의 행 index에 전달하여 fancy indexing이 작동 되도록 한다. 
#        .style.highlight_max()를 chaining 하여 컬럼별 최대 값을 강조 한다. 
college_n2.loc[unique_max_cols].style.highlight_max()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
California Institute of Technology,765.0,785.0,983.0,0.2787,0.0153,0.1221,0.4385,0.001,0.0,0.057,0.0875,0.0,0.0,0.1126,0.2303,0.0082,77800.0,11812.5
University of Phoenix-Arizona,,,151558.0,0.3098,0.1555,0.076,0.0082,0.0042,0.005,0.1131,0.0131,0.3152,0.0,0.6009,0.592,,,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.625,0.2,,15710.0
Velvatex College of Beauty Culture,,,25.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.7692,0.0,0.52,,
Thunderbird School of Global Management,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,118900.0,
Cosmopolitan Beauty and Tech School,,,110.0,0.0091,0.0,0.0182,0.9727,0.0,0.0,0.0,0.0,0.0,0.3182,0.7761,0.1244,0.9545,,
Haskell Indian Nations University,430.0,440.0,805.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0224,0.8396,0.0,0.2089,22800.0,
Palau Community College,,,602.0,0.0,0.0017,0.0,0.0,0.0,0.9983,0.0,0.0,0.0,0.3887,0.856,0.0,0.2616,24700.0,
LIU Brentwood,,,15.0,0.0,0.1333,0.2667,0.0,0.0,0.0,0.5333,0.0,0.0667,0.4,0.5652,0.7826,0.7826,44600.0,25499.0
California University of Management and Sciences,,,98.0,0.0102,0.0204,0.0,0.0408,0.0,0.0,0.0,0.9286,0.0,0.0,0.0926,0.0556,0.6852,,


# 6.7 Replicating idxmax with method chaining

In [79]:
### [Tech] .idxmax()를 직접 구현해 본다. 
###        1) 각 컬럼별 최대값을 구한다. 
###        2) 해당 컬럼의 최대값과 같은 값을 갖는지 여부를 판단하는 Boolean Series 구성한다. 
###        3) 복수 개의 행에서 동일한 최대값을 가질 경우, 최초로 나타나는 하나의 행만 남긴다. 
###           이를 위해서 2)에서 구한 Boolean Series에 이중으로 누적합을 수행한다.
###           [F,F,T,F,T] 에 대해 .cumsum()을 수행하면 [0,0,1,1,2] 가 되고
###           이중으로 .cumsum().cumsum()을 수행하면 [0,0,1,2,4 ]가 된다. 
###           이중 누적합에서 처음으로 1이 나타나는 시점이 첫 최대값의 index 이다. 
###        4) 하나라도 1을 갖는 값이 포함된 행이 각 컬럼의 최대값을 갖는 최초 행이다. 
### [Goal] college df에서 각 컬럼의 최대값을 강조해본다. 
###       단 .idxmax()를 사용하지 않는다. 즉 .idxmax의 구현원리를 살펴 본다. 

## >> How it works...

In [80]:
# 6.7.1 college.csv로 부터 INSTNM을 인덱스로 하는 college를 load 하고,
#      object 형 데이터, binary 형 데이터에 대한 처리를 수행한다. 
college = pd.read_csv('data/college.csv', index_col='INSTNM')
cols = ['MD_EARN_WNE_P10','GRAD_DEBT_MDN_SUPP']
for col in cols : 
    college[col] = pd.to_numeric(college[col], errors='coerce')
college_n = college.select_dtypes(include=[np.number])
criteria = college_n.nunique() == 2
binary_cols = college_n.columns[criteria].tolist()
college_n = college_n.drop(labels=binary_cols, axis='columns')

In [81]:
# 6.7.2 각 컬럼의 최대값을 구한다. 
college_n.max().head()

SATVRMID         765.0
SATMTMID         785.0
UGDS          151558.0
UGDS_WHITE         1.0
UGDS_BLACK         1.0
dtype: float64

In [82]:
# pd.DataFrame.eq?  # default axis = 1 , college_n.max 와 각 행을 비교한다. 

In [83]:
# 6.7.3  각 컬럼의 최대값과 동일한 값을 갖는지 여부를 판단한다. 
college_n.eq(college_n.max()).head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [84]:
# 6.7.4 각 행에 하나라도 True 가 있으면, 해당 행을 어떤 컬럼에 대해서 최대 값이다. 
#       has_row_max는 하나라도 최대값을 갖는지 여부를 판단하는 Boolean Series 이다. 
has_row_max =college_n.eq(college_n.max()).any(axis='columns')
has_row_max.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

In [85]:
# 6.7.5  college는 18개 컬럼인데, 최대값을 갖는 행은 401 개가 나왔다. 
#       UGDS_ 등 비율을 갖는 컬럼에서 최대값 1을 동률로 갖는 행이 다수 있을 것으로 보인다. 
college_n.shape

(7535, 18)

In [86]:
has_row_max.sum()

401

In [87]:
# 6.7.6 최초로 각 컬럼의 최대값이 나타나는 행을 선별하기 위해 이중 누적합(.cumsum)을 수행할 것이다. 
#      우선 .cumsum() 을 한 번 적용해보자. 
#      한 번 True 가 나타나고 False가 이어지면, 동일하게 1이라는 값을 갖는다. 
college_n.eq(college_n.max()).cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama at Birmingham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Amridge University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama in Huntsville,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alabama State University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2
Rasmussen College - Overland Park,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2
National Personal Training Institute of Cleveland,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2
Bay Area Medical Academy - San Jose Satellite Location,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2


In [88]:
# 6.7.7 .cumsum().cumsum()으로 두 번 적용하면, 각 컬럼별로 1은 단 한 번만 나타난다. 
college_n.eq(college_n.max()).cumsum().cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama at Birmingham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Amridge University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama in Huntsville,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alabama State University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,7305,7305,415,379750,73107,341103,983,11382,3316,5056,1076,7276,113649,172944,170273,36183,3445,10266
Rasmussen College - Overland Park,7306,7306,416,379859,73135,341239,984,11384,3317,5057,1077,7277,113693,173010,170328,36195,3446,10268
National Personal Training Institute of Cleveland,7307,7307,417,379968,73163,341375,985,11386,3318,5058,1078,7278,113737,173076,170383,36207,3447,10270
Bay Area Medical Academy - San Jose Satellite Location,7308,7308,418,380077,73191,341511,986,11388,3319,5059,1079,7279,113781,173142,170438,36219,3448,10272


In [89]:
# 6.7.8 이중 누적합을 통과한 값에 대해서 하나라도 1이 있는 행을 선별하여 최대값을 포함한
#      최초의 행을 추출한다. 
has_row_max2 = college_n.eq(college_n.max()).cumsum().cumsum()\
        .eq(1).any(axis = 'columns')
has_row_max2.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

In [90]:
# 6.7.9 has_row_max2 에서 True의 개수는 컬럼수인 18개 보다 작다. 
has_row_max2.sum()  # True의 개수를 구한다. 

16

In [91]:
# 6.7.10  True인 행 인덱스만 남기기 위해서 자기 자신을 boolean index로 넘겨서 filter하고 
#        그 Series의 index를 갖는다. 
idxmax_cols = has_row_max2[has_row_max2].index
idxmax_cols

Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')

In [92]:
# 6.7.11 idxmax를 통해 구한 행 인덱스와 동일한 결과가 나온 것을 확인한다. 
set (college_n.idxmax().unique())== set (idxmax_cols)

True

## >> There's more 6.7

In [93]:
# .idxmax()와 이중 누적합을 통해 수작업으로 구성한 최대값 찾기 기법은 
#  성능에서 5배 이상 차이가 난다. 
#   그럼에도 불구 하고 종종 .cumsum() 등은 내가 원하는 데이터 값을 찾기 위해, 유용하게 활용된다. 

%timeit college_n.idxmax().values

4.54 ms ± 823 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [94]:
%timeit college_n.eq(college_n.max())\
                .cumsum()\
                .cumsum()\
                .eq(1)\
                .any(axis = 'columns')\
                [lambda x: x].index

46.4 ms ± 9.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# 6.8 Finding the most common maximum

In [95]:
### [Tech] 컬럼별로 동일한 척도를 갖는 경우, 컬럼 방향으로 비교, idxmax()가 의미가 있다.
###         axis 옵션을 조정한다. 
### [Goal] 백인의 비율이 가장 많은 학교는 몇 % 인지 구해본다.  UGDS_ 로 시작하는 컬럼들이 
###       인종의 비율이라는 동일한 척도로 구성되어 있어, 열 방향으로의 비교가 의미를 갖는다. 

## >> How it works...

In [96]:
# 6.8.1 college.csv로 부터 INSTNM을 인덱스로 하는 college를 load 하고, 인종 비율을 나타내는 
#      UGDS_ 로 시작하는 컬럼으로만 구성 된 college_ugds를 생성한다. 
college = pd.read_csv('data/college.csv', index_col = 'INSTNM')
college_ugds = college.filter(like = 'UGDS_')
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [97]:
# 6.8.2 각 행 별로 최대값을 갖는 열을 idxmax 메소드를 이용해서 구한다. 
#      이 때,  axis= 'columns'로 적용 축의 방향을 바꾼다. 
#       highest_percentage_race 라는 series를 구성한다. 
highest_percentage_race= college_ugds.idxmax(axis = 'columns')
highest_percentage_race

INSTNM
Alabama A & M University                                  UGDS_BLACK
University of Alabama at Birmingham                       UGDS_WHITE
Amridge University                                        UGDS_BLACK
University of Alabama in Huntsville                       UGDS_WHITE
Alabama State University                                  UGDS_BLACK
                                                             ...    
SAE Institute of Technology  San Francisco                       NaN
Rasmussen College - Overland Park                                NaN
National Personal Training Institute of Cleveland                NaN
Bay Area Medical Academy - San Jose Satellite Location           NaN
Excel Learning Center-San Antonio South                          NaN
Length: 7535, dtype: object

In [98]:
# 6.8.3 highest_percentage_race의 value_counts(normalize = True) 로 
#  최대 구성 인종의 분포를 구한다.  
highest_percentage_race.value_counts(normalize = True)

UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
UGDS_UNKN     0.023422
UGDS_ASIAN    0.012074
UGDS_AIAN     0.006110
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
dtype: float64

In [104]:
# 6.8.3 highest_percentage_race의 value_counts(normalize = True) 로 
#  최대 구성 인종의 분포를 구한다.  
highest_percentage_race.value_counts()

UGDS_WHITE    4608
UGDS_BLACK    1042
UGDS_HISP      890
UGDS_UNKN      161
UGDS_ASIAN      83
UGDS_AIAN       42
UGDS_NRA        28
UGDS_NHPI       12
UGDS_2MOR        8
dtype: int64

## >> There's more 6.8

In [99]:
# 위 예제의 응용으로,  흑인이 가장 많은 학교에서 흑인 다음으로 많은 인종의 분포를 살펴보자. 

# 우선 흑인이 가장 많은 학교만 filter 추출 한다. 
college_black = college_ugds[highest_percentage_race== 'UGDS_BLACK']
college_black

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0000,0.0059,0.0138
Amridge University,0.2990,0.4192,0.0069,0.0034,0.0000,0.0000,0.0000,0.0000,0.2715
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.0010,0.0006,0.0098,0.0243,0.0137
Concordia College Alabama,0.0280,0.8758,0.0373,0.0093,0.0000,0.0000,0.0031,0.0466,0.0000
South University-Montgomery,0.3046,0.6054,0.0153,0.0153,0.0153,0.0096,0.0000,0.0019,0.0326
...,...,...,...,...,...,...,...,...,...
New Horizons Medical Institute-Winder,0.4175,0.5000,0.0388,0.0194,0.0000,0.0000,0.0049,0.0000,0.0194
SAE Institute of Technology-Chicago,0.2797,0.4576,0.1864,0.0085,0.0254,0.0000,0.0000,0.0085,0.0339
Fayette Beauty Academy-Ritz Beauty Academy,0.0000,1.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000
National American University-Houston,0.1858,0.6443,0.0672,0.0079,0.0079,0.0000,0.0751,0.0000,0.0119


In [100]:
# 해당 학교 df에서 UGDS_BLACK 컬럼을 삭제한다. (1등 컬럼의 제거)
college_black=college_black.drop ('UGDS_BLACK', axis = 'columns')
college_black

Unnamed: 0_level_0,UGDS_WHITE,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama A & M University,0.0333,0.0055,0.0019,0.0024,0.0019,0.0000,0.0059,0.0138
Amridge University,0.2990,0.0069,0.0034,0.0000,0.0000,0.0000,0.0000,0.2715
Alabama State University,0.0158,0.0121,0.0019,0.0010,0.0006,0.0098,0.0243,0.0137
Concordia College Alabama,0.0280,0.0373,0.0093,0.0000,0.0000,0.0031,0.0466,0.0000
South University-Montgomery,0.3046,0.0153,0.0153,0.0153,0.0096,0.0000,0.0019,0.0326
...,...,...,...,...,...,...,...,...
New Horizons Medical Institute-Winder,0.4175,0.0388,0.0194,0.0000,0.0000,0.0049,0.0000,0.0194
SAE Institute of Technology-Chicago,0.2797,0.1864,0.0085,0.0254,0.0000,0.0000,0.0085,0.0339
Fayette Beauty Academy-Ritz Beauty Academy,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000
National American University-Houston,0.1858,0.0672,0.0079,0.0079,0.0000,0.0751,0.0000,0.0119


In [101]:
# 다시 행별로 어느 컬럼이 최대값을 갖는지, idxmax(axis = 'columns')를 이용해서 구한다. 
college_black.idxmax(axis = 'columns')

INSTNM
Alabama A & M University                                 UGDS_WHITE
Amridge University                                       UGDS_WHITE
Alabama State University                                   UGDS_NRA
Concordia College Alabama                                  UGDS_NRA
South University-Montgomery                              UGDS_WHITE
                                                            ...    
New Horizons Medical Institute-Winder                    UGDS_WHITE
SAE Institute of Technology-Chicago                      UGDS_WHITE
Fayette Beauty Academy-Ritz Beauty Academy               UGDS_WHITE
National American University-Houston                     UGDS_WHITE
Hollywood Institute of Beauty Careers-West Palm Beach     UGDS_HISP
Length: 1042, dtype: object

In [102]:
# 여기에 .value_counts(normalize = True)를 적용하여 다음으로 많은 인종의 분포를 살펴본다. 
college_black.idxmax(axis = 'columns').value_counts(normalize = True)

UGDS_WHITE    0.661228
UGDS_HISP     0.230326
UGDS_UNKN     0.071977
UGDS_NRA      0.018234
UGDS_ASIAN    0.009597
UGDS_2MOR     0.006718
UGDS_AIAN     0.000960
UGDS_NHPI     0.000960
dtype: float64