# Chapter 2: pandas Foundations: part 2

## Recipes
* [2.1 Selecting multiple DataFrame columns](#2.1-Selecting-multiple-DataFrame-columns)
* [2.2 Selecting columns with methods](#2.2-Selecting-columns-with-methods)
* [2.3 Ordering column names sensibly](#2.3-Ordering-column-names-sensibly)
* [2.4 Operating on the entire DataFrame](#2.4-Operating-on-the-entire-DataFrame)
* [2.5 Chaining DataFrame methods together](#2.5-Chaining-DataFrame-methods-together)
* [2.6 Working with operators on a DataFrame](#2.6-Working-with-operators-on-a-DataFrame)
* [2.7 Comparing missing values](#2.7-Comparing-missing-values)
* [2.8 Transposing the direction of a DataFrame operation](#2.8-Transposing-the-direction-of-a-DataFrame-operation)
* [2.9 Determining college campus diversity](#2.9-Determining-college-campus-diversity)

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 40

# 2.1 Selecting multiple DataFrame columns

In [2]:
### [Tech] DataFrame 내 복수열의 선택  [[]]  :인덱스 연산자 안에 컬럼리스트를 사용
### [Goal] 영화배우 컬럼만 선택

## >> How it works...

In [3]:
#2.1.1 영화 배우 컬럼의 선택
movie = pd.read_csv('data/movie.csv')
movie_actor_director = movie[['actor_1_name', 'actor_2_name', 
                              'actor_3_name', 'director_name']]
movie_actor_director.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


In [4]:
#2.2.2 한 컬럼 선택 시에 Series 가 아닌 DataFrame이 되게 하려면 1개 값을 갖는 리스트로 전달
movie[['director_name']].head()

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker


## >> There's more...2.1

In [5]:
# 변수 목록을 리스트변수로 저장해서 기술하면 깔끔
cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
movie_actor_director = movie[cols]
movie_actor_director.tail()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
4911,Eric Mabius,Daphne Zuniga,Crystal Lowe,Scott Smith
4912,Natalie Zea,Valorie Curry,Sam Underwood,
4913,Eva Boehnke,Maxwell Moody,David Chandler,Benjamin Roberds
4914,Alan Ruck,Daniel Henney,Eliza Coupe,Daniel Hsia
4915,John August,Brian Herzlinger,Jon Gunn,Jon Gunn


In [6]:
# 자주 하는 오류
movie['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

In [None]:
# 리스트가 아닌 튜플이 됨
tuple1 = 1,2,3,'a','b'
tuple2 = (1,2,3,'a','b')
list2 = [1,2,3,'a','b']
tuple1 == tuple2   , tuple1 == list2

# 2.2 Selecting columns with methods

In [7]:
### [Tech] 컬럼 선택 method : .select_dtypes() , filter() 
### [Goal] movie 테이블에서 , 숫자형 컬럼만 선택, facebook 관련 컬럼만 선택, 
#        컬러명에 숫자가 있는 컬럼만 선택

## >> How it works...

In [8]:
#2.2.1 컬럼의 데이터형 별 분포 집계
movie = pd.read_csv('data/movie.csv', index_col='movie_title')
movie.dtypes.value_counts()

float64    13
object     11
int64       3
dtype: int64

In [None]:
#2.2.2 특정 데이터형을 가진 컬럼만 선택
movie.select_dtypes(include=['int64']).head()

In [None]:
#2.2.3 숫자형 데이터를 갖는 컬럼 선택
movie.select_dtypes(include=['number']).head()

In [None]:
#2.2.4 컬럼명에 'facebook' 이 들어있는 컬럼 선택
movie.filter(like='facebook').head()

In [None]:
#2.2.5  컬럼명에 숫자가 들어 있는 컬럼 선택  : reaex ==>reqular expression 
movie.filter(regex='\d').head()

In [None]:
movie.filter(regex='\d').columns.size

## >> There's more... 2.2

In [10]:
# filter 함수 내 items 인자 : 존재하지 않는 컬럼이 있어도 key error 발생 시키지 않음
movie.filter(items=['actor_1_name', 'asdf']).head()

Unnamed: 0_level_0,actor_1_name
movie_title,Unnamed: 1_level_1
Avatar,CCH Pounder
Pirates of the Caribbean: At World's End,Johnny Depp
Spectre,Christoph Waltz
The Dark Knight Rises,Tom Hardy
Star Wars: Episode VII - The Force Awakens,Doug Walker


In [9]:
movie[['actor_1_name', 'asdf']]

KeyError: "['asdf'] not in index"

# 2.3 Ordering column names sensibly

In [16]:
### [Tech] 정렬된 컬럼을 갖는 리스트로 테이블 재생성
#       * 이산 discrete / 연속 continuous ,  공통 도메인을 그룹화 , 중요한 것부터
### [Goal] movie table의 컬럼의 순서를 이해하기 쉽게 정리한다.  (28개 컬럼)

## >> How it works...

In [11]:
#2.3.1 movie 테이블 호출
movie = pd.read_csv('data/movie.csv')

In [12]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
color                        4897 non-null object
director_name                4814 non-null object
num_critic_for_reviews       4867 non-null float64
duration                     4901 non-null float64
director_facebook_likes      4814 non-null float64
actor_3_facebook_likes       4893 non-null float64
actor_2_name                 4903 non-null object
actor_1_facebook_likes       4909 non-null float64
gross                        4054 non-null float64
genres                       4916 non-null object
actor_1_name                 4909 non-null object
movie_title                  4916 non-null object
num_voted_users              4916 non-null int64
cast_total_facebook_likes    4916 non-null int64
actor_3_name                 4893 non-null object
facenumber_in_poster         4903 non-null float64
plot_keywords                4764 non-null object
movie_imdb_link              4916 non-

In [18]:
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


In [19]:
#2.3.2 컬럼명을 확보
movie.columns , movie.shape

(Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
        'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
        'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
        'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
        'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
        'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
        'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
        'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
       dtype='object'), (4916, 28))

In [20]:
#2.3.3 컬럼을 유형 별로 정리
disc_core = ['movie_title','title_year', 'content_rating','genres']
disc_people = ['director_name','actor_1_name', 'actor_2_name','actor_3_name']
disc_other = ['color','country','language','plot_keywords','movie_imdb_link']
cont_fb = ['director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes',
           'actor_3_facebook_likes', 'cast_total_facebook_likes', 'movie_facebook_likes']
cont_finance = ['budget','gross']
cont_num_reviews = ['num_voted_users','num_user_for_reviews', 'num_critic_for_reviews']
cont_other = ['imdb_score','duration', 'aspect_ratio', 'facenumber_in_poster']

In [21]:
#2.3.4 정렬한 컬럼 순서로 조합 
new_col_order = disc_core + disc_people + disc_other + \
                    cont_fb + cont_finance + cont_num_reviews + cont_other
set(movie.columns) == set(new_col_order)

True

In [22]:
#2.3.5 새로운 컬럼 순서를 갖는 movie2를 생성
movie2 = movie[new_col_order]
movie2.head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,language,plot_keywords,movie_imdb_link,director_facebook_likes,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,cast_total_facebook_likes,movie_facebook_likes,budget,gross,num_voted_users,num_user_for_reviews,num_critic_for_reviews,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,English,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,0.0,1000.0,936.0,855.0,4834,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,English,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,563.0,40000.0,5000.0,1000.0,48350,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,English,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,0.0,11000.0,393.0,161.0,11700,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0
3,The Dark Knight Rises,2012.0,PG-13,Action|Thriller,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Color,USA,English,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,22000.0,27000.0,23000.0,23000.0,106759,164000,250000000.0,448130642.0,1144337,2701.0,813.0,8.5,164.0,2.35,0.0
4,Star Wars: Episode VII - The Force Awakens,,,Documentary,Doug Walker,Doug Walker,Rob Walker,,,,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,131.0,131.0,12.0,,143,0,,,8,,,7.1,,,0.0


# 2.4 Operating on the entire DataFrame

In [23]:
### [Tech] DataFrame 요약정보 - .shape, .size, .ndim, len(), .count(), 
#                               .min(), .max(), .mean(), mdeian(), std()
#                               .describe()  
### [Goal] movie table에 대한 기술통계량 확인

## >> How it works...

In [13]:
#2.4.1 movie data 의 외형 확인 
movie = pd.read_csv('data/movie.csv')
movie.shape

(4916, 28)

In [25]:
movie.head(3)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000


In [26]:
movie.size

137648

In [129]:
movie.shape[0] * movie.shape[1]

137648

In [27]:
movie.ndim

2

In [28]:
len(movie)

4916

In [29]:
#2.4.2 누락값을 제외한 항목 수 
movie.count()

color                        4897
director_name                4814
num_critic_for_reviews       4867
duration                     4901
director_facebook_likes      4814
actor_3_facebook_likes       4893
actor_2_name                 4903
actor_1_facebook_likes       4909
gross                        4054
genres                       4916
actor_1_name                 4909
movie_title                  4916
num_voted_users              4916
cast_total_facebook_likes    4916
actor_3_name                 4893
facenumber_in_poster         4903
plot_keywords                4764
movie_imdb_link              4916
num_user_for_reviews         4895
language                     4904
country                      4911
content_rating               4616
budget                       4432
title_year                   4810
actor_2_facebook_likes       4903
imdb_score                   4916
aspect_ratio                 4590
movie_facebook_likes         4916
dtype: int64

In [30]:
#2.4.3 DataFrame의 요약 통계는 column별 정리한 Series 값을 반환
movie.min()

num_critic_for_reviews          1.00
duration                        7.00
director_facebook_likes         0.00
actor_3_facebook_likes          0.00
actor_1_facebook_likes          0.00
gross                         162.00
num_voted_users                 5.00
cast_total_facebook_likes       0.00
facenumber_in_poster            0.00
num_user_for_reviews            1.00
budget                        218.00
title_year                   1916.00
actor_2_facebook_likes          0.00
imdb_score                      1.60
aspect_ratio                    1.18
movie_facebook_likes            0.00
dtype: float64

In [31]:
movie.min().size

16

In [32]:
#2.4.4 수치형 데이터 컬럼에 대한 요약 통계
movie.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,49.0,93.0,7.0,132.0,607.0,5019656.0,8361.75,1394.75,0.0,64.0,6000000.0,1999.0,277.0,5.8,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
75%,191.0,118.0,189.75,633.0,11000.0,61108410.0,93772.75,13616.75,2.0,320.5,43000000.0,2011.0,912.0,7.2,2.35,2000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


In [33]:

movie.describe().T.count()

count    16
mean     16
std      16
min      16
25%      16
50%      16
75%      16
max      16
dtype: int64

In [34]:
#2.4.5 percentile = 분위수 위치 전달
movie.describe(percentiles=[.01, .3, .99])

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
1%,2.0,43.0,0.0,0.0,6.08,8474.8,53.0,6.0,0.0,1.94,60000.0,1951.0,0.0,3.1,1.33,0.0
30%,60.0,95.0,11.0,176.0,694.0,7914069.0,11864.5,1684.5,0.0,80.0,8000000.0,2000.0,345.0,6.0,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
99%,546.68,189.0,16000.0,11000.0,44920.0,326412800.0,681584.6,62413.9,8.0,1999.24,200000000.0,2016.0,17000.0,8.5,4.0,93850.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


In [35]:
movie.quantile([.01, .3, .99])

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0.01,2.0,43.0,0.0,0.0,6.08,8474.8,53.0,6.0,0.0,1.94,60000.0,1951.0,0.0,3.1,1.33,0.0
0.3,60.0,95.0,11.0,176.0,694.0,7914069.0,11864.5,1684.5,0.0,80.0,8000000.0,2000.0,345.0,6.0,1.85,0.0
0.99,546.68,189.0,16000.0,11000.0,44920.0,326412800.0,681584.6,62413.9,8.0,1999.24,200000000.0,2016.0,17000.0,8.5,4.0,93850.0


## >> There's more... 2.4

In [15]:
movie.min(skipna=True)

num_critic_for_reviews          1.00
duration                        7.00
director_facebook_likes         0.00
actor_3_facebook_likes          0.00
actor_1_facebook_likes          0.00
gross                         162.00
num_voted_users                 5.00
cast_total_facebook_likes       0.00
facenumber_in_poster            0.00
num_user_for_reviews            1.00
budget                        218.00
title_year                   1916.00
actor_2_facebook_likes          0.00
imdb_score                      1.60
aspect_ratio                    1.18
movie_facebook_likes            0.00
dtype: float64

In [14]:
movie.select_dtypes('object').min(skipna = True)

Series([], dtype: float64)

In [139]:
movie.select_dtypes('object').fillna('').min(skipna = False)

color                                                               
director_name                                                       
actor_2_name                                                        
genres                                                        Action
actor_1_name                                                        
movie_title                                                  #Horror
actor_3_name                                                        
plot_keywords                                                       
movie_imdb_link    http://www.imdb.com/title/tt0006864/?ref_=fn_t...
language                                                            
country                                                             
content_rating                                                      
dtype: object

# 2.5 Chaining DataFrame methods together

In [38]:
### [Tech] method 를 연결해서 사용  ( 객체지향기법 ,  return 값의 형태를 항상 고려)
### [Goal] movie table의 누락값 갯수 count

## >> How it works...

In [39]:
#2.5.1 누락값 여부
movie = pd.read_csv('data/movie.csv')
movie.isnull().head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,True,False,True,True,False,True,False,False,True,False,False,False,False,False,True,False,True,False,True,True,True,True,True,True,False,False,True,False


In [40]:
#2.5.2 boolean에 대해서 True/False -> 1/0  수치로 환산해서 계산
movie.isnull().sum().head()

color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
dtype: int64

In [41]:
#2.5.3 모든 누락값 갯수
movie.isnull().sum().sum()

2654

In [42]:
#2.5.4 누락값 존재 여부 자체의 확인
movie.isnull().any().any()

True

In [43]:
movie.isnull().all()

color                        False
director_name                False
num_critic_for_reviews       False
duration                     False
director_facebook_likes      False
actor_3_facebook_likes       False
actor_2_name                 False
actor_1_facebook_likes       False
gross                        False
genres                       False
actor_1_name                 False
movie_title                  False
num_voted_users              False
cast_total_facebook_likes    False
actor_3_name                 False
facenumber_in_poster         False
plot_keywords                False
movie_imdb_link              False
num_user_for_reviews         False
language                     False
country                      False
content_rating               False
budget                       False
title_year                   False
actor_2_facebook_likes       False
imdb_score                   False
aspect_ratio                 False
movie_facebook_likes         False
dtype: bool

In [44]:
movie.isnull().dtypes.value_counts()

bool    28
dtype: int64

## >> There's more... 2.5

In [45]:
# 문자형은 결측치가 있으면 합산 연산이 작동하지 않음
movie[['color', 'movie_title', 'color']].max()

Series([], dtype: float64)

In [46]:
# fillna('')로 누락값 치환
movie[['color', 'movie_title', 'color']].fillna('').max()

color             Color
movie_title    Æon Flux
color             Color
dtype: object

In [47]:
# 문자형의 최대값
movie.select_dtypes(['object']).fillna('').max()

color                                                          Color
director_name                                          Étienne Faure
actor_2_name                                           Zubaida Sahar
genres                                                       Western
actor_1_name                                           Óscar Jaenada
movie_title                                                 Æon Flux
actor_3_name                                           Óscar Jaenada
plot_keywords                                    zombie|zombie spoof
movie_imdb_link    http://www.imdb.com/title/tt5574490/?ref_=fn_t...
language                                                        Zulu
country                                                 West Germany
content_rating                                                     X
dtype: object

In [48]:
# 문자형의 최대값
movie.select_dtypes(['object']).fillna('zz').min()

color                                                Black and White
director_name                                          A. Raven Cruz
actor_2_name                                                 50 Cent
genres                                                        Action
actor_1_name                                                 50 Cent
movie_title                                                  #Horror
actor_3_name                                                 50 Cent
plot_keywords               10 year old|dog|florida|girl|supermarket
movie_imdb_link    http://www.imdb.com/title/tt0006864/?ref_=fn_t...
language                                                  Aboriginal
country                                                  Afghanistan
content_rating                                              Approved
dtype: object

# 2.6 Working with operators on a DataFrame

In [49]:
### [Tech] DataFrame 연산자 적용 : 데이터형을 고려 , num type vs. obj type
### [Goal] college 인종 컬럼 값의 반올림을 연산으로 직접 도출 (.round() 기능 구현)

## >> Getting ready...

In [16]:
college = pd.read_csv('data/college.csv')

In [51]:
college.head(3)

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


In [52]:
# DataFrame  + scala
college + 3

TypeError: can only concatenate str (not "int") to str

In [53]:
# DataFrame  + scala  (숫자형만)
college.select_dtypes('number') +3

Unnamed: 0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV
0,4.0,3.0,3.0,3,427.0,423.0,3.0,4209.0,3.0333,3.9353,3.0055,3.0019,3.0024,3.0019,3.0000,3.0059,3.0138,3.0656,4,3.7356,3.8284,3.1049
1,3.0,3.0,3.0,3,573.0,568.0,3.0,11386.0,3.5922,3.2600,3.0283,3.0518,3.0022,3.0007,3.0368,3.0179,3.0100,3.2607,4,3.3460,3.5214,3.2422
2,3.0,3.0,3.0,4,,,4.0,294.0,3.2990,3.4192,3.0069,3.0034,3.0000,3.0000,3.0000,3.0000,3.2715,3.4536,4,3.6801,3.7795,3.8540
3,3.0,3.0,3.0,3,598.0,593.0,3.0,5454.0,3.6988,3.1255,3.0382,3.0376,3.0143,3.0002,3.0172,3.0332,3.0350,3.2146,4,3.3072,3.4596,3.2640
4,4.0,3.0,3.0,3,428.0,433.0,3.0,4814.0,3.0158,3.9208,3.0121,3.0019,3.0010,3.0006,3.0098,3.0243,3.0137,3.0892,4,3.7347,3.7554,3.1270
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7530,,,,4,,,,,,,,,,,,,,,4,,,
7531,,,,4,,,,,,,,,,,,,,,4,,,
7532,,,,4,,,,,,,,,,,,,,,4,,,
7533,,,,4,,,,,,,,,,,,,,,4,,,


In [54]:
# DataFrame  + scala  (문자형만)
college.select_dtypes('object') +'3'

Unnamed: 0,INSTNM,CITY,STABBR,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University3,Normal3,AL3,303003,338883
1,University of Alabama at Birmingham3,Birmingham3,AL3,397003,21941.53
2,Amridge University3,Montgomery3,AL3,401003,233703
3,University of Alabama in Huntsville3,Huntsville3,AL3,455003,240973
4,Alabama State University3,Montgomery3,AL3,266003,33118.53
...,...,...,...,...,...
7530,SAE Institute of Technology San Francisco3,Emeryville3,CA3,,95003
7531,Rasmussen College - Overland Park3,Overland Park3,KS3,,211633
7532,National Personal Training Institute of Clevel...,Highland Heights3,OH3,,63333
7533,Bay Area Medical Academy - San Jose Satellite ...,San Jose3,CA3,,PrivacySuppressed3


## >> How to do it...

In [20]:
# 원하는 컬럼만 선택해서 적용
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')
college_ugds_.head(3)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715


In [21]:
#2.6.1  소수점 이하 2자리로 반올림을 직접 구현  
# (0.005를 더한 다음 .01로 나눈몫을 구하고 100을 나눔 )
college_ugds_.head() + .00501

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03831,0.94031,0.01051,0.00691,0.00741,0.00691,0.00501,0.01091,0.01881
University of Alabama at Birmingham,0.59721,0.26501,0.03331,0.05681,0.00721,0.00571,0.04181,0.02291,0.01501
Amridge University,0.30401,0.42421,0.01191,0.00841,0.00501,0.00501,0.00501,0.00501,0.27651
University of Alabama in Huntsville,0.70381,0.13051,0.04321,0.04261,0.01931,0.00521,0.02221,0.03821,0.04001
Alabama State University,0.02081,0.92581,0.01711,0.00691,0.00601,0.00561,0.01481,0.02931,0.01871


In [22]:
#2.6.2
(college_ugds_.head() + .00501) // .01 

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,3.0,94.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
University of Alabama at Birmingham,59.0,26.0,3.0,5.0,0.0,0.0,4.0,2.0,1.0
Amridge University,30.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,27.0
University of Alabama in Huntsville,70.0,13.0,4.0,4.0,1.0,0.0,2.0,3.0,4.0
Alabama State University,2.0,92.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0


In [23]:
#2.6.3 
college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
college_ugds_op_round.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


In [24]:
#2.6.4 round 함수로 반울림
college_ugds_round = (college_ugds_ + .00001).round(2)
college_ugds_round.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


In [25]:
college_ugds_round2 = (college_ugds_).round(2)
college_ugds_round2.head()


Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


In [26]:
#2.6.5  두 값이 같은지 확인
college_ugds_op_round.equals(college_ugds_round2)

False

In [61]:
# 컴퓨터 연산에서 아주 작은 값의 처리

In [17]:
(.045 + .005)

0.049999999999999996

In [18]:
(.045 + .005+ .00001)

0.05001

In [64]:
round (.045 + .005, 2)

0.05

## >> There's more... 2.6

In [65]:
college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
college_ugds_op_round_methods.equals(college_ugds_op_round)

True

# 2.7 Comparing missing values

In [66]:
### [Tech] 누락값 missing value의 특성 이해  np.nan, None , np.nan == np.nan --> False
### [Goal] college table에서 누락값 갯수 세기

In [67]:
np.nan == np.nan

False

In [68]:
None == None

True

In [69]:
5 > np.nan

False

In [70]:
np.nan > 5

False

In [71]:
np.nan == 5

False

In [72]:
# np.nan은 어떤 숫자/문자 값과 같지않다 라는 명제만 True를 반환
5 != np.nan

True

In [73]:
college.shape

(7535, 26)

In [74]:
college.dropna().shape

(1171, 26)

In [75]:
college.dropna(axis =0 , how = 'any').shape

(1171, 26)

In [76]:
college.dropna(axis =1 , how = 'any').shape

(7535, 4)

In [77]:
college.dropna(axis =0 , how = 'all').shape

(7535, 26)

## >> Getting ready...

In [78]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')

In [79]:
college_ugds_.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


## >> How to do it ...

In [80]:
#2.7.1  == scala
college_ugds_.head() == .0019

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,False,False,True,False,True,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,True,False,False,False,False,False


In [81]:
#2.7.2 self compare
college_self_compare = college_ugds_ == college_ugds_
college_self_compare.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,True,True,True,True,True,True,True,True,True
University of Alabama at Birmingham,True,True,True,True,True,True,True,True,True
Amridge University,True,True,True,True,True,True,True,True,True
University of Alabama in Huntsville,True,True,True,True,True,True,True,True,True
Alabama State University,True,True,True,True,True,True,True,True,True


In [82]:
#2.7.3 누락값의 영향
college_self_compare.all()

UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

In [83]:
#2.7.4 누락값은    np.nan == np.nan  => False 반환으로 count 할 수 없음
(college_ugds_ == np.nan).sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

In [84]:
#2.7.5  누락값의 비교는 .isnull()  을 사용
college_ugds_.isnull().sum()

UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

In [85]:
#2.7.6 두 DataFram의 비교는 .equals() 를 사용
college_ugds_.equals(college_ugds_)

True

In [86]:
(college_ugds_==college_ugds_).all().all()

False

## >> There's more... 2.7

In [87]:
# .eq()   와  .equals()  는 다른 것이다
college_ugds_.eq(.0019).head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,False,False,True,False,True,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,True,False,False,False,False,False


In [88]:
# .equals  vs. assert_frame_equal  (assert 발생 시킴)
from pandas.testing import assert_frame_equal
assert_frame_equal(college_ugds_, college_ugds_)

In [89]:
assert_frame_equal(college_ugds_, college)

AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (7535, 9)
[right]: (7535, 26)

In [90]:
s = pd.Series ([0,1,None , 10, 20])
s

0     0.0
1     1.0
2     NaN
3    10.0
4    20.0
dtype: float64

In [91]:
s.fillna(method = 'ffill')

0     0.0
1     1.0
2     1.0
3    10.0
4    20.0
dtype: float64

In [91]:
s.fillna(method = 'ffill')

0     0.0
1     1.0
2     1.0
3    10.0
4    20.0
dtype: float64

In [92]:
s.fillna(method = 'bfill')

0     0.0
1     1.0
2    10.0
3    10.0
4    20.0
dtype: float64

# 2.8 Transposing the direction of a DataFrame operation

In [93]:
### [Tech] Axis 매개변수 처리 : axis =   0 (index) default , 1 (column)
### [Goal] college 인종 비율 연산 UGDS_  : 학교별 연산,  인종별 연산  

## >> How it works...

In [27]:
#2.8.1 UGDS_ 인종 비율 컬럼 데이터 테이블 구성
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')
college_ugds_.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [95]:
college_ugds_.shape

(7535, 9)

In [96]:
#2.8.2 default 0 index  --> 행을 합하기 때문에 컬럼별 집계된 결과 
college_ugds_.count()

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [97]:
college_ugds_.count(axis=0)

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [98]:
college_ugds_.count(axis='index')

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [99]:
#2.8.3 axis = 1 ('column') 으로 하면 열을 합하기 때문에 행 단위로 집계된 결과 
college_ugds_.count(axis='columns').head()

INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

In [28]:
#2.8.4 행 (학교) 단위의 구성비 임을 알 수 있다. 
college_ugds_.sum(axis='columns').head()

INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

In [29]:
#2.8.4 행 (학교) 단위의 구성비 임을 알 수 있다. 
college_ugds_.sum().head()

UGDS_WHITE    3507.1643
UGDS_BLACK    1306.0369
UGDS_HISP     1111.0782
UGDS_ASIAN     230.5831
UGDS_AIAN       94.9476
dtype: float64

In [101]:
#2.8.5 각 인종 구성비의 중앙값을 보려면 각 열을 구성하는 '행' 들의 중앙값을 얻는다
college_ugds_.median(axis='index')

UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
UGDS_NHPI     0.00000
UGDS_2MOR     0.01750
UGDS_NRA      0.00000
UGDS_UNKN     0.01430
dtype: float64

## >> There's more 2.8

In [103]:
# 학교별 구성비를 누적으로 표현 .cumsum(axis = 1)
college_ugds_cumsum = college_ugds_.cumsum(axis=1)
college_ugds_cumsum.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9686,0.9741,0.976,0.9784,0.9803,0.9803,0.9862,1.0
University of Alabama at Birmingham,0.5922,0.8522,0.8805,0.9323,0.9345,0.9352,0.972,0.9899,0.9999
Amridge University,0.299,0.7182,0.7251,0.7285,0.7285,0.7285,0.7285,0.7285,1.0
University of Alabama in Huntsville,0.6988,0.8243,0.8625,0.9001,0.9144,0.9146,0.9318,0.965,1.0
Alabama State University,0.0158,0.9366,0.9487,0.9506,0.9516,0.9522,0.962,0.9863,1.0


In [104]:
college_ugds_.cumsum(axis=0)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0000,0.0059,0.0138
University of Alabama at Birmingham,0.6255,1.1953,0.0338,0.0537,0.0046,0.0026,0.0368,0.0238,0.0238
Amridge University,0.9245,1.6145,0.0407,0.0571,0.0046,0.0026,0.0368,0.0238,0.2953
University of Alabama in Huntsville,1.6233,1.7400,0.0789,0.0947,0.0189,0.0028,0.0540,0.0570,0.3303
Alabama State University,1.6391,2.6608,0.0910,0.0966,0.0199,0.0034,0.0638,0.0813,0.3440
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


# 2.9 Determining college campus diversity

In [105]:
### [Tech] 복습 - chaining, missing value , sort , 논리비교, 연산 결과 반환 type
### [Goal] 다양한 인종구성비 지수를 만들고 이를 적용한다

In [106]:
# 다앙수 지수  상위 Top 10 학교
pd.read_csv('data/college_diversity.csv', index_col='School')


Unnamed: 0_level_0,Diversity Index
School,Unnamed: 1_level_1
"Rutgers University--Newark Newark, NJ",0.76
"Andrews University Berrien Springs, MI",0.74
"Stanford University Stanford, CA",0.74
"University of Houston Houston, TX",0.74
"University of Nevada--Las Vegas Las Vegas, NV",0.74
"University of San Francisco San Francisco, CA",0.74
"San Francisco State University San Francisco, CA",0.73
"University of Illinois--Chicago Chicago, IL",0.73
"New Jersey Institute of Technology Newark, NJ",0.72
"Texas Woman's University Denton, TX",0.72


## >> How it works...

In [107]:
#2.9.1 학부생 인종 테이블
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds_ = college.filter(like='UGDS_')
college_ugds_.head(3)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715


In [108]:
college_ugds_.shape

(7535, 9)

In [109]:
college_ugds_.columns

Index(['UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN'],
      dtype='object')

In [110]:
coleageDD = pd.read_csv('data/college_data_dictionary.csv' , 
                        index_col = 'column_name')
coleageDD.loc['UGDS':'UGDS_UNKN']

Unnamed: 0_level_0,description
column_name,Unnamed: 1_level_1
UGDS,Undergraduate Enrollment
UGDS_WHITE,Percent Undergrad White
UGDS_BLACK,Percent Undergrad Black
UGDS_HISP,Percent Undergrad Hispanic
UGDS_ASIAN,Percent Undergrad Asian
UGDS_AIAN,Percent Undergrad American Indian/Alaskan Native
UGDS_NHPI,Percent Undergrad Native Hawaiian/Pacific Isla...
UGDS_2MOR,Percent Undergrad 2 or more races
UGDS_NRA,Percent Undergrad non-resident aliens
UGDS_UNKN,Percent Undergrad race unknown


In [111]:
college_ugds_.isnull().sum(axis=1)

INSTNM
Alabama A & M University                                  0
University of Alabama at Birmingham                       0
Amridge University                                        0
University of Alabama in Huntsville                       0
Alabama State University                                  0
                                                         ..
SAE Institute of Technology  San Francisco                9
Rasmussen College - Overland Park                         9
National Personal Training Institute of Cleveland         9
Bay Area Medical Academy - San Jose Satellite Location    9
Excel Learning Center-San Antonio South                   9
Length: 7535, dtype: int64

In [112]:
#2.9.2 학교별 누락값의 갯수
college_ugds_.isnull().sum(axis=1).sort_values(ascending=False).head()

INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

In [113]:
#2.9.3 모든 열이 누락된  행 삭제 후 누락값 여부 확인
college_ugds_ = college_ugds_.dropna(how='all')

In [114]:
college_ugds_.isnull().sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

In [115]:
college_ugds_.isnull().sum().sum()

0

In [116]:
#2.9.4 15% 이상인 값만 남기기 위해서 비교연산을 통한 True/False 논리값으로 변환
college_ugds_.ge(.15).head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,True,False,False,False,False,False,False,False
University of Alabama at Birmingham,True,True,False,False,False,False,False,False,False
Amridge University,True,True,False,False,False,False,False,False,True
University of Alabama in Huntsville,True,False,False,False,False,False,False,False,False
Alabama State University,False,True,False,False,False,False,False,False,False


In [117]:
#2.9.5 대학별 15%이상 인종 수 
diversity_metric = college_ugds_.ge(.15).sum(axis='columns')
diversity_metric.head()

INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

In [118]:
#2.9.6 구성인종 분포
diversity_metric.value_counts()

1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

In [119]:
#2.9.7 구성인종수 상위학교 
diversity_metric.sort_values(ascending=False).head()

INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

In [120]:
#2.9.8 상위 2 학교 값 확인 - _2MOR, _UNKN  비중이 높음
college_ugds_.loc[['Regency Beauty Institute-Austin', 
                          'Central Texas Beauty College-Temple']]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Regency Beauty Institute-Austin,0.1867,0.2133,0.16,0.0,0.0,0.0,0.1733,0.0,0.2667
Central Texas Beauty College-Temple,0.1616,0.2323,0.2626,0.0202,0.0,0.0,0.1717,0.0,0.1515


In [140]:
zeros = diversity_metric.sort_values(ascending=True).head(7).index.to_list()
college_ugds_.loc[zeros]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Taft University System,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
American Conservatory Theater,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Prince Institute-Rocky Mountains,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Spanish-American Institute,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Professional Business College,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Lyme Academy College of Fine Arts,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Education and Technology Institute,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [123]:
#2.9.9 US diversity 상위 학교들의 구성 현황 
us_news_top = ['Rutgers University-Newark', 
               'Andrews University', 
               'Stanford University', 
               'University of Houston',
               'University of Nevada-Las Vegas']

diversity_metric.loc[us_news_top]

INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

## >> There's more... 2.9

In [124]:
# 반대로 다양화 되지 않은 학교 찾는 방법 = 최대 인종 비율이 큰 순서
college_ugds_.max(axis=1).sort_values(ascending=False).head(10)

INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

In [125]:
college_ugds_.max(axis=1).sort_values(ascending = False)

INSTNM
Dewey University-Manati                     1.0
Yeshiva and Kollel Harbotzas Torah          1.0
Mr Leon's School of Hair Design-Lewiston    1.0
Dewey University-Bayamon                    1.0
Shepherds Theological Seminary              1.0
                                           ... 
Taft University System                      0.0
American Conservatory Theater               0.0
Education and Technology Institute          0.0
Lyme Academy College of Fine Arts           0.0
Spanish-American Institute                  0.0
Length: 6874, dtype: float64

In [126]:
(college_ugds_ > .01).all(axis=1).any()

True

In [142]:
(college_ugds_ > .01).all(axis=1).sum()

14

In [145]:
college_ugds_[(college_ugds_ > .01).all(axis=1)]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
John F. Kennedy University,0.3665,0.1554,0.259,0.0837,0.0159,0.012,0.0359,0.0319,0.0398
National Holistic Institute,0.4122,0.0933,0.1889,0.0689,0.0111,0.0278,0.1522,0.0267,0.0189
Santa Fe University of Art and Design,0.4529,0.0703,0.2765,0.0191,0.025,0.0179,0.0751,0.0501,0.0131
Eastern Oregon University,0.7611,0.0294,0.0624,0.0207,0.0254,0.0113,0.0133,0.015,0.0614
New Hope Christian College-Eugene,0.6111,0.037,0.0864,0.037,0.0123,0.0432,0.0926,0.0247,0.0556
Salt Lake Community College,0.6888,0.0248,0.1612,0.0405,0.0104,0.0128,0.0256,0.0132,0.0227
Northwest University,0.6827,0.0447,0.0853,0.0475,0.0124,0.0165,0.0447,0.02,0.0461
South Puget Sound Community College,0.6351,0.0322,0.085,0.0509,0.0125,0.0102,0.0926,0.0112,0.0706
Fashion Institute of Design & Merchandising-San Diego,0.3732,0.0352,0.3732,0.0775,0.0141,0.0282,0.0141,0.0352,0.0493
Northwest College of Art & Design,0.5851,0.0213,0.0957,0.0213,0.0426,0.0213,0.0851,0.0106,0.117
