# Pandas 한번에 제대로 배우기
> by 이수안 컴퓨터 연구소 




---



## Pandas 특징
- 부동 소수점이 아닌 데이터 뿐만 아니라 부동 소수점 데이터에서도 결측 데이터(NaN)를 쉽게 처리
- 크기 변이성: 데이터프레임 및 고차원 객체에서 열을 삽입 및 삭제 가능
- 자동 및 명시적 데이터 정렬: 객체를 라벨 집합에 명시적으로 정렬하거나, 사용자가 라벨을 무시하고 Series, Dataframe 등의 계산에서 자동으로 데이터 조정 가능
- 데이터 세트에서 집계 및 변환을 위한 분할(split), 적용(apply), 결합(combine) 작업을 수행할 수 있는 강력하고 유연한 group-by 함수 제공
- 누락된 데이터 또는 다른 Python 및 NumPy 데이터 구조에서 서로 다른 인덱싱 데이터를 DataFrame 개체로 귑게 변환
- 대용량 데이터 세트의 지능형 라벨 기반 슬라이싱, 고급 인덱싱 및 부분 집합 구하기 가능
- 직관적인 데이터 세트 병합 및 결합
- 데이터 세트의 유연한 재구성 및 피벗
- 축의 계층적 라벨링(눈금 당 여러 개의 라벨을 가질 수 있음)
- 플랫 파일(csv 및 구분), Excel 파일, 데이터베이스 로딩 및 초고속 HDF5 형식의 데이터 저장/로드에 사용되는 강력한 IO도구
- 시계열 특정 기능: 날짜 범위 생성 및 주파수 변환, 무빙 윈도우 통계, 날짜 이동 및 지연

In [1]:
import numpy as np
import pandas as pd
pd.__version__

'1.2.4'

## Pandas 객체


### Series 객체

In [6]:
s = pd.Series((0, 0.25, 0.5, 0.75, 1.0))
s

0    0.00
1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

In [7]:
s.values

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [8]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [9]:
s[1]

0.25

In [10]:
s[1:4]

1    0.25
2    0.50
3    0.75
dtype: float64

In [11]:
s = pd.Series([0, 0.25, 0.5, 0.75, 1], index=['a', 'b', 'c', 'd', 'e'])
s

a    0.00
b    0.25
c    0.50
d    0.75
e    1.00
dtype: float64

In [13]:
s['c']

0.5

In [17]:
'b' in s

True

In [18]:
s.unique()

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [21]:
s.value_counts()

0.00    1
1.00    1
0.50    1
0.75    1
0.25    1
dtype: int64

In [20]:
s.isin([0.25, 0.75])

a    False
b     True
c    False
d     True
e    False
dtype: bool

In [25]:
pop_tuple = {'서울특별시': 9720846, 
            '부산광역시': 3404423, 
            '인천광역시': 2947217,
            '대구광역시': 2427954, 
            '대전광역시': 1471040}
population =pd.Series(pop_tuple)
population

서울특별시    9720846
부산광역시    3404423
인천광역시    2947217
대구광역시    2427954
대전광역시    1471040
dtype: int64

In [26]:
population['서울특별시']

9720846

In [27]:
population['서울특별시':'인천광역시']

서울특별시    9720846
부산광역시    3404423
인천광역시    2947217
dtype: int64

### DataFrame 객체

In [28]:
pd.DataFrame([{'A': 2, 'B': 4, 'D': 3}, {'A': 4, 'B': 5, 'C': 7}])


Unnamed: 0,A,B,D,C
0,2,4,3.0,
1,4,5,,7.0


In [30]:
pd.DataFrame(np.random.rand(5, 5), 
            columns = ['A', 'B' ,'C', 'D', 'E'],
            index = [1, 2, 3, 4, 5])

Unnamed: 0,A,B,C,D,E
1,0.480411,0.390632,0.261853,0.937141,0.06484
2,0.409991,0.945218,0.501853,0.66812,0.957351
3,0.352943,0.501391,0.166705,0.025515,0.957486
4,0.799258,0.720353,0.980737,0.921693,0.463714
5,0.772527,0.956855,0.984953,0.748891,0.380766


In [34]:
male_tuple = {'서울특별시': 4732275, 
            '부산광역시': 1668618, 
            '인천광역시': 1476813,
            '대구광역시': 1198815, 
            '대전광역시': 734441}
male =pd.Series(male_tuple)
print(male)

female_tuple = {'서울특별시': 4988571, 
            '부산광역시': 1735805, 
            '인천광역시': 1470404,
            '대구광역시': 1229139, 
            '대전광역시': 736599}
female =pd.Series(female_tuple)
print(female)

서울특별시    4732275
부산광역시    1668618
인천광역시    1476813
대구광역시    1198815
대전광역시     734441
dtype: int64
서울특별시    4988571
부산광역시    1735805
인천광역시    1470404
대구광역시    1229139
대전광역시     736599
dtype: int64


In [37]:
korea_df = pd.DataFrame({'인구수': population, 
                        '남자인구수': male, 
                        '여자인구수': female})
korea_df

Unnamed: 0,인구수,남자인구수,여자인구수
서울특별시,9720846,4732275,4988571
부산광역시,3404423,1668618,1735805
인천광역시,2947217,1476813,1470404
대구광역시,2427954,1198815,1229139
대전광역시,1471040,734441,736599


In [38]:
korea_df.index

Index(['서울특별시', '부산광역시', '인천광역시', '대구광역시', '대전광역시'], dtype='object')

In [39]:
korea_df.columns

Index(['인구수', '남자인구수', '여자인구수'], dtype='object')

In [40]:
korea_df['여자인구수']

서울특별시    4988571
부산광역시    1735805
인천광역시    1470404
대구광역시    1229139
대전광역시     736599
Name: 여자인구수, dtype: int64

In [41]:
korea_df['서울특별시':'인천광역시']

Unnamed: 0,인구수,남자인구수,여자인구수
서울특별시,9720846,4732275,4988571
부산광역시,3404423,1668618,1735805
인천광역시,2947217,1476813,1470404


### Index 객체


|클래스|설명|
|--|--|
Index|일반적인 Index 객체이며, NumPy 배열 형식으로 축의 이름 표현
Int64Index|정수 값을 위한 Index
MultiIndex|단일 축에 여러 단계 색인을 표현하는 계층적 Index 객체(튜플의 배열과 유사)
DatetimeIndex|NumPy의 datetime64 타입으로 타임스탬프 저장
PeriodIndex|기간 데이터를 위한 Index

In [42]:
idx = pd.Index([2, 4, 6, 8, 10])
idx

Int64Index([2, 4, 6, 8, 10], dtype='int64')

In [43]:
idx[1]

4

In [44]:
print(idx)
print(idx.size)
print(idx.shape)
print(idx.ndim)
print(idx.dtype)

Int64Index([2, 4, 6, 8, 10], dtype='int64')
5
(5,)
1
int64


#### Index 연산

|연산자|메소드|설명|
|--|--|--|
.|`append`|색인 객체를 추가한 새로운 색인 변환
 `-`|`difference`|색인의 차집합 반환
 `&`|`intersection`|색인의 교집합 반환
 `\|`|`union`|색인의 합집합 반환
.|`isin`| 색인이 존재하는지 여부를 불리언 배열로 변환
.|`delete`|해당 index가 삭제된 새로운 색인 반환
.|`drop`|값이 삭제된 새로운 색인 반환
.|`insert`|색인이 추가된 새로운 색인 반환
.|`is_monotonic`|색인이 단조성을 가지면 True
.|`is_unique`|중복되는 색인이 없다면 True
.|`unique`|색인에서 중복되는 요소를 제거하고 유일한 값만 반환
  

In [62]:
idx1 = pd.Index([1, 2, 4, 6, 8])
idx2 = pd.Index([2, 4, 5, 6, 7])

print(idx1.append(idx2))
print(idx1.difference(idx2))
print(idx1.intersection(idx2))
print(idx.union(idx2))
print(idx1.isin([1, 2]))
print(idx1.delete(0))
print(idx1.drop(4))

Int64Index([1, 2, 4, 6, 8, 2, 4, 5, 6, 7], dtype='int64')
Int64Index([1, 8], dtype='int64')
Int64Index([2, 4, 6], dtype='int64')
Int64Index([2, 4, 5, 6, 7, 8, 10], dtype='int64')
[ True  True False False False]
Int64Index([2, 4, 6, 8], dtype='int64')
Int64Index([1, 2, 6, 8], dtype='int64')




---



## 인덱싱(Indexing)

In [63]:
s = pd.Series([0, 0.25, 0.5, 0.75, 1.0],
             index = ['a', 'b', 'c', 'd', 'e'])
s

a    0.00
b    0.25
c    0.50
d    0.75
e    1.00
dtype: float64

In [64]:
s['b']

0.25

In [65]:
'b' in s

True

In [74]:
s.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [68]:
list(s.items())

[('a', 0.0), ('b', 0.25), ('c', 0.5), ('d', 0.75), ('e', 1.0)]

In [75]:
s['f'] = 1.25
s

a    0.00
b    0.25
c    0.50
d    0.75
e    1.00
f    1.25
dtype: float64

In [76]:
s['a':'d']

a    0.00
b    0.25
c    0.50
d    0.75
dtype: float64

In [77]:
s[0:4]

a    0.00
b    0.25
c    0.50
d    0.75
dtype: float64

In [82]:
s[(s > 0.4) & (s < 0.8)]

c    0.50
d    0.75
dtype: float64

In [84]:
s[['a', 'c']]

a    0.0
c    0.5
dtype: float64

### Series 인덱싱

In [85]:
s = pd.Series(['a', 'b', 'c', 'd', 'e'], 
             index = [1, 3, 5, 7, 9])
s

1    a
3    b
5    c
7    d
9    e
dtype: object

In [86]:
s[1]

'a'

In [87]:
s[2:4]

5    c
7    d
dtype: object

In [88]:
s.iloc[1] # 정수값 반환

'b'

In [89]:
s.iloc[2:4]

5    c
7    d
dtype: object

In [91]:
s.reindex(range(10))

0    NaN
1      a
2    NaN
3      b
4    NaN
5      c
6    NaN
7      d
8    NaN
9      e
dtype: object

In [94]:
s.reindex(range(10), method = 'bfill') # back fill

0    a
1    a
2    b
3    b
4    c
5    c
6    d
7    d
8    e
9    e
dtype: object

### DataFrame 인덱싱


|사용 방법|설명|
|:--|--|
`df[val]`|하나의 컬럼 또는 여러 컬럼을 선택
`df.loc[val]`|라벨값으로 로우의 부분집합 선택
`df.loc[:, val}`|라벨값으로 컬럼의 부분집합 선택
`df.loc[val1, val2}`|라벨값으로 로우와 컬럼의 부분집합 선택
`df.iloc[where]`|정수 색인으로 로우의 부분집합 선택
`df.iloc[:. where]`|정수 색인으로 컬럼의 부분집합 선택
`df.iloc[where_i. where_j]`|정수 색인으로 로우와 컬럼의 부분집합 선택
`df.at[label_i. label_j]`|로우와 컬럼의 라벨로 단일 값 선택
`df.lat[i, j]`|로우와 컬럼의 정수 색인으로 단일 값 선택
`reindex`|하나 이상의 축을 새로운 색인으로 재색인
`get_value, set_value`|로우와 컬럼의 이름으로 값 선택

In [95]:
korea_df

Unnamed: 0,인구수,남자인구수,여자인구수
서울특별시,9720846,4732275,4988571
부산광역시,3404423,1668618,1735805
인천광역시,2947217,1476813,1470404
대구광역시,2427954,1198815,1229139
대전광역시,1471040,734441,736599


In [96]:
korea_df['남자인구수']

서울특별시    4732275
부산광역시    1668618
인천광역시    1476813
대구광역시    1198815
대전광역시     734441
Name: 남자인구수, dtype: int64

In [97]:
korea_df.남자인구수

서울특별시    4732275
부산광역시    1668618
인천광역시    1476813
대구광역시    1198815
대전광역시     734441
Name: 남자인구수, dtype: int64

In [98]:
korea_df['남여비율'] = (korea_df['남자인구수'] * 100 / korea_df['여자인구수'])

In [101]:
korea_df

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
서울특별시,9720846,4732275,4988571,94.862336
부산광역시,3404423,1668618,1735805,96.129346
인천광역시,2947217,1476813,1470404,100.435867
대구광역시,2427954,1198815,1229139,97.532907
대전광역시,1471040,734441,736599,99.707032


In [102]:
korea_df.values

array([[9.72084600e+06, 4.73227500e+06, 4.98857100e+06, 9.48623363e+01],
       [3.40442300e+06, 1.66861800e+06, 1.73580500e+06, 9.61293463e+01],
       [2.94721700e+06, 1.47681300e+06, 1.47040400e+06, 1.00435867e+02],
       [2.42795400e+06, 1.19881500e+06, 1.22913900e+06, 9.75329072e+01],
       [1.47104000e+06, 7.34441000e+05, 7.36599000e+05, 9.97070319e+01]])

In [103]:
korea_df.T

Unnamed: 0,서울특별시,부산광역시,인천광역시,대구광역시,대전광역시
인구수,9720846.0,3404423.0,2947217.0,2427954.0,1471040.0
남자인구수,4732275.0,1668618.0,1476813.0,1198815.0,734441.0
여자인구수,4988571.0,1735805.0,1470404.0,1229139.0,736599.0
남여비율,94.86234,96.12935,100.4359,97.53291,99.70703


In [104]:
korea_df.values[0]

array([9.72084600e+06, 4.73227500e+06, 4.98857100e+06, 9.48623363e+01])

In [106]:
korea_df['인구수']

서울특별시    9720846
부산광역시    3404423
인천광역시    2947217
대구광역시    2427954
대전광역시    1471040
Name: 인구수, dtype: int64

In [107]:
korea_df.loc[:'인천광역시', :'남자인구수']

Unnamed: 0,인구수,남자인구수
서울특별시,9720846,4732275
부산광역시,3404423,1668618
인천광역시,2947217,1476813


In [108]:
korea_df.loc[(korea_df.여자인구수 > 1000000)]

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
서울특별시,9720846,4732275,4988571,94.862336
부산광역시,3404423,1668618,1735805,96.129346
인천광역시,2947217,1476813,1470404,100.435867
대구광역시,2427954,1198815,1229139,97.532907


In [109]:
korea_df.loc[(korea_df.인구수 < 2000000)]

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
대전광역시,1471040,734441,736599,99.707032


In [110]:
korea_df.loc[(korea_df.인구수 > 2500000)]

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
서울특별시,9720846,4732275,4988571,94.862336
부산광역시,3404423,1668618,1735805,96.129346
인천광역시,2947217,1476813,1470404,100.435867


In [111]:
korea_df.loc[korea_df.남여비율 > 100]

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
인천광역시,2947217,1476813,1470404,100.435867


In [116]:
korea_df

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
서울특별시,9720846,4732275,4988571,94.862336
부산광역시,3404423,1668618,1735805,96.129346
인천광역시,2947217,1476813,1470404,100.435867
대구광역시,2427954,1198815,1229139,97.532907
대전광역시,1471040,734441,736599,99.707032


In [118]:
korea_df.iloc[:3, :2]

Unnamed: 0,인구수,남자인구수
서울특별시,9720846,4732275
부산광역시,3404423,1668618
인천광역시,2947217,1476813


### 다중 인덱싱(Multi Indexing)

* 1차원의 Series와 2차원의 DataFrame 객체를 넘어 3차원, 4차원 이상의 고차원 데이터 처리
* 단일 인덱스 내에 여러 인덱스를 포함하는 다중 인덱싱

#### 다중 인덱스 Series

In [120]:
korea_df

Unnamed: 0,인구수,남자인구수,여자인구수,남여비율
서울특별시,9720846,4732275,4988571,94.862336
부산광역시,3404423,1668618,1735805,96.129346
인천광역시,2947217,1476813,1470404,100.435867
대구광역시,2427954,1198815,1229139,97.532907
대전광역시,1471040,734441,736599,99.707032


In [124]:
idx_tuples = [('서울특별시', 2010), ('서울특별시', 2020),
              ('부산광역시', 2010), ('부산광역시', 2020), 
              ('인천광역시', 2010), ('인천광역시', 2020), 
              ('대구광역시', 2010), ('대구광역시', 2020), 
              ('대전광역시', 2010), ('대전광역시', 2020)] 
idx_tuples

[('서울특별시', 2010),
 ('서울특별시', 2020),
 ('부산광역시', 2010),
 ('부산광역시', 2020),
 ('인천광역시', 2010),
 ('인천광역시', 2020),
 ('대구광역시', 2010),
 ('대구광역시', 2020),
 ('대전광역시', 2010),
 ('대전광역시', 2020)]

In [137]:
pop_tuples = [10312545, 9720846, 3567910, 3404423, 2758297, 2947217, 
             2511676, 2427954, 1503664, 1471040]
population = pd.Series(pop_tuples, index = idx_tuples)
population

(서울특별시, 2010)    10312545
(서울특별시, 2020)     9720846
(부산광역시, 2010)     3567910
(부산광역시, 2020)     3404423
(인천광역시, 2010)     2758297
(인천광역시, 2020)     2947217
(대구광역시, 2010)     2511676
(대구광역시, 2020)     2427954
(대전광역시, 2010)     1503664
(대전광역시, 2020)     1471040
dtype: int64

In [138]:
midx = pd.MultiIndex.from_tuples(idx_tuples)
midx

MultiIndex([('서울특별시', 2010),
            ('서울특별시', 2020),
            ('부산광역시', 2010),
            ('부산광역시', 2020),
            ('인천광역시', 2010),
            ('인천광역시', 2020),
            ('대구광역시', 2010),
            ('대구광역시', 2020),
            ('대전광역시', 2010),
            ('대전광역시', 2020)],
           )

In [139]:
population = population.reindex(midx)
population

서울특별시  2010    10312545
       2020     9720846
부산광역시  2010     3567910
       2020     3404423
인천광역시  2010     2758297
       2020     2947217
대구광역시  2010     2511676
       2020     2427954
대전광역시  2010     1503664
       2020     1471040
dtype: int64

In [141]:
population[:, 2010]

서울특별시    10312545
부산광역시     3567910
인천광역시     2758297
대구광역시     2511676
대전광역시     1503664
dtype: int64

In [146]:
population['대전광역시',:]

2010    1503664
2020    1471040
dtype: int64

In [147]:
korea_mdf = population.unstack()
korea_mdf

Unnamed: 0,2010,2020
대구광역시,2511676,2427954
대전광역시,1503664,1471040
부산광역시,3567910,3404423
서울특별시,10312545,9720846
인천광역시,2758297,2947217


In [150]:
korea_mdf.stack()

대구광역시  2010     2511676
       2020     2427954
대전광역시  2010     1503664
       2020     1471040
부산광역시  2010     3567910
       2020     3404423
서울특별시  2010    10312545
       2020     9720846
인천광역시  2010     2758297
       2020     2947217
dtype: int64

In [155]:
male_tuples = [5111259, 4732275, 1773170, 1668618, 1390356, 1476813, 
             1255245, 1198815, 753648, 734441]
male_tuples

[5111259,
 4732275,
 1773170,
 1668618,
 1390356,
 1476813,
 1255245,
 1198815,
 753648,
 734441]

In [158]:
population

서울특별시  2010    10312545
       2020     9720846
부산광역시  2010     3567910
       2020     3404423
인천광역시  2010     2758297
       2020     2947217
대구광역시  2010     2511676
       2020     2427954
대전광역시  2010     1503664
       2020     1471040
dtype: int64

In [223]:
korea_mdf = pd.DataFrame({'총인구수':population, 
                         '남자인구수': male_tuples})
korea_mdf

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1
서울특별시,2010,10312545,5111259
서울특별시,2020,9720846,4732275
부산광역시,2010,3567910,1773170
부산광역시,2020,3404423,1668618
인천광역시,2010,2758297,1390356
인천광역시,2020,2947217,1476813
대구광역시,2010,2511676,1255245
대구광역시,2020,2427954,1198815
대전광역시,2010,1503664,753648
대전광역시,2020,1471040,734441


In [224]:
female_tuples = [5201286, 4988571, 1794740, 1735805, 1367940, 1470404,
                1256431, 1229139, 750016, 736599]
female_tuples

[5201286,
 4988571,
 1794740,
 1735805,
 1367940,
 1470404,
 1256431,
 1229139,
 750016,
 736599]

In [225]:
korea_mdf = pd.DataFrame({'총인구수':population, 
                         '남자인구수': male_tuples, 
                         '여자인구수': female_tuples})
korea_mdf

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
서울특별시,2010,10312545,5111259,5201286
서울특별시,2020,9720846,4732275,4988571
부산광역시,2010,3567910,1773170,1794740
부산광역시,2020,3404423,1668618,1735805
인천광역시,2010,2758297,1390356,1367940
인천광역시,2020,2947217,1476813,1470404
대구광역시,2010,2511676,1255245,1256431
대구광역시,2020,2427954,1198815,1229139
대전광역시,2010,1503664,753648,750016
대전광역시,2020,1471040,734441,736599


In [226]:
ratio = korea_mdf['남자인구수'] * 100 / korea_mdf['여자인구수']
ratio

행정구역   년도  
서울특별시  2010     98.269140
       2020     94.862336
부산광역시  2010     98.798155
       2020     96.129346
인천광역시  2010    101.638668
       2020    100.435867
대구광역시  2010     99.905606
       2020     97.532907
대전광역시  2010    100.484256
       2020     99.707032
dtype: float64

In [227]:
ratio.unstack()

년도,2010,2020
행정구역,Unnamed: 1_level_1,Unnamed: 2_level_1
대구광역시,99.905606,97.532907
대전광역시,100.484256,99.707032
부산광역시,98.798155,96.129346
서울특별시,98.26914,94.862336
인천광역시,101.638668,100.435867


In [228]:
korea_mdf = pd.DataFrame({'총인구수':population, 
                         '남자인구수': male_tuples, 
                         '여자인구수': female_tuples, 
                         '남여비율': ratio})
korea_mdf

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수,남여비율
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
서울특별시,2010,10312545,5111259,5201286,98.26914
서울특별시,2020,9720846,4732275,4988571,94.862336
부산광역시,2010,3567910,1773170,1794740,98.798155
부산광역시,2020,3404423,1668618,1735805,96.129346
인천광역시,2010,2758297,1390356,1367940,101.638668
인천광역시,2020,2947217,1476813,1470404,100.435867
대구광역시,2010,2511676,1255245,1256431,99.905606
대구광역시,2020,2427954,1198815,1229139,97.532907
대전광역시,2010,1503664,753648,750016,100.484256
대전광역시,2020,1471040,734441,736599,99.707032


#### 다중 인덱스 생성

In [165]:
df = pd.DataFrame(np.random.rand(6, 3), 
                 index = [['a', 'a', 'b', 'b', 'c', 'c'], [1, 2, 1, 2, 1, 2]], 
                 columns = ['c1', 'c2', 'c3'])
df

Unnamed: 0,Unnamed: 1,c1,c2,c3
a,1,0.472749,0.968537,0.4186
a,2,0.203213,0.001083,0.828604
b,1,0.066785,0.456441,0.349855
b,2,0.029368,0.86941,0.757202
c,1,0.595198,0.629096,0.028985
c,2,0.41282,0.33571,0.547151


In [166]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b', 'c', 'c'], [1, 2, 1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2)],
           )

In [169]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2), ('c', 1), ('c', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2)],
           )

In [170]:
pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]]) # 곱 형태

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2)],
           )

In [171]:
pd.MultiIndex(levels = [['a', 'b', 'c'], [1, 2]],
             codes = [[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2)],
           )

In [172]:
#########
population

서울특별시  2010    10312545
       2020     9720846
부산광역시  2010     3567910
       2020     3404423
인천광역시  2010     2758297
       2020     2947217
대구광역시  2010     2511676
       2020     2427954
대전광역시  2010     1503664
       2020     1471040
dtype: int64

In [222]:
population.index.names = ['행정구역', '년도']
population

행정구역   년도  
서울특별시  2010    10312545
       2020     9720846
부산광역시  2010     3567910
       2020     3404423
인천광역시  2010     2758297
       2020     2947217
대구광역시  2010     2511676
       2020     2427954
대전광역시  2010     1503664
       2020     1471040
dtype: int64

In [177]:
idx = pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]], 
                                names = ['name1', 'name2'])
cols = pd.MultiIndex.from_product([['c1', 'c2', 'c3'], [1, 2]], 
                                 names = ['col_name1', 'col_name2'])

data = np.round(np.random.randn(6, 6), 2)
mdf = pd.DataFrame(data, index = idx, columns = cols)
mdf

Unnamed: 0_level_0,col_name1,c1,c1,c2,c2,c3,c3
Unnamed: 0_level_1,col_name2,1,2,1,2,1,2
name1,name2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,1,-0.25,-0.88,-0.16,0.07,-0.11,-0.88
a,2,0.35,3.33,0.19,-0.32,1.48,0.38
b,1,-0.32,-1.52,0.35,0.37,-1.22,0.03
b,2,-0.06,0.33,-0.51,-0.11,-1.25,0.7
c,1,0.3,-0.55,1.04,0.84,1.48,0.79
c,2,-0.36,-0.42,2.63,0.86,0.45,-1.31


In [179]:
mdf['c2']

Unnamed: 0_level_0,col_name2,1,2
name1,name2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,-0.16,0.07
a,2,0.19,-0.32
b,1,0.35,0.37
b,2,-0.51,-0.11
c,1,1.04,0.84
c,2,2.63,0.86


#### 인덱싱 및 슬라이싱

In [180]:
population

행정구역   년도  
서울특별시  2010    10312545
       2020     9720846
부산광역시  2010     3567910
       2020     3404423
인천광역시  2010     2758297
       2020     2947217
대구광역시  2010     2511676
       2020     2427954
대전광역시  2010     1503664
       2020     1471040
dtype: int64

In [181]:
population['인천광역시', 2010]

2758297

In [183]:
population[:, 2010]

행정구역
서울특별시    10312545
부산광역시     3567910
인천광역시     2758297
대구광역시     2511676
대전광역시     1503664
dtype: int64

In [185]:
population[population > 3000000]

행정구역   년도  
서울특별시  2010    10312545
       2020     9720846
부산광역시  2010     3567910
       2020     3404423
dtype: int64

In [186]:
population[['대구광역시', '대전광역시']]

행정구역   년도  
대구광역시  2010    2511676
       2020    2427954
대전광역시  2010    1503664
       2020    1471040
dtype: int64

In [187]:
mdf

Unnamed: 0_level_0,col_name1,c1,c1,c2,c2,c3,c3
Unnamed: 0_level_1,col_name2,1,2,1,2,1,2
name1,name2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,1,-0.25,-0.88,-0.16,0.07,-0.11,-0.88
a,2,0.35,3.33,0.19,-0.32,1.48,0.38
b,1,-0.32,-1.52,0.35,0.37,-1.22,0.03
b,2,-0.06,0.33,-0.51,-0.11,-1.25,0.7
c,1,0.3,-0.55,1.04,0.84,1.48,0.79
c,2,-0.36,-0.42,2.63,0.86,0.45,-1.31


In [195]:
mdf['c2', 1]

name1  name2
a      1       -0.16
       2        0.19
b      1        0.35
       2       -0.51
c      1        1.04
       2        2.63
Name: (c2, 1), dtype: float64

In [189]:
mdf.iloc[:3, :4]

Unnamed: 0_level_0,col_name1,c1,c1,c2,c2
Unnamed: 0_level_1,col_name2,1,2,1,2
name1,name2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,1,-0.25,-0.88,-0.16,0.07
a,2,0.35,3.33,0.19,-0.32
b,1,-0.32,-1.52,0.35,0.37


In [204]:
mdf.loc['a', ('c2', 1)]

name2
1   -0.16
2    0.19
Name: (c2, 1), dtype: float64

In [205]:
idx_slice = pd.IndexSlice # IndexSlice                                                                                                                                             
mdf.loc[idx_slice[:, 2], idx_slice[:, 2]]

Unnamed: 0_level_0,col_name1,c1,c2,c3
Unnamed: 0_level_1,col_name2,2,2,2
name1,name2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,2,3.33,-0.32,0.38
b,2,0.33,-0.11,0.7
c,2,-0.42,0.86,-1.31


#### 다중 인덱스 재정렬

In [206]:
idx

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2)],
           names=['name1', 'name2'])

In [229]:
korea_mdf

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수,남여비율
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
서울특별시,2010,10312545,5111259,5201286,98.26914
서울특별시,2020,9720846,4732275,4988571,94.862336
부산광역시,2010,3567910,1773170,1794740,98.798155
부산광역시,2020,3404423,1668618,1735805,96.129346
인천광역시,2010,2758297,1390356,1367940,101.638668
인천광역시,2020,2947217,1476813,1470404,100.435867
대구광역시,2010,2511676,1255245,1256431,99.905606
대구광역시,2020,2427954,1198815,1229139,97.532907
대전광역시,2010,1503664,753648,750016,100.484256
대전광역시,2020,1471040,734441,736599,99.707032


In [230]:
korea_mdf['서울특별시':'인천광역시'] # --> raise UnsortedIndexError

UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

In [232]:
korea_mdf = korea_mdf.sort_index()
korea_mdf

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수,남여비율
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
대구광역시,2010,2511676,1255245,1256431,99.905606
대구광역시,2020,2427954,1198815,1229139,97.532907
대전광역시,2010,1503664,753648,750016,100.484256
대전광역시,2020,1471040,734441,736599,99.707032
부산광역시,2010,3567910,1773170,1794740,98.798155
부산광역시,2020,3404423,1668618,1735805,96.129346
서울특별시,2010,10312545,5111259,5201286,98.26914
서울특별시,2020,9720846,4732275,4988571,94.862336
인천광역시,2010,2758297,1390356,1367940,101.638668
인천광역시,2020,2947217,1476813,1470404,100.435867


In [233]:
korea_mdf['서울특별시':'인천광역시']

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수,남여비율
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
서울특별시,2010,10312545,5111259,5201286,98.26914
서울특별시,2020,9720846,4732275,4988571,94.862336
인천광역시,2010,2758297,1390356,1367940,101.638668
인천광역시,2020,2947217,1476813,1470404,100.435867


In [234]:
korea_mdf.unstack(level = 0)

Unnamed: 0_level_0,총인구수,총인구수,총인구수,총인구수,총인구수,남자인구수,남자인구수,남자인구수,남자인구수,남자인구수,여자인구수,여자인구수,여자인구수,여자인구수,여자인구수,남여비율,남여비율,남여비율,남여비율,남여비율
행정구역,대구광역시,대전광역시,부산광역시,서울특별시,인천광역시,대구광역시,대전광역시,부산광역시,서울특별시,인천광역시,대구광역시,대전광역시,부산광역시,서울특별시,인천광역시,대구광역시,대전광역시,부산광역시,서울특별시,인천광역시
년도,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
2010,2511676,1503664,3567910,10312545,2758297,1255245,753648,1773170,5111259,1390356,1256431,750016,1794740,5201286,1367940,99.905606,100.484256,98.798155,98.26914,101.638668
2020,2427954,1471040,3404423,9720846,2947217,1198815,734441,1668618,4732275,1476813,1229139,736599,1735805,4988571,1470404,97.532907,99.707032,96.129346,94.862336,100.435867


In [235]:
korea_mdf.unstack(level = 1)

Unnamed: 0_level_0,총인구수,총인구수,남자인구수,남자인구수,여자인구수,여자인구수,남여비율,남여비율
년도,2010,2020,2010,2020,2010,2020,2010,2020
행정구역,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
대구광역시,2511676,2427954,1255245,1198815,1256431,1229139,99.905606,97.532907
대전광역시,1503664,1471040,753648,734441,750016,736599,100.484256,99.707032
부산광역시,3567910,3404423,1773170,1668618,1794740,1735805,98.798155,96.129346
서울특별시,10312545,9720846,5111259,4732275,5201286,4988571,98.26914,94.862336
인천광역시,2758297,2947217,1390356,1476813,1367940,1470404,101.638668,100.435867


In [236]:
korea_mdf.stack()

행정구역   년도         
대구광역시  2010  총인구수     2.511676e+06
             남자인구수    1.255245e+06
             여자인구수    1.256431e+06
             남여비율     9.990561e+01
       2020  총인구수     2.427954e+06
             남자인구수    1.198815e+06
             여자인구수    1.229139e+06
             남여비율     9.753291e+01
대전광역시  2010  총인구수     1.503664e+06
             남자인구수    7.536480e+05
             여자인구수    7.500160e+05
             남여비율     1.004843e+02
       2020  총인구수     1.471040e+06
             남자인구수    7.344410e+05
             여자인구수    7.365990e+05
             남여비율     9.970703e+01
부산광역시  2010  총인구수     3.567910e+06
             남자인구수    1.773170e+06
             여자인구수    1.794740e+06
             남여비율     9.879815e+01
       2020  총인구수     3.404423e+06
             남자인구수    1.668618e+06
             여자인구수    1.735805e+06
             남여비율     9.612935e+01
서울특별시  2010  총인구수     1.031254e+07
             남자인구수    5.111259e+06
             여자인구수    5.201286e+06
             남여비율     9.826914e+01
 

In [237]:
korea_mdf

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수,남여비율
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
대구광역시,2010,2511676,1255245,1256431,99.905606
대구광역시,2020,2427954,1198815,1229139,97.532907
대전광역시,2010,1503664,753648,750016,100.484256
대전광역시,2020,1471040,734441,736599,99.707032
부산광역시,2010,3567910,1773170,1794740,98.798155
부산광역시,2020,3404423,1668618,1735805,96.129346
서울특별시,2010,10312545,5111259,5201286,98.26914
서울특별시,2020,9720846,4732275,4988571,94.862336
인천광역시,2010,2758297,1390356,1367940,101.638668
인천광역시,2020,2947217,1476813,1470404,100.435867


In [238]:
idx_flat = korea_mdf.reset_index(level = 0)
idx_flat

Unnamed: 0_level_0,행정구역,총인구수,남자인구수,여자인구수,남여비율
년도,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010,대구광역시,2511676,1255245,1256431,99.905606
2020,대구광역시,2427954,1198815,1229139,97.532907
2010,대전광역시,1503664,753648,750016,100.484256
2020,대전광역시,1471040,734441,736599,99.707032
2010,부산광역시,3567910,1773170,1794740,98.798155
2020,부산광역시,3404423,1668618,1735805,96.129346
2010,서울특별시,10312545,5111259,5201286,98.26914
2020,서울특별시,9720846,4732275,4988571,94.862336
2010,인천광역시,2758297,1390356,1367940,101.638668
2020,인천광역시,2947217,1476813,1470404,100.435867


In [239]:
idx_flat = korea_mdf.reset_index(level = (0, 1))
idx_flat

Unnamed: 0,행정구역,년도,총인구수,남자인구수,여자인구수,남여비율
0,대구광역시,2010,2511676,1255245,1256431,99.905606
1,대구광역시,2020,2427954,1198815,1229139,97.532907
2,대전광역시,2010,1503664,753648,750016,100.484256
3,대전광역시,2020,1471040,734441,736599,99.707032
4,부산광역시,2010,3567910,1773170,1794740,98.798155
5,부산광역시,2020,3404423,1668618,1735805,96.129346
6,서울특별시,2010,10312545,5111259,5201286,98.26914
7,서울특별시,2020,9720846,4732275,4988571,94.862336
8,인천광역시,2010,2758297,1390356,1367940,101.638668
9,인천광역시,2020,2947217,1476813,1470404,100.435867


In [240]:
idx_flat.set_index(['행정구역', '년도'])

Unnamed: 0_level_0,Unnamed: 1_level_0,총인구수,남자인구수,여자인구수,남여비율
행정구역,년도,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
대구광역시,2010,2511676,1255245,1256431,99.905606
대구광역시,2020,2427954,1198815,1229139,97.532907
대전광역시,2010,1503664,753648,750016,100.484256
대전광역시,2020,1471040,734441,736599,99.707032
부산광역시,2010,3567910,1773170,1794740,98.798155
부산광역시,2020,3404423,1668618,1735805,96.129346
서울특별시,2010,10312545,5111259,5201286,98.26914
서울특별시,2020,9720846,4732275,4988571,94.862336
인천광역시,2010,2758297,1390356,1367940,101.638668
인천광역시,2020,2947217,1476813,1470404,100.435867


## 데이터 연산

In [2]:
s = pd.Series(np.random.randint(0, 10, 5))
s

0    8
1    6
2    0
3    6
4    7
dtype: int32

In [3]:
df = pd.DataFrame(np.random.randint(0, 10, (3, 3)),
                 columns = ['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,4,6,1
1,1,6,5
2,0,2,4


In [4]:
np.exp(s)

0    2980.957987
1     403.428793
2       1.000000
3     403.428793
4    1096.633158
dtype: float64

In [5]:
np.cos(df * np.pi / 4)

Unnamed: 0,A,B,C
0,-1.0,-1.83697e-16,0.707107
1,0.707107,-1.83697e-16,-0.707107
2,1.0,6.123234000000001e-17,-1.0


In [6]:
s1 = pd.Series([1, 3, 5, 7, 9], index = [0, 1, 2, 3, 4])
s2 = pd.Series([2, 4, 6, 8, 10], index = [1, 2, 3, 4, 5])
s1 + s2 # 같은 index 기준

0     NaN
1     5.0
2     9.0
3    13.0
4    17.0
5     NaN
dtype: float64

In [8]:
s1.add(s2, fill_value = 0)

0     1.0
1     5.0
2     9.0
3    13.0
4    17.0
5    10.0
dtype: float64

In [18]:
df1 = pd.DataFrame(np.random.randint(0, 10, (3, 3)), 
                  columns = list('ACD'))
df1

Unnamed: 0,A,C,D
0,9,1,5
1,8,8,3
2,6,2,4


In [14]:
df2 = pd.DataFrame(np.random.randint(0, 10, (5, 5)), 
                  columns = list('BAECD'))
df2

Unnamed: 0,B,A,E,C,D
0,1,0,1,9,2
1,9,9,3,8,1
2,1,5,0,1,8
3,9,0,5,5,0
4,3,5,4,6,4


In [15]:
df1 + df2 # 인덱스에 따라 연산

Unnamed: 0,A,B,C,D,E,F
0,8,,14,5,3,
1,12,,12,10,7,
2,5,,5,14,3,
3,5,,9,5,13,
4,12,,9,13,8,


In [22]:
fvalue = df1.stack().mean()
df1.add(df2, fill_value = fvalue)

Unnamed: 0,A,B,C,D,E
0,9.0,6.111111,10.0,7.0,6.111111
1,17.0,14.111111,16.0,4.0,8.111111
2,11.0,6.111111,3.0,12.0,5.111111
3,5.111111,14.111111,10.111111,5.111111,10.111111
4,10.111111,8.111111,11.111111,9.111111,9.111111


### 연산자 범용 함수


|Python 연산자|Pandas 메소드|
|:--|:--|
`+`|`add`, `radd`
`-`|`sub`, `rsub`, `subtract`
`*`|`mul`, `rmul`, `multiply`
`/`|`truediv`, `div`, `rdiv`, `divide`
`//`|`floordiv`, `rfloordiv`
`%`|`mod`
`**`|`pow`, `rpow`

#### add()

In [23]:
a = np.random.randint(1, 10, size = (3, 3))
a

array([[4, 9, 6],
       [6, 7, 5],
       [7, 7, 4]])

In [24]:
a + a[0]

array([[ 8, 18, 12],
       [10, 16, 11],
       [11, 16, 10]])

In [26]:
df = pd.DataFrame(a, columns = list('ABC'))
df

Unnamed: 0,A,B,C
0,4,9,6
1,6,7,5
2,7,7,4


In [27]:
df + df.iloc[0] # NumPy와 같이 브로드캐스팅 됨

Unnamed: 0,A,B,C
0,8,18,12
1,10,16,11
2,11,16,10


In [28]:
df.add(df.iloc[0])

Unnamed: 0,A,B,C
0,8,18,12
1,10,16,11
2,11,16,10


#### sub() / subtract()

In [29]:
a

array([[4, 9, 6],
       [6, 7, 5],
       [7, 7, 4]])

In [30]:
a - a[0]

array([[ 0,  0,  0],
       [ 2, -2, -1],
       [ 3, -2, -2]])

In [31]:
df

Unnamed: 0,A,B,C
0,4,9,6
1,6,7,5
2,7,7,4


In [32]:
df - df.iloc[0]


Unnamed: 0,A,B,C
0,0,0,0
1,2,-2,-1
2,3,-2,-2


In [33]:
df.sub(df.iloc[0])

Unnamed: 0,A,B,C
0,0,0,0
1,2,-2,-1
2,3,-2,-2


In [35]:
df.subtract(df['B'], axis = 0)

Unnamed: 0,A,B,C
0,-5,0,-3
1,-1,0,-2
2,0,0,-3


#### mul() / multply()




In [36]:
a

array([[4, 9, 6],
       [6, 7, 5],
       [7, 7, 4]])

In [38]:
a*a[0]

array([[16, 81, 36],
       [24, 63, 30],
       [28, 63, 24]])

In [39]:
df

Unnamed: 0,A,B,C
0,4,9,6
1,6,7,5
2,7,7,4


In [40]:
df * df.iloc[1]

Unnamed: 0,A,B,C
0,24,63,30
1,36,49,25
2,42,49,20


In [41]:
df.mul(df.iloc[1])

Unnamed: 0,A,B,C
0,24,63,30
1,36,49,25
2,42,49,20


In [42]:
df.multiply(df.iloc[2])

Unnamed: 0,A,B,C
0,28,63,24
1,42,49,20
2,49,49,16


#### truediv() /  div() / divide() / floordiv()

In [43]:
a

array([[4, 9, 6],
       [6, 7, 5],
       [7, 7, 4]])

In [44]:
a/a[0]

array([[1.        , 1.        , 1.        ],
       [1.5       , 0.77777778, 0.83333333],
       [1.75      , 0.77777778, 0.66666667]])

In [45]:
df / df.iloc[0]

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,1.5,0.777778,0.833333
2,1.75,0.777778,0.666667


In [46]:
df.truediv(df.iloc[0])

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,1.5,0.777778,0.833333
2,1.75,0.777778,0.666667


In [47]:
a // a[0]

array([[1, 1, 1],
       [1, 0, 0],
       [1, 0, 0]], dtype=int32)

In [49]:
df.floordiv(df.iloc[0])

Unnamed: 0,A,B,C
0,1,1,1
1,1,0,0
2,1,0,0


#### mod()

In [50]:
a

array([[4, 9, 6],
       [6, 7, 5],
       [7, 7, 4]])

In [51]:
a % a[0]

array([[0, 0, 0],
       [2, 7, 5],
       [3, 7, 4]], dtype=int32)

In [52]:
df

Unnamed: 0,A,B,C
0,4,9,6
1,6,7,5
2,7,7,4


In [53]:
df.mod(df.iloc[0])

Unnamed: 0,A,B,C
0,0,0,0
1,2,7,5
2,3,7,4


#### pow()

In [54]:
a ** a[0]

array([[      256, 387420489,     46656],
       [     1296,  40353607,     15625],
       [     2401,  40353607,      4096]], dtype=int32)

In [55]:
df

Unnamed: 0,A,B,C
0,4,9,6
1,6,7,5
2,7,7,4


In [56]:
df.pow(df.iloc[0])

Unnamed: 0,A,B,C
0,256,387420489,46656
1,1296,40353607,15625
2,2401,40353607,4096


In [57]:
row = df.iloc[0, ::2]
row

A    4
C    6
Name: 0, dtype: int32

In [58]:
df - row

Unnamed: 0,A,B,C
0,0.0,,0.0
1,2.0,,-1.0
2,3.0,,-2.0


### 정렬(Sort)

In [60]:
s = pd.Series(range(5), index = ['A', 'D', 'B', 'C', 'E'])
s

A    0
D    1
B    2
C    3
E    4
dtype: int64

In [61]:
s.sort_index()

A    0
B    2
C    3
D    1
E    4
dtype: int64

In [62]:
s.sort_values()

A    0
D    1
B    2
C    3
E    4
dtype: int64

In [63]:
df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), 
                 index = [2, 4, 1, 3], 
                 columns = list('BDAC'))
df

Unnamed: 0,B,D,A,C
2,0,6,7,4
4,8,9,1,0
1,4,7,8,3
3,1,2,1,4


In [65]:
df.sort_index()

Unnamed: 0,B,D,A,C
1,4,7,8,3
2,0,6,7,4
3,1,2,1,4
4,8,9,1,0


In [68]:
df.sort_index(axis = 1)

Unnamed: 0,A,B,C,D
2,7,0,4,6
4,1,8,0,9
1,8,4,3,7
3,1,1,4,2


In [66]:
df.sort_values(by = 'A') # 기준 열

Unnamed: 0,B,D,A,C
4,8,9,1,0
3,1,2,1,4
2,0,6,7,4
1,4,7,8,3


In [67]:
df.sort_values(by = ['A', 'C']) # A 이후 C 순으로 정렬

Unnamed: 0,B,D,A,C
4,8,9,1,0
3,1,2,1,4
2,0,6,7,4
1,4,7,8,3


### 순위(Ranking)


|메소드|설명|
|:--|:--|
`average`|기본값. 순위에 같은 값을 가지는 항목들의 평균 값을 사용
`min`|같은 값을 가지는 그룹을 낮은 순위로 지정
`max`|같은 값을 가지는 그룹을 높은 순위로 지정
`first`|데이터 내의 위치에 따라 순위 지정
`dense`|같은 그룹 내에서 모두 같은 순위를 적용하지 않고 1씩 증가

In [69]:
s = pd.Series([-2, 4, 7, -4, 1, 5, 2, 5])
s

0   -2
1    4
2    7
3   -4
4    1
5    5
6    2
7    5
dtype: int64

In [70]:
s.rank() # 각 index별 랭크 (동일값은 .5)

0    2.0
1    5.0
2    8.0
3    1.0
4    3.0
5    6.5
6    4.0
7    6.5
dtype: float64

In [71]:
s.rank(method = 'first') # 동일 값이라도 먼저 온 값에 순위를 더 높여줌

0    2.0
1    5.0
2    8.0
3    1.0
4    3.0
5    6.0
6    4.0
7    7.0
dtype: float64

In [73]:
s.rank(method = 'max') # 동일 값 순위 올림(6.5 -> 7)

0    2.0
1    5.0
2    8.0
3    1.0
4    3.0
5    7.0
6    4.0
7    7.0
dtype: float64

### 고성능 연산 pd.eval, pd.query

In [76]:
nrows, ncols = 10000, 100
df1, df2, df3, df4 = (pd.DataFrame(np.random.rand(nrows, ncols)) for i in range(4))

In [78]:
%timeit df1 + df2 + df3 + df4

8.64 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [79]:
%timeit pd.eval('df1 + df2 + df3 + df4')

6.36 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [80]:
%timeit df1 * -df2 / (-df3 * df4)

15.6 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [83]:
%timeit pd.eval('df1 * -df2 / (-df3 * df4)')

6.63 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [85]:
df = pd.DataFrame(np.random.rand(1000000, 5), columns = list('ABCDE'))
df.head()

Unnamed: 0,A,B,C,D,E
0,0.641301,0.31012,0.101351,0.748783,0.039265
1,0.424981,0.038268,0.549519,0.726498,0.728869
2,0.003499,0.424629,0.107606,0.429814,0.295551
3,0.415023,0.790073,0.214719,0.960948,0.46701
4,0.004172,0.485605,0.020616,0.02197,0.024636


In [86]:
%timeit df['A'] + df['B'] / df['C'] - df['D'] * df['E']

16.3 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [88]:
%timeit pd.eval('df.A + df.B / df.C - df.D * df.E')

5.57 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [89]:
%timeit df.eval('A + B / C - D * E')

11.5 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [92]:
df.eval('R = A + B / C - D * E', inplace = True) # 열에 값 추가
df.head()

Unnamed: 0,A,B,C,D,E,R
0,0.641301,0.31012,0.101351,0.748783,0.039265,3.671774
1,0.424981,0.038268,0.549519,0.726498,0.728869,-0.034902
2,0.003499,0.424629,0.107606,0.429814,0.295551,3.822611
3,0.415023,0.790073,0.214719,0.960948,0.46701,3.645822
4,0.004172,0.485605,0.020616,0.02197,0.024636,23.558891


In [96]:
col_mean = df.mean(1)
df['A'] + col_mean

0         1.560066
1         0.830519
2         0.850785
3         1.497289
4         4.023487
            ...   
999995    1.514057
999996    1.843659
999997    0.628248
999998    0.922213
999999    1.105709
Length: 1000000, dtype: float64

In [97]:
df.eval('A + @col_mean')

0         1.560066
1         0.830519
2         0.850785
3         1.497289
4         4.023487
            ...   
999995    1.514057
999996    1.843659
999997    0.628248
999998    0.922213
999999    1.105709
Length: 1000000, dtype: float64

In [98]:
df[(df.A < 0.5) & (df.B < 0.5) & (df.C > 0.5)]

Unnamed: 0,A,B,C,D,E,R
1,0.424981,0.038268,0.549519,0.726498,0.728869,-0.034902
15,0.345195,0.071002,0.845290,0.053175,0.572531,0.398748
36,0.122183,0.174952,0.787148,0.653953,0.054990,0.308483
45,0.329193,0.069272,0.568244,0.017256,0.947376,0.434751
47,0.136199,0.057219,0.504028,0.969802,0.588140,-0.320657
...,...,...,...,...,...,...
999973,0.227572,0.421288,0.955111,0.228909,0.233336,0.615247
999976,0.042678,0.125053,0.777993,0.189511,0.301331,0.146309
999978,0.424474,0.080124,0.624827,0.084814,0.274380,0.529437
999990,0.421117,0.123769,0.759835,0.184489,0.838719,0.429272


In [99]:
pd.eval('df[(df.A < 0.5) & (df.B < 0.5) & (df.C > 0.5)]')

Unnamed: 0,A,B,C,D,E,R
1,0.424981,0.038268,0.549519,0.726498,0.728869,-0.034902
15,0.345195,0.071002,0.845290,0.053175,0.572531,0.398748
36,0.122183,0.174952,0.787148,0.653953,0.054990,0.308483
45,0.329193,0.069272,0.568244,0.017256,0.947376,0.434751
47,0.136199,0.057219,0.504028,0.969802,0.588140,-0.320657
...,...,...,...,...,...,...
999973,0.227572,0.421288,0.955111,0.228909,0.233336,0.615247
999976,0.042678,0.125053,0.777993,0.189511,0.301331,0.146309
999978,0.424474,0.080124,0.624827,0.084814,0.274380,0.529437
999990,0.421117,0.123769,0.759835,0.184489,0.838719,0.429272


In [100]:
df.query('(A < 0.5) and (B < 0.5) and (C > 0.5)')

Unnamed: 0,A,B,C,D,E,R
1,0.424981,0.038268,0.549519,0.726498,0.728869,-0.034902
15,0.345195,0.071002,0.845290,0.053175,0.572531,0.398748
36,0.122183,0.174952,0.787148,0.653953,0.054990,0.308483
45,0.329193,0.069272,0.568244,0.017256,0.947376,0.434751
47,0.136199,0.057219,0.504028,0.969802,0.588140,-0.320657
...,...,...,...,...,...,...
999973,0.227572,0.421288,0.955111,0.228909,0.233336,0.615247
999976,0.042678,0.125053,0.777993,0.189511,0.301331,0.146309
999978,0.424474,0.080124,0.624827,0.084814,0.274380,0.529437
999990,0.421117,0.123769,0.759835,0.184489,0.838719,0.429272


## 데이터 결합

### Concat() / Append()

In [101]:
s1 = pd.Series(['a', 'b'], index = [1, 2])
s2 = pd.Series(['c', 'd'], index = [3, 4])
pd.concat([s1, s2])

1    a
2    b
3    c
4    d
dtype: object

In [102]:
def create_df(cols, idx):
    data = {c: [str(c.lower()) + str(i) for i in idx] for c in cols} 
    return pd.DataFrame(data, idx)

In [103]:
df1 = create_df('AB', [1, 2])
df1

Unnamed: 0,A,B
1,a1,b1
2,a2,b2


In [105]:
df2 = create_df('AB', [3, 4])
df2

Unnamed: 0,A,B
3,a3,b3
4,a4,b4


In [106]:
pd.concat([df1, df2])

Unnamed: 0,A,B
1,a1,b1
2,a2,b2
3,a3,b3
4,a4,b4


In [117]:
df3 = create_df('AB', [0, 1])
df3

Unnamed: 0,A,B
0,a0,b0
1,a1,b1


In [108]:
df4 = create_df('CD', [0, 1])
df4

Unnamed: 0,C,D
0,c0,d0
1,c1,d1


In [110]:
pd.concat([df3, df4])

Unnamed: 0,A,B,C,D
0,a0,b0,,
1,a1,b1,,
0,,,c0,d0
1,,,c1,d1


In [125]:
pd.concat([df3, df4], axis = 1)

Unnamed: 0,A,B,C,D
0,a0,b0,c0,d0
1,a1,b1,c1,d1


In [126]:
pd.concat([df1, df3])

Unnamed: 0,A,B
1,a1,b1
2,a2,b2
0,a0,b0
1,a1,b1


In [118]:
pd.concat([df1, df3], verify_integrity = True) # 겹치는 행이 있어서 충돌 오류 발생

ValueError: Indexes have overlapping values: Int64Index([1], dtype='int64')

In [119]:
pd.concat([df1, df3], ignore_index = True) # 무시하고 강제 결합

Unnamed: 0,A,B
0,a1,b1
1,a2,b2
2,a0,b0
3,a1,b1


In [120]:
pd.concat([df1, df3], keys = ['X', 'Y']) 

Unnamed: 0,Unnamed: 1,A,B
X,1,a1,b1
X,2,a2,b2
Y,0,a0,b0
Y,1,a1,b1


In [122]:
df5 = create_df('ABC', [1, 2])
df6 = create_df('BCD', [3, 4])
pd.concat([df5, df6])

Unnamed: 0,A,B,C,D
1,a1,b1,c1,
2,a2,b2,c2,
3,,b3,c3,d3
4,,b4,c4,d4


In [123]:
pd.concat([df5, df6], join = 'inner') # inner조인: 둘 다 존재하는 부분만 조인

Unnamed: 0,B,C
1,b1,c1
2,b2,c2
3,b3,c3
4,b4,c4


In [124]:
df5.append(df6)

Unnamed: 0,A,B,C,D
1,a1,b1,c1,
2,a2,b2,c2,
3,,b3,c3,d3
4,,b4,c4,d4


### 병합과 조인

In [127]:
df1 = pd.DataFrame({'학생': ['홍길동', '이순신', '임꺽정', '김유신'],
                   '학과': ['경영학과', '교육학과', '컴퓨터학과', '통계학과']})
df1

Unnamed: 0,학생,학과
0,홍길동,경영학과
1,이순신,교육학과
2,임꺽정,컴퓨터학과
3,김유신,통계학과


In [128]:
df2 = pd.DataFrame({'학생': ['홍길동', '이순신', '임꺽정', '김유신'],
                   '입학년도': [2012, 2016, 2019, 2020]})
df2

Unnamed: 0,학생,입학년도
0,홍길동,2012
1,이순신,2016
2,임꺽정,2019
3,김유신,2020


In [129]:
df3 = pd.merge(df1, df2)
df3

Unnamed: 0,학생,학과,입학년도
0,홍길동,경영학과,2012
1,이순신,교육학과,2016
2,임꺽정,컴퓨터학과,2019
3,김유신,통계학과,2020


In [130]:
df4 = pd.DataFrame({'학과': ['경영학과', '교육학과', '컴퓨터학과', '통계학과'],
                   '학과장': ['황희', '장영실', '안창호', '정약용']})
df4

Unnamed: 0,학과,학과장
0,경영학과,황희
1,교육학과,장영실
2,컴퓨터학과,안창호
3,통계학과,정약용


In [131]:
pd.merge(df3, df4)

Unnamed: 0,학생,학과,입학년도,학과장
0,홍길동,경영학과,2012,황희
1,이순신,교육학과,2016,장영실
2,임꺽정,컴퓨터학과,2019,안창호
3,김유신,통계학과,2020,정약용


In [132]:
df5 = pd.DataFrame({'학과': ['경영학과', '교육학과', '교육학과', '컴퓨터학과', '컴퓨터학과', '통계학과']
                   , '과목': ['경영개론', '기초수학', '물리학', '프로그래밍', '운영체제', '확률론']})
df5


Unnamed: 0,학과,과목
0,경영학과,경영개론
1,교육학과,기초수학
2,교육학과,물리학
3,컴퓨터학과,프로그래밍
4,컴퓨터학과,운영체제
5,통계학과,확률론


In [134]:
pd.merge(df1, df5)

Unnamed: 0,학생,학과,과목
0,홍길동,경영학과,경영개론
1,이순신,교육학과,기초수학
2,이순신,교육학과,물리학
3,임꺽정,컴퓨터학과,프로그래밍
4,임꺽정,컴퓨터학과,운영체제
5,김유신,통계학과,확률론


In [137]:
pd.merge(df1, df2, on = '학생')

Unnamed: 0,학생,학과,입학년도
0,홍길동,경영학과,2012
1,이순신,교육학과,2016
2,임꺽정,컴퓨터학과,2019
3,김유신,통계학과,2020


In [141]:
df6 = pd.DataFrame({'이름': ['홍길동', '이순신', '임꺽정', '김유신'],
                   '성적': ['A', 'A+', 'B', 'A+']})
df6

Unnamed: 0,이름,성적
0,홍길동,A
1,이순신,A+
2,임꺽정,B
3,김유신,A+


In [162]:
pd.merge(df1, df6, left_on = '학생', right_on = '이름')

Unnamed: 0,학생,학과,이름,성적
0,홍길동,경영학과,홍길동,A
1,이순신,교육학과,이순신,A+
2,임꺽정,컴퓨터학과,임꺽정,B
3,김유신,통계학과,김유신,A+


In [145]:
pd.merge(df1, df6, left_on = '학생', right_on = '이름').drop('이름', axis = 1)

Unnamed: 0,학생,학과,성적
0,홍길동,경영학과,A
1,이순신,교육학과,A+
2,임꺽정,컴퓨터학과,B
3,김유신,통계학과,A+


In [146]:
mdf1 = df1.set_index('학생')
mdf2 = df2.set_index('학생')

In [147]:
mdf1

Unnamed: 0_level_0,학과
학생,Unnamed: 1_level_1
홍길동,경영학과
이순신,교육학과
임꺽정,컴퓨터학과
김유신,통계학과


In [148]:
mdf2

Unnamed: 0_level_0,입학년도
학생,Unnamed: 1_level_1
홍길동,2012
이순신,2016
임꺽정,2019
김유신,2020


In [149]:
pd.merge(mdf1, mdf2, left_index = True, right_index = True)

Unnamed: 0_level_0,학과,입학년도
학생,Unnamed: 1_level_1,Unnamed: 2_level_1
홍길동,경영학과,2012
이순신,교육학과,2016
임꺽정,컴퓨터학과,2019
김유신,통계학과,2020


In [150]:
mdf1.join(mdf2)

Unnamed: 0_level_0,학과,입학년도
학생,Unnamed: 1_level_1,Unnamed: 2_level_1
홍길동,경영학과,2012
이순신,교육학과,2016
임꺽정,컴퓨터학과,2019
김유신,통계학과,2020


In [154]:
pd.merge(mdf1, df6, left_index = True, right_on = '이름')

Unnamed: 0,학과,이름,성적
0,경영학과,홍길동,A
1,교육학과,이순신,A+
2,컴퓨터학과,임꺽정,B
3,통계학과,김유신,A+


In [152]:
df7 = pd.DataFrame({'이름': ['홍길동', '이순신', '임꺽정'], 
                   '주문음식': ['햄버거', '피자', '짜장면']})
df7

Unnamed: 0,이름,주문음식
0,홍길동,햄버거
1,이순신,피자
2,임꺽정,짜장면


In [168]:
df8 = pd.DataFrame({'이름': ['홍길동', '이순신', '김유신'], 
                   '주문음료': ['콜라', '사이다', '커피']})
df8

Unnamed: 0,이름,주문음료
0,홍길동,콜라
1,이순신,사이다
2,김유신,커피


In [169]:
pd.merge(df7, df8)  # inner

Unnamed: 0,이름,주문음식,주문음료
0,홍길동,햄버거,콜라
1,이순신,피자,사이다


In [171]:
pd.merge(df7, df8, how = 'outer' )

Unnamed: 0,이름,주문음식,주문음료
0,홍길동,햄버거,콜라
1,이순신,피자,사이다
2,임꺽정,짜장면,
3,김유신,,커피


In [173]:
pd.merge(df7, df8, how = 'left')

Unnamed: 0,이름,주문음식,주문음료
0,홍길동,햄버거,콜라
1,이순신,피자,사이다
2,임꺽정,짜장면,


In [172]:
pd.merge(df7, df8, how = 'right')

Unnamed: 0,이름,주문음식,주문음료
0,홍길동,햄버거,콜라
1,이순신,피자,사이다
2,김유신,,커피


In [178]:
df9 = pd.DataFrame({'이름' : ['홍길동', '이순신', '임꺽정', '김유신'], 
                   '순위' : [3, 2, 4, 1]})
df9

Unnamed: 0,이름,순위
0,홍길동,3
1,이순신,2
2,임꺽정,4
3,김유신,1


In [180]:
df10 = pd.DataFrame({'이름': ['홍길동', '이순신', '임꺽정', '김유신'], 
                   '순위' : [3, 1, 3, 2]})
df10

Unnamed: 0,이름,순위
0,홍길동,3
1,이순신,1
2,임꺽정,3
3,김유신,2


In [181]:
pd.merge(df9, df10, on = '이름')

Unnamed: 0,이름,순위_x,순위_y
0,홍길동,3,3
1,이순신,2,1
2,임꺽정,4,3
3,김유신,1,2


In [183]:
pd.merge(df9, df10, on = '이름', suffixes = ["_인기", "_성적"])

Unnamed: 0,이름,순위_인기,순위_성적
0,홍길동,3,3
1,이순신,2,1
2,임꺽정,4,3
3,김유신,1,2


## 데이터 집계와 그룹 연산

#### 집계 연산(Aggregation)


|집계|설명|
|:--|:--|
`count`|전체 개수
`head`, `tail`|앞의 항목 일부 반환, 뒤의 항목 일부 반환
`describe`|Series, DataFrame의 각 컬럼에 대한 요약 통계
`min`, `max`|최소값, 최대값
`cummin`, `cummax`|누적 최소값, 누적 최대값
`argmin`, `argmax`|최소값과 최대값의 색인 위치
`idxmin`, `idxmax`|최소값과 최대값의 색인 값
`mean`, `median`|평균값, 중앙값
`std`, `var`|표준편차, 분산
`skew`|왜도(skewness) 값 계산
`kurt`|첨도(kurtosis) 값 계산
`mad`|절대 평균 편차
`sum`, `cumsum`|전체 항목 합, 누적 합
`prod`, `cumprod`|전체 항목 곱, 누적 곱
`quantile`|0부터 1까지의 분위수 계산
`diff`|1차 산술차 계산
`pct_change`|퍼센트 변화율 계산
`corr`, `cov`|상관관계, 공분산 계산


In [187]:
df = pd.DataFrame([[1, 1.2, np.nan],
                 [2.4, 5.5, 4.2], 
                 [np.nan, np.nan, np.nan], 
                 [0.44, -3.1, -4.1]], 
                 index = [1, 2, 3, 4], 
                 columns = ['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
1,1.0,1.2,
2,2.4,5.5,4.2
3,,,
4,0.44,-3.1,-4.1


In [189]:
df.head(2)

Unnamed: 0,A,B,C
1,1.0,1.2,
2,2.4,5.5,4.2


In [190]:
df.tail(2)

Unnamed: 0,A,B,C
3,,,
4,0.44,-3.1,-4.1


In [191]:
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,2.0
mean,1.28,1.2,0.05
std,1.009554,4.3,5.868986
min,0.44,-3.1,-4.1
25%,0.72,-0.95,-2.025
50%,1.0,1.2,0.05
75%,1.7,3.35,2.125
max,2.4,5.5,4.2


In [194]:
print(df)
print(np.argmin(df), np.argmax(df))

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
2 2


In [198]:
print(df)
print(df.idxmin())
print(df.idxmax())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
A    4
B    4
C    4
dtype: int64
A    2
B    2
C    2
dtype: int64


In [199]:
print(df)
print(df.std())
print(df.var())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
A    1.009554
B    4.300000
C    5.868986
dtype: float64
A     1.0192
B    18.4900
C    34.4450
dtype: float64


In [200]:
print(df)
print(df.skew())
print(df.kurt())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
A    1.15207
B    0.00000
C        NaN
dtype: float64
A   NaN
B   NaN
C   NaN
dtype: float64


In [201]:
print(df)
print(df.sum())
print(df.cumsum())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
A    3.84
B    3.60
C    0.10
dtype: float64
      A    B    C
1  1.00  1.2  NaN
2  3.40  6.7  4.2
3   NaN  NaN  NaN
4  3.84  3.6  0.1


In [203]:
print(df)
print(df.prod())
print(df.cumprod())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
A     1.056
B   -20.460
C   -17.220
dtype: float64
       A      B      C
1  1.000   1.20    NaN
2  2.400   6.60   4.20
3    NaN    NaN    NaN
4  1.056 -20.46 -17.22


In [204]:
df.diff()

Unnamed: 0,A,B,C
1,,,
2,1.4,4.3,
3,,,
4,,,


In [206]:
print(df)
print(df.quantile())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
A    1.00
B    1.20
C    0.05
Name: 0.5, dtype: float64


In [207]:
df.pct_change()

Unnamed: 0,A,B,C
1,,,
2,1.4,3.583333,
3,0.0,0.0,0.0
4,-0.816667,-1.563636,-1.97619


In [210]:
print(df)
print(df.corr())

      A    B    C
1  1.00  1.2  NaN
2  2.40  5.5  4.2
3   NaN  NaN  NaN
4  0.44 -3.1 -4.1
          A         B    C
A  1.000000  0.970725  1.0
B  0.970725  1.000000  1.0
C  1.000000  1.000000  1.0


In [209]:
df.corrwith(df.B)

A    0.970725
B    1.000000
C    1.000000
dtype: float64

In [211]:
df['B'].unique()

array([ 1.2,  5.5,  nan, -3.1])

In [212]:
df['A'].value_counts()

2.40    1
0.44    1
1.00    1
Name: A, dtype: int64

### GroupBy 연산

In [213]:
df = pd.DataFrame({'c1':['a', 'a', 'b', 'b', 'c', 'd', 'e'], 
                  'c2': ['A', 'A', 'B', 'B', 'C', 'D', 'E'], 
                  'c3': np.random.randint(7), 
                  'c4': np.random.random(7)})
df

Unnamed: 0,c1,c2,c3,c4
0,a,A,2,0.217283
1,a,A,2,0.824383
2,b,B,2,0.054355
3,b,B,2,0.885497
4,c,C,2,0.423253
5,d,D,2,0.518867
6,e,E,2,0.799007


In [217]:
df.dtypes

c1     object
c2     object
c3      int64
c4    float64
dtype: object

In [219]:
df['c3'].groupby(df['c1']).mean()

c1
a    2
b    2
c    2
d    2
e    2
Name: c3, dtype: int64

In [220]:
df['c4'].groupby(df['c2']).std()

c2
A    0.429285
B    0.587706
C         NaN
D         NaN
E         NaN
Name: c4, dtype: float64

In [222]:
df['c4'].groupby([df['c1'], df['c2']]).sum()

c1  c2
a   A     1.041667
b   B     0.939853
c   C     0.423253
d   D     0.518867
e   E     0.799007
Name: c4, dtype: float64

In [223]:
df['c4'].groupby([df['c1'], df['c2']]).sum().unstack()

c2,A,B,C,D,E
c1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,1.041667,,,,
b,,0.939853,,,
c,,,0.423253,,
d,,,,0.518867,
e,,,,,0.799007


In [224]:
df.groupby('c1').mean()

Unnamed: 0_level_0,c3,c4
c1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,0.520833
b,2,0.469926
c,2,0.423253
d,2,0.518867
e,2,0.799007


In [226]:
df.groupby(['c1', 'c2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,c3,c4
c1,c2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,A,2,0.520833
b,B,2,0.469926
c,C,2,0.423253
d,D,2,0.518867
e,E,2,0.799007


In [228]:
print(df)
print(df.groupby(['c1', 'c2']).size())

  c1 c2  c3        c4
0  a  A   2  0.217283
1  a  A   2  0.824383
2  b  B   2  0.054355
3  b  B   2  0.885497
4  c  C   2  0.423253
5  d  D   2  0.518867
6  e  E   2  0.799007
c1  c2
a   A     2
b   B     2
c   C     1
d   D     1
e   E     1
dtype: int64


In [229]:
for c1, group in df.groupby('c1'):
    print(c1)
    print(group)

a
  c1 c2  c3        c4
0  a  A   2  0.217283
1  a  A   2  0.824383
b
  c1 c2  c3        c4
2  b  B   2  0.054355
3  b  B   2  0.885497
c
  c1 c2  c3        c4
4  c  C   2  0.423253
d
  c1 c2  c3        c4
5  d  D   2  0.518867
e
  c1 c2  c3        c4
6  e  E   2  0.799007


In [231]:
for (c1,c2), group in df.groupby(['c1', 'c2']):
    print((c1, c2))
    print(group)

('a', 'A')
  c1 c2  c3        c4
0  a  A   2  0.217283
1  a  A   2  0.824383
('b', 'B')
  c1 c2  c3        c4
2  b  B   2  0.054355
3  b  B   2  0.885497
('c', 'C')
  c1 c2  c3        c4
4  c  C   2  0.423253
('d', 'D')
  c1 c2  c3        c4
5  d  D   2  0.518867
('e', 'E')
  c1 c2  c3        c4
6  e  E   2  0.799007


In [232]:
df.groupby(['c1', 'c2'])[['c4']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,c4
c1,c2,Unnamed: 2_level_1
a,A,0.520833
b,B,0.469926
c,C,0.423253
d,D,0.518867
e,E,0.799007


In [236]:
print(df)
print(df.groupby('c1').sum())
print(df.groupby('c1')['c3'].sum())

  c1 c2  c3        c4
0  a  A   2  0.217283
1  a  A   2  0.824383
2  b  B   2  0.054355
3  b  B   2  0.885497
4  c  C   2  0.423253
5  d  D   2  0.518867
6  e  E   2  0.799007
    c3        c4
c1              
a    4  1.041667
b    4  0.939853
c    2  0.423253
d    2  0.518867
e    2  0.799007
c1
a    4
b    4
c    2
d    2
e    2
Name: c3, dtype: int64


In [239]:
df.groupby('c1')['c4'].agg(['mean', 'sum', 'max', 'min']) # agg 함수

Unnamed: 0_level_0,mean,sum,max,min
c1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,0.520833,1.041667,0.824383,0.217283
b,0.469926,0.939853,0.885497,0.054355
c,0.423253,0.423253,0.423253,0.423253
d,0.518867,0.518867,0.518867,0.518867
e,0.799007,0.799007,0.799007,0.799007


In [240]:
def top(df, n = 3, column = 'c1'):
    return df.sort_values(by = column)[-n:]
top(df, 5)

Unnamed: 0,c1,c2,c3,c4
2,b,B,2,0.054355
3,b,B,2,0.885497
4,c,C,2,0.423253
5,d,D,2,0.518867
6,e,E,2,0.799007


In [244]:
df.groupby('c1').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,c1,c2,c3,c4
c1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,0,a,A,2,0.217283
a,1,a,A,2,0.824383
b,2,b,B,2,0.054355
b,3,b,B,2,0.885497
c,4,c,C,2,0.423253
d,5,d,D,2,0.518867
e,6,e,E,2,0.799007


In [245]:
df

Unnamed: 0,c1,c2,c3,c4
0,a,A,2,0.217283
1,a,A,2,0.824383
2,b,B,2,0.054355
3,b,B,2,0.885497
4,c,C,2,0.423253
5,d,D,2,0.518867
6,e,E,2,0.799007


### 피벗 테이블(Pivot Table)


|함수|설명|
|:--|:--|
`values`|집계하려는 칼럼 이름 혹은 이름의 리스트, 기본적으로 모든 숫자 컬럼 집계
`index`|피벗테이블의 로우를 그룹으로 묶을 컬럼 이름이나 그룹 키
`columns`|피벗테이블의 컬럼을 그룹으로 묶을 컬럼 이름이나 그룹 키
`aggfunc`|집계 함수나 함수 리스트, 기본값으로 `mean`이 사용
`fill_value`|결과 테이블에서 누락된 값 대체를 ㅜ이한 값
`dropna`|True인 경우 모든 항목이 NA인 칼럼은 포함하지 않음
`margins`|부분합이나 총계를 담기 위한 로우/칼럼 추가 여부, 기본값은 False

In [246]:
df

Unnamed: 0,c1,c2,c3,c4
0,a,A,2,0.217283
1,a,A,2,0.824383
2,b,B,2,0.054355
3,b,B,2,0.885497
4,c,C,2,0.423253
5,d,D,2,0.518867
6,e,E,2,0.799007


In [249]:
df.pivot_table(['c3', 'c4'], 
              index = ['c1'], 
              columns = ['c2'])

Unnamed: 0_level_0,c3,c3,c3,c3,c3,c4,c4,c4,c4,c4
c2,A,B,C,D,E,A,B,C,D,E
c1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
a,2.0,,,,,0.520833,,,,
b,,2.0,,,,,0.469926,,,
c,,,2.0,,,,,0.423253,,
d,,,,2.0,,,,,0.518867,
e,,,,,2.0,,,,,0.799007


In [253]:
df.pivot_table(['c3', 'c4'], 
              index = ['c1'], 
              columns = ['c2'], 
              margins = True, 
              aggfunc = sum
              )

Unnamed: 0_level_0,c3,c3,c3,c3,c3,c3,c4,c4,c4,c4,c4,c4
c2,A,B,C,D,E,All,A,B,C,D,E,All
c1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
a,4.0,,,,,4,1.041667,,,,,1.041667
b,,4.0,,,,4,,0.939853,,,,0.939853
c,,,2.0,,,2,,,0.423253,,,0.423253
d,,,,2.0,,2,,,,0.518867,,0.518867
e,,,,,2.0,2,,,,,0.799007,0.799007
All,4.0,4.0,2.0,2.0,2.0,14,1.041667,0.939853,0.423253,0.518867,0.799007,3.722647


In [254]:
df.pivot_table(['c3', 'c4'], 
              index = ['c1'], 
              columns = ['c2'], 
              margins = True, 
              aggfunc = sum,
              fill_value = 0)

Unnamed: 0_level_0,c3,c3,c3,c3,c3,c3,c4,c4,c4,c4,c4,c4
c2,A,B,C,D,E,All,A,B,C,D,E,All
c1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
a,4,0,0,0,0,4,1.041667,0.0,0.0,0.0,0.0,1.041667
b,0,4,0,0,0,4,0.0,0.939853,0.0,0.0,0.0,0.939853
c,0,0,2,0,0,2,0.0,0.0,0.423253,0.0,0.0,0.423253
d,0,0,0,2,0,2,0.0,0.0,0.0,0.518867,0.0,0.518867
e,0,0,0,0,2,2,0.0,0.0,0.0,0.0,0.799007,0.799007
All,4,4,2,2,2,14,1.041667,0.939853,0.423253,0.518867,0.799007,3.722647


In [255]:
pd.crosstab(df.c1, df.c2)

c2,A,B,C,D,E
c1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,2,0,0,0,0
b,0,2,0,0,0
c,0,0,1,0,0
d,0,0,0,1,0
e,0,0,0,0,1


In [256]:
pd.crosstab(df.c1, df.c2, values = df.c3, aggfunc = sum, margins = True)

c2,A,B,C,D,E,All
c1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a,4.0,,,,,4
b,,4.0,,,,4
c,,,2.0,,,2
d,,,,2.0,,2
e,,,,,2.0,2
All,4.0,4.0,2.0,2.0,2.0,14


### 범주형(Categorical) 데이터


|메소드|설명|
|:--|:--|
`add_categories`|기존 카테고리에 새로운 카테고리 추가
`as_ordered`|카테고리에 순서 지정
`as_unordered`|카테고리에 순서 미지정
`remove_categories`|카테고리 제거
`remove_unused_categories`|사용 안 하는 카테고리 제거
`rename_categories`|카테고리 이름 변경
`reorder_categories`|새로운 카테고리에 순서 지정
`set_categories`|새로운 카테고리로 변경

In [257]:
s = pd.Series(['c1', 'c2', 'c1', 'c2','c1'] * 2)
s

0    c1
1    c2
2    c1
3    c2
4    c1
5    c1
6    c2
7    c1
8    c2
9    c1
dtype: object

In [259]:
pd.unique(s)

array(['c1', 'c2'], dtype=object)

In [260]:
pd.value_counts(s)

c1    6
c2    4
dtype: int64

In [261]:
code = pd.Series([0, 1, 0, 1, 0]*2)
code

0    0
1    1
2    0
3    1
4    0
5    0
6    1
7    0
8    1
9    0
dtype: int64

In [262]:
d = pd.Series(['c1', 'c2'])
d

0    c1
1    c2
dtype: object

In [263]:
d.take(code)

0    c1
1    c2
0    c1
1    c2
0    c1
0    c1
1    c2
0    c1
1    c2
0    c1
dtype: object

In [264]:
df = pd.DataFrame({'id' : np.arange(len(s)), 
                  'c' : s,
                  'v' : np.random.randint(1000, 5000, size = len(s))})
df

Unnamed: 0,id,c,v
0,0,c1,2962
1,1,c2,1260
2,2,c1,1819
3,3,c2,4356
4,4,c1,1535
5,5,c1,3016
6,6,c2,2003
7,7,c1,2561
8,8,c2,2088
9,9,c1,2713


In [268]:
c = df['c'].astype('category')
c

0    c1
1    c2
2    c1
3    c2
4    c1
5    c1
6    c2
7    c1
8    c2
9    c1
Name: c, dtype: category
Categories (2, object): ['c1', 'c2']

In [269]:
c.values

['c1', 'c2', 'c1', 'c2', 'c1', 'c1', 'c2', 'c1', 'c2', 'c1']
Categories (2, object): ['c1', 'c2']

In [270]:
c.values.categories

Index(['c1', 'c2'], dtype='object')

In [271]:
c.values.codes

array([0, 1, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int8)

In [272]:
df['c'] = c
df.c

0    c1
1    c2
2    c1
3    c2
4    c1
5    c1
6    c2
7    c1
8    c2
9    c1
Name: c, dtype: category
Categories (2, object): ['c1', 'c2']

In [287]:
df

Unnamed: 0,id,c,v
0,0,c1,2962
1,1,c2,1260
2,2,c1,1819
3,3,c2,4356
4,4,c1,1535
5,5,c1,3016
6,6,c2,2003
7,7,c1,2561
8,8,c2,2088
9,9,c1,2713


In [274]:
#2
c = pd.Categorical(['c1', 'c2', 'c3', 'c1', 'c2'])
c

['c1', 'c2', 'c3', 'c1', 'c2']
Categories (3, object): ['c1', 'c2', 'c3']

In [276]:
#3
categories = ['c1', 'c2', 'c3']
codes = [0, 1, 2, 0, 1]
c = pd.Categorical.from_codes(codes, categories)
c

['c1', 'c2', 'c3', 'c1', 'c2']
Categories (3, object): ['c1', 'c2', 'c3']

In [277]:
pd.Categorical.from_codes(codes, categories, ordered = True)

['c1', 'c2', 'c3', 'c1', 'c2']
Categories (3, object): ['c1' < 'c2' < 'c3']

In [278]:
c.as_ordered()

['c1', 'c2', 'c3', 'c1', 'c2']
Categories (3, object): ['c1' < 'c2' < 'c3']

In [279]:
c.codes

array([0, 1, 2, 0, 1], dtype=int8)

In [280]:
c.categories

Index(['c1', 'c2', 'c3'], dtype='object')

In [281]:
# 새로운 카테고리로 변경
c = c.set_categories(['c1', 'c2', 'c3', 'c4', 'c5'])
c.categories

Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')

In [282]:
c.value_counts()

c1    2
c2    2
c3    1
c4    0
c5    0
dtype: int64

In [284]:
c[c.isin(['c1', 'c3'])]

['c1', 'c3', 'c1']
Categories (5, object): ['c1', 'c2', 'c3', 'c4', 'c5']

In [285]:
c = c.remove_unused_categories()

In [286]:
c.categories

Index(['c1', 'c2', 'c3'], dtype='object')

## 문자열 연산

#### 문자열 연산자

- 파이썬의 문자열 연산자를 거의 모두 반영
- pd.str.함수()

In [290]:
name_tuple = ['Wonsup Kim', 'Steven Jobs', 'Larry Page', 'Elon Must', None, 'Bill Gates', 'Mark Zuckerberg']
names = pd.Series(name_tuple)
names

0         Wonsup Kim
1        Steven Jobs
2         Larry Page
3          Elon Must
4               None
5         Bill Gates
6    Mark Zuckerberg
dtype: object

In [291]:
names.str.lower()

0         wonsup kim
1        steven jobs
2         larry page
3          elon must
4               None
5         bill gates
6    mark zuckerberg
dtype: object

In [292]:
names.str.split()

0         [Wonsup, Kim]
1        [Steven, Jobs]
2         [Larry, Page]
3          [Elon, Must]
4                  None
5         [Bill, Gates]
6    [Mark, Zuckerberg]
dtype: object

In [295]:
names.str.find('a')

0   -1.0
1   -1.0
2    1.0
3   -1.0
4    NaN
5    6.0
6    1.0
dtype: float64

In [296]:
names.str.len()

0    10.0
1    11.0
2    10.0
3     9.0
4     NaN
5    10.0
6    15.0
dtype: float64

#### 기타 연산자


|메소드|설명|
|:--|:--|
`get()`|각 요소에 인덱스 지정
`slice()`|각 요소에 슬라이스 적용
`slice_replace()`|각 요소의 슬라이스를 특정 값으로 대체
`cat()`|문자열 연결
`repeat()`|값 반복
`normalize()`|문자열의 유니코드 형태로 반환
`pad()`|문자열 왼쪽, 오른쪽, 또는 양쪽 공백 추가
`wrap()`|긴 문자열을 주어진 너비보다 짧은 길이의 여러 줄로 나눔
`join()`|Series의 각 요소에 있는 문자열을 전달된 구분자와 결합
`get_dummies`|DataFrame으로 가변수(dummy variable) 추출

In [297]:
names.str[0:4]

0    Wons
1    Stev
2    Larr
3    Elon
4    None
5    Bill
6    Mark
dtype: object

In [298]:
names.str.split().str.get(-1)

0           Kim
1          Jobs
2          Page
3          Must
4          None
5         Gates
6    Zuckerberg
dtype: object

In [299]:
names.str.cat()

'Wonsup KimSteven JobsLarry PageElon MustBill GatesMark Zuckerberg'

In [300]:
names.str.repeat(2)

0              Wonsup KimWonsup Kim
1            Steven JobsSteven Jobs
2              Larry PageLarry Page
3                Elon MustElon Must
4                              None
5              Bill GatesBill Gates
6    Mark ZuckerbergMark Zuckerberg
dtype: object

In [301]:
names.str.join('*')

0              W*o*n*s*u*p* *K*i*m
1            S*t*e*v*e*n* *J*o*b*s
2              L*a*r*r*y* *P*a*g*e
3                E*l*o*n* *M*u*s*t
4                             None
5              B*i*l*l* *G*a*t*e*s
6    M*a*r*k* *Z*u*c*k*e*r*b*e*r*g
dtype: object

#### 정규표현식


|메소드|설명|
|:--|:--|
`match()`|각 요소에 `re.match()`호출. 불리언 값 반환
`extract()`|각 요소에 `re.match()`호출. 문자열로 매칭된 그룹 반환
`findall()`|각 요소에 `re.findall()` 호출
`replace()`|패턴이 발생한 곳을 다른 문자열로 대체
`contains()`|각 요소에 `re.search()`호출. 불리언 값 반환
`count()`|패턴 발생 건수 집계


In [302]:
names.str.match('([A-Za-z]+)')

0    True
1    True
2    True
3    True
4    None
5    True
6    True
dtype: object

In [303]:
names.str.findall('([A-Za-z]+)')

0         [Wonsup, Kim]
1        [Steven, Jobs]
2         [Larry, Page]
3          [Elon, Must]
4                  None
5         [Bill, Gates]
6    [Mark, Zuckerberg]
dtype: object

## 시계열 처리

#### 시계열 데이터 구조


### 시계열 기본

### 주기와 오프셋


### 시프트(Shift)

### 시간대 처리

* 국제표준시(Coordinated Universal Time, UTC)를 기준으로 떨어진 거리만큼 오프셋으로 시간대 처리
* 전 세계의 시간대 정보를 모아놓은 올슨 데이터베이스를 활용한 라이브러리인 `pytz` 사용

### 기간과 기간 연산

### 리샘플링(Resampling)

* 리샘플링(Resampling): 시계열의 빈도 변환
* 다운샘플링(Down sampling): 상위 빈도 데이터를 하위 빈도 데이터로 집계
* 업샘플링(Up sampling): 하위 빈도 데이터를 상위 빈도 데이터로 집계

### 무빙 윈도우(Moving Window)

## 데이터 읽기 및 저장


### 텍스트 파일 읽기/쓰기

### 이진 데이터 파일 읽기/쓰기

## 데이터 정제

### 누락값 처리

* 대부분의 실제 데이터들은 정제되지 않고 누락값들이 존재
* 서로 다른 데이터들은 다른 형태의 결측을 가짐
* 결측 데이터는 `null`, `NaN`, `NA`로 표기

#### None: 파이썬 누락 데이터

#### NaN: 누락된 수치 데이터

#### Null 값 처리


### 중복 제거

### 값 치환

## 참고문헌

* Pandas 사이트: https://pandas.pydata.org/
* Jake VanderPlas, "Python Data Science Handbook", O'Reilly
* Wes Mckinney, "Python for Data Analysis", O'Reilly