# 2. 파이썬으로 데이터 주무르기, pandas
**pandas를 활용해서 데이터프레임을 다뤄봅시다.**

1. Pandas 시작하기
    - prerequisite : Table
    - pandas import하기
   
2. Pandas로 1차원 데이터 다루기 - Series 
    - Series 선언하기
    - Series vs ndarray
    - Series vs dict
    - Series에 이름 붙이기
3. Pandas로 2차원 데이터 다루기 - dataframe
    - dataframe 선언하기
    - from csv to dataframe
    - dataframe 자료 접근하기

[수업에 사용된 covid 데이터](https://www.kaggle.com/imdevskp/corona-virus-report)

## I. pandas 시작하기

### Prerequisite : Table
- 행과 열을 이용해서 데이터를 저장하고 관리하는 자료구조(컨테이너)
- 주로 행은 개체, 열은 속성을 나타냄


### Pandas 시작하기
<code>import pandas</code> 를 통해서 진행

In [1]:
import pandas as pd

## II. pandas로 1차원 데이터 다루기 - Series

### Series?
- 1-D labeled **array**
- 인덱스를 지정해줄 수 있음

In [4]:
s = pd.Series([1, 4, 9, 16, 25])

s

0     1
1     4
2     9
3    16
4    25
dtype: int64

In [6]:
t = pd.Series({'one':1, 'two':2, 'three':3, 'four':4, 'five': 5})

t

one      1
two      2
three    3
four     4
five     5
dtype: int64

### Series + Numpy

- Series 는 ndarray와 유사하다!

In [8]:
s[1]

4

In [9]:
t[1]

2

In [10]:
t[1:3]

two      2
three    3
dtype: int64

In [11]:
s[s >s.median()] # 자기 자신의 median (중앙값) 보다 큰 값들만 가지고 와라

3    16
4    25
dtype: int64

In [12]:
s[[3, 1, 4]]

3    16
1     4
4    25
dtype: int64

In [13]:
import numpy as np

np.exp(s)

0    2.718282e+00
1    5.459815e+01
2    8.103084e+03
3    8.886111e+06
4    7.200490e+10
dtype: float64

In [23]:
s.dtype

dtype('int64')

### Series + dict
- series 는 **dict** 와 유사하다

In [24]:
t

one      1
two      2
three    3
four     4
five     5
dtype: int64

In [25]:
t['one']

1

In [26]:
# Series 에 값 추가

t['six'] = 6

t

one      1
two      2
three    3
four     4
five     5
six      6
dtype: int64

In [27]:
'six' in t

True

In [28]:
'seven' in t

False

In [29]:
# t['seven']

In [30]:
t.get('seven')

In [31]:
t.get('seven',0)

0

### Series 에 이름 붙이기
- <code>name</code> 속성을 가지고 있다.
- 처음 Series 를 만들 때 이름을 붙일 수 있습니다.

In [32]:
s = pd.Series(np.random.randn(5), name = "random_nums")

s

0   -0.352523
1   -0.995875
2   -0.143107
3    0.532284
4   -0.414448
Name: random_nums, dtype: float64

In [38]:
s.name = "임의의 난수"

s

0   -0.352523
1   -0.995875
2   -0.143107
3    0.532284
4   -0.414448
Name: 임의의 난수, dtype: float64

## III. Pandas로 2차원 데이터 다루기 - dataframe

### dataframe?

- 2-D labeled **table**
- 인덱스를 지정할 수도 있음


In [39]:
d = {"height": [1, 2, 3, 4], "weight": [30, 40, 50, 60]}

df = pd.DataFrame(d)

df

Unnamed: 0,height,weight
0,1,30
1,2,40
2,3,50
3,4,60


In [41]:
## dtype 확인
# numpy.array 는 dtype 으로 사용, dataframe은 dtypes 로 사용 -> dataframe 은 column 별로 dtype이 다를 수 있기 때문!

df.dtypes

height    int64
weight    int64
dtype: object

### From CSV to dataframe

- Comma Separated Value 를 DataFrame 으로 생성해줄 수 있다.
- <code>.read_csv()</code> 를 이용

In [43]:
# 동일 경로에 country_wise_latest.csv 가 존재하면:

covid = pd.read_csv("./country_wise_latest.csv")

covid

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.50,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.00,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.60,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa
5,Antigua and Barbuda,86,3,65,18,4,0,5,3.49,75.58,4.62,76,10,13.16,Americas
6,Argentina,167416,3059,72575,91782,4890,120,2057,1.83,43.35,4.21,130774,36642,28.02,Americas
7,Armenia,37390,711,26665,10014,73,6,187,1.90,71.32,2.67,34981,2409,6.89,Europe
8,Australia,15303,167,9311,5825,368,6,137,1.09,60.84,1.79,12428,2875,23.13,Western Pacific
9,Austria,20558,713,18246,1599,86,1,37,3.47,88.75,3.91,19743,815,4.13,Europe


### Pandas 활용 1. 일부분만 관찰하기

<code>head(n)</code>: 처음 N 개의 데이터 참조

In [44]:
# 위에서부터 5개를 관찰하는 방법(함수)

covid.head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


<code>tail(n)</code>: 마지막 n 개의 데이터를 참조

In [45]:
# 아래에서부터 5개를 관찰하는 방법(함수)

covid.tail(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean
183,Western Sahara,10,1,8,1,0,0,0,10.0,80.0,12.5,10,0,0.0,Africa
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa
186,Zimbabwe,2704,36,542,2126,192,2,24,1.33,20.04,6.64,1713,991,57.85,Africa


### Pandas 활용 2. 데이터 접근하기

- <code>df['column_name']</code> or <code>df.column_name</code>
- <code>df.column_name</code> 은 column_name 에 띄어쓰기가 있을 경우 사용 불가

In [48]:
covid['Confirmed']

0        36263
1         4880
2        27973
3          907
4          950
5           86
6       167416
7        37390
8        15303
9        20558
10       30446
11         382
12       39482
13      226225
14         110
15       67251
16       66428
17          48
18        1770
19          99
20       71181
21       10498
22         739
23     2442375
24         141
25       10621
26        1100
27         350
28         378
29        2328
        ...   
157     272421
158       2805
159      11424
160       1483
161      79395
162      34477
163        674
164        462
165       7235
166        509
167       3297
168         24
169        874
170        148
171       1455
172     227019
173    4290259
174       1128
175      67096
176      59177
177     301708
178       1202
179      21209
180      15988
181        431
182      10621
183         10
184       1691
185       4552
186       2704
Name: Confirmed, Length: 187, dtype: int64

In [47]:
covid.Active

0         9796
1         1991
2         7973
3           52
4          667
5           18
6        91782
7        10014
8         5825
9         1599
10        6781
11         280
12        3231
13       97577
14           9
15        6221
16       39154
17          20
18         699
19          13
20       47056
21        5274
22         674
23      508116
24           0
25        4689
26         121
27          52
28          76
29         756
        ...   
157      93613
158        673
159       4765
160        534
161      73695
162       1599
163        634
164         15
165       1147
166        305
167        128
168         24
169        249
170         12
171        248
172      10920
173    2816444
174        140
175      28258
176       6322
177     254427
178        216
179       9414
180       5883
181         66
182       6791
183          1
184        375
185       1597
186       2126
Name: Active, Length: 187, dtype: int64

### Honey Tip! Dataframe 의 각 column 은 "Series" 다!

In [50]:
type(covid['Confirmed'])

pandas.core.series.Series

In [51]:
covid['Confirmed'][0]

36263

In [52]:
covid['Confirmed'][1:5]

1     4880
2    27973
3      907
4      950
Name: Confirmed, dtype: int64

### Pandas 활용 3. "조건"을 이용해서 데이터 접근하기

In [57]:
# 신규 확진자가 100 명이 넘는 나라를 찾아보자!

covid[covid['New cases']> 100].head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
6,Argentina,167416,3059,72575,91782,4890,120,2057,1.83,43.35,4.21,130774,36642,28.02,Americas
8,Australia,15303,167,9311,5825,368,6,137,1.09,60.84,1.79,12428,2875,23.13,Western Pacific


In [None]:
covid['WHO Region'].unique()

In [59]:
# WHO 지역(WHO_Region) 이 동남아시아인 나라 찾기

covid['WHO Region'].unique() # 범주형 자료에서 범주를 unique 하게 보여준다.

array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
       'Western Pacific', 'South-East Asia'], dtype=object)

In [60]:
covid[covid['WHO Region'] == 'South-East Asia']

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
13,Bangladesh,226225,2965,125683,97577,2772,37,1801,1.31,55.56,2.36,207453,18772,9.05,South-East Asia
19,Bhutan,99,0,86,13,4,0,1,0.0,86.87,0.0,90,9,10.0,South-East Asia
27,Burma,350,6,292,52,0,0,2,1.71,83.43,2.05,341,9,2.64,South-East Asia
79,India,1480073,33408,951166,495499,44457,637,33598,2.26,64.26,3.51,1155338,324735,28.11,South-East Asia
80,Indonesia,100303,4838,58173,37292,1525,57,1518,4.82,58.0,8.32,88214,12089,13.7,South-East Asia
106,Maldives,3369,15,2547,807,67,0,19,0.45,75.6,0.59,2999,370,12.34,South-East Asia
119,Nepal,18752,48,13754,4950,139,3,626,0.26,73.35,0.35,17844,908,5.09,South-East Asia
158,Sri Lanka,2805,11,2121,673,23,0,15,0.39,75.61,0.52,2730,75,2.75,South-East Asia
167,Thailand,3297,58,3111,128,6,0,2,1.76,94.36,1.86,3250,47,1.45,South-East Asia
168,Timor-Leste,24,0,0,24,0,0,0,0.0,0.0,0.0,24,0,0.0,South-East Asia


### Pandas 활용 4. 행을 기준으로 데이터 접근하기

In [63]:
# 예시 데이터 - 도서관 정보

books_dict = {"Available": [True, True, False], "Location": [102, 215, 323], "Genre": ["Programming", "Physics", "Math"]}

books_df = pd.DataFrame(books_dict, index = ['버그란 무엇인가', '두근두근 물리학', '미분해줘 홈즈'])

books_df

Unnamed: 0,Available,Location,Genre
버그란 무엇인가,True,102,Programming
두근두근 물리학,True,215,Physics
미분해줘 홈즈,False,323,Math


### 인덱스를 이용해서 가져오기: <code>.loc[row, col]</code>

In [64]:
books_df.loc["버그란 무엇인가"]

Available           True
Location             102
Genre        Programming
Name: 버그란 무엇인가, dtype: object

In [65]:
type(books_df.loc["버그란 무엇인가"])

pandas.core.series.Series

In [66]:
# "미분해줘 홈즈 책이 대출가능한지?"

books_df.loc["미분해줘 홈즈",'Available']

False

### 숫자 인덱스를 이용해서 가져오기: <code>.iloc[rowidx, colidx]</code>

In [67]:
# 인덱스 0 행의 인덱스 1 열 가지고오기

books_df.iloc[0, 1]

102

In [68]:
# 인덱스 1 행의 인덱스 0~1 열 가지고오기

books_df.iloc[1, 0:2]

Available    True
Location      215
Name: 두근두근 물리학, dtype: object

## Pandas 활용 5. groupby

- Split : 특정한 "기준"을 바탕으로 DataFrame 을 분할
- Apply : 통계함수 - sum(), mean(), median(), - 을 적용해서 각 데이터를 압축
- Combine : Apply 된 결과를 바탕으로 새로운 Series 를 생성 (group_key : applied_value)
<br/>
<br/>

<code>.groupby()</code>

In [69]:
covid.head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


In [70]:
# WHO Region 별 확진자수

# 1. covid 에서 확진자 수 column 만 추출한다
# 2. 이를 covid 의 WHO Region 을 기준으로 groupby 한다

covid_by_region = covid['Confirmed'].groupby(by=covid["WHO Region"])

# Split 만 진행이 된 것
covid_by_region

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000016CD80715C0>

In [71]:
covid_by_region.sum()

WHO Region
Africa                    723207
Americas                 8839286
Eastern Mediterranean    1490744
Europe                   3299523
South-East Asia          1835297
Western Pacific           292428
Name: Confirmed, dtype: int64

In [72]:
# 국가당 감염자 수

covid_by_region.mean() # sum() / 국가 수

WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

## Mission:
### 1. covid 데이터에서 100 case 대비 사망률(`Deaths / 100 Cases`)이 가장 높은 국가는?

In [91]:
covid[covid["Deaths / 100 Cases"] == max(covid["Deaths / 100 Cases"])]["Country/Region"]

184    Yemen
Name: Country/Region, dtype: object

### 2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 'Europe'를 모두 출력하면?  
Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.

In [101]:
covid_1 = covid[covid["New cases"] ==0 ]
covid_2 = covid_1[covid_1["WHO Region"] == 'Europe']

covid_2

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
56,Estonia,2034,69,1923,42,0,0,1,3.39,94.54,3.59,2021,13,0.64,Europe
75,Holy See,12,0,12,0,0,0,0,0.0,100.0,0.0,12,0,0.0,Europe
95,Latvia,1219,31,1045,143,0,0,0,2.54,85.73,2.97,1192,27,2.27,Europe
100,Liechtenstein,86,1,81,4,0,0,0,1.16,94.19,1.23,86,0,0.0,Europe
113,Monaco,116,4,104,8,0,0,0,3.45,89.66,3.85,109,7,6.42,Europe
143,San Marino,699,42,657,0,0,0,0,6.01,93.99,6.39,699,0,0.0,Europe
157,Spain,272421,28432,150376,93613,0,0,0,10.44,55.2,18.91,264836,7585,2.86,Europe


### 3. 다음 [데이터](https://www.kaggle.com/neuromusic/avocado-prices)를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?

In [108]:
df_region = df['AveragePrice'].groupby(by=df['region']).max()

df_region