# 2. 파이썬으로 데이터 주무르기, pandas
**pandas를 활용해서 데이터프레임을 다뤄봅시다.**

1. Pandas 시작하기
    - prerequisite : Table
    - pandas import하기
   
2. Pandas로 1차원 데이터 다루기 - Series 
    - Series 선언하기
    - Series vs ndarray
    - Series vs dict
    - Series에 이름 붙이기
3. Pandas로 2차원 데이터 다루기 - dataframe
    - dataframe 선언하기
    - from csv to dataframe
    - dataframe 자료 접근하기

[수업에 사용된 covid 데이터](https://www.kaggle.com/imdevskp/corona-virus-report)

## I. pandas 시작하기

### Prerequisite: Table

- 행과 열을 이용해서 데이터를 저장하고 관리하는 자료구조(컨테이너)
- 주로 행은 개체, 열은 속성을 나타냄

In [2]:
import pandas as pd

## II. pandas로 1차원 데이터 다루기 - Series

### Series?

- 1-D labeled array
- 인덱스를 지정해 줄 수 있음

In [3]:
s = pd.Series([1, 4, 9, 16, 25])
t = pd.Series({'one': 1, 'two': 2, 'three': 3})

s
t

one      1
two      2
three    3
dtype: int64

### Series + Numpy
- Series는 Numpy와 유사하다!

In [4]:
s[1]

4

In [5]:
t[1]

2

In [6]:
t[1:3]

two      2
three    3
dtype: int64

In [7]:
s[s > s.median()]

3    16
4    25
dtype: int64

In [8]:
s[[3, 1, 4]]

3    16
1     4
4    25
dtype: int64

In [9]:
import numpy as np

np.exp(s)

0    2.718282e+00
1    5.459815e+01
2    8.103084e+03
3    8.886111e+06
4    7.200490e+10
dtype: float64

In [10]:
s.dtype

dtype('int64')

### Series + dict
- Series는 dict와 유사하다

In [11]:
t

one      1
two      2
three    3
dtype: int64

In [12]:
t['one']

1

In [13]:
t['four'] = 4

t

one      1
two      2
three    3
four     4
dtype: int64

In [14]:
'four' in t

True

In [15]:
'five' in t

False

In [16]:
# t['six']
t.get('six')

In [17]:
t.get('six', 0)

0

### Series에 이름 붙이기

- `name` 속성을 가지고 있다.
- 처음 Series를 만들 때, 이름을 붙일 수 있다.

In [18]:
s = pd.Series(np.random.randn(5), name='random_nums')

s

0   -2.415851
1    1.563125
2    0.539950
3   -0.460772
4    0.701739
Name: random_nums, dtype: float64

In [19]:
s.name = '임의의 난수'

s

0   -2.415851
1    1.563125
2    0.539950
3   -0.460772
4    0.701739
Name: 임의의 난수, dtype: float64

## III. Pandas로 2차원 데이터 다루기 - dataframe

### dataframe?
- 2-D labeled **table**
- 인덱스를 지정할 수도 있음

In [20]:
d = {'height': [1, 2, 3, 4], 'weight': [30, 40, 50, 60]}

In [21]:
df = pd.DataFrame(d)

df

Unnamed: 0,height,weight
0,1,30
1,2,40
2,3,50
3,4,60


In [22]:
# dtype 확인


df.dtypes

height    int64
weight    int64
dtype: object

### From CSV to DataFrame

- Comma Seperated Value를 DataFrame으로 생성해 줄 수 있다.
- `.read_csv()`를 이용

In [23]:
covid = pd.read_csv("./archive/country_wise_latest.csv")

covid

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.50,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.00,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.60,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean
183,Western Sahara,10,1,8,1,0,0,0,10.00,80.00,12.50,10,0,0.00,Africa
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa


### Pandas 활용 1. 일부분만 관찰하기

`head(n)`: 처음 n개의 데이터 참조

In [24]:
# 위에서부터 5개를 관찰하는 방법(함수)

covid.head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


`tail(n)`: 마지막 n개의 데이터를 참조

In [25]:
# 아래에서부터 5개를 관찰하는 방법(함수)

covid.tail(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
182,West Bank and Gaza,10621,78,3752,6791,152,2,0,0.73,35.33,2.08,8916,1705,19.12,Eastern Mediterranean
183,Western Sahara,10,1,8,1,0,0,0,10.0,80.0,12.5,10,0,0.0,Africa
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean
185,Zambia,4552,140,2815,1597,71,1,465,3.08,61.84,4.97,3326,1226,36.86,Africa
186,Zimbabwe,2704,36,542,2126,192,2,24,1.33,20.04,6.64,1713,991,57.85,Africa


### Pandas 활용 2. 데이터 접근하기

- `df['column_name']` or `df.column_name`

In [26]:
covid["Confirmed"]

0      36263
1       4880
2      27973
3        907
4        950
       ...  
182    10621
183       10
184     1691
185     4552
186     2704
Name: Confirmed, Length: 187, dtype: int64

In [27]:
covid.Active

0      9796
1      1991
2      7973
3        52
4       667
       ... 
182    6791
183       1
184     375
185    1597
186    2126
Name: Active, Length: 187, dtype: int64

### Honey Tip! DataFrame의 각 column으 "Series"다!

In [28]:
covid['Confirmed'][0]

covid['Confirmed'][1:5]

1     4880
2    27973
3      907
4      950
Name: Confirmed, dtype: int64

### Pandas 활용 3. "조건"을 이용해서 데이터 접근하기

In [29]:
# 신규 확진자가 100명이 넘는 나라를 찾아보자!

covid[covid["New cases"] > 100].head(5)

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
6,Argentina,167416,3059,72575,91782,4890,120,2057,1.83,43.35,4.21,130774,36642,28.02,Americas
8,Australia,15303,167,9311,5825,368,6,137,1.09,60.84,1.79,12428,2875,23.13,Western Pacific


In [30]:
# WHO 지역(WHO Region)이 동남아시아인 나라 찾기

kinds = covid['WHO Region'].unique()
print(kinds)

covid[covid['WHO Region']  == 'South-East Asia'].head(3)

['Eastern Mediterranean' 'Europe' 'Africa' 'Americas' 'Western Pacific'
 'South-East Asia']


Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
13,Bangladesh,226225,2965,125683,97577,2772,37,1801,1.31,55.56,2.36,207453,18772,9.05,South-East Asia
19,Bhutan,99,0,86,13,4,0,1,0.0,86.87,0.0,90,9,10.0,South-East Asia
27,Burma,350,6,292,52,0,0,2,1.71,83.43,2.05,341,9,2.64,South-East Asia


### Pandas 활용 4. 행을 기준으로 데이터 접근하기

In [31]:
# 예시 데이터 - 도서관 정보

books_dict = {'Available':[True, True, False], 'Location':[102, 215, 323], 'Genre':['Programming', 'Physics', 'Maht']}

books_df = pd.DataFrame(books_dict, index=['a', 'b', 'c'])

books_df

Unnamed: 0,Available,Location,Genre
a,True,102,Programming
b,True,215,Physics
c,False,323,Maht


### 인덱스를 이용해서 가져오기: `.loc[row, col]`

In [32]:
# type: Series
books_df.loc['b']

books_df.loc['a', 'Available']
books_df.loc['a']['Available']

True

### 숫자 인덱스를 이용해서 가져오기: `.iloc[row_idx, col_idx]`

In [33]:
books_df.iloc[0, 1]

books_df.iloc[0:2, 1:3]

Unnamed: 0,Location,Genre
a,102,Programming
b,215,Physics


## Pandas 활용 5. groupby

- Split: 특정한 기준으로 DataFrame을 분할
- Apply: 통계함수 - sum(), mean(), median() - 을 적용해서 각 데이터를 압축
- Combine: Apply된 결과를 바탕으로 새로운 Series를 생성 (group_key: applied_value)

In [34]:
# WHO Region 별 확진자수

covid_by_region = covid['Confirmed'].groupby(by=covid["WHO Region"])
covid_by_region

<pandas.core.groupby.generic.SeriesGroupBy object at 0x070A4D30>

In [35]:
covid_by_region.sum()

WHO Region
Africa                    723207
Americas                 8839286
Eastern Mediterranean    1490744
Europe                   3299523
South-East Asia          1835297
Western Pacific           292428
Name: Confirmed, dtype: int64

In [36]:
# 국가별 감영자 수
covid_by_region.mean()

WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

## Mission:
### 1. covid 데이터에서 100 case 대비 사망률(`Deaths / 100 Cases`)이 가장 높은 국가는?

In [37]:
covid.columns

Index(['Country/Region', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'New cases', 'New deaths', 'New recovered', 'Deaths / 100 Cases',
       'Recovered / 100 Cases', 'Deaths / 100 Recovered',
       'Confirmed last week', '1 week change', '1 week % increase',
       'WHO Region'],
      dtype='object')

In [38]:
max_val = covid['Deaths / 100 Cases'].max()

covid[covid['Deaths / 100 Cases'] == max_val]

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
184,Yemen,1691,483,833,375,10,4,36,28.56,49.26,57.98,1619,72,4.45,Eastern Mediterranean


In [42]:
covid.iloc[covid['Deaths / 100 Cases'].idxmax()]

Country/Region                            Yemen
Confirmed                                  1691
Deaths                                      483
Recovered                                   833
Active                                      375
New cases                                    10
New deaths                                    4
New recovered                                36
Deaths / 100 Cases                        28.56
Recovered / 100 Cases                     49.26
Deaths / 100 Recovered                    57.98
Confirmed last week                        1619
1 week change                                72
1 week % increase                          4.45
WHO Region                Eastern Mediterranean
Name: 184, dtype: object

### 2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 'Europe'를 모두 출력하면?  
Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.

In [46]:
zeros = covid[covid['New cases'] == 0]

zeros[zeros['WHO Region'] == 'Europe']

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
56,Estonia,2034,69,1923,42,0,0,1,3.39,94.54,3.59,2021,13,0.64,Europe
75,Holy See,12,0,12,0,0,0,0,0.0,100.0,0.0,12,0,0.0,Europe
95,Latvia,1219,31,1045,143,0,0,0,2.54,85.73,2.97,1192,27,2.27,Europe
100,Liechtenstein,86,1,81,4,0,0,0,1.16,94.19,1.23,86,0,0.0,Europe
113,Monaco,116,4,104,8,0,0,0,3.45,89.66,3.85,109,7,6.42,Europe
143,San Marino,699,42,657,0,0,0,0,6.01,93.99,6.39,699,0,0.0,Europe
157,Spain,272421,28432,150376,93613,0,0,0,10.44,55.2,18.91,264836,7585,2.86,Europe


In [45]:
covid[(covid['New cases'] == 0) & (covid['WHO Region'] == 'Europe')]

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
56,Estonia,2034,69,1923,42,0,0,1,3.39,94.54,3.59,2021,13,0.64,Europe
75,Holy See,12,0,12,0,0,0,0,0.0,100.0,0.0,12,0,0.0,Europe
95,Latvia,1219,31,1045,143,0,0,0,2.54,85.73,2.97,1192,27,2.27,Europe
100,Liechtenstein,86,1,81,4,0,0,0,1.16,94.19,1.23,86,0,0.0,Europe
113,Monaco,116,4,104,8,0,0,0,3.45,89.66,3.85,109,7,6.42,Europe
143,San Marino,699,42,657,0,0,0,0,6.01,93.99,6.39,699,0,0.0,Europe
157,Spain,272421,28432,150376,93613,0,0,0,10.44,55.2,18.91,264836,7585,2.86,Europe


### 3. 다음 [데이터](https://www.kaggle.com/neuromusic/avocado-prices)를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?

In [62]:
avo = pd.read_csv("./archive/avocado.csv")

avo[avo['region'] == 'Albany'].max()

Unnamed: 0              52
Date            2018-03-25
AveragePrice          2.13
Total Volume        216738
4046                 34913
4225                195725
4770               5883.16
Total Bags         36806.8
Small Bags         30126.3
Large Bags           27206
XLarge Bags           2900
type               organic
year                  2018
region              Albany
dtype: object

In [58]:
avo['AveragePrice'].groupby(by=avo['region']).max()

region
Albany                 2.13
Atlanta                2.75
BaltimoreWashington    2.28
Boise                  2.79
Boston                 2.19
BuffaloRochester       2.57
California             2.58
Charlotte              2.83
Chicago                2.30
CincinnatiDayton       2.20
Columbus               2.22
DallasFtWorth          1.90
Denver                 2.16
Detroit                2.08
GrandRapids            2.73
GreatLakes             1.98
HarrisburgScranton     2.27
HartfordSpringfield    2.68
Houston                1.92
Indianapolis           2.10
Jacksonville           2.99
LasVegas               3.03
LosAngeles             2.44
Louisville             2.29
MiamiFtLauderdale      3.05
Midsouth               2.17
Nashville              2.24
NewOrleansMobile       2.32
NewYork                2.65
Northeast              2.31
NorthernNewEngland     1.96
Orlando                2.87
Philadelphia           2.45
PhoenixTucson          2.62
Pittsburgh             1.83
Plains       