# 2. 파이썬으로 데이터 주무르기, pandas
**pandas를 활용해서 데이터프레임을 다뤄봅시다.**

1. Pandas 시작하기
    - prerequisite : Table
    - pandas import하기
   
2. Pandas로 1차원 데이터 다루기 - Series 
    - Series 선언하기
    - Series vs ndarray
    - Series vs dict
    - Series에 이름 붙이기
3. Pandas로 2차원 데이터 다루기 - dataframe
    - dataframe 선언하기
    - csv 파일을 dataframe으로! : `to_csv()`
    - dataframe 자료 접근하기
    - dataframe 정모하기 : `.groupby()`
    - dataframe 합치기 : `.merge()` 

[수업에 사용된 covid 데이터](https://www.kaggle.com/imdevskp/corona-virus-report)

## I. pandas 시작하기

#### Prequisite: Table

In [7]:
import pandas as pd
import numpy as np

## II. pandas로 1차원 데이터 다루기 - Series

- **1-D labeled array** > np.array
- Index 지정할 수 있음 > 딕셔너리로도 생성 가능
- series[조건]으로 데이터 접근 가능

In [8]:
# 기본 사용
d = {chr(i + 65): i for i in range(5)}

t = pd.Series(d)
s = pd.Series(i ** 2 for i in range(1, 8))

s[1:3]
s[s > s.median()]
s[[3, 1, 4]] # tuple 불가능
s.dtype
np.exp(s)

# dict과의 유사성
t['A']
t['F'] = 5
t

'F' in t
'G' in t
t.get('G', -1)

-1

### Series의 name 속성 사용

In [9]:
r = pd.Series(np.random.randn(5), name="random_nums")
r.name = '임의의 난수'
r.name

'임의의 난수'

## III. Pandas로 2차원 데이터 다루기 - dataframe

- **2-D labeled table**
- 마찬가지로 Index 지정 가능
- 2차원이기에 dict()을 사용해서 생성하는 것이 더 편함

#### 일부분 관찰
- df.head(N), df.tail(N)

#### 데이터 접근
1. **Column 접근**
    - df['column_name'], df.column_name
    - each column of df is Series
    - df[조건]
2. **Row 접근**
    - Access by Index: df.loc[row, col] or df.loc[row][col]
    - Index의 순서로 접근 : df.iloc[row_idx, col_idx] or df.iloc[row_idx][col_idx]
   
#### Groupby
- Split : 기준으로 df분할
- Apply : 통계함수로 데이터를 압축
- Combine : 압축된 결과로 새로운 Series를 생성 (group_key: applied_value)

In [10]:
# 기본 사용
d = {"height": range(181, 185), "weight": range(30, 70, 10)}

df = pd.DataFrame(d)
df.dtypes

# ./*.csv 읽어오기
covid = pd.read_csv("country_wise_latest.csv") # ./ 생략 가능
covid.head()

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
0,Afghanistan,36263,1269,25198,9796,106,10,18,3.5,69.49,5.04,35526,737,2.07,Eastern Mediterranean
1,Albania,4880,144,2745,1991,117,6,63,2.95,56.25,5.25,4171,709,17.0,Europe
2,Algeria,27973,1163,18837,7973,616,8,749,4.16,67.34,6.17,23691,4282,18.07,Africa
3,Andorra,907,52,803,52,10,0,0,5.73,88.53,6.48,884,23,2.6,Europe
4,Angola,950,41,242,667,18,1,0,4.32,25.47,16.94,749,201,26.84,Africa


In [13]:
# Column Access
covid['Active']
covid.Active
covid['WHO Region']
type(covid['WHO Region']) # pandas.core.series.series

# 조건으로 Access
condition = covid['New cases'] > 100
covid[condition]

covid['WHO Region'].unique()
# c = covid['WHO Region'].str.contains('Asia')
is_southeast_asia = covid['WHO Region'] == 'South-East Asia'
type(is_southeast_asia) # Name='WHO Region'의 True/False로 구성된 pandas.core.series.series
df_southeast_asia = covid[is_southeast_asia]
# df_southeast_asia

In [37]:
# Row Access

books_dict = {"Available": [True, True, False], "Location": [102, 215, 323], 
              "Genre": ["Programming", "Physics", "Mathmetics"]}
# Setting custom Index 
books_df = pd.DataFrame(books_dict, index=['버그란 무엇인가', '두근두근 물리학', '미분해줘 홈즈'])

# Access by Index
books_df.loc['버그란 무엇인가']
books_df.loc['미분해줘 홈즈', 'Available'] # books_df.loc['미분해줘 홈즈']['Available']

books_df.iloc[0, 1]
books_df.iloc[1, 0:2]

Available    True
Location      215
Name: 두근두근 물리학, dtype: object

In [23]:
# Groupby
# WHO Region 별 확진자수 

# Split까지 된 상태
covid_confirmed_by_region = covid['Confirmed'].groupby(by=covid["WHO Region"])

# Apply를 해서 자동적으로 Combine된 상태
covid_confirmed_by_region.sum()

# 국가당 감염자수 
covid_confirmed_by_region.mean()

WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

## Mission:
### 1. covid 데이터에서 100 case 대비 사망률(`Deaths / 100 Cases`)이 가장 높은 국가는?

In [48]:
max_deaths_by_100_cases = covid['Deaths / 100 Cases'].max()
is_max = covid['Deaths / 100 Cases'] == max_deaths_by_100_cases
covid[is_max]['Country/Region'].values[0]

'Yemen'

### 2. covid 데이터에서 신규 확진자가 없는 나라 중 WHO Region이 'Europe'를 모두 출력하면?  
Hint : 한 줄에 동시에 두가지 조건을 Apply하는 경우 Warning이 발생할 수 있습니다.

In [61]:
zero_new_cases_in_Europe = (covid['New cases'] == 0) & (covid['WHO Region'] == 'Europe') 
                            # & 대신 and 사용 시 Error
covid[zero_new_cases_in_Europe]

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,Deaths / 100 Cases,Recovered / 100 Cases,Deaths / 100 Recovered,Confirmed last week,1 week change,1 week % increase,WHO Region
56,Estonia,2034,69,1923,42,0,0,1,3.39,94.54,3.59,2021,13,0.64,Europe
75,Holy See,12,0,12,0,0,0,0,0.0,100.0,0.0,12,0,0.0,Europe
95,Latvia,1219,31,1045,143,0,0,0,2.54,85.73,2.97,1192,27,2.27,Europe
100,Liechtenstein,86,1,81,4,0,0,0,1.16,94.19,1.23,86,0,0.0,Europe
113,Monaco,116,4,104,8,0,0,0,3.45,89.66,3.85,109,7,6.42,Europe
143,San Marino,699,42,657,0,0,0,0,6.01,93.99,6.39,699,0,0.0,Europe
157,Spain,272421,28432,150376,93613,0,0,0,10.44,55.2,18.91,264836,7585,2.86,Europe


### 3. 다음 [데이터](https://www.kaggle.com/neuromusic/avocado-prices)를 이용해 각 Region별로 아보카도가 가장 비싼 평균가격(AveragePrice)을 출력하면?

In [77]:
avocado_url = 'https://storage.googleapis.com/kaggle-data-sets/30292/38613/compressed/avocado.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210506%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210506T130543Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=7d2f1233aeca384b76638c3fd6e78a8d78af2f051e98dfac9f8dff5c7ae6613af0e6a01b030494e23ccf1c90ed64bcd1ac99b5d79291aba759ae057be250e4984b1d82a25b51998b098c70a2337565fa5e527ab09a71c521d9c1fb66815299e970a4d0bd1ccefba881abd3decd0a4376b8b35bf20233e5e733e0d3ac6abd9ef366e81512a1702f9e872140948ec84339a255bd74aca84df659c6f427acc4eb634244d54534d8262cd5ead77d88c8b43db0f8158dbd9a24b9d8d178c9415b53ad9eb6559e04356f590fb4f0ff0fad76ad9321f50c468f5520e117a18b6f83bf49a4867cf954fe77956d783df7274d8832f79e40d3f422449d3f107e1c5828cf55'
avocado_df = pd.read_csv(avocado_url, compression='zip')
avocado_df.head(20)

average_price_by_region = avocado_df['AveragePrice'].groupby(by=avocado_df['region'])
average_price_by_region.max()

region
Albany                 2.13
Atlanta                2.75
BaltimoreWashington    2.28
Boise                  2.79
Boston                 2.19
BuffaloRochester       2.57
California             2.58
Charlotte              2.83
Chicago                2.30
CincinnatiDayton       2.20
Columbus               2.22
DallasFtWorth          1.90
Denver                 2.16
Detroit                2.08
GrandRapids            2.73
GreatLakes             1.98
HarrisburgScranton     2.27
HartfordSpringfield    2.68
Houston                1.92
Indianapolis           2.10
Jacksonville           2.99
LasVegas               3.03
LosAngeles             2.44
Louisville             2.29
MiamiFtLauderdale      3.05
Midsouth               2.17
Nashville              2.24
NewOrleansMobile       2.32
NewYork                2.65
Northeast              2.31
NorthernNewEngland     1.96
Orlando                2.87
Philadelphia           2.45
PhoenixTucson          2.62
Pittsburgh             1.83
Plains       