# 7.6 groupby로 분석 단위 바꾸기
* 데이터프레임의 분석 단위를 바꾸기 위해 데이터를 집계해야 하는 경우가 생긴다.  
> e.g.1 가구당 월별 지출 > 가구당 연간 지출  
> e.g.2 과목별 성적 > 학점(GPA)  

* 이 때 먼저 (1) 필요에 따라 중복되지 않은 행들만 선택하고, (2) groupby를 사용하여 그룹별로 행들 사이의 계산을 해 본 뒤, 분석 단위를 조정한다.

### 1. 라이브러리 임포트, 데이터 로드

In [2]:
import pandas as pd
import numpy as np

# covid19 data
coviddaily = pd.read_csv('data/coviddaily720.csv', parse_dates=['casedate'])
# land temperature data
ltbrazil = pd.read_csv('data/ltbrazil.csv')

### 2. [코로나19 데이터] 국가별 일일 데이터 > 전체 국가의 일일 데이터

* 중간 단계🔍

In [12]:
# 현재 데이터 = 국가별 일일 데이터
coviddaily.head(5)

Unnamed: 0,iso_code,casedate,location,continent,new_cases,new_deaths,population,pop_density,median_age,gdp_per_capita,hosp_beds,region
0,AFG,2019-12-31,Afghanistan,Asia,0.0,0.0,38928341.0,54.422,18.6,1803.987,0.5,South Asia
1,AFG,2020-01-01,Afghanistan,Asia,0.0,0.0,38928341.0,54.422,18.6,1803.987,0.5,South Asia
2,AFG,2020-01-02,Afghanistan,Asia,0.0,0.0,38928341.0,54.422,18.6,1803.987,0.5,South Asia
3,AFG,2020-01-03,Afghanistan,Asia,0.0,0.0,38928341.0,54.422,18.6,1803.987,0.5,South Asia
4,AFG,2020-01-04,Afghanistan,Asia,0.0,0.0,38928341.0,54.422,18.6,1803.987,0.5,South Asia


In [7]:
# 현재 데이터의 날짜 범위 = 2019.12.31 ~ 2020.07.12
coviddaily.casedate.describe()

  coviddaily.casedate.describe()


count                   29213
unique                    195
top       2020-05-23 00:00:00
freq                      209
first     2019-12-31 00:00:00
last      2020-07-12 00:00:00
Name: casedate, dtype: object

In [13]:
# casedate 열에 대하여 between 함수로 범위 지정 (* 날짜의 표시 형식(format)에 상관없이 내용의미만 맞게 입력하면 되는 듯 함)
coviddaily.loc[coviddaily.casedate.between('2020/02/01', '2020-07-12')].groupby(['casedate'], as_index=False)[['new_cases', 'new_deaths']].sum()

Unnamed: 0,casedate,new_cases,new_deaths
0,2020-02-01,2120.0,46.0
1,2020-02-02,2608.0,46.0
2,2020-02-03,2818.0,57.0
3,2020-02-04,3243.0,65.0
4,2020-02-05,3897.0,66.0
...,...,...,...
158,2020-07-08,207024.0,6091.0
159,2020-07-09,215473.0,5375.0
160,2020-07-10,228608.0,5441.0
161,2020-07-11,229759.0,5276.0


* **본 코드**🐱‍💻

In [14]:
coviddailytotals = coviddaily.loc[coviddaily.casedate.between('2020/02/01', '2020-07-12')].\
                   groupby(['casedate'], as_index=False)[['new_cases', 'new_deaths']].sum()

In [15]:
coviddailytotals.head(10)

Unnamed: 0,casedate,new_cases,new_deaths
0,2020-02-01,2120.0,46.0
1,2020-02-02,2608.0,46.0
2,2020-02-03,2818.0,57.0
3,2020-02-04,3243.0,65.0
4,2020-02-05,3897.0,66.0
5,2020-02-06,3741.0,72.0
6,2020-02-07,3177.0,73.0
7,2020-02-08,3439.0,86.0
8,2020-02-09,2619.0,89.0
9,2020-02-10,2982.0,97.0


### 3. [지표온도 데이터] 기상 관측소별 월별 측정온도 > 평균 온도 데이터

* 중간 과정

In [20]:
ltbrazil.loc[ltbrazil.station=="ALTAMIRA"]

Unnamed: 0,locationid,year,month,temperature,latitude,longitude,elevation,station,countryid,country,latabs
648,BR000352000,2019,8,28.55,-3.2,-52.2,112.0,ALTAMIRA,BR,Brazil,3.2
740,BR000352000,2019,9,28.85,-3.2,-52.2,112.0,ALTAMIRA,BR,Brazil,3.2
832,BR000352000,2019,10,28.65,-3.2,-52.2,112.0,ALTAMIRA,BR,Brazil,3.2
924,BR000352000,2019,11,28.0,-3.2,-52.2,112.0,ALTAMIRA,BR,Brazil,3.2
1016,BR000352000,2019,12,27.5,-3.2,-52.2,112.0,ALTAMIRA,BR,Brazil,3.2


* **본 코드**🐱‍💻

In [21]:
# 온돗값 누락 행을 제거
ltbrazil = ltbrazil.dropna(subset=['temperature'])

# 월별 데이터 > 연간 평균 데이터
ltbrazilavgs = ltbrazil.groupby(['station'], as_index=False).agg({'latabs':'first', 'elevation':'first', 'temperature':'mean'})

# 데이터 표시
ltbrazilavgs.head(10)

Unnamed: 0,station,latabs,elevation,temperature
0,ALTAMIRA,3.2,112.0,28.31
1,ALTA_FLORESTA_AERO,9.867,289.0,29.374167
2,ARAXA,19.567,1004.0,21.6125
3,BACABAL,4.21,25.1,29.75
4,BAGE,31.333,242.0,19.295833
5,BARBALHA,7.317,409.0,27.2
6,BARCELOS,0.981,34.1,28.270833
7,BARRA_DO_CORDA,5.5,153.0,28.766667
8,BARREIRAS,12.15,439.0,26.795833
9,BARTOLOMEU_LISANDRO,21.698,17.4,25.843333
