## 날짜 다루기
- 날짜 값 파싱 -> 날짜 범위 체크 -> 누락된 날짜 대치 -> 시간 간격 계산
- 판다스를 사용해 날짜 값 파싱 후 datetime 값을 얻었다면 절반은 온 것!

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

covidcases=pd.read_csv('C:/data-cleansing-main/Chapter06/data/covidcases720.csv')
nls97=pd.read_csv('C:/data-cleansing-main/Chapter06/data/nls97c.csv')
nls97.set_index('personid',inplace=True)

In [3]:
nls97[['birthmonth','birthyear']].isnull().sum()        # 생월에 결측값이 하나 존재

birthmonth    1
birthyear     0
dtype: int64

In [4]:
nls97.birthmonth.value_counts().sort_index()

1.0     815
2.0     693
3.0     760
4.0     659
5.0     689
6.0     720
7.0     762
8.0     782
9.0     839
10.0    765
11.0    763
12.0    736
Name: birthmonth, dtype: int64

In [5]:
nls97.birthyear.value_counts().sort_index()

1980    1691
1981    1874
1982    1841
1983    1807
1984    1771
Name: birthyear, dtype: int64

In [7]:
nls97.birthmonth.fillna(int(nls97.birthmonth.mean()),inplace=True)      # birthmonth의 평균을 정수로 반올림한 값으로 대치. 6월

### 판다스 to_datetime 사용
- 딕셔너리 값 전달 가능, year/month/day 키 필요

In [13]:
nls97['birthdate']=pd.to_datetime(dict(year=nls97.birthyear, month=nls97.birthmonth, day=15))       # 딕셔너리 값 전달
nls97[['birthmonth','birthyear','birthdate']].isna().sum()

birthmonth    0
birthyear     0
birthdate     0
dtype: int64

In [17]:
nls97['birthdate']

personid
100061   1980-05-15
100139   1983-09-15
100284   1984-11-15
100292   1982-04-15
100583   1980-06-15
            ...    
999291   1981-04-15
999406   1982-07-15
999543   1984-08-15
999698   1983-05-15
999963   1982-09-15
Name: birthdate, Length: 8984, dtype: datetime64[ns]

### datetime열을 사용해 연령 값 계산

In [22]:
# 시작 날짜와 끝 날짜를 받아서 나이를 계산하는 함수 정의
def calcage(startdate,enddate):
    age=enddate.year-startdate.year
    if (enddate.month<startdate.month or enddate.month==startdate.month and enddate.day<startdate.day):
        age=age-1
    return age

rundate=pd.to_datetime('2020-07-20')
nls97['age']=nls97.apply(lambda x:calcage(x.birthdate,rundate),axis=1)
nls97.loc[100061:100583,['age','birthdate']]

Unnamed: 0_level_0,age,birthdate
personid,Unnamed: 1_level_1,Unnamed: 2_level_1
100061,40,1980-05-15
100139,36,1983-09-15
100284,35,1984-11-15
100292,38,1982-04-15
100583,40,1980-06-15


### 문자형인 컬럼을 날짜형으로 바꾸기

In [27]:
covidcases.dtypes       # casedate 컬럼이 object형

iso_code                            object
continent                           object
location                            object
casedate                            object
total_cases                        float64
new_cases                          float64
total_deaths                       float64
new_deaths                         float64
total_cases_per_million            float64
new_cases_per_million              float64
total_deaths_per_million           float64
new_deaths_per_million             float64
total_tests                        float64
new_tests                          float64
total_tests_per_thousand           float64
new_tests_per_thousand             float64
new_tests_smoothed                 float64
new_tests_smoothed_per_thousand    float64
tests_units                         object
stringency_index                   float64
population                         float64
population_density                 float64
median_age                         float64
aged_65_old

In [29]:
covidcases['casedate']=pd.to_datetime(covidcases.casedate,format='%Y-%m-%d')
covidcases.dtypes

iso_code                                   object
continent                                  object
location                                   object
casedate                           datetime64[ns]
total_cases                               float64
new_cases                                 float64
total_deaths                              float64
new_deaths                                float64
total_cases_per_million                   float64
new_cases_per_million                     float64
total_deaths_per_million                  float64
new_deaths_per_million                    float64
total_tests                               float64
new_tests                                 float64
total_tests_per_thousand                  float64
new_tests_per_thousand                    float64
new_tests_smoothed                        float64
new_tests_smoothed_per_thousand           float64
tests_units                                object
stringency_index                          float64


In [30]:
covidcases.casedate.describe()

  covidcases.casedate.describe()


count                   29529
unique                    195
top       2020-05-17 00:00:00
freq                      209
first     2019-12-31 00:00:00
last      2020-07-12 00:00:00
Name: casedate, dtype: object

### 날짜 간격을 포착하는 timedelta 객체 생성(?)

In [31]:
firstcase = covidcases.loc[covidcases.new_cases>0,['location','casedate']].sort_values(['location','casedate']).\
  drop_duplicates(['location'], keep='first').rename(columns={'casedate':'firstcasedate'})
covidcases = pd.merge(covidcases, firstcase, left_on=['location'], right_on=['location'], how="left")
covidcases['dayssincefirstcase'] = covidcases.casedate - covidcases.firstcasedate
covidcases.dayssincefirstcase.describe()

count                         29529
mean     56 days 00:15:12.892410850
std      47 days 00:35:41.813685246
min              -62 days +00:00:00
25%                21 days 00:00:00
50%                57 days 00:00:00
75%                92 days 00:00:00
max               194 days 00:00:00
Name: dayssincefirstcase, dtype: object

> 최초 확진자가 발생하기 62일 전부터 보고를 시작한 나라가 있음을 알 수 있다.