### 데이터 시각화
- 데이터 분석 결과를 쉽게 이해할 수 있도록 시각적으로 표현
- 탐색적 데이터 분석, 데이터 처리, 데이터 예측 모든 경우, 결과를 알아보기 쉽게하기 위해 필수.
- 참고 사이트 : https://app.flourish.studio

#### 1. 데이터확인

In [1]:
import pandas as pd
file_path = 'ex_datafile/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'

csv = pd.read_csv(file_path+'04-01-2020.csv',encoding='utf-8')
csv.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,4,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,46,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,7,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,192,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,1,"Adair, Iowa, US"


In [2]:
import pandas as pd
file_path = 'ex_datafile/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'

csv = pd.read_csv(file_path+'03-01-2020.csv',encoding='utf-8')
csv.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,Hubei,Mainland China,2020-03-01T10:13:19,66907,2761,31536,30.9756,112.2707
1,,South Korea,2020-03-01T23:43:03,3736,17,30,36.0,128.0
2,,Italy,2020-03-01T23:23:02,1694,34,83,43.0,12.0
3,Guangdong,Mainland China,2020-03-01T14:13:18,1349,7,1016,23.3417,113.4244
4,Henan,Mainland China,2020-03-01T14:13:18,1272,22,1198,33.882,113.614


위에 데일리 데이터를 확인해보면 컬럼명이 변경이 되었다.<br>그래서 데이터 가공이 필요함.


#### 2. 데이터 가공

In [3]:
csv = pd.read_csv(file_path+'01-22-2020.csv',encoding='utf-8')
try : 
    csv = csv[['Province_State', 'Country_Region', 'Confirmed']] # col 지정해서 데이터프레임 추출

except:
    csv = csv[['Province/State', 'Country/Region', 'Confirmed']] # 해당 csv 파일의 col확인 후 데이터프레임 추출
    csv.columns = ['Province_State', 'Country_Region', 'Confirmed'] # 추출한 데이터프레임 col명 변경
    
csv.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1.0
1,Beijing,Mainland China,14.0
2,Chongqing,Mainland China,6.0
3,Fujian,Mainland China,1.0
4,Gansu,Mainland China,


In [4]:
csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province_State  35 non-null     object 
 1   Country_Region  38 non-null     object 
 2   Confirmed       29 non-null     float64
dtypes: float64(1), object(2)
memory usage: 1.0+ KB


#### 3. 데이터 프레임 데이터 변환
    - 특정 컬럼만 선택해서 데이터프레임 만들기
    - 특정 컬럼에 없는 데이터 삭제(NaN)
    - 특정 컬럼의 데이터 타입 변경 ex) str -> int형변환

In [5]:
csv = pd.read_csv(file_path+'01-22-2020.csv',encoding='utf-8')
try : 
    csv = csv[['Province_State', 'Country_Region', 'Confirmed']] # col 지정해서 데이터프레임 추출

except:
    csv = csv[['Province/State', 'Country/Region', 'Confirmed']] # 해당 csv 파일의 col확인 후 데이터프레임 추출
    csv.columns = ['Province_State', 'Country_Region', 'Confirmed'] # 추출한 데이터프레임 col명 변경

csv = csv.dropna(subset=['Confirmed'])
csv = csv.astype({'Confirmed' : 'int64'})
csv.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


#### 4. 국가 코드 가져오기

In [6]:
country_info = pd.read_csv('ex_datafile/COVID-19-master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv', encoding='utf-8')
country_info.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,Population
0,4,AF,AFG,4.0,,,,Afghanistan,33.93911,67.709953,Afghanistan,38928341.0
1,8,AL,ALB,8.0,,,,Albania,41.1533,20.1683,Albania,2877800.0
2,10,AQ,ATA,10.0,,,,Antarctica,-71.9499,23.347,Antarctica,
3,12,DZ,DZA,12.0,,,,Algeria,28.0339,1.6596,Algeria,43851043.0
4,20,AD,AND,20.0,,,,Andorra,42.5063,1.5218,Andorra,77265.0


#### 5. 데이터 합치기

In [7]:
test_df = pd.merge(csv, country_info, how='left', on='Country_Region')
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3483 entries, 0 to 3482
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Province_State_x  3431 non-null   object 
 1   Country_Region    3483 non-null   object 
 2   Confirmed         3483 non-null   int64  
 3   UID               3457 non-null   float64
 4   iso2              3457 non-null   object 
 5   iso3              3457 non-null   object 
 6   code3             3457 non-null   float64
 7   FIPS              3384 non-null   float64
 8   Admin2            3343 non-null   object 
 9   Province_State_y  3454 non-null   object 
 10  Lat               3336 non-null   float64
 11  Long_             3336 non-null   float64
 12  Combined_Key      3457 non-null   object 
 13  Population        3336 non-null   float64
dtypes: float64(6), int64(1), object(7)
memory usage: 381.1+ KB


In [8]:
test_df.isnull().sum()

Province_State_x     52
Country_Region        0
Confirmed             0
UID                  26
iso2                 26
iso3                 26
code3                26
FIPS                 99
Admin2              140
Province_State_y     29
Lat                 147
Long_               147
Combined_Key         26
Population          147
dtype: int64

In [9]:
show = test_df[test_df['iso2'].isnull()]
show.head()

Unnamed: 0,Province_State_x,Country_Region,Confirmed,UID,iso2,iso3,code3,FIPS,Admin2,Province_State_y,Lat,Long_,Combined_Key,Population
0,Anhui,Mainland China,1,,,,,,,,,,,
1,Beijing,Mainland China,14,,,,,,,,,,,
2,Chongqing,Mainland China,6,,,,,,,,,,,
3,Fujian,Mainland China,1,,,,,,,,,,,
4,Guangdong,Mainland China,26,,,,,,,,,,,


컬럼값 변경하기
- Country_Region 국가명이 다양한 경우가 많음
- 별도의 json 파일로 만든 후 일관되게 변경

In [10]:
import json

with open('ex_datafile/COVID-19-master/csse_covid_19_data/country_convert.json', 'r', encoding='utf-8') as json_file:
    json_data = json.load(json_file)
    print(json_data)

{'Mainland China': 'China', 'Macau': 'China', 'South Korea': 'Korea, South', 'Aruba': 'Netherlands', ' Azerbaijan': 'Azerbaijan', 'Bahamas, The': 'Bahamas', 'Cape Verde': 'Cabo Verde', 'Cayman Islands': 'United Kingdom', 'Channel Islands': 'United Kingdom', 'Curacao': 'Netherlands', 'Czech Republic': 'Czechia', 'East Timor': 'Timor-Leste', 'Faroe Islands': 'Denmark', 'French Guiana': 'France', 'Gambia, The': 'Gambia', 'Gibraltar': 'United Kingdom', 'Greenland': 'Denmark', 'Guadeloupe': 'France', 'Guam': 'US', 'Guernsey': 'US', 'Hong Kong': 'China', 'Hong Kong SAR': 'China', 'Iran (Islamic Republic of)': 'Iran', 'Ivory Coast': "Cote d'Ivoire", 'Jersey': 'US', 'Macao SAR': 'China', 'Martinique': 'France', 'Mayotte': 'France', 'North Ireland': 'United Kingdom', 'Palestine': 'West Bank and Gaza', 'Puerto Rico': 'US', 'Republic of Ireland': 'Ireland', 'Republic of Korea': 'Korea, South', 'Republic of Moldova': 'Moldova', 'Republic of the Congo': 'Congo (Brazzaville)', 'Reunion': 'France', '

### apply() 함수 사용법
- apply() 함수를 사용해서, 특정 컬럼값 변경 가능

In [11]:
df = pd.DataFrame({
    '영어' : [60, 70],
    '수학' : [100, 50]
    }, index = ['Dave', 'David']
)
df

Unnamed: 0,영어,수학
Dave,60,100
David,70,50


In [12]:
def func(df_data):
    print(type(df_data)) # 타입 확인
    print(df_data.index) # 인덱스 확인
    print(df_data.values) # 값 확인
    return df_data  

In [13]:
df_func = df.apply(func, axis=0) # 열로 받는다

<class 'pandas.core.series.Series'>
Index(['Dave', 'David'], dtype='object')
[60 70]
<class 'pandas.core.series.Series'>
Index(['Dave', 'David'], dtype='object')
[100  50]


↑<br> 행이 2개인데 3번 출력되는 이유는 apply() 함수 자체가, 첫번째 행에 대해서는 두번 호출하도록 구성되어있기때문이다.

In [14]:
df = pd.DataFrame({
    '영어' : [60, 70],
    '수학' : [100, 50]
    }, index = ['Dave', 'David']
)
df

Unnamed: 0,영어,수학
Dave,60,100
David,70,50


In [15]:
def func(df_data):
    df_data['영어'] = 80
    return df_data

In [16]:
a = df.apply(func, axis=1)

In [17]:
a

Unnamed: 0,영어,수학
Dave,80,100
David,80,50


### apply() 함수 사용해서, 국가 컬럼값 변경하기

In [18]:
# 데이터 셋팅
import pandas as pd

csv = pd.read_csv(file_path+'01-22-2020.csv',encoding='utf-8')
try : 
    csv = csv[['Province_State', 'Country_Region', 'Confirmed']] # col 지정해서 데이터프레임 추출

except:
    csv = csv[['Province/State', 'Country/Region', 'Confirmed']] # 해당 csv 파일의 col확인 후 데이터프레임 추출
    csv.columns = ['Province_State', 'Country_Region', 'Confirmed'] # 추출한 데이터프레임 col명 변경

csv = csv.dropna(subset=['Confirmed'])
csv = csv.astype({'Confirmed' : 'int64'})
csv.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


- 변경할 국가명을 가지고 있는 json 파일 읽기

In [19]:
import json

with open('ex_datafile/COVID-19-master/csse_covid_19_data/country_convert.json', 'r', encoding='utf-8') as json_file:
    json_data = json.load(json_file)
    print(json_data)

{'Mainland China': 'China', 'Macau': 'China', 'South Korea': 'Korea, South', 'Aruba': 'Netherlands', ' Azerbaijan': 'Azerbaijan', 'Bahamas, The': 'Bahamas', 'Cape Verde': 'Cabo Verde', 'Cayman Islands': 'United Kingdom', 'Channel Islands': 'United Kingdom', 'Curacao': 'Netherlands', 'Czech Republic': 'Czechia', 'East Timor': 'Timor-Leste', 'Faroe Islands': 'Denmark', 'French Guiana': 'France', 'Gambia, The': 'Gambia', 'Gibraltar': 'United Kingdom', 'Greenland': 'Denmark', 'Guadeloupe': 'France', 'Guam': 'US', 'Guernsey': 'US', 'Hong Kong': 'China', 'Hong Kong SAR': 'China', 'Iran (Islamic Republic of)': 'Iran', 'Ivory Coast': "Cote d'Ivoire", 'Jersey': 'US', 'Macao SAR': 'China', 'Martinique': 'France', 'Mayotte': 'France', 'North Ireland': 'United Kingdom', 'Palestine': 'West Bank and Gaza', 'Puerto Rico': 'US', 'Republic of Ireland': 'Ireland', 'Republic of Korea': 'Korea, South', 'Republic of Moldova': 'Moldova', 'Republic of the Congo': 'Congo (Brazzaville)', 'Reunion': 'France', '

- Country_Region 이라는 컬럼값을 확인해서, 국가명이 다르게 기재되어 있을 경우에만, 지정한 국가명으로 변경

In [20]:
def func(row):
    if row['Country_Region'] in json_data: # dict 데이터에 이게있으면
        row['Country_Region'] = json_data[row['Country_Region']]
    return row

In [22]:
csv = csv.apply(func, axis=1)
csv.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


### 현재까지 한 작업
- 여러 데이터에서 Country_Region, Country/Region 이런 컬러명이 있다.
- 그래서 csv파일 읽었을 때 컬럼명을 Country/Region 경우 Country_Region 이걸로 변경.
- Country_Region 데이터에서 Mainland China, 이렇게 국가가 다르게 되어있는 경우가 있는것을 확인해서 json파일로 {key, value} 형식으로 만든다.
- csv파일을 읽어서 County_Region 열의 데이터를 따로 받아서 json파일에서 key가 조회되면 해당 값으로 변경하는 함수를 만든다.
- 그다음 apply() 함수를 통해서 열을 합친다.

In [33]:
data = '01-01-2022.csv'
data= data.replace('-', '/')
data
date = data.split('.')[0].lstrip('0')
date

'1/01/2022'

In [34]:
csv.columns

Index(['Province_State', 'Country_Region', 'Confirmed'], dtype='object')

In [38]:
csv.columns = ['Province_State', 'Country_Region', date]
csv.columns

Index(['Province_State', 'Country_Region', '1/01/2022'], dtype='object')

In [39]:
csv.head()

Unnamed: 0,Province_State,Country_Region,1/01/2022
0,Anhui,China,1
1,Beijing,China,14
2,Chongqing,China,6
3,Fujian,China,1
5,Guangdong,China,26


#### 5. 중복 데이터 합치기
- groupby() : 그룹별로 데이터를 집계하는 함수
    - 동일한 컬럼값으로 묶어서 통계 또는 평균등을 확인할 수 있음

In [61]:
df = pd.DataFrame({
    'sex' : ['Man', 'Man', 'Man'],
    'name' : ['lee', 'kim', 'kim'],
    'math' : [100, 50, 80],
    'english' : [80, 70, 50]
})
df

Unnamed: 0,sex,name,math,english
0,Man,lee,100,80
1,Man,kim,50,70
2,Man,kim,80,50


In [58]:
df.groupby('name').mean()

TypeError: Could not convert ManMan to numeric

In [50]:
df[['name','math','english']].groupby('name').mean()

Unnamed: 0_level_0,math,english
name,Unnamed: 1_level_1,Unnamed: 2_level_1
kim,65.0,60.0
lee,100.0,80.0


In [56]:
df.groupby('name').sum()

Unnamed: 0_level_0,sex,math,english
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
kim,MM,130,120
lee,M,100,80


In [62]:
# 데이터 셋팅
import pandas as pd

csv = pd.read_csv(file_path+'01-22-2020.csv',encoding='utf-8')
try : 
    csv = csv[['Province_State', 'Country_Region', 'Confirmed']] # col 지정해서 데이터프레임 추출

except:
    csv = csv[['Province/State', 'Country/Region', 'Confirmed']] # 해당 csv 파일의 col확인 후 데이터프레임 추출
    csv.columns = ['Province_State', 'Country_Region', 'Confirmed'] # 추출한 데이터프레임 col명 변경

csv = csv.dropna(subset=['Confirmed'])
csv = csv.astype({'Confirmed' : 'int64'})
csv.head()

Unnamed: 0,Province_State,Country_Region,Confirmed
0,Anhui,Mainland China,1
1,Beijing,Mainland China,14
2,Chongqing,Mainland China,6
3,Fujian,Mainland China,1
5,Guangdong,Mainland China,26


In [66]:
csv[['Country_Region', 'Confirmed']].groupby('Country_Region').sum()

Unnamed: 0_level_0,Confirmed
Country_Region,Unnamed: 1_level_1
Japan,2
Macau,1
Mainland China,547
South Korea,1
Taiwan,1
Thailand,2
US,1


##### 6. 데이터 전처리하기
- 지금까지 한 과정 함수로 만들기
- 1. csv 파일 읽기
- 2. 'Country_Region', 'Confirmed' 컬럼 가져오기
- 3. 'Confirmed' NaN 행 삭제
- 4. 'Country_Region' 의 국가명을 여러 파일에 일관되게 변경
- 5. 'Confirmed' 데이터 타입을 int64(정수)로 변경
- 6. 파일명 기반으로 날짜 문자열로 변환하고, 'Confirmed' 컬럼명 변경

In [112]:
# 데이터 셋팅
import pandas as pd
import json

# json파일 불러옹기
with open('ex_datafile/COVID-19-master/csse_covid_19_data/country_convert.json', 'r', encoding='utf-8') as json_file:
    json_data = json.load(json_file)
    
# 컬럼명 변경
def country_name_convert(row):
    if row['Country_Region'] in json_data:
        return json_data[row['Country_Region']]
    return row['Country_Region']


# csv파일 읽기고 전처리
def creat_dateframe(filename):
    file_path = 'ex_datafile/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
    csv = pd.read_csv(file_path + filename,encoding='utf-8')                   # 1.
    try : 
        csv = csv[['Province_State', 'Country_Region', 'Confirmed']]           # 2.

    except:
        csv = csv[['Province/State', 'Country/Region', 'Confirmed']]           # 2.
        csv.columns = ['Province_State', 'Country_Region', 'Confirmed'] 

    csv = csv.dropna(subset=['Confirmed'])                                     # 3.
    csv['Country_Region'] = csv.apply(country_name_convert, axis=1)            # 4.
    csv = csv.astype({'Confirmed' : 'int64'})                                  # 5.
    csv = csv[['Country_Region', 'Confirmed']].groupby('Country_Region').sum() # 6.
    
    date_column = filename.split('.')[0].lstrip('0').replace('-','/')          # 7.
    csv.columns = [date_column]
    return csv


### test

In [113]:
one = creat_dateframe('01-22-2020.csv')
two = creat_dateframe('04-01-2020.csv')

In [114]:
two.head()

Unnamed: 0_level_0,4/01/2020
Country_Region,Unnamed: 1_level_1
Afghanistan,192
Albania,259
Algeria,847
Andorra,390
Angola,8


#### 데이터프레임 합치기

In [118]:
df = pd.merge(one, two, how='outer', left_index=True, right_index=True)
df.head()

Unnamed: 0_level_0,1/22/2020,4/01/2020
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,,192
Albania,,259
Algeria,,847
Andorra,,390
Angola,,8


#### NaN -> 0 으로 변경

In [117]:
df = df.fillna(0)
df.head()

Unnamed: 0_level_0,1/22/2020,4/01/2020
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,0.0,192
Albania,0.0,259
Algeria,0.0,847
Andorra,0.0,390
Angola,0.0,8


#### 파일 리스트 확인하기
- ex) 01-11-2022.csv 확장자 .csv파일만 가지고 리스트에 저장
- 저장시 .csv 제거후 sort로 오름차순으로 정렬

In [121]:
import os

file_path = 'ex_datafile/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'

file_list = os.listdir(file_path)
csv_list = list()

for file in file_list:
    if file.split(".")[-1] == 'csv':
        csv_list.append(file)

print (csv_list)

['01-01-2021.csv', '01-01-2022.csv', '01-01-2023.csv', '01-02-2021.csv', '01-02-2022.csv', '01-02-2023.csv', '01-03-2021.csv', '01-03-2022.csv', '01-03-2023.csv', '01-04-2021.csv', '01-04-2022.csv', '01-04-2023.csv', '01-05-2021.csv', '01-05-2022.csv', '01-05-2023.csv', '01-06-2021.csv', '01-06-2022.csv', '01-06-2023.csv', '01-07-2021.csv', '01-07-2022.csv', '01-07-2023.csv', '01-08-2021.csv', '01-08-2022.csv', '01-08-2023.csv', '01-09-2021.csv', '01-09-2022.csv', '01-09-2023.csv', '01-10-2021.csv', '01-10-2022.csv', '01-10-2023.csv', '01-11-2021.csv', '01-11-2022.csv', '01-11-2023.csv', '01-12-2021.csv', '01-12-2022.csv', '01-12-2023.csv', '01-13-2021.csv', '01-13-2022.csv', '01-13-2023.csv', '01-14-2021.csv', '01-14-2022.csv', '01-14-2023.csv', '01-15-2021.csv', '01-15-2022.csv', '01-15-2023.csv', '01-16-2021.csv', '01-16-2022.csv', '01-16-2023.csv', '01-17-2021.csv', '01-17-2022.csv', '01-17-2023.csv', '01-18-2021.csv', '01-18-2022.csv', '01-18-2023.csv', '01-19-2021.csv', '01-19-20

#### 리스트 정렬
- sort() : 오름차순 정렬(default)
- sort(reverse=True) : 내림차순 정렬

In [122]:
csv_list.sort()
csv_list

['01-01-2021.csv',
 '01-01-2022.csv',
 '01-01-2023.csv',
 '01-02-2021.csv',
 '01-02-2022.csv',
 '01-02-2023.csv',
 '01-03-2021.csv',
 '01-03-2022.csv',
 '01-03-2023.csv',
 '01-04-2021.csv',
 '01-04-2022.csv',
 '01-04-2023.csv',
 '01-05-2021.csv',
 '01-05-2022.csv',
 '01-05-2023.csv',
 '01-06-2021.csv',
 '01-06-2022.csv',
 '01-06-2023.csv',
 '01-07-2021.csv',
 '01-07-2022.csv',
 '01-07-2023.csv',
 '01-08-2021.csv',
 '01-08-2022.csv',
 '01-08-2023.csv',
 '01-09-2021.csv',
 '01-09-2022.csv',
 '01-09-2023.csv',
 '01-10-2021.csv',
 '01-10-2022.csv',
 '01-10-2023.csv',
 '01-11-2021.csv',
 '01-11-2022.csv',
 '01-11-2023.csv',
 '01-12-2021.csv',
 '01-12-2022.csv',
 '01-12-2023.csv',
 '01-13-2021.csv',
 '01-13-2022.csv',
 '01-13-2023.csv',
 '01-14-2021.csv',
 '01-14-2022.csv',
 '01-14-2023.csv',
 '01-15-2021.csv',
 '01-15-2022.csv',
 '01-15-2023.csv',
 '01-16-2021.csv',
 '01-16-2022.csv',
 '01-16-2023.csv',
 '01-17-2021.csv',
 '01-17-2022.csv',
 '01-17-2023.csv',
 '01-18-2021.csv',
 '01-18-2022

## total 코드

In [6]:
# 데이터 셋팅
import pandas as pd
import json
import os


# json파일 불러옹기
with open('ex_datafile/COVID-19-master/csse_covid_19_data/country_convert.json', 'r', encoding='utf-8') as json_file:
    json_data = json.load(json_file)
    
# 컬럼명 변경
def country_name_convert(row):
    if row['Country_Region'] in json_data:
        return json_data[row['Country_Region']]
    return row['Country_Region']


# csv파일 읽기고 전처리
def create_dateframe(filename):
    file_path = 'ex_datafile/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
    csv = pd.read_csv(file_path + filename,encoding='utf-8')                   # 1.
    try : 
        csv = csv[['Province_State', 'Country_Region', 'Confirmed']]           # 2.

    except:
        csv = csv[['Province/State', 'Country/Region', 'Confirmed']]           # 2.
        csv.columns = ['Province_State', 'Country_Region', 'Confirmed'] 

    csv = csv.dropna(subset=['Confirmed'])                                     # 3.
    csv['Country_Region'] = csv.apply(country_name_convert, axis=1)            # 4.
    csv = csv.astype({'Confirmed' : 'int64'})                                  # 5.
    csv = csv[['Country_Region', 'Confirmed']].groupby('Country_Region').sum() # 6.
    
    date_column = filename.split('.')[0].lstrip('0').replace('-','/')          # 7.
    csv.columns = [date_column]
    return csv

def generate_dateframe_by_path(file_path):

    file_list, csv_list = os.listdir(file_path), list()
    first_doc = True
    for file in file_list:
        if file.split(".")[-1] == 'csv':
            csv_list.append(file)
    csv_list.sort()
    
    for file in csv_list:
        doc = create_dateframe(file)
        if first_doc:
            final_doc, first_doc = doc, False
        else:
            final_doc = pd.merge(final_doc, doc, how='outer', left_index=True, right_index=True)

    final_doc = final_doc.fillna(0)
    return final_doc


In [7]:
file_path = 'ex_datafile/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/'
doc = generate_dateframe_by_path(file_path)
doc

Unnamed: 0_level_0,1/01/2021,1/01/2022,1/01/2023,1/02/2021,1/02/2022,1/02/2023,1/03/2021,1/03/2022,1/03/2023,1/04/2021,...,12/28/2022,12/29/2020,12/29/2021,12/29/2022,12/30/2020,12/30/2021,12/30/2022,12/31/2020,12/31/2021,12/31/2022
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,52513.0,158107.0,207616.0,52586.0,158189.0,207627.0,52709.0,158183.0,207654.0,52909.0,...,207493.0,52147.0,158037.0,207511.0,52330.0,158056.0,207550.0,52330.0,158084.0,207559.0
Albania,58316.0,210224.0,333811.0,58991.0,210885.0,333812.0,59438.0,210885.0,333812.0,59623.0,...,333776.0,57146.0,208899.0,333776.0,57727.0,208899.0,333806.0,58316.0,210224.0,333806.0
Algeria,99897.0,218818.0,271229.0,100159.0,219159.0,271229.0,100408.0,219532.0,271230.0,100645.0,...,271208.0,98988.0,217647.0,271217.0,99311.0,218037.0,271223.0,99610.0,218432.0,271228.0
Andorra,8117.0,23740.0,47751.0,8166.0,23740.0,47751.0,8192.0,24502.0,47751.0,8249.0,...,47751.0,7919.0,22823.0,47751.0,7983.0,23122.0,47751.0,8049.0,23740.0,47751.0
Angola,17568.0,82398.0,105095.0,17608.0,82920.0,105095.0,17642.0,83764.0,105095.0,17684.0,...,105095.0,17371.0,78475.0,105095.0,17433.0,79871.0,105095.0,17553.0,81593.0,105095.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,139223.0,469748.0,703228.0,140287.0,469748.0,703228.0,141219.0,469748.0,703228.0,142228.0,...,703228.0,135459.0,469748.0,703228.0,136736.0,469748.0,703228.0,138004.0,469748.0,703228.0
Winter Olympics 2022,0.0,0.0,535.0,0.0,0.0,535.0,0.0,0.0,535.0,0.0,...,535.0,0.0,0.0,535.0,0.0,0.0,535.0,0.0,0.0,535.0
Yemen,2101.0,10127.0,11945.0,2101.0,10130.0,11945.0,2101.0,10138.0,11945.0,2101.0,...,11945.0,2096.0,10125.0,11945.0,2097.0,10126.0,11945.0,2099.0,10126.0,11945.0
Zambia,20997.0,257948.0,334629.0,21230.0,259677.0,334661.0,21582.0,261221.0,334695.0,21993.0,...,334196.0,20177.0,243638.0,334294.0,20462.0,249193.0,334425.0,20725.0,254274.0,334425.0


#### 데이터 타입 변경

In [8]:
doc = doc.astype('int64')
doc

Unnamed: 0_level_0,1/01/2021,1/01/2022,1/01/2023,1/02/2021,1/02/2022,1/02/2023,1/03/2021,1/03/2022,1/03/2023,1/04/2021,...,12/28/2022,12/29/2020,12/29/2021,12/29/2022,12/30/2020,12/30/2021,12/30/2022,12/31/2020,12/31/2021,12/31/2022
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,52513,158107,207616,52586,158189,207627,52709,158183,207654,52909,...,207493,52147,158037,207511,52330,158056,207550,52330,158084,207559
Albania,58316,210224,333811,58991,210885,333812,59438,210885,333812,59623,...,333776,57146,208899,333776,57727,208899,333806,58316,210224,333806
Algeria,99897,218818,271229,100159,219159,271229,100408,219532,271230,100645,...,271208,98988,217647,271217,99311,218037,271223,99610,218432,271228
Andorra,8117,23740,47751,8166,23740,47751,8192,24502,47751,8249,...,47751,7919,22823,47751,7983,23122,47751,8049,23740,47751
Angola,17568,82398,105095,17608,82920,105095,17642,83764,105095,17684,...,105095,17371,78475,105095,17433,79871,105095,17553,81593,105095
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,139223,469748,703228,140287,469748,703228,141219,469748,703228,142228,...,703228,135459,469748,703228,136736,469748,703228,138004,469748,703228
Winter Olympics 2022,0,0,535,0,0,535,0,0,535,0,...,535,0,0,535,0,0,535,0,0,535
Yemen,2101,10127,11945,2101,10130,11945,2101,10138,11945,2101,...,11945,2096,10125,11945,2097,10126,11945,2099,10126,11945
Zambia,20997,257948,334629,21230,259677,334661,21582,261221,334695,21993,...,334196,20177,243638,334294,20462,249193,334425,20725,254274,334425


#### pandas 라이브러리로 csv 파일 생성
- pandas dataframe 데이터를 csv 파일로 저장하기 위해, to_csv() 함수 사용<br>
    doc.to_csv("00_data/students_default.csv")<br><br>
- encoding 옵션 사용 가능<br>
    doc.to_csv("00_data/students_default.csv", encoding='utf-8-sig')

In [10]:
doc.to_csv("ex_datafile/COVID-19-master/final_df.csv")