<a href="https://colab.research.google.com/github/yeahginny/Data-Analysis-School/blob/main/0831_NetworkX_COVID19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. The Lists of Data Table
### 1) Case Data
- **Case**: Data of COVID-19 infection cases in South Korea

### 2) Patient Data
- **PatientInfo**: Epidemiological data of COVID-19 patients in South Korea
- **PatientRoute**: Route data of COVID-19 patients in South Korea (currently unavailable)

### 3) Time Series Data
- **Time**: Time series data of COVID-19 status in South Korea
- **TimeAge**: Time series data of COVID-19 status in terms of the age in South Korea
- **TimeGender**: Time series data of COVID-19 status in terms of gender in South Korea
- **TimeProvince**: Time series data of COVID-19 status in terms of the Province in South Korea

### 4) Additional Data
- **Region**: Location and statistical data of the regions in South Korea
- **Weather**: Data of the weather in the regions of South Korea
- **SearchTrend**: Trend data of the keywords searched in NAVER which is one of the largest portals in South Korea
- **SeoulFloating**: Data of floating population in Seoul, South Korea (from SK Telecom Big Data Hub)
- **Policy**: Data of the government policy for COVID-19 in South Korea

# 2. The Structure of our Dataset
- What color means is that they have similar properties.
- If a line is connected between columns, it means that the values of the columns are partially shared.
- The dotted lines mean weak relevance.
![db_0701](https://user-images.githubusercontent.com/50820635/86225695-8dca0580-bbc5-11ea-9e9b-b0ca33414d8a.PNG)

# 3. The Detailed Description of each Data Table

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
path = '/content/'

case = patinfo = pd.read_csv(path+'Case.csv')
patinfo = pd.read_csv(path+'PatientInfo.csv')
#p_route = pd.read_csv(path+'PatientRoute.csv')
time = pd.read_csv(path+'Time.csv')
t_age = pd.read_csv(path+'TimeAge.csv')
t_gender = pd.read_csv(path+'TimeGender.csv')
t_provin = pd.read_csv(path+'TimeProvince.csv')
region = pd.read_csv(path+'Region.csv')
weather = pd.read_csv(path+'Weather.csv')
search = pd.read_csv(path+'SearchTrend.csv')
floating = pd.read_csv(path+'SeoulFloating.csv')
policy = pd.read_csv(path+'Policy.csv')

##### Before the Start..
- We make a structured dataset based on the report materials of KCDC and local governments.
- In Korea, we use the terms named '-do', '-si', '-gun' and '-gu',
- The meaning of them are explained below.

***


### Levels of administrative divisions in South Korea
#### Upper Level (Provincial-level divisions)
- **Special City**:
*Seoul*
- **Metropolitan City**:
*Busan / Daegu / Daejeon / Gwangju / Incheon / Ulsan*
- **Province(-do)**:
*Gyeonggi-do / Gangwon-do / Chungcheongbuk-do / Chungcheongnam-do / Jeollabuk-do / Jeollanam-do / Gyeongsangbuk-do / Gyeongsangnam-do*

#### Lower Level (Municipal-level divisions)
- **City(-si)**
[List of cities in South Korea](https://en.wikipedia.org/wiki/List_of_cities_in_South_Korea)
- **Country(-gun)**
[List of counties of South Korea](https://en.wikipedia.org/wiki/List_of_counties_of_South_Korea)
- **District(-gu)**
[List of districts in South Korea](https://en.wikipedia.org/wiki/List_of_districts_in_South_Korea)

***

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2815958%2F1c50702025f44b0c1ce92460bd2ea3f9%2Fus_hi_30-1.jpg?generation=1582819435038273&amp;alt=media">

***

Sources
- http://nationalatlas.ngii.go.kr/pages/page_1266.php
- https://en.wikipedia.org/wiki/Administrative_divisions_of_South_Korea

### 1) Case
#### Data of COVID-19 infection cases in South Korea
1. case_id: the ID of the infection case
  > - case_id(7) = region_code(5) + case_number(2)  
  > - You can check the region_code in 'Region.csv'
- province: Special City / Metropolitan City / Province(-do)
- city: City(-si) / Country (-gun) / District (-gu)
  > - The value 'from other city' means that where the group infection started is other city.
- group: TRUE: group infection / FALSE: not group
  > - If the value is 'TRUE' in this column, the value of 'infection_cases' means the name of group.  
  > - The values named 'contact with patient', 'overseas inflow' and 'etc' are not group infection.
- infection_case: the infection case (the name of group or other cases)
  > - The value 'overseas inflow' means that the infection is from other country.  
  > - The value 'etc' includes individual cases, cases where relevance classification is ongoing after investigation, and cases under investigation.
- confirmed: the accumulated number of the confirmed
- latitude: the latitude of the group (WGS84)
- longitude: the longitude of the group (WGS84)


In [None]:
case.head()

### 2) PatientInfo
#### Epidemiological data of COVID-19 patients in South Korea
1. patient_id: the ID of the patient
  > - patient_id(10) = region_code(5) + patient_number(5)
  > - You can check the region_code in 'Region.csv'
  > - There are two types of the patient_number  
      1) local_num: The number given by the local government.  
      2) global_num: The number given by the KCDC  
- sex: the sex of the patient
- age: the age of the patient
  > - 0s: 0 ~ 9  
  > - 10s: 10 ~ 19  
  ...  
  > - 90s: 90 ~ 99  
  > - 100s: 100 ~ 109
- country: the country of the patient
- province: the province of the patient
- city: the city of the patient
- infection_case: the case of infection
- infected_by: the ID of who infected the patient
  > - This column refers to the  'patient_id' column.
- contact_number: the number of contacts with people
- symptom_onset_date: the date of symptom onset
- confirmed_date: the date of being confirmed
- released_date: the date of being released
- deceased_date: the date of being deceased
- state: isolated / released / deceased
  > - isolated: being isolated in the hospital
  > - released: being released from the hospital
  > - deceased: being deceased

In [None]:
patinfo.head()

### 3) PatientRoute
#### Route data of COVID-19 patients in South Korea
- patient_id: the ID of the patient
- date: YYYY-MM-DD
- province: Special City / Metropolitan City / Province(-do)
- city: City(-si) / Country (-gun) / District (-gu)
- latitude: the latitude of the visit (WGS84)
- longitude: the longitude of the visit (WGS84)

In [None]:
#p_route.head()

### 4) Time
#### Time series data of COVID-19 status in South Korea
- date: YYYY-MM-DD
- time: Time (0 = AM 12:00 / 16 = PM 04:00)
  > - The time for KCDC to open the information has been changed from PM 04:00 to AM 12:00 since March 2nd.
- test: the accumulated number of tests
  > - A test is a diagnosis of an infection.
- negative: the accumulated number of negative results
- confirmed: the accumulated number of positive results
- released: the accumulated number of releases
- deceased: the accumulated number of deceases

In [None]:
time.head()

### 5) TimeAge
#### Time series data of COVID-19 status in terms of the age in South Korea
- date: YYYY-MM-DD
  > - The status in terms of the age has been presented since March 2nd.
- time: Time
- age: the age of patients
- confirmed: the accumulated number of the confirmed
- deceased: the accumulated number of the deceased

In [None]:
t_age.head()

### 6) TimeGender
#### Time series data of COVID-19 status in terms of the gender in South Korea
- date: YYYY-MM-DD
  > - The status in terms of the gender has been presented since March 2nd.
- time: Time
- sex: the gender of patients
- confirmed: the accumulated number of the confirmed
- deceased: the accumulated number of the deceased

In [None]:
t_gender.head()

### 7) TimeProvince
#### Time series data of COVID-19 status in terms of the Province in South Korea
- date: YYYY-MM-DD
- time: Time
- province: the province of South Korea
- confirmed: the accumulated number of the confirmed in the province
  > - The confirmed status in terms of the provinces has been presented since Feburary 21th.
  > - The value before Feburary 21th can be different.
- released: the accumulated number of the released in the province
  > - The confirmed status in terms of the provinces has been presented since March 5th.
  > - The value before March 5th can be different.
- deceased: the accumulated number of the deceased in the province
  > - The confirmed status in terms of the provinces has been presented since March 5th.
  > - The value before March 5th can be different.

In [None]:
t_provin.head()

### 8) Region
#### Location and statistical data of the regions in South Korea
- code: the code of the region
- province: Special City / Metropolitan City / Province(-do)
- city: City(-si) / Country (-gun) / District (-gu)
- latitude: the latitude of the visit (WGS84)
- longitude: the longitude of the visit (WGS84)
- elementary_school_count: the number of elementary schools
- kindergarten_count: the number of kindergartens
- university_count: the number of universities
- academy_ratio: the ratio of academies
- elderly_population_ratio: the ratio of the elderly population
- elderly_alone_ratio: the ratio of elderly households living alone
- nursing_home_count: the number of nursing homes

Source of the statistic: [KOSTAT (Statistics Korea)](http://kosis.kr/)

In [None]:
region.head()

### 9) Weather
#### Data of the weather in the regions of South Korea
- code: the code of the region
- province: Special City / Metropolitan City / Province(-do)
- date: YYYY-MM-DD
- avg_temp: the average temperature
- min_temp: the lowest temperature
- max_temp: the highest temperature
- precipitation: the daily precipitation
- max_wind_speed: the maximum wind speed
- most_wind_direction: the most frequent wind direction
- avg_relative_humidity: the average relative humidity

Source of the weather data: [KMA (Korea Meteorological Administration)](http://data.kma.go.kr)

In [None]:
weather.head()

### 10) SearchTrend
#### Trend data of the keywords searched in NAVER which is one of the largest portal in South Korea
- date: YYYY-MM-DD
- cold: the search volume of 'cold' in Korean language
  > - The unit means relative value by setting the highest search volume in the period to 100.
- flu: the search volume of 'flu' in Korean language
  > - Same as above.
- pneumonia: the search volume of 'pneumonia' in Korean language
  > - Same as above.
- coronavirus: the search volume of 'coronavirus' in Korean language
  > - Same as above.


Source of the data: [NAVER DataLab](https://datalab.naver.com/)

In [None]:
search.head()

### 11) SeoulFloating
#### Data of floating population in Seoul, South Korea (from SK Telecom Big Data Hub)

- date: YYYY-MM-DD
- hour: Hour
- birth_year: the birth year of the floating population
- sext: he sex of the floating population
- province: Special City / Metropolitan City / Province(-do)
- city: City(-si) / Country (-gun) / District (-gu)
- fp_num: the number of floating population

Source of the data: [SKT Big Data Hub](https://www.bigdatahub.co.kr)

In [None]:
floating.head()

### 12) Policy
#### Data of the government policy for COVID-19 in South Korea

- policy_id: the ID of the policy
- country: the country that implemented the policy
- type: the type of the policy
- gov_policy: the policy of the government
- detail: the detail of the policy
- start_date: the start date of the policy
- end_date: the end date of the policy

In [None]:
policy.head()

NameError: ignored

In [None]:
pip install pyvis

Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl (756 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=5.3.0->pyvis)
  Downloading jedi-0.19.0-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, pyvis
Successfully installed jedi-0.19.0 pyvis-0.3.2


In [None]:
pip install networkx



In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import json
from folium.features import CustomIcon

In [None]:
import networkx as nx
from pyvis.network import Network

# 최종 수정

In [None]:
# 1. 데이터 불러오기
path = "/content/"
case_df = pd.read_csv(path + 'Case.csv')
patinfo_df = pd.read_csv(path + 'PatientInfo.csv')

# 2. 집단 감염 케이스 추출
infect_cases = case_df.loc[~case_df['infection_case'].isin(['etc', 'contact with patient', 'overseas inflow']), 'infection_case'].unique()

# 3. 결측치 처리
patinfo_df['infected_by'].fillna('nan', inplace=True)

# 4. 감염 정보 리스트 생성
pat_ids = patinfo_df['patient_id'].tolist()
pat_cases = patinfo_df['infection_case'].tolist()
whos_infected = patinfo_df['infected_by'].tolist()

# 5. 감염된 환자 정보 추출
who_indices = [i for i, infector in enumerate(whos_infected) if infector != 'nan' and infector in pat_ids]
edges = [(int(whos_infected[i]), int(pat_ids[i])) for i in who_indices]

# 6. 환자 ID와 감염 장소 리스트 생성
patients_from_cases = [pat_ids[i] for i, case in enumerate(pat_cases) if case in infect_cases]
unique_cases = list(infect_cases)

patients_from_cases[:5], unique_cases[:5]  # Displaying some sample values for verification

([1000000015, 1000000020, 1000000022, 1000000023, 1000000025],
 ['Itaewon Clubs',
  'Richway',
  'Guro-gu Call Center',
  'Yangcheon Table Tennis Club',
  'Day Care Center'])

In [None]:
def get_date_optimized(pati_id):
    # 환자 ID에 해당하는 확진 날짜 찾기
    confirmed_date = patinfo_df.loc[patinfo_df['patient_id'] == pati_id, 'confirmed_date'].values[0]

    # 날짜 문자열을 정수로 변환
    date = int(confirmed_date.replace('-', ''))

    return date

# 함수 테스트
sample_patient_id = pat_ids[0]
get_date_optimized(sample_patient_id)

20200123

# 최종 구성

In [None]:
# Replacing new_edges with edges
new_edges = edges

# Creating num_case (patient and its infection case)
num_case = [(case, pat_id) for case, pat_id in zip(pat_cases, pat_ids) if case in infect_cases]

# Optimizing the code

# Extracting new nodes from new_edges
new_nodes_1 = {i[0] for i in new_edges if i[0] not in patients_from_cases}
new_nodes_2 = {i[1] for i in new_edges if i[1] not in patients_from_cases}

# Updating the patient_list
patient_list = list(set(patients_from_cases).union(new_nodes_1, new_nodes_2))

# Combining all edges
edge_list = num_case + new_edges

# Graph creation
G = nx.Graph()
G.add_nodes_from(patient_list, bipartite=1)
G.add_nodes_from(unique_cases, bipartite=0)
G.add_edges_from(edge_list)

# Sorting connected components by size
components_covid = sorted(nx.connected_components(G), key=len, reverse=True)
first = next(i for i, comp in enumerate(components_covid) if len(comp) == 5)
trash = set().union(*components_covid[first:])

# Removing trash nodes
patient_list = list(set(patient_list) - trash)
case_list = list(set(unique_cases) - trash)
edge_list = [edge for edge in edge_list if edge[0] not in trash and edge[1] not in trash]

# Extracting patient indices
list_ind = [pat_ids.index(i) for i in patient_list]

# Displaying some sample values for verification
list_ind[:5], patient_list[:5], case_list[:5], edge_list[:5]

([4205, 2108, 2109, 2110, 2112],
 [6001000448, 1600000001, 1600000002, 1600000003, 1600000005],
 ['Geochang Church',
  'Seongdong-gu APT',
  'Wangsung Church',
  'Yangcheon Table Tennis Club',
  'gym facility in Cheonan'],
 [('Seongdong-gu APT', 1000000015),
  ('Seongdong-gu APT', 1000000020),
  ("Eunpyeong St. Mary's Hospital", 1000000022),
  ('Shincheonji Church', 1000000023),
  ("Eunpyeong St. Mary's Hospital", 1000000025)])

# 최적화

In [None]:
# Constructing a dictionary for each case to its connected patients
dict_51 = {i: list(set(G[i]) - set(case_list)) for i in case_list}

# Updating the edges based on the earliest infected patient for each case
for case, patients in dict_51.items():
    # Getting the date of infection for each patient linked to the case
    dates = [get_date_optimized(patient) for patient in patients]
    # Identifying the earliest infected patients
    earliest_date = min(dates)
    first_infected = [patients[i] for i, date in enumerate(dates) if date == earliest_date]

    # Updating the edge list
    for infected in first_infected:
        edge_list = [edge for edge in edge_list if edge != (case, infected)]
        edge_list.append((infected, case))

# Setting up the graph visualization
g = Network(height=800, width=1600, directed=True, notebook=True)
g.set_options("""
var options = {
  "nodes": {
    "font": {
      "size": 100,
      "strokeColor": "rgba(165,215,255,1)"}}}
""")

# Adding nodes for each case
[g.add_node(case, title=case, color='gray', label=case, shape='star') for case in case_list]

# Displaying some sample values for verification
edge_list[:5], list(dict_51.items())[:2]



([('Seongdong-gu APT', 1000000020),
  ('Shincheonji Church', 1000000023),
  ("Eunpyeong St. Mary's Hospital", 1000000025),
  ("Eunpyeong St. Mary's Hospital", 1000000028),
  ("Eunpyeong St. Mary's Hospital", 1000000029)],
 [('Geochang Church',
   [6100000047, 6100000052, 6100000054, 6100000057, 6100000058, 6100000059]),
  ('Seongdong-gu APT',
   [1000000102,
    1000000070,
    1000000078,
    1000000015,
    1000000079,
    1000000020,
    1000000087,
    1000000088,
    1000000089,
    1000000090,
    1000000092,
    1000000094,
    1000000095])])

In [None]:
# Defining mappings and functions for optimization

# Color mapping based on age
age_colors = {
    '0s': 'purple',
    '10s': 'indigo',
    '20s': 'blue',
    '30s': 'skyblue',
    '40s': 'green',
    '50s': 'lawngreen',
    '60s': 'yellow',
    '70s': 'orange',
    '80s': 'red',
    '90s': 'brown'
}

# Shape determination based on sex
def get_shape(sex):
    if sex == 'male':
        return 'square'
    elif sex == 'female':
        return 'dot'
    else:
        return 'triangle'

# Adding nodes to the graph based on the determined attributes
for i in list_ind:
    patient_row = patinfo_df.iloc[i]
    id_pat = int(patient_row['patient_id'])
    age = patient_row['age']
    sex = patient_row['sex']

    color = age_colors.get(age, 'black')
    shape = get_shape(sex)

    g.add_node(id_pat, label=[' '], shape=shape, color=color, size=12, title=str(id_pat))

# Adding edges to the graph
for edge in edge_list:
    g.add_edge(source=edge[0], to=edge[1])


In [None]:
g.show("network_final.html")

network_final.html


# 누가 가장 많은 영향을 주었을까?

In [None]:
# 1. Removing unnecessary infection cases
filtered_cases = set(case_df['infection_case']) - {'etc', 'contact with patient', 'overseas inflow'}

# 2. Replacing missing values
patinfo_df.fillna('nan', inplace=True)

# 3. Extracting patient information
pat_ids = patinfo_df['patient_id'].tolist()
pat_cases = patinfo_df['infection_case'].tolist()
infected_by = patinfo_df['infected_by'].tolist()

# Adjusting for cases where a patient is infected by multiple infectors

# 4. Generating edges based on who infected whom
infected_edges = []
for infector_str, pat_id in zip(infected_by, pat_ids):
    if infector_str != 'nan':
        for single_infector in infector_str.split(','):
            single_infector = single_infector.strip()
            if int(single_infector) in pat_ids:
                infected_edges.append((int(single_infector), int(pat_id)))

# Displaying some sample values for verification
infected_edges[:5]
# 5. Generating edges based on infection cases
case_edges = [(case, pat_id) for case, pat_id in zip(pat_cases, pat_ids) if case in filtered_cases]

# 6. Consolidating all nodes and edges
all_patient_nodes = list(set(pat_ids))
all_case_nodes = list(filtered_cases)
all_edges = infected_edges + case_edges

# Displaying some sample values for verification
all_patient_nodes[:5], all_case_nodes[:5], all_edges[:5]

([6001000448, 6001000449, 6001000450, 6001000451, 6001000452],
 ['Fatima Hospital',
  'Yongin Brothers',
  'Door-to-door sales in Daejeon',
  'Daejeon door-to-door sales',
  'gym facility in Cheonan'],
 [(1000000002, 1000000005),
  (1000000003, 1000000006),
  (1000000003, 1000000007),
  (1000000003, 1000000010),
  (1000000017, 1000000013)])

In [None]:
# Creating a directed graph and adding nodes and edges directly
G = nx.DiGraph()
G.add_nodes_from(all_patient_nodes, bipartite=1)
G.add_nodes_from(all_case_nodes, bipartite=0)
G.add_edges_from(all_edges)

# Checking the number of nodes and edges for verification
G.number_of_nodes(), G.number_of_edges()

(5242, 2433)

In [None]:
# Filtering and sorting in one step
top_5 = sorted([(node, deg) for node, deg in G.out_degree() if node in all_patient_nodes], key=lambda x: x[1], reverse=True)[:5]

# Extracting node IDs and their counts
top_5_nodes = [entry[0] for entry in top_5]
top_5_count = [entry[1] for entry in top_5]

top_5_nodes, top_5_count

([2000000205, 4100000008, 2000000167, 1400000209, 4100000006],
 [51, 27, 24, 24, 21])