## Disease Network 실습 using Python
* 진단 score 테이블 생성 예제 

### Requirement
* Dignosis progression network와 같이 이전 진단과 이후 진단에 대한 $D_i$ x $D_j$ Matrix를 생성한다.
<br/>
* Matrix의 score(점수)는 카운트의 백분율 값으로 환산한다.
<br/>

* 식은 다음과 같이 정의한다.
    * $B\over A$ ; $A$: 이전 진단 $D_i$의 총 카운트, $B$: 이전 진단 $D_i$을 가지고 있는 이후 진단 $D_j$의 총 카운트

<br/>

* **단, 환자 개개인의 반복 측정으로 인해 카운트 계산 중 다음 환자로 넘어갈 경우를 고려해야 한다.**

Import the Pandas, Numpy

In [1]:
import pandas as pd
import pandasgui as gui
import numpy as np

Load data set (.sas)

In [2]:
pd_sas = pd.read_sas('nsc2_edu_m20.sas7bdat', format='sas7bdat')
pd_sas

Unnamed: 0,RN_INDI,RN_KEY,RN_INST,MDCARE_STRT_DT,FORM_CD,SICK_SYM1,SICK_SYM2,OPRTN_YN,MDCARE_DD_CNT,VSHSP_DD_CNT,TOT_PRSC_DD_CNT,MCARE_RSLT_TYPE,ED_RC_TOT_AMT,EDC_SBA,EDC_INSUR_BRDN_AMT,SPCF_SYM_TYPE
0,3050415.0,b'20060000000008',203739.0,b'20060113',b'03',b'C20',,b'0',1.0,1.0,0.0,b'1',10410.0,0.0,10410.0,b'V027'
1,6771465.0,b'20060000000011',876467.0,b'20060105',b'03',b'S335',,b'0',1.0,1.0,0.0,b'1',14060.0,1500.0,12560.0,
2,9238266.0,b'20060000000013',396247.0,b'20060118',b'03',b'R104',,b'0',1.0,1.0,0.0,b'1',17780.0,5330.0,12450.0,
3,2518883.0,b'20060000000015',192322.0,b'20060107',b'03',b'I10',b'I70',b'0',6.0,6.0,33.0,b'1',55690.0,9000.0,46690.0,
4,2251892.0,b'20060000000020',503221.0,b'20060102',b'02',b'J189',b'J459',b'0',21.0,12.0,0.0,b'1',2650240.0,602890.0,2047350.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137862,9817912.0,b'20120000075216',256194.0,b'20121224',b'03',b'J029',b'J303',b'0',1.0,1.0,3.0,b'1',9220.0,2700.0,6520.0,
137863,9799437.0,b'20120000075217',135151.0,b'20121208',b'03',b'J060',b'J304',b'0',1.0,1.0,3.0,b'1',11230.0,3300.0,7930.0,
137864,9804004.0,b'20120000075218',252557.0,b'20121214',b'03',b'J00',b'J209',b'0',1.0,1.0,3.0,b'1',9210.0,2700.0,6510.0,
137865,9799437.0,b'20120000075219',135151.0,b'20121226',b'03',b'J0141',b'J303',b'0',1.0,1.0,3.0,b'1',9210.0,2700.0,6510.0,


필요 변수만 사용
* 필요 변수: 환자 ID, 방문 일자, 주증상1
* 환자 ID, 병원 방문 일자에 따라 정렬
* 변수 타입 변환

In [3]:
df_patients = pd_sas[['RN_INDI', 'MDCARE_STRT_DT', 'SICK_SYM1']]\
                    .sort_values(by=['RN_INDI', 'MDCARE_STRT_DT'])\
                    .astype({'RN_INDI':'int','MDCARE_STRT_DT':'str', 'SICK_SYM1':'str'})

df_patients.info()

<class 'pandas.core.frame.DataFrame'>
Index: 137867 entries, 585 to 137855
Data columns (total 3 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   RN_INDI         137867 non-null  int32 
 1   MDCARE_STRT_DT  137867 non-null  object
 2   SICK_SYM1       137867 non-null  object
dtypes: int32(1), object(2)
memory usage: 3.7+ MB


disease 데이터 출력

In [4]:
df_patients

Unnamed: 0,RN_INDI,MDCARE_STRT_DT,SICK_SYM1
585,1001978,20060102,H650
2201,1001978,20060222,J039
4685,1001978,20060302,K52
3830,1001978,20060303,H660
5834,1001978,20060401,J459
...,...,...,...
127868,9825078,20120908,J209
127866,9825078,20120919,J209
131250,9825078,20121017,J209
134682,9825078,20121121,J209


"I"를 포함하는 진단만 필터링

In [5]:
df_patients = df_patients[df_patients['SICK_SYM1'].str.contains("I")]

df_patients

Unnamed: 0,RN_INDI,MDCARE_STRT_DT,SICK_SYM1
1837,1070894,20060104,I10
2096,1070894,20060203,I10
4528,1070894,20060303,I10
13356,1070894,20060731,I11
13794,1070894,20060807,I11
...,...,...,...
117477,9761994,20120615,I109
123269,9761994,20120816,I109
128790,9761994,20121015,I109
136943,9761994,20121214,I109


각 진단이 가지고 있는 카운트 출력

In [6]:
df_patients['SICK_SYM1'].value_counts(sort=False)

SICK_SYM1
I10      6627
I11        32
I119      340
I109     2270
I100      330
         ... 
I279        5
I059        4
I058        2
I2149       8
I2510       2
Name: count, Length: 190, dtype: int64

각 환자가 가지고 있는 고유한 진단 출력 및 진단 배열 생성
* 총 3,447개

In [7]:
diagnosis = df_patients['SICK_SYM1'].unique()
print(diagnosis)
print(diagnosis.shape)

['I10' 'I11' 'I119' 'I109' 'I100' 'I8410' 'I849' 'I839' 'I8418' 'I209'
 'I841' 'I840' 'I110' 'I200' 'I842' 'I739' 'I219' 'I730' 'I499' 'I850'
 'I500' 'I479' 'I638' 'I639' 'I659' 'I693' 'I69318' 'I69319' 'I660' 'I64'
 'I12' 'I509' 'I5000' 'I20' 'I830' 'I051' 'I158' 'I890' 'I691' 'I698'
 'I69819' 'I060' 'I633' 'I6358' 'I6339' 'I8488' 'I211' 'I480' 'I7300'
 'I738' 'I259' 'I491' 'I845' 'I678' 'I48' 'I249' 'I252' 'I846' 'I269'
 'I129' 'I139' 'I959' 'I694' 'I159' 'I472' 'I672' 'I49' 'I498' 'I748'
 'I671' 'I84' 'I781' 'I13' 'I669' 'I889' 'I70920' 'I250' 'I70' 'I61'
 'I709' 'I679' 'I843' 'I517' 'I251' 'I6330' 'I652' 'I2088' 'I8411' 'I675'
 'I091' 'I061' 'I340' 'I802' 'I_' 'I848' 'I518' 'I471' 'I702' 'I63' 'I844'
 'I5001' 'I501' 'I469' 'I270' 'I38' 'I493' 'I70880' 'I212' 'I214' 'I248'
 'I712' 'I208' 'I950' 'I635' 'I8480' 'I6350' 'I631' 'I120' 'I240' 'I69119'
 'I98' 'I988' 'I7029' 'I130' 'I426' 'I831' 'I101' 'I83' 'I629' 'I351'
 'I731' 'I519' 'I258' 'I7000' 'I50' 'I749' 'I829' 'I6319' 'I951' 'I8

Id와 진단만 추출

In [8]:
df_patients = df_patients[['RN_INDI', 'SICK_SYM1']]

데이터 분할 후 좌우로 결합

In [9]:
prev = df_patients[:len(df_patients)-1] # 이전 데이터
prev.reset_index(inplace=True)

next = df_patients[1:] # 다음 데이터
next.reset_index(inplace=True)

concat = pd.concat([prev[['RN_INDI', 'SICK_SYM1']], 
                    next[['RN_INDI', 'SICK_SYM1']]], axis=1)
concat

Unnamed: 0,RN_INDI,SICK_SYM1,RN_INDI.1,SICK_SYM1.1
0,1070894,I10,1070894,I10
1,1070894,I10,1070894,I10
2,1070894,I10,1070894,I11
3,1070894,I11,1070894,I11
4,1070894,I11,1070894,I11
...,...,...,...,...
12721,9761994,I109,9761994,I109
12722,9761994,I109,9761994,I109
12723,9761994,I109,9761994,I109
12724,9761994,I109,9761994,I109


컬럼명 변경 후 prev ID와 next ID가 다르면 해당 행 삭제
* 개인 환자 고려

In [10]:
concat.columns = ['RN_INDI_PREV', 'SICK_SYM1_PREV',
                  'RN_INDI_NEXT', 'SICK_SYM1_NEXT']
concat.drop(concat[(concat['RN_INDI_PREV']) != (concat['RN_INDI_NEXT'])].index, inplace=True)
concat

Unnamed: 0,RN_INDI_PREV,SICK_SYM1_PREV,RN_INDI_NEXT,SICK_SYM1_NEXT
0,1070894,I10,1070894,I10
1,1070894,I10,1070894,I10
2,1070894,I10,1070894,I11
3,1070894,I11,1070894,I11
4,1070894,I11,1070894,I11
...,...,...,...,...
12720,9761994,I109,9761994,I109
12721,9761994,I109,9761994,I109
12722,9761994,I109,9761994,I109
12723,9761994,I109,9761994,I109


Numpy 변환

In [11]:
np_patients_concat = concat.to_numpy()
np_patients_concat

array([[1070894, 'I10', 1070894, 'I10'],
       [1070894, 'I10', 1070894, 'I10'],
       [1070894, 'I10', 1070894, 'I11'],
       ...,
       [9761994, 'I109', 9761994, 'I109'],
       [9761994, 'I109', 9761994, 'I109'],
       [9761994, 'I109', 9761994, 'I109']], dtype=object)

계산 방법
* **score matrix에서 행과 열 전체를 순회하는 것이 아닌 진단 쌍에서 가지고 있는 요소만 계산하여 score matrix의 행과 열에 넣어주기**


* 필요 변수
    * 각 진단에 대한 **진단명:진단번호** 를 가지는 딕셔너리 -> 진단 번호로 score matrix의 행과 열의 인덱스에 매핑 가능
    * 진단 쌍 배열 -> [ [이전 진단, 이후 진단], ...]
    * 이전 진단의 개수 배열 -> [$D_1$의 개수, $D_2$의 개수, ..., $D_I$ 개수]
     

In [12]:
# 진단 배열에서 진단에 대한 진단번호 매기기 (score matrix를 위함)
dig_dict = {diagnosis[i] : i for i in range(len(diagnosis))} # ex) {I10(KEY):0(VALUE), I12(KEY):1(VALUE) ...} 
print("dig_dict: \n", dig_dict)

# 진단 쌍 배열 생성: 좌우로 결합된 환자 데이터 프레임의 이전 이후 진단쌍(진단번호로 표기) 배열 생성, 행에 맞춰 추가 됨
pairs_number = np.array([[dig_dict[D[1]], dig_dict[D[3]]] for D in np_patients_concat]) # D[1]=이전 진단번호 || D[3]=이후 진단번호
print("pairs: \n", pairs_number)

# 기존 환자 데이터 프레임에서 고유한 값을 가진 진단의 개수 배열 생성 
np_patients = df_patients.to_numpy() #  -> numpy 변환
diagnosis_number = np.array([dig_dict[diag] for diag in np_patients[:,1]]) # dig_dict 변수의 진단번호(VALUE) 삽입
countsOfD_i = np.unique(diagnosis_number, return_counts=True)[1] # 고유 진단번호의 개수(이전 진단이 가지고 있는 개수와 같음) 

print("countsOfD_i: \n", countsOfD_i)

dig_dict: 
 {'I10': 0, 'I11': 1, 'I119': 2, 'I109': 3, 'I100': 4, 'I8410': 5, 'I849': 6, 'I839': 7, 'I8418': 8, 'I209': 9, 'I841': 10, 'I840': 11, 'I110': 12, 'I200': 13, 'I842': 14, 'I739': 15, 'I219': 16, 'I730': 17, 'I499': 18, 'I850': 19, 'I500': 20, 'I479': 21, 'I638': 22, 'I639': 23, 'I659': 24, 'I693': 25, 'I69318': 26, 'I69319': 27, 'I660': 28, 'I64': 29, 'I12': 30, 'I509': 31, 'I5000': 32, 'I20': 33, 'I830': 34, 'I051': 35, 'I158': 36, 'I890': 37, 'I691': 38, 'I698': 39, 'I69819': 40, 'I060': 41, 'I633': 42, 'I6358': 43, 'I6339': 44, 'I8488': 45, 'I211': 46, 'I480': 47, 'I7300': 48, 'I738': 49, 'I259': 50, 'I491': 51, 'I845': 52, 'I678': 53, 'I48': 54, 'I249': 55, 'I252': 56, 'I846': 57, 'I269': 58, 'I129': 59, 'I139': 60, 'I959': 61, 'I694': 62, 'I159': 63, 'I472': 64, 'I672': 65, 'I49': 66, 'I498': 67, 'I748': 68, 'I671': 69, 'I84': 70, 'I781': 71, 'I13': 72, 'I669': 73, 'I889': 74, 'I70920': 75, 'I250': 76, 'I70': 77, 'I61': 78, 'I709': 79, 'I679': 80, 'I843': 81, 'I517': 8

Score Matrix 계산

In [13]:
from tqdm.notebook import tqdm

# score 계산 함수 정의
def computeScore(pairs, matrix, countsOfD_i) -> np.array:
    for D_i, D_j in tqdm(pairs): # pairs(2차원 배열)에서 한 배열의 요소: [D_i(이전진단코드번호), D_j(이후진단코드번호)]
        if matrix[D_i, D_j] != 0: # 이미 계산 된 진단 쌍이면 다음 사이클 넘어감
            continue
        countOfD_i = countsOfD_i[D_i]
        countOfD_j = np.sum(np.all(pairs == [D_i, D_j], axis=1))
        matrix[D_i, D_j] = countOfD_j / countOfD_i  # 행렬의 각 i, j 인덱스는 진단번호와 같음
    return matrix

matrix = np.zeros((190, 190), dtype=np.float16) # 190 x 190 0 행렬 생성

scoreMatrix = computeScore(pairs_number, matrix, countsOfD_i)
scoreMatrix = pd.DataFrame(scoreMatrix, index=diagnosis, columns=diagnosis)

scoreMatrix

  0%|          | 0/11951 [00:00<?, ?it/s]

Unnamed: 0,I10,I11,I119,I109,I100,I8410,I849,I839,I8418,I209,...,I7023,I792,I620,I69,I744,I279,I059,I058,I2149,I2510
I10,0.909180,0.000453,0.004074,0.030334,0.005585,0.000000,0.000906,0.001358,0.000000,0.002716,...,0.0,0.0,0.000151,0.0,0.0,0.000000,0.000302,0.000000,0.000000,0.000
I11,0.187500,0.718750,0.031250,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000
I119,0.047058,0.000000,0.829590,0.023529,0.000000,0.000000,0.002941,0.000000,0.000000,0.011765,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000
I109,0.000441,0.000000,0.000441,0.848633,0.004845,0.000881,0.000441,0.000000,0.000881,0.001762,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000881,0.000881,0.000000,0.000
I100,0.000000,0.000000,0.000000,0.045441,0.793945,0.000000,0.003031,0.000000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.003031,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
I279,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.799805,0.000000,0.000000,0.000000,0.000
I059,0.250000,0.000000,0.000000,0.750000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000
I058,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000
I2149,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.750000,0.125


Score matrix 엑셀 파일 저장

In [14]:
# scoreMatrix.to_excel('Output/Score_Matrix2.xlsx')

진단쌍 정보 테이블 생성
* ICD-10 분류표 불러오기

In [15]:
df_KCD = pd.read_excel('ICD-10.xlsx')
df_KCD

Unnamed: 0,Diagnosis(ICD-10),Diagnosis(full term)
0,A00,Cholera
1,A000,"Cholera due to Vibrio cholerae 01, biovar chol..."
2,A001,"Cholera due to Vibrio cholerae 01, biovar eltor"
3,A009,"Cholera, unspecified"
4,A01,Typhoid and paratyphoid fevers
...,...,...
41003,M9989/6,"Myelodysplastic syndrome, NOS, malignant, meta..."
41004,M9991/3,"Refractory neutropenia, malignant, primary site"
41005,M9991/6,"Refractory neutropenia, malignant, metastatic ..."
41006,M9992/3,"Refractory thrombocytopenia, malignant, primar..."


CountMatrix 계산

In [16]:
def computeCount(pairs, matrix) -> np.array:
    for D_i, D_j in tqdm(pairs): # pairs(2차원 배열)에서 한 배열의 요소: [D_i(이전진단코드번호), D_j(이후진단코드번호)]
        if matrix[D_i, D_j] != 0: # 이미 계산 된 진단 쌍이면 다음 사이클 넘어감
            continue
        countOfD_j = np.sum(np.all(pairs == [D_i, D_j], axis=1))
        # 행렬의 각 i, j 인덱스는 진단번호와 같음
        matrix[D_i, D_j] = countOfD_j
    return matrix

matrix = np.zeros((190, 190), dtype=np.uint16) # 3447 x 3447 0행렬 생성

countMatrix = computeCount(pairs_number, matrix)
countMatrix = pd.DataFrame(countMatrix, index=diagnosis, columns=diagnosis)

countMatrix

  0%|          | 0/11951 [00:00<?, ?it/s]

Unnamed: 0,I10,I11,I119,I109,I100,I8410,I849,I839,I8418,I209,...,I7023,I792,I620,I69,I744,I279,I059,I058,I2149,I2510
I10,6025,3,27,201,37,0,6,9,0,18,...,0,0,1,0,0,0,2,0,0,0
I11,6,23,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
I119,16,0,282,8,0,0,1,0,0,4,...,0,0,0,0,0,0,0,0,0,0
I109,1,0,1,1926,11,2,1,0,2,4,...,0,0,0,0,0,0,2,2,0,0
I100,0,0,0,15,262,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
I279,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,4,0,0,0,0
I059,1,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
I058,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
I2149,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,6,1


정보 테이블 생성
* 이전 진단코드 || 이전 진단명 || 이후 진단코드 || 이후 진단명 || Count || Count ratio

In [17]:
pairs_name = np.array([[D[1], D[3]] for D in np_patients_concat]) # 진단(진단코드명) 쌍 생성

# 진단쌍 데이터프레임 생성
df_pair = pd.DataFrame(pairs_name, columns=[['Prior diagnosis(ICD-10 code)', 'Later diagnosis(ICD-10 code)']])
df_pair = df_pair.drop_duplicates() # 중복되는 진단 쌍 제거
df_pair.reset_index(inplace=True)
     
# 진단코드의 진단명 컬럼 생성
# 진단코드명과 ICD-10 코드 매핑
df_pair_prev = df_pair[['Prior diagnosis(ICD-10 code)']]
df_pair_next = df_pair[['Later diagnosis(ICD-10 code)']]

df_pair_prev.columns = ['Diagnosis(ICD-10)']
df_pair_next.columns = ['Diagnosis(ICD-10)']

df_pair_prev_merge = pd.merge(df_pair_prev, df_KCD, how='left', on='Diagnosis(ICD-10)')
df_pair_next_merge = pd.merge(df_pair_next, df_KCD, how='left', on='Diagnosis(ICD-10)')

df_pairInfo = pd.concat([df_pair_prev_merge, df_pair_next_merge], axis=1)

df_pairInfo.columns = ['Prior diagnosis(ICD-10)', 'Prior diagnosis(full term)', 
                       'Later diagnosis(ICD-10)', 'Later diagnosis(full term)']

# Count, Count ratio 값 삽입
df_pairInfo['Count'] = 0
df_pairInfo['Count ratio'] = 0.0
for i in range(len(df_pairInfo)):
    df_pairInfo.iloc[i, 4] = countMatrix.iloc[dig_dict[df_pairInfo.iloc[i, 0]], dig_dict[df_pairInfo.iloc[i, 2]]] # matrix(인덱스, 인덱스)
    df_pairInfo.iloc[i, 5] = scoreMatrix.iloc[dig_dict[df_pairInfo.iloc[i, 0]], dig_dict[df_pairInfo.iloc[i, 2]]]

df_pairInfo

Unnamed: 0,Prior diagnosis(ICD-10),Prior diagnosis(full term),Later diagnosis(ICD-10),Later diagnosis(full term),Count,Count ratio
0,I10,Essential (primary) hypertension,I10,Essential (primary) hypertension,6025,0.909180
1,I10,Essential (primary) hypertension,I11,Hypertensive heart disease,3,0.000453
2,I11,Hypertensive heart disease,I11,Hypertensive heart disease,23,0.718750
3,I11,Hypertensive heart disease,I10,Essential (primary) hypertension,6,0.187500
4,I10,Essential (primary) hypertension,I119,Hypertensive heart disease without (congestive...,27,0.004074
...,...,...,...,...,...,...
679,I100,,I2149,,1,0.003031
680,I2149,,I2510,,1,0.125000
681,I2510,,I2510,,1,0.500000
682,I2510,,I2149,,1,0.500000


In [18]:
# df_pairInfo.to_excel('Output/diagnosis pairs information.xlsx', index=False)