# Merge Data
SNP_Data_Preprocessing의 결과와 MRI-Dataset의 결과를 Merge하고 Label을 달아서 Model에 넣을 Dataset으로서 만드는 과정입니다.

In [1]:
from pandas_plink import read_plink1_bin
import os
import pandas as pd
import os
import numpy as np

NL_MCI, NL_AD, NL_MCI_AD File, ROI 불러오기

In [2]:
file_path = './SNP_DATA/'

NL_MCI = pd.read_csv(file_path+"NL_MCI.csv",index_col=0)
NL_AD = pd.read_csv(file_path+"NL_AD.csv",index_col=0)
NL_MCI_AD = pd.read_csv(file_path+'preprocessing_snp.csv')

X_roi = pd.read_csv('./MRI_DATA/CDMF_ROI_Final.csv', index_col=0)

NL_MCI_AD P-Value 제거

In [3]:
NL_MCI_AD = NL_MCI_AD.iloc[0:len(NL_MCI_AD)-1]

Load Label

In [4]:
subject_path = file_path+'ADNI_ScreeningList_8_22_12.csv'
subject = pd.read_csv(subject_path)

imsi_label = subject.loc[:,["PTID","Screen.Diagnosis"]]
imsi_label.rename(columns = {"PTID":"Subject","Screen.Diagnosis":"Label"},inplace=True)
# Delete Subject duplicated
imsi_label = imsi_label.drop_duplicates(['Subject'], keep='first')

One-Hot-Encoding => NL_MCI_AD Label
- NL: [1,0,0]
- MCI: [0,1,0]
- AD: [0,0,1]

In [5]:
def one_hot(x):
    if x == "NL":
        return np.array([1.,0.,0.])
    elif x == "MCI":
        return np.array([0.,1.,0.])
    else:
        return np.array([0.,0.,1.])
    
imsi_label["Label"] = imsi_label["Label"].apply(lambda x:one_hot(x))

Merge SNPs Data and ROI Data

In [6]:
# NL_MCI_AD
X_NL_MCI_AD = pd.merge(X_roi,NL_MCI_AD,on="Subject",how="right")
X_NL_MCI_AD = pd.merge(X_NL_MCI_AD,imsi_label,on='Subject',how='left')

# NL_MCI
X_NL_MCI = pd.merge(X_roi,NL_MCI,on='Subject',how='right')

# NL_MCI
X_NL_AD = pd.merge(X_roi,NL_AD,on='Subject',how='right')

Delete Specific Subject  

각가의 전처리 과정(Plink or Freesurfer)에서 Quality가 좋지 않은 Subject를 제거하였습니다. 따라서 Subject를 통일시켜주는 작업을 추가적으로 실시한다.

In [7]:
# Specific Subject Delete
X_NL_MCI = X_NL_MCI[X_NL_MCI["Subject"]!="018_S_0043"]
X_NL_MCI = X_NL_MCI[X_NL_MCI["Subject"]!="005_S_1224"]

X_NL_AD = X_NL_AD[X_NL_AD["Subject"]!="009_S_1334"]
X_NL_AD = X_NL_AD[X_NL_AD["Subject"]!="100_S_0747"]
X_NL_AD = X_NL_AD[X_NL_AD["Subject"]!="018_S_0043"]

Data Quality Check

In [8]:
# NL_MCI_AD
# Null값 확인
print("Total num of Null: ",X_NL_MCI_AD.isnull().sum().sum())

# Duplicate 확인
print("Total num of duplicated: ",X_NL_MCI_AD.duplicated("Subject").sum())

# NL_MCI
# Null값 확인
print("Total num of Null: ",X_NL_MCI.isnull().sum().sum())

# Duplicate 확인
print("Total num of duplicated: ",X_NL_MCI.duplicated("Subject").sum())

# NL_MCI_AD
# Null값 확인
print("Total num of Null: ",X_NL_AD.isnull().sum().sum())

# Duplicate 확인
print("Total num of duplicated: ",X_NL_AD.duplicated("Subject").sum())

Total num of Null:  0
Total num of duplicated:  0
Total num of Null:  0
Total num of duplicated:  0
Total num of Null:  0
Total num of duplicated:  0


NL_MCI or NL_AD에서 사용할 SNPs Data는 One-Hot-Encoding을 사용하게 된다.  

NL_MCI_AD의 경우에는 Label이 One-Hot-Encoding이여서, One-Hot-Encoding으로서 SNPs Data를 바꾸게 되면, Overfitting이 심하게 일어나게 된다. 따라서 NL_MCI_AD의 SNPs Data는 One-Hot-Encoding을 거치지 않는다.

In [9]:
def one_hot_snp(x):
    a = list()
    for i in x:
        if i == 0:
            a.append(np.array([1.,0.,0.]))
        elif i == 1:
            a.append(np.array([0.,1.,0.]))
        else:
            a.append(np.array([0.,0.,1.]))
    return a

NL_MCI Final Preprocessing

In [10]:
file_path = './data/NL_MCI_Data/'

Y_Label = X_NL_MCI.pop("Label")-1

X_snp = X_NL_MCI.loc[:,"rs10864271":"rs429358"]
X_snp = X_snp.apply(one_hot_snp)

X_roi = X_NL_MCI.loc[:,"rh_bankssts_area":"CerebralWhiteMatterVol_y"]
X_roi = (X_roi-X_roi.min())/(X_roi.max()-X_roi.min())

X_snp_ = np.zeros((X_snp.shape[0],X_snp.shape[1],3))

for i in range(len(X_snp)):
    for j in range(len(X_snp.iloc[i])):
        X_snp_[i,j]=X_snp.iloc[i,j]

X_snp = X_snp_
X_snp_.shape

X_roi.to_csv(file_path+'X_roi.csv')
np.save(file_path+'X_snp.npy',X_snp)
Y_Label.to_csv(file_path+'Label.csv')

# Data Check
print("X_roi's Number of Nan: ",X_roi.isna().sum().sum())
print("X_snp's Number of Nan: ",np.isnan(X_snp).sum())
print("Y_Label's Number of Nan: ",Y_Label.isna().sum())

# Data Shape
print("X_roi's shape: ",X_roi.shape)
print("X_snp's shape: ",X_snp.shape)
print("Y_Label's shape: ",Y_Label.shape)

X_roi's Number of Nan:  0
X_snp's Number of Nan:  0
Y_Label's Number of Nan:  0
X_roi's shape:  (578, 333)
X_snp's shape:  (578, 200, 3)
Y_Label's shape:  (578,)


NL_AD Final Preprocessing

In [11]:
file_path = './data/NL_AD_Data/'

Y_Label = X_NL_AD.pop("Label")-1

X_snp = X_NL_AD.loc[:,"rs7544111":"rs429358"]
X_snp = X_snp.apply(one_hot_snp)

X_roi = X_NL_AD.loc[:,"rh_bankssts_area":"CerebralWhiteMatterVol_y"]
X_roi = (X_roi-X_roi.min())/(X_roi.max()-X_roi.min())

X_snp_ = np.zeros((X_snp.shape[0],X_snp.shape[1],3))

for i in range(len(X_snp)):
    for j in range(len(X_snp.iloc[i])):
        X_snp_[i,j]=X_snp.iloc[i,j]

X_snp = X_snp_
X_snp_.shape

X_roi.to_csv(file_path+'X_roi.csv')
np.save(file_path+'X_snp.npy',X_snp)
Y_Label.to_csv(file_path+'Label.csv')

# Data Check
print("X_roi's Number of Nan: ",X_roi.isna().sum().sum())
print("X_snp's Number of Nan: ",np.isnan(X_snp).sum())
print("Y_Label's Number of Nan: ",Y_Label.isna().sum())

# Data Shape
print("X_roi's shape: ",X_roi.shape)
print("X_snp's shape: ",X_snp.shape)
print("Y_Label's shape: ",Y_Label.shape)

X_roi's Number of Nan:  0
X_snp's Number of Nan:  0
Y_Label's Number of Nan:  0
X_roi's shape:  (385, 333)
X_snp's shape:  (385, 201, 3)
Y_Label's shape:  (385,)


NL_MCI_AD Final Preprocessing

In [12]:
file_path = './data/NL_MCI_AD_Data/'

Subject = X_NL_MCI_AD.pop("Subject")
Y_Label = X_NL_MCI_AD.pop("Label")
X_roi = X_NL_MCI_AD.loc[:,"rh_bankssts_area":"CerebralWhiteMatterVol_y"]
X_roi = 2*((X_roi-X_roi.min())/(X_roi.max()-X_roi.min()))
X_snp = X_NL_MCI_AD.loc[:,"rs2075650":"rs10241890"]

Y = np.zeros((len(Y_Label),3))
for i,y in enumerate(Y_Label):
    Y[i] = y
    
X_roi.to_csv(file_path+'X_roi.csv')
X_snp.to_csv(file_path+'X_snp.csv')
np.save(file_path+'Label.npy', Y)

# Data Check
print("X_roi's Number of Nan: ",X_roi.isna().sum().sum())
print("X_snp's Number of Nan: ",X_snp.isna().sum().sum())
print("Y_Label's Number of Nan: ",np.isnan(Y).sum())

# Data Shape
print("X_roi's shape: ",X_roi.shape)
print("X_snp's shape: ",X_snp.shape)
print("Y_Label's shape: ",Y.shape)

X_roi's Number of Nan:  0
X_snp's Number of Nan:  0
Y_Label's Number of Nan:  0
X_roi's shape:  (750, 333)
X_snp's shape:  (750, 200)
Y_Label's shape:  (750, 3)
