# 4_2_PCA.ipynb
Transform data into a lower dimensional data structure by PCA.
- Calculate principal components until the cumulative contribution ratio reaches 0.99
- Reduced the number of features from 19322 to 381.
### input
- 4_Feature_extraction_PCA/output/Feature_extraction.npz : A file that select features such that (1 - Levenshtein ratio) between paths from different KEGG pathways is greater than 0.3 for at least one of them from 'Calc_Edit_Distance.csv'
- 9_Integration_SE_TI_Target_datafile/Y_binary_TI.npz : A file with Path and TI linked
### output
- 5_X_train_test_datafile/train/X_train_PCA_TI.npz : Training data for explanatory variables 
- 5_X_train_test_datafile/train/Y_train_TI.npz : Training data for response variables 
- 5_X_train_test_datafile/test/X_test_PCA_TI.npz : Test data for explanatory variables 
- 5_X_train_test_datafile/test/Y_test_TI.npz : Test data for response variables 
- 15_PCA_model/X_PCA_model.pkl : PCA model

In [1]:
from scipy.sparse import csr_matrix
from scipy.sparse import save_npz, load_npz
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import pickle

In [2]:
X = load_npz('output/Feature_extraction.npz').toarray()

In [3]:
print(f'Size of the matrix after feature extraction: {X.shape}')

Size of the matrix after feature extraction: (59885, 19322)


In [4]:
y_ti = load_npz('../9_Integration_SE_TI_Target_datafile/Y_binary_TI.npz').toarray()

In [5]:
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(X, y_ti, test_size=0.1, random_state = 0)

In [6]:
#主成分分析の実行
pca = PCA(n_components = 0.99, random_state = 0)
pca.fit(X_train_t)

PCA(n_components=0.99, random_state=0)

In [7]:
save_npz('../5_X_train_test_datafile/train/X_train_PCA_TI.npz', csr_matrix(pca.transform(X_train_t)))
save_npz('../5_X_train_test_datafile/train/Y_train_TI.npz', csr_matrix(y_train_t))

In [8]:
save_npz('../5_X_train_test_datafile/test/X_test_PCA_TI.npz', csr_matrix(pca.transform(X_test_t)))
save_npz('../5_X_train_test_datafile/test/Y_test_TI.npz', csr_matrix(y_test_t))

In [9]:
# モデルの保存
with open('../6_PCA_model/X_PCA_model.pkl', 'wb') as f:
    pickle.dump(pca, f)

In [10]:
print(f'Size of the matrix of train data after PCA: {csr_matrix(pca.transform(X_train)).shape}')
print(f'Size of the matrix of test data after PCA: {csr_matrix(pca.transform(X_test)).shape}')

Size of the matrix of train data after PCA: (53896, 381)
Size of the matrix of test data after PCA: (5989, 381)


In [None]:
y_se = load_npz('../9_Integration_SE_TI_Target_datafile/Y_binary_SE.npz').toarray()

In [None]:
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X, y_se, test_size=0.1, random_state = 0)

In [None]:
save_npz('../5_X_train_test_datafile/train/X_train_PCA_SE.npz', csr_matrix(pca.transform(X_train_s)))
save_npz('../5_X_train_test_datafile/train/Y_train_SE.npz', csr_matrix(y_train_s))

In [None]:
save_npz('../5_X_train_test_datafile/test/X_test_PCA_SE.npz', csr_matrix(pca.transform(X_test_s)))
save_npz('../5_X_train_test_datafile/test/Y_test_SE.npz', csr_matrix(y_test_s))