# 4_0_train_test_count_TI.ipynb
Create training and test data for response variable and examine the SEs that appear in 6 or more in the test and training data respectively.

### input
- 9_Integration_SE_TI_Target_datafile/Y_binary_TI.npz : A file with Path ID and TI linked.

### output
- 5_X_train_test_datafile/Y/Y_train_TI.npz : Training data for response variable in TI.
- 5_X_train_test_datafile/Y/Y_test_TI.npz : Test data for response variable in TI.
- 4_Feature_extraction/output/Train_Test_count_TI.csv : A file containing the TI ID and the number of TI for training and test data.

In [1]:
import pandas as pd
from scipy.sparse import csr_matrix
from scipy.sparse import save_npz, load_npz
from sklearn.model_selection import train_test_split

In [2]:
y = pd.DataFrame(load_npz('../9_Integration_SE_TI_Target_datafile/Y_binary_TI.npz').toarray())

In [3]:
train_id, test_id = train_test_split(pd.DataFrame(range(67481)), test_size=0.1, random_state = 0)

In [4]:
Y_train = pd.merge(y.reset_index(), train_id.reset_index(drop = True).rename(columns = {0:'index'}), left_on = 'index', right_on = 'index').sort_values('index').set_index('index')
Y_test = pd.merge(y.reset_index(), test_id.reset_index(drop = True).rename(columns = {0:'index'}), left_on = 'index', right_on = 'index').sort_values('index').set_index('index')

In [5]:
save_npz('../5_X_train_test_datafile/Y/Y_train_TI.npz', csr_matrix(Y_train))
save_npz('../5_X_train_test_datafile/Y/Y_test_TI.npz', csr_matrix(Y_test))

In [6]:
Y_train = csr_matrix(Y_train).toarray()
Y_test = csr_matrix(Y_test).toarray()

In [7]:
df_a = pd.DataFrame()

for i in range(Y_train.shape[1]):
    y_tr = Y_train[:, i]
    y_te = Y_test[:, i]
    
    df_a = pd.concat([df_a, pd.DataFrame([[i, y_tr.sum(), y_te.sum()]]).rename(columns = {0:'ID', 1:'train', 2:'test'})])
df_a = df_a.reset_index(drop = True)

In [8]:
df_a

Unnamed: 0,ID,train,test
0,0,109.0,11.0
1,1,18.0,0.0
2,2,18.0,0.0
3,3,42.0,5.0
4,4,167.0,18.0
...,...,...,...
1679,1679,59.0,8.0
1680,1680,10.0,1.0
1681,1681,41.0,5.0
1682,1682,47.0,4.0


In [9]:
df_ok = df_a[(df_a['test'] > 5) & (df_a['train'] > 5)]
df_ok['use_ID'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ok['use_ID'] = 1


In [10]:
df_ok

Unnamed: 0,ID,train,test,use_ID
0,0,109.0,11.0,1
4,4,167.0,18.0,1
5,5,103.0,11.0,1
6,6,86.0,9.0,1
7,7,94.0,10.0,1
...,...,...,...,...
1672,1672,375.0,35.0,1
1675,1675,307.0,27.0,1
1676,1676,626.0,60.0,1
1678,1678,358.0,43.0,1


In [11]:
df_out = df_a[~((df_a['test'] > 5) & (df_a['train'] > 5))]
df_out['use_ID'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_out['use_ID'] = 0


In [12]:
df_out

Unnamed: 0,ID,train,test,use_ID
1,1,18.0,0.0,0
2,2,18.0,0.0,0
3,3,42.0,5.0,0
13,13,3.0,0.0,0
14,14,8.0,1.0,0
...,...,...,...,...
1677,1677,32.0,3.0,0
1680,1680,10.0,1.0,0
1681,1681,41.0,5.0,0
1682,1682,47.0,4.0,0


In [13]:
df = pd.concat([df_ok, df_out]).sort_index()

In [14]:
df

Unnamed: 0,ID,train,test,use_ID
0,0,109.0,11.0,1
1,1,18.0,0.0,0
2,2,18.0,0.0,0
3,3,42.0,5.0,0
4,4,167.0,18.0,1
...,...,...,...,...
1679,1679,59.0,8.0,1
1680,1680,10.0,1.0,0
1681,1681,41.0,5.0,0
1682,1682,47.0,4.0,0


In [15]:
df.to_csv('output/Train_Test_count_TI.csv',encoding = 'utf-8')