# 10_1_train_test_count_TI.ipynb
Examine the TIs that appear in 6 or more in the test and training data respectively.

### input
- 5_X_train_test_datafile/train/X_train_PCA_TI.npz : Training data for explanatory variables in TI
- 5_X_train_test_datafile/train/Y_train_TI.npz : Training data for response variables in TI
- 5_X_train_test_datafile/test/X_test_PCA_TI.npz : Test data for explanatory variables in TI
- 5_X_train_test_datafile/test/Y_test_TI.npz : Test data for response variables in TI

### output
- 10_build_LGBM_code/output/Train_Test_count_TI.csv : A file containing the ID of the TI that will create the model 

In [1]:
import pandas as pd
from scipy.sparse import load_npz

In [2]:
X_train = load_npz('../5_X_train_test_datafile/train/X_train_PCA_TI.npz')
Y_train = load_npz('../5_X_train_test_datafile/train/Y_train_TI.npz')
X_test = load_npz('../5_X_train_test_datafile/test/X_test_PCA_TI.npz')
Y_test = load_npz('../5_X_train_test_datafile/test/Y_test_TI.npz')

In [3]:
df_a = pd.DataFrame()

for i in range(Y_train.shape[1]):
    y_tr = Y_train[:, i]
    y_te = Y_test[:, i]
    
    df_a = pd.concat([df_a, pd.DataFrame([[i, y_tr.sum(), y_te.sum()]]).rename(columns = {0:'ID', 1:'train', 2:'test'})])
df_a = df_a.reset_index(drop = True)

In [4]:
df_a

Unnamed: 0,ID,train,test
0,0,109.0,11.0
1,1,16.0,2.0
2,2,16.0,2.0
3,3,0.0,0.0
4,4,40.0,2.0
...,...,...,...
2017,2017,10.0,1.0
2018,2018,39.0,2.0
2019,2019,46.0,5.0
2020,2020,7.0,0.0


In [5]:
df_ok = df_a[(df_a['test'] > 5) & (df_a['train'] > 5)]
df_ok['use_ID'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ok['use_ID'] = 1


In [6]:
df_ok

Unnamed: 0,ID,train,test,use_ID
0,0,109.0,11.0,1
5,5,167.0,18.0,1
6,6,106.0,8.0,1
7,7,89.0,6.0,1
8,8,94.0,10.0,1
...,...,...,...,...
2010,2010,299.0,35.0,1
2011,2011,611.0,72.0,1
2012,2012,29.0,6.0,1
2013,2013,354.0,42.0,1


In [7]:
df_out = df_a[~((df_a['test'] > 5) & (df_a['train'] > 5))]
df_out['use_ID'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_out['use_ID'] = 0


In [8]:
df_out

Unnamed: 0,ID,train,test,use_ID
1,1,16.0,2.0,0
2,2,16.0,2.0,0
3,3,0.0,0.0,0
4,4,40.0,2.0,0
9,9,0.0,0.0,0
...,...,...,...,...
2017,2017,10.0,1.0,0
2018,2018,39.0,2.0,0
2019,2019,46.0,5.0,0
2020,2020,7.0,0.0,0


In [9]:
df = pd.concat([df_ok, df_out]).sort_index()

In [10]:
df

Unnamed: 0,ID,train,test,use_ID
0,0,109.0,11.0,1
1,1,16.0,2.0,0
2,2,16.0,2.0,0
3,3,0.0,0.0,0
4,4,40.0,2.0,0
...,...,...,...,...
2017,2017,10.0,1.0,0
2018,2018,39.0,2.0,0
2019,2019,46.0,5.0,0
2020,2020,7.0,0.0,0


In [11]:
df.to_csv('output/Train_Test_count_TI.csv',encoding = 'utf-8')