# 10_1_build_LGBM_SE.ipynb
Let LGBM learn the relationship between Path similarity (Levenshtein ratio) pattern and SE pattern.

- Using LGBM.
- Perform downsamling + vaging.
- After downsampling, the upper limit of the number of rows in the matrix is set to 1000.
- Separate the data after downsampling into training data and validation data.
- Learning is performed while tuning the parameters by optuna.integration.lightgbm.
- Bagging is done 10 times.

### input
- 5_X_train_test_datafile_nonPCA/train/X_train_Clustering_SE_*.npz : Training data for explanatory variables after sampling for the second time in SE.
- 5_X_train_test_datafile/Y/Y_train_SE.npz : Training data for response variable in SE.

### output
- 11_LGBM_SE_nonPCA/model_se_*.pkl : LightGBM trained on training data.

In [1]:
from imblearn.under_sampling import RandomUnderSampler
from scipy.sparse import load_npz
from sklearn.model_selection import train_test_split
import optuna.integration.lightgbm as lgb
import pickle
import warnings
warnings.filterwarnings('ignore')

In [2]:
Y_train = load_npz('../5_X_train_test_datafile/Y/Y_train_SE.npz')

In [3]:
Y_train.shape

(60732, 177)

In [4]:
def bagging(seed):
    sampler = RandomUnderSampler(random_state=seed, replacement=True, sampling_strategy = 0.5)
    X_resampled, y_resampled = sampler.fit_resample(X_train.toarray(), y_train.toarray())
    if X_resampled.shape[0] > 1000:
        X_resampled, _, y_resampled, _ = train_test_split(X_resampled, y_resampled, train_size=1000, random_state=0, shuffle=True)
    
    X_train2, X_valid, y_train2, y_valid = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=0, shuffle=True)

    lgb_train = lgb.Dataset(X_train2, y_train2)
    lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)
    
    model = lgb.train(params,
                lgb_train,
                num_boost_round=200,
                valid_sets=[lgb_train, lgb_eval],
                early_stopping_rounds=50, verbose_eval=False,
                optuna_seed=0,
                verbosity = -1
               )
    
    with open('../11_LGBM_SE_nonPCA/model_se_'+str(i)+'_'+str(seed)+'.pkl', 'wb') as f:
        pickle.dump(model, f)
    
    return model

In [None]:
for i in range(177):
    X_train = load_npz('../5_X_train_test_datafile_nonPCA/train/X_train_Clustering_SE_'+ str(i) +'.npz')
    y_train = Y_train[:, i]
    
    params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'verbosity' : -1,
    'deterministic':True,
    'force_row_wise':True
    }
      
    models = []

    try:
        for k in range(10):
            models.append(bagging(k))
    

    except:
        print(i)