# First Model

In this notebook, we create a simple model using LightGBM. The features included in this model are:
- all float (or int but not category) variables as it is:
    - `RevLineCr`, `NoEmp`, `CreateJob`, `RetainedJob`, `ApprovalFY`. `DisbursementGross`, `GrAppv`, `SBA_Appv`
- some categorical variables as it is:
    - `NewExist`, `RevLineCr`, `LowDoc`, `UrbanRural`
- Some date objects as daystamp:
    - `DisbursementDate_daystamp`, `ApprovalDate_daystamp`
- Some categorical varibles with coarse labeling:
    - `FranchiseCode`(0,1,or others)
- Some categorical variables with holdout target encoding:
    - `Sector`, `State`, `BankState`
- `Longitude`, `Latitude`: holdout target encoded with `HistGradientBoostingClassifier`

Note that `City` is not used in this model

In [1]:
import pandas as pd
import numpy as np
from cat_encodings import target_encode_test, target_encode_smooth_test
from clean_data import clean_data
import os
pd.options.mode.chained_assignment = None

## データの準備

In [2]:
DATA_DIR = "../data"
EDITED_DATA_DIR = "edited_data"

In [3]:
# load data
train = pd.read_csv(os.path.join(DATA_DIR, "train.csv"), index_col = 0)
test = pd.read_csv(os.path.join(DATA_DIR, "test.csv"), index_col = 0)
geo_train = pd.read_csv(os.path.join(EDITED_DATA_DIR, "train_geohash.csv"), index_col = 0)
geo_test = pd.read_csv(os.path.join(EDITED_DATA_DIR, "test_geohash.csv"), index_col = 0)

In [4]:
# clean data
train = clean_data(train)
test = clean_data(test)

In [5]:
# columns that are in geo_train but not train
print(set(geo_train) - set(train))

{'location', 'origin', 'geohash', 'latitude', 'longitude'}


In [6]:
# we use latitude and longitude col
for col in ['latitude', 'longitude']:
    train[col] = geo_train[col]
    test[col] = geo_test[col]

In [7]:
train.dtypes

Term                                  int64
NoEmp                                 int64
NewExist                           category
CreateJob                             int64
RetainedJob                           int64
FranchiseCode                      category
RevLineCr                          category
LowDoc                             category
DisbursementDate             datetime64[ns]
MIS_Status                            int64
Sector                             category
ApprovalDate                 datetime64[ns]
ApprovalFY                            int64
City                                 object
State                                object
BankState                            object
DisbursementGross                   float64
GrAppv                              float64
SBA_Appv                            float64
UrbanRural                         category
DisbursementDate_year               float64
DisbursementDate_month              float64
DisbursementDate_day            

In [8]:
# columns used for training -> all_cols
num_cols = ['NoEmp', 'CreateJob', 'RetainedJob', 'ApprovalFY', 'DisbursementGross', 'GrAppv', 'SBA_Appv']
retained_cat_cols = ['NewExist', 'RevLineCr', 'LowDoc', 'UrbanRural']
timestamp_cols = ['DisbursementDate_daystamp', 'ApprovalDate_daystamp']
franchise_cols = ['FranchiseCode1', 'FranchiseCode0']
target_encode_cols = ['Sector', 'State', 'BankState']
target_encoded_cols = [item + "_target" for item in target_encode_cols]
target_encode_smooth_cols = ["longitude", "latitude"]
target_encoded_smooth_cols = [item + "_target" for item in target_encode_smooth_cols]
location_cols = ['latitude', 'longitude']
all_cols = num_cols + retained_cat_cols + timestamp_cols + franchise_cols + target_encoded_cols + location_cols + target_encoded_smooth_cols

## train, val の分割

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_val, y_train, y_val = train_test_split(train,train["MIS_Status"], test_size=0.2, random_state=42, stratify=train['MIS_Status']) # stratifyした方がいいかも

### target encoding for valid data

trainの値を使って，testのtarget encodingをします．

In [11]:
from sklearn.model_selection import StratifiedKFold

In [12]:
target_encode_cols = ['Sector', 'State', 'BankState']
    
# We can simply use training data to encode the test data
for col in target_encode_cols:
    X_val = target_encode_test(X_train, y_train, X_val, col)

In [13]:
# for smooth data:
X_val =target_encode_smooth_test(X_train, y_train, X_val, "longitude")
X_val =target_encode_smooth_test(X_train, y_train, X_val, "latitude")

### Target encoding for train data

CVの分け方と合うように，target encodingをしていきます．

In [14]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1000)
# for train
kf_iter_train = kf.split(X_train, y_train)

# create list of indices for training / test data to use for holdout target encoding
folds_train = []
for train_idx, test_idx in kf_iter_train:
    folds_train.append((train_idx, test_idx))

In [15]:
X_trs = []; X_cvalids = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr, X_cvalid = X_train.iloc[train_indices], X_train.iloc[cvalid_indices]
    y_tr, y_cvalid = y_train.iloc[train_indices], y_train.iloc[cvalid_indices]
    for col in target_encode_cols:
        X_tr = target_encode_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_test(X_tr, y_tr, X_cvalid, col)
    for col in target_encode_smooth_cols:
        X_tr = target_encode_smooth_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_smooth_test(X_tr, y_tr, X_cvalid, col)
    X_trs.append(X_tr[all_cols])
    X_cvalids.append(X_cvalid[all_cols])

## LightGBM training

### Macro F1 Score (with threshold included in the metric)

In [16]:
from sklearn.metrics import  f1_score
def Macrof1(preds, eval_dataset):
    y_true = eval_dataset.get_label()
    max_score =0
    for th in np.linspace(0.2,0.9,100):
        y_pred = (preds>th).astype(int)
        score = f1_score(y_true, y_pred, average='macro')
        if score > max_score:
            max_score = score
    return 'Macrof1', max_score, True

### CV Training!

In [17]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = y_train.iloc[train_indices]
    y_cvalid = y_train.iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr[all_cols], y_tr)
    lgb_eval = lgb.Dataset(X_cvalid[all_cols], y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0
[10]	valid_0's Macrof1: 0.67221
[20]	valid_0's Macrof1: 0.671485
[30]	valid_0's Macrof1: 0.670574
[40]	valid_0's Macrof1: 0.672791
[50]	valid_0's Macrof1: 0.676263
[60]	valid_0's Macrof1: 0.676787
[70]	valid_0's Macrof1: 0.677848
[80]	valid_0's Macrof1: 0.679993
[90]	valid_0's Macrof1: 0.678207
[100]	valid_0's Macrof1: 0.680928
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.6815196565893237
fold No. =  1
[10]	valid_0's Macrof1: 0.676689
[20]	valid_0's Macrof1: 0.676277
[30]	valid_0's Macrof1: 0.681054
[40]	valid_0's Macrof1: 0.682928
[50]	valid_0's Macrof1: 0.681497
[60]	valid_0's Macrof1: 0.680838
[70]	valid_0's Macrof1: 0.680375
[80]	valid_0's Macrof1: 0.678758
[90]	valid_0's Macrof1: 0.678758
[100]	valid_0's Macrof1: 0.681378
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.6842781069276336
fold No. =  2
[10]	valid_0's Macrof1: 0.67464
[20]	valid_0's Macrof1: 0.676558
[30]	valid_0's Macrof1: 0.678082
[40]	valid_0's Macrof1: 0.678405
[50]	valid_0's Macro

In [18]:
valid_scores

[0.6837238209415563,
 0.68682229501981,
 0.6807903189433329,
 0.6778777267457856,
 0.6851297909188753]

### Check scores with valid data

In [19]:
pred_average = np.mean([item.predict(X_val[all_cols]) for item in boosters], axis=0)
test_prediction = (pred_average > np.quantile(pred_average, 0.1)).astype(int) # ここはもう少し良い選び方があるはずです．

In [20]:
f1_score(y_val, test_prediction, average='macro')

0.6741911970338366

## LGB training with all data

### target encoding for CV

In [21]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1000)
# for train
kf_iter_train = kf.split(train, train['MIS_Status'])

# create list of indices for training / test data to use for holdout target encoding
folds_train = []
for train_idx, test_idx in kf_iter_train:
    folds_train.append((train_idx, test_idx))

In [22]:
X_trs = []; X_cvalids = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr, X_cvalid = train.iloc[train_indices], train.iloc[cvalid_indices]
    y_tr, y_cvalid = train['MIS_Status'].iloc[train_indices], train['MIS_Status'].iloc[cvalid_indices]
    for col in target_encode_cols:
        X_tr = target_encode_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_test(X_tr, y_tr, X_cvalid, col)
    for col in target_encode_smooth_cols:
        X_tr = target_encode_smooth_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_smooth_test(X_tr, y_tr, X_cvalid, col)
    X_trs.append(X_tr[all_cols])
    X_cvalids.append(X_cvalid[all_cols])

In [23]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = train['MIS_Status'].iloc[train_indices]
    y_cvalid = train['MIS_Status'].iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr[all_cols], y_tr)
    lgb_eval = lgb.Dataset(X_cvalid[all_cols], y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0
[10]	valid_0's Macrof1: 0.666318
[20]	valid_0's Macrof1: 0.671807
[30]	valid_0's Macrof1: 0.671669
[40]	valid_0's Macrof1: 0.670473
[50]	valid_0's Macrof1: 0.67403
[60]	valid_0's Macrof1: 0.672121
[70]	valid_0's Macrof1: 0.670829
[80]	valid_0's Macrof1: 0.670472
[90]	valid_0's Macrof1: 0.67274
[100]	valid_0's Macrof1: 0.672559
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.674030193706318
fold No. =  1
[10]	valid_0's Macrof1: 0.681439
[20]	valid_0's Macrof1: 0.687257
[30]	valid_0's Macrof1: 0.688046
[40]	valid_0's Macrof1: 0.688245
[50]	valid_0's Macrof1: 0.68713
[60]	valid_0's Macrof1: 0.685986
[70]	valid_0's Macrof1: 0.686127
[80]	valid_0's Macrof1: 0.686882
[90]	valid_0's Macrof1: 0.686113
[100]	valid_0's Macrof1: 0.685705
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.6900889332082562
fold No. =  2
[10]	valid_0's Macrof1: 0.674238
[20]	valid_0's Macrof1: 0.677406
[30]	valid_0's Macrof1: 0.680719
[40]	valid_0's Macrof1: 0.681098
[50]	valid_0's Macrof1

In [24]:
valid_scores

[0.674030193706318,
 0.6900889332082562,
 0.6825250507932391,
 0.679484572257977,
 0.6751655935736369]

### prepare submission
#### target encoding for test data

In [25]:
target_encode_cols = ['Sector', 'State', 'BankState']
    
# We can simply use training data to encode the test data
for col in target_encode_cols:
    test = target_encode_test(train, train['MIS_Status'], test, col)
# for smooth data:
test =target_encode_smooth_test(train, train['MIS_Status'], test, "longitude")
test =target_encode_smooth_test(train, train['MIS_Status'], test, "latitude")

In [26]:
# prepare submission
pred_average = np.mean([item.predict(test[all_cols]) for item in boosters], axis=0)
test_prediction = (pred_average > np.quantile(pred_average, 0.1)).astype(int) # ここはもう少し良い選び方があるはずです．

In [27]:
test['prediction'] = test_prediction
test['prediction'].to_csv('submission_ex.csv', header=None)