
# First Model

In this notebook, we create a simple model using LightGBM. The features included in this model are:
- all float (or int but not category) variables as it is:
    - `RevLineCr`, `NoEmp`, `CreateJob`, `RetainedJob`, `ApprovalFY`. `DisbursementGross`, `GrAppv`, `SBA_Appv`
- some categorical variables as it is:
    - `NewExist`, `RevLineCr`, `LowDoc`, `UrbanRural`
- Some date objects as daystamp:
    - `DisbursementDate_daystamp`, `ApprovalDate_daystamp`
- Some categorical varibles with coarse labeling:
    - `FranchiseCode`(0,1,or others)
- Some categorical variables with holdout target encoding:
    - `Sector`, `State`, `BankState`
- `Longitude`, `Latitude`: holdout target encoded with `HistGradientBoostingClassifier`

Note that `City` is not used in this model

In [15]:
import pandas as pd
import numpy as np
from cat_encodings import target_encode_test, target_encode_smooth_test, target_encode_cols_smooth_test
from clean_data import clean_data
from tqdm.notebook import tqdm
import os
import numbers
pd.options.mode.chained_assignment = None

## データの準備

In [16]:
train = pd.read_csv("edited_data/train_cleaned_geo.csv", index_col=0)
test = pd.read_csv("edited_data/test_cleaned_geo.csv", index_col=0)

In [17]:
# columns used for training -> all_cols
num_cols = ['NoEmp', 'CreateJob', 'RetainedJob', 'ApprovalFY', 'DisbursementGross', 'GrAppv', 'SBA_Appv']
retained_cat_cols = ['NewExist', 'RevLineCr', 'LowDoc', 'UrbanRural']
timestamp_cols = ['DisbursementDate_daystamp', 'ApprovalDate_daystamp']
franchise_cols = ['FranchiseCode1', 'FranchiseCode0']
target_encode_cols = ['Sector', 'State', 'BankState', 'FranchiseCode']
target_encoded_cols = [item + "_target" for item in target_encode_cols]
target_encode_smooth_cols = ["longitude", "latitude"]
target_encoded_smooth_cols = ['location_target']
location_cols = ['latitude', 'longitude']
all_cols = num_cols + retained_cat_cols + timestamp_cols + franchise_cols + target_encoded_cols + location_cols + target_encoded_smooth_cols

In [18]:
# use pandas category for categorical vars
for column in retained_cat_cols:
    train[column] = train[column].astype("category")
    test[column] = test[column].astype("category")

## train, val の分割

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train, X_val, y_train, y_val = train_test_split(train,train["MIS_Status"], test_size=0.2, random_state=42, stratify=train['MIS_Status']) # stratifyした方がいいかも

### target encoding for valid data

trainの値を使って，testのtarget encodingをします．

In [21]:
from sklearn.model_selection import StratifiedKFold

In [22]:
target_encode_cols = ['Sector', 'State', 'BankState', 'FranchiseCode']
    
# We can simply use training data to encode the test data
for col in target_encode_cols:
    X_val = target_encode_test(X_train, y_train, X_val, col)

In [23]:
# for smooth data:
X_val = target_encode_cols_smooth_test(X_train, y_train, X_val, ["longitude", "latitude"], "location")
#X_val =target_encode_smooth_test(X_train, y_train, X_val, "longitude")
#X_val =target_encode_smooth_test(X_train, y_train, X_val, "latitude")

### Target encoding for train data

CVの分け方と合うように，target encodingをしていきます．

In [24]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1000)
# for train
kf_iter_train = kf.split(X_train, y_train)

# create list of indices for training / test data to use for holdout target encoding
folds_train = []
for train_idx, test_idx in kf_iter_train:
    folds_train.append((train_idx, test_idx))

In [25]:
X_trs = []; X_cvalids = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr, X_cvalid = X_train.iloc[train_indices], X_train.iloc[cvalid_indices]
    y_tr, y_cvalid = y_train.iloc[train_indices], y_train.iloc[cvalid_indices]
    for col in target_encode_cols:
        X_tr = target_encode_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_test(X_tr, y_tr, X_cvalid, col)
    X_tr = target_encode_cols_smooth_test(X_cvalid, y_cvalid, X_tr, ["longitude", "latitude"], "location")
    X_cvalid = target_encode_cols_smooth_test(X_tr, y_tr, X_cvalid, ["longitude", "latitude"], "location")
    # for col in target_encode_smooth_cols:
    #     X_tr = target_encode_smooth_test(X_cvalid, y_cvalid, X_tr, col)
    #     X_cvalid = target_encode_smooth_test(X_tr, y_tr, X_cvalid, col)
    X_trs.append(X_tr[all_cols])
    X_cvalids.append(X_cvalid[all_cols])

In [26]:
train[train['DisbursementDate'].isnull()]['MIS_Status'].mean()

0.6266666666666667

In [27]:
# smoteennをスキップする時用
y_trs = [None for i in range(5)]
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    y_trs[fold] = y_train.iloc[train_indices]

## LightGBM training

### Macro F1 Score (with threshold included in the metric)

In [28]:
from sklearn.metrics import  f1_score
def Macrof1(preds, eval_dataset):
    y_true = eval_dataset.get_label()
    max_score =0
    for th in np.linspace(0.1,0.9,100):
        y_pred = (preds>th).astype(int)
        score = f1_score(y_true, y_pred, average='macro')
        if score > max_score:
            max_score = score
    return 'Macrof1', max_score, True

### CV Training!

In [29]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = y_trs[fold]
    y_cvalid = y_train.iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr, y_tr)
    lgb_eval = lgb.Dataset(X_cvalid, y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0
[10]	valid_0's Macrof1: 0.681079
[20]	valid_0's Macrof1: 0.68243
[30]	valid_0's Macrof1: 0.685347
[40]	valid_0's Macrof1: 0.685313
[50]	valid_0's Macrof1: 0.684716
[60]	valid_0's Macrof1: 0.682284
[70]	valid_0's Macrof1: 0.682242
[80]	valid_0's Macrof1: 0.680164
[90]	valid_0's Macrof1: 0.680963
[100]	valid_0's Macrof1: 0.68235
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.6876402584155049
fold No. =  1
[10]	valid_0's Macrof1: 0.689282
[20]	valid_0's Macrof1: 0.6931
[30]	valid_0's Macrof1: 0.692598
[40]	valid_0's Macrof1: 0.692592
[50]	valid_0's Macrof1: 0.694833
[60]	valid_0's Macrof1: 0.695501
[70]	valid_0's Macrof1: 0.693943
[80]	valid_0's Macrof1: 0.694385
[90]	valid_0's Macrof1: 0.693278
[100]	valid_0's Macrof1: 0.692398
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.6959372776923907
fold No. =  2
[10]	valid_0's Macrof1: 0.675781
[20]	valid_0's Macrof1: 0.677121
[30]	valid_0's Macrof1: 0.678869
[40]	valid_0's Macrof1: 0.684797
[50]	valid_0's Macrof1

In [30]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = y_trs[fold]
    y_cvalid = y_train.iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr, y_tr)
    lgb_eval = lgb.Dataset(X_cvalid, y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0


[10]	valid_0's Macrof1: 0.681079
[20]	valid_0's Macrof1: 0.68243
[30]	valid_0's Macrof1: 0.685347
[40]	valid_0's Macrof1: 0.685313
[50]	valid_0's Macrof1: 0.684716
[60]	valid_0's Macrof1: 0.682284
[70]	valid_0's Macrof1: 0.682242
[80]	valid_0's Macrof1: 0.680164
[90]	valid_0's Macrof1: 0.680963
[100]	valid_0's Macrof1: 0.68235
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.6876402584155049
fold No. =  1
[10]	valid_0's Macrof1: 0.689282
[20]	valid_0's Macrof1: 0.6931
[30]	valid_0's Macrof1: 0.692598
[40]	valid_0's Macrof1: 0.692592
[50]	valid_0's Macrof1: 0.694833
[60]	valid_0's Macrof1: 0.695501
[70]	valid_0's Macrof1: 0.693943
[80]	valid_0's Macrof1: 0.694385
[90]	valid_0's Macrof1: 0.693278
[100]	valid_0's Macrof1: 0.692398
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.6959372776923907
fold No. =  2
[10]	valid_0's Macrof1: 0.675781
[20]	valid_0's Macrof1: 0.677121
[30]	valid_0's Macrof1: 0.678869
[40]	valid_0's Macrof1: 0.684797
[50]	valid_0's Macrof1: 0.68413
[60]

In [None]:
valid_scores

[0.6858837177977554,
 0.6952863920108225,
 0.6842453803525832,
 0.6730346469061509,
 0.6886688530417936]

### Check scores with valid data

In [None]:
# target encode all training data
target_encode_cols = ['Sector', 'State', 'BankState', 'FranchiseCode']
    
# We can simply use training data to encode the test data
for col in target_encode_cols:
    X_train = target_encode_test(X_train, y_train, X_train, col)
# for smooth data:
X_train = target_encode_cols_smooth_test(X_train, y_train, X_train, ["longitude", "latitude"], "location")
#X_val =target_encode_smooth_test(X_train, y_train, X_val, "longitude")
#X_val =target_encode_smooth_test(X_train, y_train, X_val, "latitude")

In [None]:
from sklearn.linear_model import LogisticRegression
preds = np.array([item.predict(X_train[all_cols]) for item in boosters])
clf = LogisticRegression(random_state=0).fit(preds.T, y_train)

In [None]:
#pred_average = np.mean([item.predict(X_val[all_cols]) for item in boosters], axis=0)
pred_test = np.array([item.predict(X_val[all_cols]) for item in boosters])
print(pred_test)
pred_average = clf.predict_proba(pred_test.T)[:,1]

test_prediction = (pred_average > np.quantile(pred_average, 0.1)).astype(int) # ここはもう少し良い選び方があるはずです．

[[0.97920583 0.7772076  0.93469781 ... 0.96566192 0.8488847  0.9497741 ]
 [0.97560263 0.8659089  0.94050981 ... 0.94704968 0.894372   0.93765919]
 [0.97809229 0.90228615 0.92094027 ... 0.95643572 0.76844993 0.95860478]
 [0.97964122 0.78037028 0.9089544  ... 0.9599944  0.86564927 0.96976384]
 [0.97263074 0.84098256 0.92867224 ... 0.94723998 0.86507642 0.95421878]]


In [None]:
f1_score(y_val, test_prediction, average='macro')

0.6741911970338366

## LGB training with all data

### target encoding for CV

In [None]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1000)
# for train
kf_iter_train = kf.split(train, train['MIS_Status'])

# create list of indices for training / test data to use for holdout target encoding
folds_train = []
for train_idx, test_idx in kf_iter_train:
    folds_train.append((train_idx, test_idx))

In [None]:
X_trs = []; X_cvalids = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr, X_cvalid = train.iloc[train_indices], train.iloc[cvalid_indices]
    y_tr, y_cvalid = train['MIS_Status'].iloc[train_indices], train['MIS_Status'].iloc[cvalid_indices]
    for col in target_encode_cols:
        X_tr = target_encode_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_test(X_tr, y_tr, X_cvalid, col)
    # for col in target_encode_smooth_cols:
    #     X_tr = target_encode_smooth_test(X_cvalid, y_cvalid, X_tr, col)
    #     X_cvalid = target_encode_smooth_test(X_tr, y_tr, X_cvalid, col)
    X_tr = target_encode_cols_smooth_test(X_cvalid, y_cvalid, X_tr, ["longitude", "latitude"], "location")
    X_cvalid = target_encode_cols_smooth_test(X_tr, y_tr, X_cvalid, ["longitude", "latitude"], "location")
    X_trs.append(X_tr[all_cols])
    X_cvalids.append(X_cvalid[all_cols])

In [None]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = train['MIS_Status'].iloc[train_indices]
    y_cvalid = train['MIS_Status'].iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr[all_cols], y_tr)
    lgb_eval = lgb.Dataset(X_cvalid[all_cols], y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0
[10]	valid_0's Macrof1: 0.67046
[20]	valid_0's Macrof1: 0.672227
[30]	valid_0's Macrof1: 0.670903
[40]	valid_0's Macrof1: 0.673336
[50]	valid_0's Macrof1: 0.6725
[60]	valid_0's Macrof1: 0.67226
[70]	valid_0's Macrof1: 0.671062
[80]	valid_0's Macrof1: 0.670736
[90]	valid_0's Macrof1: 0.670406
[100]	valid_0's Macrof1: 0.670241
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.6733355408983948
fold No. =  1
[10]	valid_0's Macrof1: 0.682511
[20]	valid_0's Macrof1: 0.682325
[30]	valid_0's Macrof1: 0.684552
[40]	valid_0's Macrof1: 0.681769
[50]	valid_0's Macrof1: 0.683092
[60]	valid_0's Macrof1: 0.683976
[70]	valid_0's Macrof1: 0.683144
[80]	valid_0's Macrof1: 0.683463
[90]	valid_0's Macrof1: 0.682923
[100]	valid_0's Macrof1: 0.681539
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.6845524993474289
fold No. =  2
[10]	valid_0's Macrof1: 0.678195
[20]	valid_0's Macrof1: 0.677253
[30]	valid_0's Macrof1: 0.680578
[40]	valid_0's Macrof1: 0.6821
[50]	valid_0's Macrof1: 

In [None]:
valid_scores

[0.6733355408983948,
 0.6845524993474289,
 0.6849859714144189,
 0.6812022318001724,
 0.6768916607542323]

### prepare submission
#### target encoding for test data

In [None]:
target_encode_cols = ['Sector', 'State', 'BankState', 'FranchiseCode']
for column in retained_cat_cols:
    test[column] = test[column].astype("category")
# We can simply use training data to encode the test data
for col in target_encode_cols:
    test = target_encode_test(train, train['MIS_Status'], test, col)
# for smooth data:
# test =target_encode_smooth_test(train, train['MIS_Status'], test, "longitude")
# test =target_encode_smooth_test(train, train['MIS_Status'], test, "latitude")
test =target_encode_cols_smooth_test(train, train['MIS_Status'], test, ["longitude", "latitude"], name="location")

In [None]:
# prepare submission
pred_average = np.mean([item.predict(test[all_cols]) for item in boosters], axis=0)
test_prediction = (pred_average > np.quantile(pred_average, 0.1)).astype(int) # ここはもう少し良い選び方があるはずです．

In [None]:
test['prediction'] = test_prediction
test['prediction'].to_csv('submission_lonlat.csv', header=None)

In [47]:
print(test[all_cols].dtypes)

NoEmp                          int64
CreateJob                      int64
RetainedJob                    int64
ApprovalFY                     int64
DisbursementGross            float64
GrAppv                       float64
SBA_Appv                     float64
NewExist                     float64
RevLineCr                     object
LowDoc                        object
UrbanRural                     int64
DisbursementDate_daystamp    float64
ApprovalDate_daystamp          int64
FranchiseCode1                  bool
FranchiseCode0                  bool
Sector_target                float64
State_target                 float64
BankState_target             float64
FranchiseCode_target         float64
latitude                     float64
longitude                    float64
location_target              float64
dtype: object


In [50]:
print(X_val[all_cols].dtypes)

NoEmp                           int64
CreateJob                       int64
RetainedJob                     int64
ApprovalFY                      int64
DisbursementGross             float64
GrAppv                        float64
SBA_Appv                      float64
NewExist                     category
RevLineCr                    category
LowDoc                       category
UrbanRural                   category
DisbursementDate_daystamp     float64
ApprovalDate_daystamp           int64
FranchiseCode1                   bool
FranchiseCode0                   bool
Sector_target                 float64
State_target                  float64
BankState_target              float64
FranchiseCode_target          float64
latitude                      float64
longitude                     float64
location_target               float64
dtype: object
