# First Model

In this notebook, we create a simple model using LightGBM. The features included in this model are:
- all float (or int but not category) variables as it is:
    - `RevLineCr`, `NoEmp`, `CreateJob`, `RetainedJob`, `ApprovalFY`. `DisbursementGross`, `GrAppv`, `SBA_Appv`
- some categorical variables as it is:
    - `NewExist`, `RevLineCr`, `LowDoc`, `UrbanRural`
- Some date objects as daystamp:
    - `DisbursementDate_daystamp`, `ApprovalDate_daystamp`
- Some categorical varibles with coarse labeling:
    - `FranchiseCode`(0,1,or others)
- Some categorical variables with holdout target encoding:
    - `Sector`, `State`, `BankState`
- `Longitude`, `Latitude`: holdout target encoded with `HistGradientBoostingClassifier`

Note that `City` is not used in this model

In [38]:
import pandas as pd
import numpy as np
from cat_encodings import target_encode_test, target_encode_smooth_test
from clean_data import clean_data
import os

## データの準備

In [39]:
DATA_DIR = "../data"
EDITED_DATA_DIR = "edited_data"

In [40]:
# load data
train = pd.read_csv(os.path.join(DATA_DIR, "train.csv"), index_col = 0)
test = pd.read_csv(os.path.join(DATA_DIR, "test.csv"), index_col = 0)
geo_train = pd.read_csv(os.path.join(EDITED_DATA_DIR, "train_geohash.csv"), index_col = 0)
geo_test = pd.read_csv(os.path.join(EDITED_DATA_DIR, "test_geohash.csv"), index_col = 0)

In [41]:
# clean data
train = clean_data(train)
test = clean_data(test)

In [43]:
# columns that are in geo_train but not train
print(set(geo_train) - set(train))

{'longitude', 'location', 'latitude', 'geohash', 'origin'}


In [44]:
# we use latitude and longitude col
for col in ['latitude', 'longitude']:
    train[col] = geo_train[col]
    test[col] = geo_test[col]

In [45]:
train.dtypes

Term                                  int64
NoEmp                                 int64
NewExist                           category
CreateJob                             int64
RetainedJob                           int64
FranchiseCode                      category
RevLineCr                          category
LowDoc                             category
DisbursementDate             datetime64[ns]
MIS_Status                            int64
Sector                             category
ApprovalDate                 datetime64[ns]
ApprovalFY                            int64
City                                 object
State                                object
BankState                            object
DisbursementGross                   float64
GrAppv                              float64
SBA_Appv                            float64
UrbanRural                         category
DisbursementDate_year               float64
DisbursementDate_month              float64
DisbursementDate_day            

In [46]:
# columns used for training -> all_cols
num_cols = ['NoEmp', 'CreateJob', 'RetainedJob', 'ApprovalFY', 'DisbursementGross', 'GrAppv', 'SBA_Appv']
retained_cat_cols = ['NewExist', 'RevLineCr', 'LowDoc', 'UrbanRural']
timestamp_cols = ['DisbursementDate_daystamp', 'ApprovalDate_daystamp']
franchise_cols = ['FranchiseCode1', 'FranchiseCode0']
target_encode_cols = ['Sector', 'State', 'BankState']
target_encoded_cols = [item + "_target" for item in target_encode_cols]
target_encode_smooth_cols = ["longitude", "latitude"]
target_encoded_smooth_cols = [item + "_target" for item in target_encode_smooth_cols]
location_cols = ['latitude', 'longitude']
all_cols = num_cols + retained_cat_cols + timestamp_cols + franchise_cols + target_encoded_cols + location_cols + target_encoded_smooth_cols

## train, val の分割

In [47]:
from sklearn.model_selection import train_test_split

In [48]:
X_train, X_val, y_train, y_val = train_test_split(train,train["MIS_Status"], test_size=0.2, random_state=42, stratify=train['MIS_Status']) # stratifyした方がいいかも

### target encoding for valid data

trainの値を使って，testのtarget encodingをします．

In [49]:
from sklearn.model_selection import StratifiedKFold

In [50]:
target_encode_cols = ['Sector', 'State', 'BankState']
    
# We can simply use training data to encode the test data
for col in target_encode_cols:
    X_val = target_encode_test(X_train, y_train, X_val, col)

In [51]:
# for smooth data:
X_val =target_encode_smooth_test(X_train, y_train, X_val, "longitude")
X_val =target_encode_smooth_test(X_train, y_train, X_val, "latitude")

### Target encoding for train data

CVの分け方と合うように，target encodingをしていきます．

In [52]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1000)
# for train
kf_iter_train = kf.split(X_train, y_train)

# create list of indices for training / test data to use for holdout target encoding
folds_train = []
for train_idx, test_idx in kf_iter_train:
    folds_train.append((train_idx, test_idx))

In [53]:
X_trs = []; X_cvalids = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr, X_cvalid = X_train.iloc[train_indices], X_train.iloc[cvalid_indices]
    y_tr, y_cvalid = y_train.iloc[train_indices], y_train.iloc[cvalid_indices]
    for col in target_encode_cols:
        X_tr = target_encode_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_test(X_tr, y_tr, X_cvalid, col)
    for col in target_encode_smooth_cols:
        X_tr = target_encode_smooth_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_smooth_test(X_tr, y_tr, X_cvalid, col)
    X_trs.append(X_tr[all_cols])
    X_cvalids.append(X_cvalid[all_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df = train_X; df['target'] = train_y
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_X[column + "_target"] = 0.9
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df = train_X; df['target'] = train_y
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_

## LightGBM training

### Macro F1 Score (with threshold included in the metric)

In [54]:
from sklearn.metrics import  f1_score
def Macrof1(preds, eval_dataset):
    y_true = eval_dataset.get_label()
    max_score =0
    for th in np.linspace(0.2,0.9,100):
        y_pred = (preds>th).astype(int)
        score = f1_score(y_true, y_pred, average='macro')
        if score > max_score:
            max_score = score
    return 'Macrof1', max_score, True

### CV Training!

In [55]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = y_train.iloc[train_indices]
    y_cvalid = y_train.iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr[all_cols], y_tr)
    lgb_eval = lgb.Dataset(X_cvalid[all_cols], y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0
[10]	valid_0's Macrof1: 0.675966
[20]	valid_0's Macrof1: 0.678506
[30]	valid_0's Macrof1: 0.679468
[40]	valid_0's Macrof1: 0.682742
[50]	valid_0's Macrof1: 0.680244
[60]	valid_0's Macrof1: 0.678919
[70]	valid_0's Macrof1: 0.67835
[80]	valid_0's Macrof1: 0.678847
[90]	valid_0's Macrof1: 0.677382
[100]	valid_0's Macrof1: 0.678011
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.6827423573067081
fold No. =  1
[10]	valid_0's Macrof1: 0.676019
[20]	valid_0's Macrof1: 0.678156
[30]	valid_0's Macrof1: 0.682608
[40]	valid_0's Macrof1: 0.681042
[50]	valid_0's Macrof1: 0.680855
[60]	valid_0's Macrof1: 0.682307
[70]	valid_0's Macrof1: 0.682749
[80]	valid_0's Macrof1: 0.684514
[90]	valid_0's Macrof1: 0.684168
[100]	valid_0's Macrof1: 0.683915
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.685496240466818
fold No. =  2
[10]	valid_0's Macrof1: 0.673643
[20]	valid_0's Macrof1: 0.674569
[30]	valid_0's Macrof1: 0.676017
[40]	valid_0's Macrof1: 0.67719
[50]	valid_0's Macrof

In [56]:
valid_scores

[0.6827423573067081,
 0.685496240466818,
 0.6789997104574546,
 0.6759035726574282,
 0.6865140417997527]

### Check scores with valid data

In [57]:
pred_average = np.mean([item.predict(X_val[all_cols]) for item in boosters], axis=0)
test_prediction = (pred_average > np.quantile(pred_average, 0.1)).astype(int) # ここはもう少し良い選び方があるはずです．

In [58]:
f1_score(y_val, test_prediction, average='macro')

0.6735554725402441

## LGB training with all data

### target encoding for CV

In [59]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1000)
# for train
kf_iter_train = kf.split(train, train['MIS_Status'])

# create list of indices for training / test data to use for holdout target encoding
folds_train = []
for train_idx, test_idx in kf_iter_train:
    folds_train.append((train_idx, test_idx))

In [60]:
X_trs = []; X_cvalids = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr, X_cvalid = train.iloc[train_indices], train.iloc[cvalid_indices]
    y_tr, y_cvalid = train['MIS_Status'].iloc[train_indices], train['MIS_Status'].iloc[cvalid_indices]
    for col in target_encode_cols:
        X_tr = target_encode_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_test(X_tr, y_tr, X_cvalid, col)
    for col in target_encode_smooth_cols:
        X_tr = target_encode_smooth_test(X_cvalid, y_cvalid, X_tr, col)
        X_cvalid = target_encode_smooth_test(X_tr, y_tr, X_cvalid, col)
    X_trs.append(X_tr[all_cols])
    X_cvalids.append(X_cvalid[all_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df = train_X; df['target'] = train_y
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_X[column + "_target"] = 0.9
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df = train_X; df['target'] = train_y
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_

In [61]:
valid_scores = []
import warnings
import lightgbm as lgb
warnings.filterwarnings("ignore", category=DeprecationWarning) 
boosters = []
for fold, (train_indices, cvalid_indices) in enumerate(folds_train):
    X_tr = X_trs[fold]
    X_cvalid = X_cvalids[fold]
    y_tr = train['MIS_Status'].iloc[train_indices]
    y_cvalid = train['MIS_Status'].iloc[cvalid_indices]
    
    lgb_train = lgb.Dataset(X_tr[all_cols], y_tr)
    lgb_eval = lgb.Dataset(X_cvalid[all_cols], y_cvalid)

    print("fold No. = ", fold)
    params = {
    'objective': 'binary',
    'metric': 'None',  # Use custom to use the custom metric for evaluation
    'verbose': -1,
    'learning_rate':0.1,
    'early_stopping_rounds': 100,
    'scale_pos_weight': 1.0,
    }
    
    model = lgb.train(
        params,
        lgb_train,
        valid_sets = lgb_eval,
        num_boost_round = 100,
        feval = Macrof1,
        callbacks=[lgb.log_evaluation(10)],
    )
    
    print("now caluculating MacroF1 values .....")
    name, score, _ = Macrof1(model.predict(X_cvalid), lgb_eval)
    print(f'fold {fold} MacroF1: {score}')
    valid_scores.append(score)

    boosters.append(model)

fold No. =  0
[10]	valid_0's Macrof1: 0.667265
[20]	valid_0's Macrof1: 0.674307
[30]	valid_0's Macrof1: 0.672733
[40]	valid_0's Macrof1: 0.671621
[50]	valid_0's Macrof1: 0.674272
[60]	valid_0's Macrof1: 0.674467
[70]	valid_0's Macrof1: 0.672323
[80]	valid_0's Macrof1: 0.671877
[90]	valid_0's Macrof1: 0.672291
[100]	valid_0's Macrof1: 0.671691
now caluculating MacroF1 values .....
fold 0 MacroF1: 0.6748154976369096
fold No. =  1
[10]	valid_0's Macrof1: 0.679223
[20]	valid_0's Macrof1: 0.685906
[30]	valid_0's Macrof1: 0.688404
[40]	valid_0's Macrof1: 0.686784
[50]	valid_0's Macrof1: 0.684356
[60]	valid_0's Macrof1: 0.68259
[70]	valid_0's Macrof1: 0.68367
[80]	valid_0's Macrof1: 0.684339
[90]	valid_0's Macrof1: 0.685675
[100]	valid_0's Macrof1: 0.684815
now caluculating MacroF1 values .....
fold 1 MacroF1: 0.6893938508034391
fold No. =  2
[10]	valid_0's Macrof1: 0.670993
[20]	valid_0's Macrof1: 0.678066
[30]	valid_0's Macrof1: 0.678673
[40]	valid_0's Macrof1: 0.679461
[50]	valid_0's Macro

In [62]:
valid_scores

[0.6748154976369096,
 0.6893938508034391,
 0.6819651180273643,
 0.6779598941004636,
 0.6746502683420891]

### prepare submission
#### target encoding for test data

In [64]:
target_encode_cols = ['Sector', 'State', 'BankState']
    
# We can simply use training data to encode the test data
for col in target_encode_cols:
    test = target_encode_test(train, train['MIS_Status'], test, col)
# for smooth data:
test =target_encode_smooth_test(train, train['MIS_Status'], test, "longitude")
test =target_encode_smooth_test(train, train['MIS_Status'], test, "latitude")

In [65]:
# prepare submission
pred_average = np.mean([item.predict(test[all_cols]) for item in boosters], axis=0)
test_prediction = (pred_average > np.quantile(pred_average, 0.1)).astype(int) # ここはもう少し良い選び方があるはずです．

In [66]:
test['prediction'] = test_prediction
test['prediction'].to_csv('submission_ex.csv', header=None)